# **Data Visualization**

# Objectives


The objective of this notebook is to explore and understand the [IBM HR Analytics Employee Attrition & Performance dataset](https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset)  through descriptive statistics and visual analysis. The focus is on uncovering key patterns and relationships among employee attributes that influence attrition and performance.

# Input
* The input can be found [here]()
* This is a csv file contained the cleaned data outputted by the ETL process.

# Outputs

- All the visualizations have been saved as PNG files and are stored in a designated folder for easy access and reference, which can be found [here](../Images).

---

# Change working directory
Change the working directory from its current folder to its parent folder as the notebooks will be stored in a subfolder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'd:\\Code Institute\\employee-turnover-prediction-1\\jupyter_notebooks'

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'd:\\Code Institute\\employee-turnover-prediction-1'

Changing path directory to the dataset

In [4]:
#path directory
raw_data_dir = os.path.join(current_dir, 'data_set/raw') 

#path directory
processed_data_dir = os.path.join(current_dir, 'data_set/processed') 


---

# Import packages

In [5]:
import pandas as pd # Import pandas
import matplotlib.pyplot as plt # Import matplotlib
import seaborn as sns # Import seaborn
import plotly.express as px # Import plotly express
import plotly.graph_objects as go # Import plotly graph objects for advanced plots
from plotly.subplots import make_subplots
sns.set_style('whitegrid') # Set style for visuals
from scipy import stats # Import scipy for statistical tests

---

# Load the cleaned dataset

In [6]:
# Load the cleaned dataset
df = pd.read_csv(os.path.join(processed_data_dir, 'cleaned_employee_attrition.csv'))
df.head(5)

Unnamed: 0,Age,Attrition,DistanceFromHome,JobLevel,JobRole,JobSatisfaction,MonthlyIncome,NumCompaniesWorked,OverTime,WorkLifeBalance,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,1,2,Sales Executive,4,5993,8,Yes,1,0,5
1,49,No,8,2,Research Scientist,2,5130,1,No,3,1,7
2,37,Yes,2,1,Laboratory Technician,3,2090,6,Yes,3,0,0
3,33,No,3,1,Research Scientist,3,2909,1,Yes,3,3,0
4,27,No,2,1,Laboratory Technician,2,3468,9,No,3,2,2


---

## 1. ATTRITION OVERVIEW

In [7]:
# Overall attrition rate
attrition_counts = df['Attrition'].value_counts()
attrition_rate = (attrition_counts['Yes'] / len(df)) * 100

fig = go.Figure(data=[
    go.Pie(labels=attrition_counts.index, 
           values=attrition_counts.values,
           hole=0.4,
           marker_colors=['#2ecc71', '#e74c3c'],
           textinfo='label+percent',
           textfont_size=14)
])

fig.update_layout(
    title=f'Overall Attrition Rate: {attrition_rate:.1f}%',
    title_font_size=20,
    height=500,
    showlegend=True
)

fig.show()

print(f"\nAttrition Statistics:")
print(f"  Total Employees: {len(df)}")
print(f"  Attrition: {attrition_counts['Yes']} ({attrition_rate:.1f}%)")
print(f"  Retained: {attrition_counts['No']} ({100-attrition_rate:.1f}%)")


Attrition Statistics:
  Total Employees: 1470
  Attrition: 237 (16.1%)
  Retained: 1233 (83.9%)


## 2. DEMOGRAPHIC ANALYSIS

In [8]:
# Age distribution by attrition
fig = px.histogram(df, x='Age', color='Attrition',
                   nbins=30, barmode='overlay',
                   title='Age Distribution by Attrition Status',
                   color_discrete_map={'Yes': '#e74c3c', 'No': '#2ecc71'},
                   opacity=0.7)

fig.update_layout(height=500, xaxis_title='Age', yaxis_title='Count')
fig.show()

# Statistics by age group
if 'AgeGroup' in df.columns:
    age_attrition = df.groupby('AgeGroup')['Attrition_Binary'].agg(['mean', 'count'])
    age_attrition['mean'] = age_attrition['mean'] * 100
    print("\nAttrition Rate by Age Group:")
    print(age_attrition)

## 3. JOB SATISFACTION ANALYSIS

In [15]:
# Job Satisfaction levels
satisfaction_cols = ['JobSatisfaction']

fig = make_subplots(rows=1, cols=1,
                    subplot_titles=satisfaction_cols)

for idx, col in enumerate(satisfaction_cols):
    row = idx // 2 + 1
    col_pos = idx % 2 + 1
    
    sat_data = df.groupby([col, 'Attrition']).size().unstack(fill_value=0)
    
    for attrition_status in sat_data.columns:
        fig.add_trace(
            go.Bar(name=attrition_status, 
                   x=sat_data.index, 
                   y=sat_data[attrition_status],
                   marker_color='#e74c3c' if attrition_status == 'Yes' else '#2ecc71',
                   showlegend=(idx == 0)),
            row=row, col=col_pos
        )

fig.update_layout(height=800, title_text='Satisfaction Levels vs Attrition', barmode='group')
fig.show()

## 4. COMPENSATION ANALYSIS

In [16]:
# Monthly Income distribution
fig = px.box(df, x='Attrition', y='MonthlyIncome',
             title='Monthly Income Distribution by Attrition',
             color='Attrition',
             color_discrete_map={'Yes': '#e74c3c', 'No': '#2ecc71'})

fig.update_layout(height=500)
fig.show()

print("\nIncome Statistics by Attrition:")
print(df.groupby('Attrition')['MonthlyIncome'].describe())


Income Statistics by Attrition:
            count         mean          std     min     25%     50%     75%  \
Attrition                                                                     
No         1233.0  6832.739659  4818.208001  1051.0  3211.0  5204.0  8834.0   
Yes         237.0  4787.092827  3640.210367  1009.0  2373.0  3202.0  5916.0   

               max  
Attrition           
No         19999.0  
Yes        19859.0  


In [17]:
# Income by Job Level and Attrition
fig = px.box(df, x='JobLevel', y='MonthlyIncome', color='Attrition',
             title='Monthly Income by Job Level and Attrition',
             color_discrete_map={'Yes': '#e74c3c', 'No': '#2ecc71'})

fig.update_layout(height=500)
fig.show()

## 5. WORK-LIFE BALANCE ANALYSIS

In [20]:
# Distance from home
fig = px.histogram(df, x='DistanceFromHome', color='Attrition',
                   nbins=20, barmode='overlay',
                   title='Distance from Home vs Attrition',
                   color_discrete_map={'Yes': '#e74c3c', 'No': '#2ecc71'},
                   opacity=0.7)

fig.update_layout(height=500)
fig.show()

## 6. CAREER PROGRESSION ANALYSIS

In [23]:
# Years since last promotion
fig = px.box(df, x='Attrition', y='YearsSinceLastPromotion',
             title='Years Since Last Promotion by Attrition',
             color='Attrition',
             color_discrete_map={'Yes': '#e74c3c', 'No': '#2ecc71'})

fig.update_layout(height=500)
fig.show()

print("\nPromotion Statistics by Attrition:")
print(df.groupby('Attrition')['YearsSinceLastPromotion'].describe())


Promotion Statistics by Attrition:
            count      mean       std  min  25%  50%  75%   max
Attrition                                                      
No         1233.0  2.234388  3.234762  0.0  0.0  1.0  3.0  15.0
Yes         237.0  1.945148  3.153077  0.0  0.0  1.0  2.0  15.0


## 6. CORRELATION ANALYSIS

In [26]:
# Correlation heatmap for numeric features
numeric_cols = ['Age', 'MonthlyIncome', 'YearsSinceLastPromotion',
                'JobSatisfaction', 'DistanceFromHome']
                
corr_matrix = df[numeric_cols].corr()

fig = px.imshow(corr_matrix,
                labels=dict(color="Correlation"),
                x=corr_matrix.columns,
                y=corr_matrix.columns,
                color_continuous_scale='RdBu_r',
                aspect='auto',
                title='Feature Correlation Heatmap')

fig.update_layout(height=700)
fig.show()