# Hypothesis Testing - Employee Attrition

Statistical validation of key hypotheses about employee attrition factors.

## Hypotheses to Test:
1. **Income Hypothesis**: Employees who leave earn significantly less
2. **Overtime Hypothesis**: Overtime significantly increases attrition
3. **Satisfaction Hypothesis**: Low satisfaction correlates with higher attrition
4. **Promotion Hypothesis**: Longer time since promotion increases attrition
5. **Work-Life Balance**: Poor work-life balance increases attrition
6. **Distance Hypothesis**: Greater distance from home increases attrition

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import chi2_contingency, ttest_ind, mannwhitneyu, f_oneway
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')


In [2]:
import os
current_dir = os.getcwd()
current_dir

'd:\\Code Institute\\employee-turnover-prediction-1\\jupyter_notebooks'

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [4]:
current_dir = os.getcwd()
current_dir

'd:\\Code Institute\\employee-turnover-prediction-1'

In [5]:
#path directory
processed_data_dir = os.path.join(current_dir, 'data_set/processed') 

# Load the cleaned dataset
df = pd.read_csv(os.path.join(processed_data_dir, 'cleaned_employee_attrition.csv'))

In [6]:
df.head(5)

Unnamed: 0,Age,Attrition,DistanceFromHome,JobLevel,JobRole,JobSatisfaction,MonthlyIncome,NumCompaniesWorked,OverTime,WorkLifeBalance,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,1,2,Sales Executive,4,5993,8,Yes,1,0,5
1,49,No,8,2,Research Scientist,2,5130,1,No,3,1,7
2,37,Yes,2,1,Laboratory Technician,3,2090,6,Yes,3,0,0
3,33,No,3,1,Research Scientist,3,2909,1,Yes,3,3,0
4,27,No,2,1,Laboratory Technician,2,3468,9,No,3,2,2


## Hypothesis 1: Income and Attrition
**H0:** There is no significant difference in monthly income between employees who left and those who stayed.

**H1:** Employees who left have significantly lower monthly income than those who stayed.

In [7]:
# Separate data by attrition status
income_left = df[df['Attrition'] == 'Yes']['MonthlyIncome']
income_stayed = df[df['Attrition'] == 'No']['MonthlyIncome']

# Perform independent t-test
t_stat, p_value = stats.ttest_ind(income_left, income_stayed)

print("=" * 70)
print("HYPOTHESIS TEST 1: Monthly Income vs Attrition")
print("=" * 70)
print(f"\nEmployees who left - Mean Income: ${income_left.mean():,.2f}")
print(f"Employees who stayed - Mean Income: ${income_stayed.mean():,.2f}")
print(f"\nDifference: ${income_stayed.mean() - income_left.mean():,.2f}")
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.6f}")

if p_value < 0.05:
    print("\n✓ RESULT: Reject null hypothesis (p < 0.05)")
    print("   There IS a significant difference in income between groups.")
    print("   Lower income is associated with higher attrition.")
else:
    print("\n✗ RESULT: Fail to reject null hypothesis (p >= 0.05)")
    print("   No significant difference in income between groups.")

HYPOTHESIS TEST 1: Monthly Income vs Attrition

Employees who left - Mean Income: $4,787.09
Employees who stayed - Mean Income: $6,832.74

Difference: $2,045.65

t-statistic: -6.2039
p-value: 0.000000

✓ RESULT: Reject null hypothesis (p < 0.05)
   There IS a significant difference in income between groups.
   Lower income is associated with higher attrition.


In [8]:
# Create visualization for income hypothesis test
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Income Distribution by Attrition Status', 'Box Plot Comparison'),
    specs=[[{'type': 'histogram'}, {'type': 'box'}]]
)

# Histogram
fig.add_trace(
    go.Histogram(x=income_stayed, name='Stayed', opacity=0.7, marker_color='#3b82f6'),
    row=1, col=1
)
fig.add_trace(
    go.Histogram(x=income_left, name='Left', opacity=0.7, marker_color='#ef4444'),
    row=1, col=1
)

# Box plots
fig.add_trace(
    go.Box(y=income_stayed, name='Stayed', marker_color='#3b82f6'),
    row=1, col=2
)
fig.add_trace(
    go.Box(y=income_left, name='Left', marker_color='#ef4444'),
    row=1, col=2
)

fig.update_layout(
    title_text=f"Hypothesis Test 1: Income vs Attrition (p-value: {p_value:.6f})",
    showlegend=True,
    height=400
)
fig.update_xaxes(title_text="Monthly Income ($)", row=1, col=1)
fig.update_yaxes(title_text="Frequency", row=1, col=1)
fig.update_yaxes(title_text="Monthly Income ($)", row=1, col=2)

fig.show()

Based on the hypothesis tests conducted:

1. **Income**: Employees who leave tend to have lower monthly income