# **Predict Employee Attrition: Transform**

## Objectives

* At the end of this phase, we will:
    - Transform the dataset to analyze it in more detail
    - Conduct statistical tests to determine and establish relationships between features
    - Create a dashaboard that enables better exploration

## Inputs

* [Task outline](https://docs.google.com/document/d/e/2PACX-1vThNllbMORJoc348kFavz4mZWT1-33xyazdD2L-3AlTfORlRhuDyT0xmCBQMD2C-K2djQQipt6te6lo/pub)
* Extract phase

## Outputs

* Transform the dataset
* Statitical tests and visualizations
* PowerBI Dashboard

---

# Import packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import pingouin as pg

---

# Data reupload

In [None]:
df = pd.read_csv("../data/cleaned_data/predict_employee_attrition_copy.csv")
print(df.shape)
df.head()

---

# Data transformation

Aggregating satisfaction levels

In [None]:
# Create TotalSatisfaction feature by averaging satisfaction scores
satisfaction_cols = ["JobSatisfaction", "EnvironmentSatisfaction", "RelationshipSatisfaction", "WorkLifeBalance"]
df["TotalSatisfaction"] = df[satisfaction_cols].mean(axis=1)
print(df[["TotalSatisfaction"] + satisfaction_cols].head())

Creating age groups

In [None]:
print(df["Age"].nunique())
print("---" * 40)
print(df["Age"].unique())
print("---" * 40)
print(f"Min age: ", df["Age"].min())
print("---" * 40)
print(f"Max age: ", df["Age"].max())

In [None]:
# Define age brackets
bins = [18, 25, 35, 45, 55, 65]
labels = ['18-25', '26-35', '36-45', '46-55', '56-65']

# Create AgeBracket column
df['AgeBracket'] = pd.cut(df['Age'], bins=bins, labels=labels, right=True, include_lowest=True)

# Display the distribution of age brackets
print(df['AgeBracket'].value_counts())
print(df[['Age', 'AgeBracket']].head())

encoding attrition and gender

In [None]:
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})

In [None]:
df["Gender"] = df["Gender"].map({"Male": 1, "Female": 0})

## Sanity check: Post-transformation

In [None]:
df.info()

In [None]:
(df == 0).sum()

In [None]:
df.isnull().sum()

---

# Correlation analysis: Post-transformation

In [None]:
plt.figure(figsize=(15, 12))
mask = np.triu(np.ones_like(df.select_dtypes(include='number').corr(), dtype=bool), k=1)
sns.heatmap(df.select_dtypes(include='number').corr(), mask=mask, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

**Correlation analysis:**
The correlation analysis following transfomration has not revealed any new linear correlations. We will now progress with hypothesis testing and check if there are any statistically significant relationships between features.


In [None]:
pd.crosstab(df['Attrition'], df['Gender'], normalize='columns')

In [None]:
df = df.drop(columns=['EmployeeCount', 'Over18', 'StandardHours'], errors='ignore') #dropping columns that do not add value to the analysis
print(df.shape)
df.head(10)

# Copy dataset

In [None]:
df.to_csv("../data/transformed_data/predict_employee_attrition_transformed.csv", index=False)

---

# Research methodology

Testing hypotheses to determine the relationship between features

# Hypothesis 2: Age and Attrition Rate



**Null hypothesis 2:** Age and attrition rate are independent of each other and do not share a statistically significant relationship.

In [None]:
# Hypothesis testing: Chi-square test of independence between AgeBracket and Attrition
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

# Load the transformed dataset
try:
    df_trans = pd.read_csv('../data/transformed_data/predict_employee_attrition_transformed.csv')
except FileNotFoundError:
    print("Error: The file was not found. Please check the path to your CSV file.")
    # Exit or handle the error appropriately
    df_trans = None

# Create a contingency table
contingency_table = pd.crosstab(df_trans['AgeBracket'], df_trans['Attrition'])
print('Contingency Table:')
print(contingency_table)

# Perform Chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f'Chi-square statistic: {chi2:.4f}')
print(f'p-value: {p:.4f}')
print(f'Degrees of freedom: {dof}')
print('Expected frequencies:')
print(expected)

In [None]:
# Visualization: Attrition rate by AgeBracket
import matplotlib.pyplot as plt
import seaborn as sns

if df_trans is not None:
    # Calculate attrition rate per age bracket
    attrition_rate = df_trans.groupby('AgeBracket')['Attrition'].mean().reset_index()

    # Create the visualization
    plt.figure(figsize=(8,5))
    sns.barplot(x='AgeBracket', y='Attrition', data=attrition_rate, palette='viridis')
    plt.title('Attrition Rate by Age Bracket')
    plt.ylabel('Attrition Rate')
    plt.xlabel('Age Bracket')
    plt.ylim(0, 1)
    plt.show()

**Key observations:**
* Contingency Table: This table shows the raw counts. For example, in the 18-25 age bracket, 79 employees did not leave (Attrition=0) and 44 did (Attrition=1).
* Chi-square statistic: 59.4387: a measure of how much the observed counts differ from the counts one would expect if there was no relationship at all between age and attrition. A larger value suggests a stronger relationship.
* A p-value of 0.0000 provides enough evidence to reject the null hypothesis. IN the synthetic dataset there is a statistically significant association between an employee's age bracket and their likelihood of attrition.

**Business impact:**
- Rejecting the null hypothesis means age is a key factor in attrition.
- HR can use these insights to design age-specific programs, reduce turnover, and optimize workforce planning.

---

# Hypothesis 4: Monthly Income and Attrition Rate

Null hypothesis: Monthly income and attrition rate are independent of each other and do not share a statistically significant relationship.

In [None]:
pg.mwu(df['MonthlyIncome'], df['Attrition'], alternative='two-sided')

**Key observations:** 
* The p-value score is 0, which means we have enough evidence to reject the null hypothesis. In the synthetic dataset, MonthlyIncome and Attrition share a statistically significant relationship.

In [None]:
fig = px.box(df, x='Attrition', y='MonthlyIncome', color='Attrition',
             title='Monthly Income vs Attrition',
             labels={'MonthlyIncome': 'Monthly Income', 'Attrition': 'Attrition Status'})
fig.show()

**Key observations:**
* Employees who left the company (Attrition = 1) generally have lower monthly incomes compared to those who stayed (Attrition = 0).
* The boxplot and distribution plots show a clear separation in income levels between the two groups.
* The Mann-Whitney U test result indicates a statistically significant difference in monthly income between employees who left and those who stayed (p-value ≈ 0).
* This suggests that lower monthly income is strongly associated with higher attrition rates.
* Addressing income disparities may help reduce employee attrition and improve retention.

**Business impact:**
* The strong link between lower monthly income and higher attrition rates highlights the importance of competitive compensation strategies.
* Organizations can reduce employee turnover by reviewing and adjusting salary structures, especially for roles with higher attrition.
* Improving pay equity and transparency may enhance employee satisfaction and retention.
* Addressing income disparities can lead to cost savings by reducing recruitment and training expenses associated with high attrition.
* Strategic compensation planning supports a more stable, motivated, and productive workforce.

**Key observations:**
* Attrition Rate: The y-axis shows the proportion of employees in each age bracket who left the company. For example, a rate of 0.4 means 40% of the employees in that group left.
* Key Insight 1: Youngest employees have the highest attrition: The "18-25" age bracket has the highest bar, with an attrition rate of over 35%. This is a very strong signal that the youngest employees are the most likely to leave.
* Key Insight 2: Mid-career stability: The attrition rate drops for the "26-35" group and is lowest for the "36-45" age bracket (at around 10%). This suggests that employees in their late 20s to mid-40s are the most stable.
* Key Insight 3: A slight rise in later careers: The rate begins to creep up again for the "46-55" and "56-65" age brackets, though it never reaches the high levels of the youngest group. This could be due to retirements or late-career changes.

**Hypotheses:**

1. Gender impacts attrition

2. 

3. 

4. 

5. Total satisfaction level impacts attrition

    5.1 JobSatisfaction impacts attrition

# Hypothesis 1: Gender and Attrition

Null hypothesis: Gender and attrition rate are independent of each other and do not share a statistically significant relationship.

In [None]:
df["Gender"].info()

In [None]:
observed, expected, stats = pg.chi2_independence(data = df, x = "Gender", y = "Attrition")

stats

**Key observations:** The p-value score is more than 0.05, which means we do not have enough evidence to reject the null hypothesis. In the synthetic dataset, Gender and Attrition do not share a statistically significant relationship.

In [None]:
fig = px.box(
    df,
    x="Gender",
    color="Attrition",
    points="all",
    title="Attrition Count by Gender",
    labels={
        "Gender": "Gender (1=Male, 0=Female)",
        "Attrition": "Attrition (0=No, 1=Yes)"
    }
)
fig.show()

**Key observations:** The boxes in the boxplots overlap and there is absolutely no difference in the median either. The graph helps us validate the test results that Gender and Attrition are not statistically related in the synthetic dataset.

**Business imapct:** While the synthetic dataset doesn't capture the observation, some behavior patterns might impact one gender more than the other gender. The business still needs to closely monitor for signals, however, subtle they maybe to ensure a fair, unbiased workspace. 

**Note:** As the current dataset is biased, we cannot generalize the findings to actual workplace environments. Recommend we tweak the data collection process to ensure that we collect balanced data.

**Note to ML Engineer/Data Scientist:** Dataset is biased as both Gender and Attrition features have imbalanced values. Gender has more male samples and Attrition has more No samples. 

---

# Hypothesis 5: Total satisfaction level and Attrition

Null hypothesis: Satisfaction levels and attrition rate are independent of each other and do not share a statistically significant relationship.

In [None]:
pg.mwu(df["Attrition"], df["TotalSatisfaction"], alternative='two-sided')

**Key observations:** A p-value score of 0 provides enough evidence to reject the null hypothesis. In the synthetic dataset, TotalSatisfaction and Attrition share a statistically significant relationship.

In [None]:
# Box plot: TotalSatisfaction by Attrition
fig = px.box(df, x="Attrition", y="TotalSatisfaction", points="all",
             title="Total Satisfaction by Attrition Status",
             labels={"Attrition": "Attrition (0=No, 1=Yes)", "TotalSatisfaction": "Total Satisfaction"},
             color="Attrition")
fig.show()

In [None]:
# Box plots: Satisfaction indicators by Attrition
satisfaction_cols = ["JobSatisfaction", "EnvironmentSatisfaction", "RelationshipSatisfaction", "WorkLifeBalance"]
for col in satisfaction_cols:
    fig = px.box(df, x="Attrition", y=col, points="all",
                 title=f"{col} by Attrition Status",
                 labels={"Attrition": "Attrition (0=No, 1=Yes)", col: col},
                 color="Attrition")
    fig.show()

**Key observations:** 

1. TotalSatisfaction aggregates JobSatisfaction, EnvironmentSatisfaction, RelationshipSatisfaction, and WorkLifeBalance.

2. I have plotted all the features in the second boxplot to understand which feature actually contributes to Attrition from the aggregated TotalSatisfaction feature in the synthetic dataset. 

3. JobSatisfaction and Attrition box plot clearly indicates a difference in Job Satifaction scores between attrited and existing employees. Although, the median is the same, the scores show a higher variance (is distributed more) among existing employees rather than attrited employees. In the synthetic dataset, there is a significant difference between attrited and existing employees when comparing job satisfaction score.

4. EnvironmentSatisfaction and Attrition indicates that the medians are overlapping. Additionally, the attrited employees' environmental satisfaction scores show a tight distribution between 1 and 4. In the synthetic dataset, there is no significant difference between attrited and existing employees when comparing environment satisfaction score.

5. RelationshipSatisfaction and Attrition indicates that the box and the medians are overlapping. In the synthetic dataset, there is no significant difference between attrited and existing employees when comparing relationship satisfaction score.

6. WorkLifeBalance and Attrition too indicates that the box and the medians are overlapping. In the synthetic dataset, there is no significant difference between attrited and existing employees when comparing work-life balance score.

**Business impact:** The synthetic dataset reveals that job satisfaction is closely associated with attrition level. What might also be interesting to focus on is the work-life balance score. Burnout plays a major role in determining an employee's motivation to continue working at an environment. Identifying and adressing this helps improve retention. Here are some measures to improve retention:
1. Rewards and recognition
2. Promotions
3. Adaptive working conditions, such as remote and hybrid work

---

# Hypothesis 5.1: Job Satisfaction and Attrition

Null hypothesis: Job satisfaction levels and attrition rate are independent of each other and do not share a statistically significant relationship.

In [None]:
pg.mwu(df["Attrition"], df["JobSatisfaction"], alternative='two-sided')

**Key observations:** A p-value score of 0 provides enough evidence to reject the null hypothesis. In the synthetic dataset, Job satisfaction level and attrition share a statistically significant relationship.

In [None]:
# Box plot: Job satisfaction by Attrition
fig = px.box(df, x="Attrition", y="JobSatisfaction", points="all",
             title="Job Satisfaction by Attrition Status",
             labels={"Attrition": "Attrition (0=No, 1=Yes)", "JobSatisfaction": "Job Satisfaction"},
             color="Attrition")
fig.show()

**Business impact:** Job satisfaction level impacts attrition in the synthetic dataset. In the real-world, organziations can help improve job satisfaction by adopting the following measures:
1. Clarity and transparency: Explaining the motivation and impact provides a clear vision on how a person's work is impacting the organization. This helps build a sense of achievement at the workplace.
2. Respect: Hear out employee opinions to ensure they feel heard and included. Apart from being respected, this helps them feel one with the organziation.
3. Culture: Build a culture of mutual respect and integrity. For instance, reflect core values in everyday work. This will help build a working culture that's not only motivated but continues motivating others.

---

# Consolidated research results

|Index|Hypotheses|method|p-value|Interpreation|
|-----|----------------|-------------------------|-------------|-------|
|1|Gender and Attrition|Chi-squared|0.29|Accept null|
|2|Gender and Attrition|Chi-squared|0.29|Reject null|
|3|Gender and Attrition|Chi-squared|0.29|Reject null|
|4|Gender and Attrition|Chi-squared|0.29|Reject null|
|5|Total satisfaction level and Attrition|MWU|0.0|Reject null|
|5.1|JobSatisfaction and Attrition|MWU|0.0|Reject null|

---

# Summary