# 📘 Employee Attrition - Data Encoding

---

## 📄 Description  
This notebook performs a **Data Encoding** which identifies and handles outliers in the dataset.  

## 👨‍💻 Author  
**Kfir Tayar** 

© Copyright 2025, Kfir Tayar. All rights reserved.  

## 🔹 Notebook Overview  
- Check for missing values  
- Plot Boxplot to detect outliers  
- Replace outliers values with NaN  
- Plot Heatmap of the outliers  
- Replace NaN's using MICE Imputer  
- Save the imputed file as a Pickle file

In [87]:
# Import Libraries & Modules
import sys
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Add the path to the utils directory
sys.path.append(os.path.abspath('../utils'))

from data_prep_utils import display_category_summary

### Load Data Set

In [90]:
imputed_df = pd.read_pickle("../Data/imputed_employee_data_20250313.pkl")

In [92]:
imputed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74498 entries, 0 to 74497
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Gender                    74498 non-null  category
 1   Years at Company          74498 non-null  float64 
 2   Job Role                  74498 non-null  category
 3   Monthly Income            74498 non-null  float64 
 4   Work-Life Balance         74498 non-null  category
 5   Job Satisfaction          74498 non-null  category
 6   Performance Rating        74498 non-null  category
 7   Number of Promotions      74498 non-null  category
 8   Overtime                  74498 non-null  category
 9   Distance from Home        74498 non-null  float64 
 10  Education Level           74498 non-null  category
 11  Marital Status            74498 non-null  category
 12  Number of Dependents      74498 non-null  category
 13  Job Level                 74498 non-null  cate

### One-Hot Encoding
Perform One-Hot Encoding for a features that include low number of categories.

In [95]:
cat_df = display_category_summary(imputed_df)
cat_df.sort_values('Unique Values')

Unnamed: 0_level_0,Unique Values,Categories
Feature,Unnamed: 1_level_1,Unnamed: 2_level_1
Gender,2,"[Male, Female]"
Attrition,2,"[Stayed, Left]"
Overtime,2,"[No, Yes]"
Innovation Opportunities,2,"[No, Yes]"
Remote Work,2,"[No, Yes]"
Leadership Opportunities,2,"[No, Yes]"
Marital Status,3,"[Married, Divorced, Single]"
Job Level,3,"[Mid, Senior, Entry]"
Company Size,3,"[Medium, Small, Large]"
Work-Life Balance,4,"[Excellent, Poor, Good, Fair]"


Create dummies for spcific features: 'Gender', 'Marital Status', 'Job Level', 'Company Size'

In [98]:
imputed_df = pd.get_dummies(imputed_df, columns=['Gender', 'Marital Status', 'Job Level', 'Company Size'])

### Turn some featues to bool type

In [101]:
bool_features = ['Overtime', 'Remote Work', 'Leadership Opportunities', 'Innovation Opportunities']

for col in bool_features:
    imputed_df[col] = imputed_df[col].astype(bool)

In [103]:
imputed_df.select_dtypes(include=['bool']).head()

Unnamed: 0,Overtime,Remote Work,Leadership Opportunities,Innovation Opportunities,Gender_Female,Gender_Male,Marital Status_Divorced,Marital Status_Married,Marital Status_Single,Job Level_Entry,Job Level_Mid,Job Level_Senior,Company Size_Large,Company Size_Medium,Company Size_Small
0,True,True,True,True,False,True,False,True,False,False,True,False,False,True,False
1,True,True,True,True,True,False,True,False,False,False,True,False,False,True,False
2,True,True,True,True,True,False,False,True,False,False,True,False,False,True,False
3,True,True,True,True,True,False,False,False,True,False,True,False,False,False,True
4,True,True,True,True,False,True,True,False,False,False,False,True,False,True,False


### Encode target feature

In [106]:
imputed_df['Attrition'] = imputed_df['Attrition'].map({'stay': 1, 'Leave': 0})

### Using LabelEncoder for the rest

In [109]:
from sklearn.preprocessing import LabelEncoder

# Dictionary to store mappings
label_mappings = {}

# Encoding categorical columns and storing mappings
for col in cat_cols:
    label_encoder = LabelEncoder()
    imputed_df[col] = label_encoder.fit_transform(imputed_df[col])
    label_mappings[col] = (list(label_encoder.classes_), list(label_encoder.transform(label_encoder.classes_)))

# Convert mappings into a DataFrame for better visualization
mapping_df = pd.DataFrame({
    "Feature": label_mappings.keys(),
    "Categories": [v[0] for v in label_mappings.values()],
    "Encoded Values": [v[1] for v in label_mappings.values()]
})

# Display the mapping table
mapping_df


Unnamed: 0,Feature,Categories,Encoded Values
0,Job Role,"[Education, Finance, Healthcare, Media, Techno...","[0, 1, 2, 3, 4]"
1,Work-Life Balance,"[Excellent, Fair, Good, Poor]","[0, 1, 2, 3]"
2,Job Satisfaction,"[High, Low, Medium, Very High]","[0, 1, 2, 3]"
3,Performance Rating,"[Average, Below Average, High, Low]","[0, 1, 2, 3]"
4,Number of Promotions,"[0, 1, 2, 3, 4]","[0, 1, 2, 3, 4]"
5,Education Level,"[Associate Degree, Bachelors Degree, High Scho...","[0, 1, 2, 3, 4]"
6,Number of Dependents,"[0, 1, 2, 3, 4, 5, 6]","[0, 1, 2, 3, 4, 5, 6]"
7,Company Reputation,"[Excellent, Fair, Good, Poor]","[0, 1, 2, 3]"
8,Employee Recognition,"[High, Low, Medium, Very High]","[0, 1, 2, 3]"
9,Age Group,"[18-23, 23-30, 30-40, 40-50, >50]","[0, 1, 2, 3, 4]"


In [48]:
cat_cols = imputed_df.select_dtypes(include='category').columns

In [54]:
for col in cat_cols:
    label_encoder = LabelEncoder()
    imputed_df[col] = label_encoder.fit_transform(imputed_df[col])

### Presernt the data freame after complete encoding proccess

In [61]:
encoded_df = imputed_df.copy()

In [63]:
encoded_df.head()

Unnamed: 0,Years at Company,Job Role,Monthly Income,Work-Life Balance,Job Satisfaction,Performance Rating,Number of Promotions,Overtime,Distance from Home,Education Level,...,Gender_Male,Marital Status_Divorced,Marital Status_Married,Marital Status_Single,Job Level_Entry,Job Level_Mid,Job Level_Senior,Company Size_Large,Company Size_Medium,Company Size_Small
0,19.0,0,5390.0,0,2,0,2,True,22.0,0,...,True,False,True,False,False,True,False,False,True,False
1,4.0,3,5534.0,3,0,3,3,True,21.0,3,...,False,True,False,False,False,True,False,False,True,False
2,10.0,2,8159.0,2,0,3,0,True,11.0,1,...,False,False,True,False,False,True,False,False,True,False
3,7.0,0,3989.0,2,0,2,1,True,27.0,2,...,False,False,False,True,False,True,False,False,False,True
4,41.0,0,4821.0,1,3,0,0,True,71.0,2,...,True,True,False,False,False,False,True,False,True,False
