# *📊 Employee Attrition - Data Encoding*

---

## *Author*  
**Kfir Tayar** 

## *Notebook Overview*  
- Perform One-Hot encoding for some of the features 
- Convert features to bool dtype 
- Encode the target feature (Attrition) 
- Perform Label Encoding on the rest of the features   
- Save the encoded file as a Pickle file

In [1]:
# Import Libraries & Modules
import sys
import os
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Add the path to the utils directory
sys.path.append(os.path.abspath('../utils'))

from data_prep_utils import display_category_summary, save_file_as_pickle

### Load Data Set

In [6]:
extended_df = pd.read_pickle("../Data/extended_employee_data_20250325.pkl")

In [8]:
extended_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74498 entries, 0 to 74497
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   Gender                    74498 non-null  category
 1   Years at Company          74498 non-null  int64   
 2   Job Role                  74498 non-null  category
 3   Monthly Income            74498 non-null  int64   
 4   Work-Life Balance         74498 non-null  category
 5   Job Satisfaction          74498 non-null  category
 6   Performance Rating        74498 non-null  category
 7   Number of Promotions      74498 non-null  category
 8   Overtime                  74498 non-null  category
 9   Distance from Home        74498 non-null  float64 
 10  Education Level           74498 non-null  category
 11  Marital Status            74498 non-null  category
 12  Number of Dependents      74498 non-null  category
 13  Job Level                 74498 non-null  cate

### One-Hot Encoding
Perform One-Hot Encoding on features with no ordinal relationship.

In [11]:
cat_df = display_category_summary(extended_df)
cat_df.sort_values('Unique Values')

Unnamed: 0_level_0,Unique Values,Categories
Feature,Unnamed: 1_level_1,Unnamed: 2_level_1
Gender,2,"[Male, Female]"
Attrition,2,"[Stayed, Left]"
Overtime,2,"[No, Yes]"
Innovation Opportunities,2,"[No, Yes]"
Remote Work,2,"[No, Yes]"
Leadership Opportunities,2,"[No, Yes]"
Marital Status,3,"[Married, Divorced, Single]"
Job Level,3,"[Mid, Senior, Entry]"
Company Size,3,"[Medium, Small, Large]"
Work-Life Balance,4,"[Excellent, Poor, Good, Fair]"


Create dummies for spcific features: 'Marital Status', 'Job Role'

In [14]:
extended_df = pd.get_dummies(extended_df, columns=['Marital Status', 'Job Role'])

### Turn some featues to bool type

In [17]:
bool_features = ['Overtime', 'Remote Work', 'Leadership Opportunities', 'Innovation Opportunities']

for col in bool_features:
    extended_df[col] = extended_df[col].astype(bool)

In [19]:
extended_df.select_dtypes(include=['bool']).head()

Unnamed: 0,Overtime,Remote Work,Leadership Opportunities,Innovation Opportunities,At Least Decade,Marital Status_Divorced,Marital Status_Married,Marital Status_Single,Job Role_Education,Job Role_Finance,Job Role_Healthcare,Job Role_Media,Job Role_Technology
0,True,True,True,True,True,False,True,False,True,False,False,False,False
1,True,True,True,True,False,True,False,False,False,False,False,True,False
2,True,True,True,True,True,False,True,False,False,False,True,False,False
3,True,True,True,True,False,False,False,True,True,False,False,False,False
4,True,True,True,True,True,True,False,False,True,False,False,False,False


### Encode target feature

In [22]:
extended_df['Attrition'] = extended_df['Attrition'].map({'Stayed': 1, 'Left': 0})

### Using LabelEncoder for the rest

In [25]:
cat_cols = extended_df.select_dtypes(include='category').columns

In [27]:
from sklearn.preprocessing import LabelEncoder

# Dictionary to store mappings
label_mappings = {}

# Encoding categorical columns and storing mappings
for col in cat_cols:
    label_encoder = LabelEncoder()
    extended_df[col] = label_encoder.fit_transform(extended_df[col])
    label_mappings[col] = (list(label_encoder.classes_), list(label_encoder.transform(label_encoder.classes_)))

# Convert mappings into a DataFrame for better visualization
mapping_df = pd.DataFrame({
    "Feature": label_mappings.keys(),
    "Categories": [v[0] for v in label_mappings.values()],
    "Encoded Values": [v[1] for v in label_mappings.values()]
})

# Display the mapping table
mapping_df

Unnamed: 0,Feature,Categories,Encoded Values
0,Gender,"[Female, Male]","[0, 1]"
1,Work-Life Balance,"[Excellent, Fair, Good, Poor]","[0, 1, 2, 3]"
2,Job Satisfaction,"[High, Low, Medium, Very High]","[0, 1, 2, 3]"
3,Performance Rating,"[Average, Below Average, High, Low]","[0, 1, 2, 3]"
4,Number of Promotions,"[0, 1, 2, 3, 4]","[0, 1, 2, 3, 4]"
5,Education Level,"[Associate Degree, Bachelors Degree, High Scho...","[0, 1, 2, 3, 4]"
6,Number of Dependents,"[0, 1, 2, 3, 4, 5, 6]","[0, 1, 2, 3, 4, 5, 6]"
7,Job Level,"[Entry, Mid, Senior]","[0, 1, 2]"
8,Company Size,"[Large, Medium, Small]","[0, 1, 2]"
9,Company Reputation,"[Excellent, Fair, Good, Poor]","[0, 1, 2, 3]"


In [29]:
for col in cat_cols:
    label_encoder = LabelEncoder()
    extended_df[col] = label_encoder.fit_transform(extended_df[col])

### Presernt the data freame after complete encoding proccess

In [32]:
encoded_df = extended_df.copy()

In [34]:
encoded_df.head()

Unnamed: 0,Gender,Years at Company,Monthly Income,Work-Life Balance,Job Satisfaction,Performance Rating,Number of Promotions,Overtime,Distance from Home,Education Level,...,avg time for promotion,Has Dependents,Marital Status_Divorced,Marital Status_Married,Marital Status_Single,Job Role_Education,Job Role_Finance,Job Role_Healthcare,Job Role_Media,Job Role_Technology
0,1,19,5390,0,2,0,2,True,35.405568,0,...,9.5,0,False,True,False,True,False,False,False,False
1,0,4,5534,3,0,3,3,True,33.796224,3,...,1.333333,1,True,False,False,False,False,False,True,False
2,0,10,8159,2,0,3,0,True,17.702784,1,...,0.0,1,False,True,False,False,False,True,False,False
3,0,7,3989,2,0,2,1,True,43.452288,2,...,7.0,1,False,False,True,True,False,False,False,False
4,1,41,4821,1,3,0,0,True,114.263424,2,...,0.0,0,True,False,False,True,False,False,False,False


In [36]:
folder = "data"
file_name = "encoded_employee_data"

save_file_as_pickle(encoded_df, folder, file_name)

File saved as: ../data/encoded_employee_data_20250325.pkl
