## Importing libraries

In [39]:
# Data processing  
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np
import pickle

# Pandas options  
# -----------------------------------------------------------------------
pd.options.display.max_colwidth = None

# Path configuration for custom module imports
# -----------------------------------------------------------------------
import sys
sys.path.append('../')  # Adds the parent directory to the path for custom module imports

# Ignore warnings  
# -----------------------------------------------------------------------
import warnings
warnings.filterwarnings("ignore")

# Custom functions and classes
# -----------------------------------------------------------------------
from src.support_encoding import Encoding, chi2_test
from src.support_scaling import scale_df
from src.support_eda import value_counts

## Data loading

In [40]:
df = pd.read_csv('../data/output/complete_data_imputed.csv', index_col=0).reset_index(drop=True)

In [41]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,Department,DistanceFromHome,Education,EducationField,Gender,JobLevel,JobRole,...,PercentSalaryHike,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,YearsAtCompany,YearsSinceLastPromotion,EnvironmentSatisfaction,JobSatisfaction,WorkLifeBalance,JobInvolvement
0,51,No,Travel_Rarely,Sales,6,2,Life Sciences,Female,1,Healthcare Representative,...,11,0,1.0,6,1,0,3.0,4.0,2.0,3
1,31,Yes,Travel_Frequently,Research & Development,10,1,Life Sciences,Female,1,Research Scientist,...,23,1,6.0,3,5,1,3.0,2.0,4.0,2
2,32,No,Travel_Frequently,Research & Development,17,4,Other,Male,4,Sales Executive,...,15,3,5.0,2,5,0,2.0,2.0,1.0,3
3,38,No,Non-Travel,Research & Development,2,5,Life Sciences,Male,3,Human Resources,...,11,3,13.0,5,8,7,4.0,4.0,3.0,2
4,32,No,Travel_Rarely,Research & Development,10,1,Medical,Male,1,Sales Executive,...,12,2,9.0,2,6,0,4.0,1.0,3.0,3


## Chi2 test

First, we run a Chi-squared test to check the independence of the variables against the target variable.

In [42]:
catgs = df.select_dtypes(include=['O', 'category']).columns

In [43]:
chi2_test(df,  catgs, 'Attrition', show=True)

We are evaluating the variable ATTRITION


Attrition,No,Yes
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1
No,1321,0
Yes,0,252


For the category ATTRITION there are significant differences, p = 0.0000


Attrition,No,Yes
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1
No,1109.0,212.0
Yes,212.0,40.0


--------------------------
We are evaluating the variable BUSINESSTRAVEL


Attrition,No,Yes
BusinessTravel,Unnamed: 1_level_1,Unnamed: 2_level_1
Non-Travel,147,12
Travel_Frequently,220,77
Travel_Rarely,954,163


For the category BUSINESSTRAVEL there are significant differences, p = 0.0000


Attrition,No,Yes
BusinessTravel,Unnamed: 1_level_1,Unnamed: 2_level_1
Non-Travel,134.0,25.0
Travel_Frequently,249.0,48.0
Travel_Rarely,938.0,179.0


--------------------------
We are evaluating the variable DEPARTMENT


Attrition,No,Yes
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
Human Resources,44,22
Research & Development,870,160
Sales,407,70


For the category DEPARTMENT there are significant differences, p = 0.0004


Attrition,No,Yes
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
Human Resources,55.0,11.0
Research & Development,865.0,165.0
Sales,401.0,76.0


--------------------------
We are evaluating the variable EDUCATIONFIELD


Attrition,No,Yes
EducationField,Unnamed: 1_level_1,Unnamed: 2_level_1
Human Resources,16,12
Life Sciences,547,108
Marketing,141,26
Medical,408,81
Other,80,10
Technical Degree,129,15


For the category EDUCATIONFIELD there are significant differences, p = 0.0011


Attrition,No,Yes
EducationField,Unnamed: 1_level_1,Unnamed: 2_level_1
Human Resources,24.0,4.0
Life Sciences,550.0,105.0
Marketing,140.0,27.0
Medical,411.0,78.0
Other,76.0,14.0
Technical Degree,121.0,23.0


--------------------------
We are evaluating the variable GENDER


Attrition,No,Yes
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,525,95
Male,796,157


For the category GENDER there are NO significant differences, p = 0.5904

--------------------------
We are evaluating the variable JOBROLE


Attrition,No,Yes
JobRole,Unnamed: 1_level_1,Unnamed: 2_level_1
Healthcare Representative,124,21
Human Resources,47,7
Laboratory Technician,232,46
Manager,91,16
Manufacturing Director,141,16
Research Director,63,22
Research Scientist,254,54
Sales Executive,290,58
Sales Representative,79,12


For the category JOBROLE there are NO significant differences, p = 0.1484

--------------------------
We are evaluating the variable MARITALSTATUS


Attrition,No,Yes
MaritalStatus,Unnamed: 1_level_1,Unnamed: 2_level_1
Divorced,320,37
Married,635,85
Single,366,130


For the category MARITALSTATUS there are significant differences, p = 0.0000


Attrition,No,Yes
MaritalStatus,Unnamed: 1_level_1,Unnamed: 2_level_1
Divorced,300.0,57.0
Married,605.0,115.0
Single,417.0,79.0


--------------------------


## Encoding

The first thing we will do is convert the target variable, `Attrition`, to numeric as follows:

- `No`: `0`

- `Yes`: `1`

In [44]:
df['Attrition'].replace({'Yes': 1, 'No': 0}, inplace=True)

In [45]:
df.select_dtypes(include=['O', 'category']).head()

Unnamed: 0,BusinessTravel,Department,EducationField,Gender,JobRole,MaritalStatus
0,Travel_Rarely,Sales,Life Sciences,Female,Healthcare Representative,Married
1,Travel_Frequently,Research & Development,Life Sciences,Female,Research Scientist,Single
2,Travel_Frequently,Research & Development,Other,Male,Sales Executive,Married
3,Non-Travel,Research & Development,Life Sciences,Male,Human Resources,Married
4,Travel_Rarely,Research & Development,Medical,Male,Sales Executive,Single


There are significant differences in all variables except for `Gender` and `JobRole`, so we will use `TargetEncoding` for all except these two, where we will use `OneHot`.

In [46]:
encoding_methods = {"onehot": ['Gender', 'JobRole'],
                    "target": ['BusinessTravel', 'Department', 'EducationField', 'MaritalStatus'],
                    "ordinal" : {},
                    "frequency": []
                    }

encoder = Encoding(df, encoding_methods, 'Attrition')

In [47]:
df_encoded = encoder.execute_all_encodings()

## Scaling

Since we used `OneHotEncoder` and `TargetEncoder` targeting the binary variable `Attrition`, it’s worth starting with a `MinMaxScaler`, as there also didn’t seem to be too many outliers.

In [48]:
df_scaled, scaler = scale_df(df_encoded, df_encoded.columns.to_list(), method="minmax")

## Imbalanced data

In [49]:
value_counts(df, 'Attrition')

The number of unique values for this category is 2


Unnamed: 0_level_0,count,proportion
Attrition,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1321,0.84
1,252,0.16


Our dataset is quite imbalanced, so in future models, we could consider rebalancing through resampling.

## Save data

In [50]:
df_scaled.to_csv('../data/output/complete_data_preprocessed.csv')

---

## Alternative preprocess

Let's perform an alternative preprocess. We will use only target encoding. For scaling we will use minmax again.

In [57]:
df_2 = df.copy()
df_2['Attrition'].replace({'Yes': 1, 'No': 0}, inplace=True)

### Encoding

In [58]:
encoding_methods_2 = {"onehot": [],
                    "target": ['BusinessTravel', 'Department', 'EducationField', 'MaritalStatus', 'Gender', 'JobRole'],
                    "ordinal" : {},
                    "frequency": []
                    }

encoder_2 = Encoding(df_2, encoding_methods_2, 'Attrition')
df_encoded_2 = encoder_2.execute_all_encodings()

### Scaling

In [59]:
df_scaled_2, scaler_2 = scale_df(df_encoded_2, df_encoded_2.columns.to_list(), method="minmax")

### Save data

In [60]:
df_scaled_2.to_csv('../data/output/complete_data_preprocessed_2.csv')