### 0. Imports

In [203]:
%load_ext autoreload
%autoreload 2

# Tratamiento de datos
# -----------------------------------------------------------------------
import pandas as pd
import numpy as np

# Visualizaciones
# -----------------------------------------------------------------------
import seaborn as sns
import matplotlib.pyplot as plt

# Vigilar progreso bucles
# -----------------------------------------------------------------------
from tqdm import tqdm

# Gestionar los warnings
# -----------------------------------------------------------------------
import warnings

# modificar el path
# -----------------------------------------------------------------------
import sys
sys.path.append("..")

# importar funciones de soporte
# -----------------------------------------------------------------------
import src.soporte_eda as se
import src.soporte_preprocesamiento as sp

# evaluar objetos literales
# -----------------------------------------------------------------------
from ast import literal_eval 

# statistics functions
# -----------------------------------------------------------------------
from scipy.stats import pearsonr, spearmanr, pointbiserialr


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# 1. Introduction - Employee Retention

# 2. Preliminary analysis and data cleaning

## 2.1 Import data

We import the main dataframe first.

In [204]:
general_data_df = pd.read_csv("../data/general_data.csv")
general_data_df.columns = [col.lower() for col in general_data_df.columns]
general_data_df.head(2)

Unnamed: 0,age,attrition,businesstravel,department,distancefromhome,education,educationfield,employeecount,employeeid,gender,...,numcompaniesworked,over18,percentsalaryhike,standardhours,stockoptionlevel,totalworkingyears,trainingtimeslastyear,yearsatcompany,yearssincelastpromotion,yearswithcurrmanager
0,51,No,Travel_Rarely,Sales,6,2,Life Sciences,1,1,Female,...,1.0,Y,11,8,0,1.0,6,1,0,0
1,31,Yes,Travel_Frequently,Research & Development,10,1,Life Sciences,1,2,Female,...,0.0,Y,23,8,1,6.0,3,5,1,4


Then, the secondary ones. 

The emplyee survey:

In [205]:
employee_survey_df = pd.read_csv("../data/employee_survey_data.csv")
employee_survey_df.columns = [col.lower() for col in employee_survey_df.columns]
employee_survey_df.head(2)

Unnamed: 0,employeeid,environmentsatisfaction,jobsatisfaction,worklifebalance
0,1,3.0,4.0,2.0
1,2,3.0,2.0,4.0


And the manager survey.

In [206]:
manager_survey_df = pd.read_csv("../data/manager_survey_data.csv")
manager_survey_df.columns = [col.lower() for col in manager_survey_df.columns]
manager_survey_df.head(2)

Unnamed: 0,employeeid,jobinvolvement,performancerating
0,1,3,3
1,2,2,4


## 2.2 Join dataframes

Merge to join the three datasats into one.

In [207]:
employee_attrition = general_data_df.merge(employee_survey_df, how="inner").merge(manager_survey_df, how="inner")

## 2.3 Explore dataframe

Let's perform a preliminary exploration to know what cleaning might be necessary.

In [208]:
se.exploracion_dataframe(employee_attrition)

El número de datos es 4410 y el de columnas es 29

 ..................... 

Las primeras filas del dataframe son:


Unnamed: 0,age,attrition,businesstravel,department,distancefromhome,education,educationfield,employeecount,employeeid,gender,...,totalworkingyears,trainingtimeslastyear,yearsatcompany,yearssincelastpromotion,yearswithcurrmanager,environmentsatisfaction,jobsatisfaction,worklifebalance,jobinvolvement,performancerating
0,51,No,Travel_Rarely,Sales,6,2,Life Sciences,1,1,Female,...,1.0,6,1,0,0,3.0,4.0,2.0,3,3
1,31,Yes,Travel_Frequently,Research & Development,10,1,Life Sciences,1,2,Female,...,6.0,3,5,1,4,3.0,2.0,4.0,2,4



 ..................... 

Los tipos de las columnas y sus valores únicos son:


Unnamed: 0,tipo_dato,conteo
age,int64,43
attrition,object,2
businesstravel,object,3
department,object,3
distancefromhome,int64,29
education,int64,5
educationfield,object,6
employeecount,int64,1
employeeid,int64,4410
gender,object,2



 ..................... 

Los duplicados que tenemos en el conjunto de datos son: 0

 ..................... 

Los nulos que tenemos en el conjunto de datos son:


Unnamed: 0,%_nulos
numcompaniesworked,0.430839
totalworkingyears,0.204082
environmentsatisfaction,0.566893
jobsatisfaction,0.453515
worklifebalance,0.861678



 ..................... 

Comprobamos que no haya valores con una sola variable:
● La variable employeecount tiene 1 solo valor único. Se elimina.
● La variable over18 tiene 1 solo valor único. Se elimina.
● La variable standardhours tiene 1 solo valor único. Se elimina.

 ..................... 

Comprobamos una representación mínima para valores numéricos:
● La variable education tiene 5 < 15 valores únicos. Se convierte a objeto.
● La variable joblevel tiene 5 < 15 valores únicos. Se convierte a objeto.
● La variable numcompaniesworked tiene 10 < 15 valores únicos. Se convierte a objeto.
● La variable percentsalaryhike tiene 15 < 15 valores únicos. Se convierte a objeto.
● La variable stockoptionlevel tiene 4 < 15 valores únicos. Se convierte a objeto.
● La variable trainingtimeslastyear tiene 7 < 15 valores únicos. Se convierte a objeto.
● La variable environmentsatisfaction tiene 4 < 15 valores únicos. Se convierte a objeto.
● La variable jobsatisfaction tiene 4 < 15 valores únicos

Unnamed: 0_level_0,count,pct
attrition,Unnamed: 1_level_1,Unnamed: 2_level_1
No,3699,83.9
Yes,711,16.1


La columna BUSINESSTRAVEL tiene 3 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
businesstravel,Unnamed: 1_level_1,Unnamed: 2_level_1
Travel_Rarely,3129,71.0
Travel_Frequently,831,18.8
Non-Travel,450,10.2


La columna DEPARTMENT tiene 3 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
department,Unnamed: 1_level_1,Unnamed: 2_level_1
Research & Development,2883,65.4
Sales,1338,30.3
Human Resources,189,4.3


La columna EDUCATION tiene 5 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
education,Unnamed: 1_level_1,Unnamed: 2_level_1
3,1716,38.9
4,1194,27.1
2,846,19.2
1,510,11.6
5,144,3.3


La columna EDUCATIONFIELD tiene 6 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
educationfield,Unnamed: 1_level_1,Unnamed: 2_level_1
Life Sciences,1818,41.2
Medical,1392,31.6
Marketing,477,10.8
Technical Degree,396,9.0
Other,246,5.6


La columna GENDER tiene 2 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,2646,60.0
Female,1764,40.0


La columna JOBLEVEL tiene 5 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
joblevel,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1629,36.9
2,1602,36.3
3,654,14.8
4,318,7.2
5,207,4.7


La columna JOBROLE tiene 9 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
jobrole,Unnamed: 1_level_1,Unnamed: 2_level_1
Sales Executive,978,22.2
Research Scientist,876,19.9
Laboratory Technician,777,17.6
Manufacturing Director,435,9.9
Healthcare Representative,393,8.9


La columna MARITALSTATUS tiene 3 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
maritalstatus,Unnamed: 1_level_1,Unnamed: 2_level_1
Married,2019,45.8
Single,1410,32.0
Divorced,981,22.2


La columna NUMCOMPANIESWORKED tiene 10 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
numcompaniesworked,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,1558,35.3
0.0,586,13.3
3.0,474,10.7
2.0,438,9.9
4.0,415,9.4


La columna PERCENTSALARYHIKE tiene 15 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
percentsalaryhike,Unnamed: 1_level_1,Unnamed: 2_level_1
11,630,14.3
13,627,14.2
14,603,13.7
12,594,13.5
15,303,6.9


La columna STOCKOPTIONLEVEL tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
stockoptionlevel,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1893,42.9
1,1788,40.5
2,474,10.7
3,255,5.8


La columna TRAININGTIMESLASTYEAR tiene 7 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
trainingtimeslastyear,Unnamed: 1_level_1,Unnamed: 2_level_1
2,1641,37.2
3,1473,33.4
4,369,8.4
5,357,8.1
1,213,4.8


La columna ENVIRONMENTSATISFACTION tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
environmentsatisfaction,Unnamed: 1_level_1,Unnamed: 2_level_1
3.0,1350,30.6
4.0,1334,30.2
2.0,856,19.4
1.0,845,19.2


La columna JOBSATISFACTION tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
jobsatisfaction,Unnamed: 1_level_1,Unnamed: 2_level_1
4.0,1367,31.0
3.0,1323,30.0
1.0,860,19.5
2.0,840,19.0


La columna WORKLIFEBALANCE tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
worklifebalance,Unnamed: 1_level_1,Unnamed: 2_level_1
3.0,2660,60.3
2.0,1019,23.1
4.0,454,10.3
1.0,239,5.4


La columna JOBINVOLVEMENT tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
jobinvolvement,Unnamed: 1_level_1,Unnamed: 2_level_1
3,2604,59.0
2,1125,25.5
4,432,9.8
1,249,5.6


La columna PERFORMANCERATING tiene 2 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
performancerating,Unnamed: 1_level_1,Unnamed: 2_level_1
3,3732,84.6
4,678,15.4





**Duplicates**
- There are **0** row level duplicates in the dataset.

Inspecting if there are duplicates by id:


In [209]:
general_data_df.duplicated("employeeid").sum()

0

Inspecting if there are duplicated by all other than the id column:

In [210]:
duplicated = general_data_df.drop(columns="employeeid").duplicated().sum()

duplicated_pct = duplicated / general_data_df.shape[0]

print(f"There are {duplicated} duplicated values, which represent {duplicated_pct:.2f} in proportion")

There are 2912 duplicated values, which represent 0.66 in proportion


An astonishing amount of records are duplicated according to the above check. This is surprising as it is not possible that 66% percent of employees are a duplicate of others...

Let's check their values and ids:

In [211]:
general_data_df.sort_values(by=list(general_data_df.drop(columns=["employeeid"]).columns)).head(6)

Unnamed: 0,age,attrition,businesstravel,department,distancefromhome,education,educationfield,employeecount,employeeid,gender,...,numcompaniesworked,over18,percentsalaryhike,standardhours,stockoptionlevel,totalworkingyears,trainingtimeslastyear,yearsatcompany,yearssincelastpromotion,yearswithcurrmanager
714,18,No,Non-Travel,Research & Development,1,4,Medical,1,715,Male,...,1.0,Y,22,8,1,0.0,2,0,0,0
2184,18,No,Non-Travel,Research & Development,1,4,Medical,1,2185,Male,...,1.0,Y,22,8,1,0.0,2,0,0,0
3654,18,No,Non-Travel,Research & Development,1,4,Medical,1,3655,Male,...,1.0,Y,22,8,1,0.0,2,0,0,0
1053,18,No,Non-Travel,Research & Development,2,3,Life Sciences,1,1054,Male,...,1.0,Y,24,8,2,0.0,4,0,0,0
2523,18,No,Non-Travel,Research & Development,2,3,Life Sciences,1,2524,Male,...,1.0,Y,24,8,2,0.0,4,0,0,0
3993,18,No,Non-Travel,Research & Development,2,3,Life Sciences,1,3994,Male,...,1.0,Y,24,8,2,0.0,4,0,0,0


Regardless of their employeeid, it is just not possible that there are three youngsters of the same age, on the same department, that live at the same distance from work, that have the same degree of education, make the same, etc. And besides it happens all over the dataset.

In this case it is clear, these records are duplicated and need to be dropped.

Drop the employee_id column and drop duplicates.

In [212]:
employee_attrition.drop(columns="employeeid", inplace=True)
employee_attrition.drop_duplicates(inplace=True)

### 2.3.1 Repeat exploration

In [213]:
se.exploracion_dataframe(employee_attrition)

El número de datos es 1573 y el de columnas es 25

 ..................... 

Las primeras filas del dataframe son:


Unnamed: 0,age,attrition,businesstravel,department,distancefromhome,education,educationfield,gender,joblevel,jobrole,...,totalworkingyears,trainingtimeslastyear,yearsatcompany,yearssincelastpromotion,yearswithcurrmanager,environmentsatisfaction,jobsatisfaction,worklifebalance,jobinvolvement,performancerating
0,51,No,Travel_Rarely,Sales,6,2,Life Sciences,Female,1,Healthcare Representative,...,1.0,6,1,0,0,3.0,4.0,2.0,3,3
1,31,Yes,Travel_Frequently,Research & Development,10,1,Life Sciences,Female,1,Research Scientist,...,6.0,3,5,1,4,3.0,2.0,4.0,2,4



 ..................... 

Los tipos de las columnas y sus valores únicos son:


Unnamed: 0,tipo_dato,conteo
age,int64,43
attrition,object,2
businesstravel,object,3
department,object,3
distancefromhome,int64,29
education,object,5
educationfield,object,6
gender,object,2
joblevel,object,5
jobrole,object,9



 ..................... 

Los duplicados que tenemos en el conjunto de datos son: 0

 ..................... 

Los nulos que tenemos en el conjunto de datos son:


Unnamed: 0,%_nulos
numcompaniesworked,1.207883
totalworkingyears,0.572155
environmentsatisfaction,1.462174
jobsatisfaction,1.14431
worklifebalance,2.225048



 ..................... 

Comprobamos que no haya valores con una sola variable:

 ..................... 

Comprobamos una representación mínima para valores numéricos:

 ..................... 

Los valores que tenemos para las columnas categóricas son: 
La columna ATTRITION tiene 2 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
attrition,Unnamed: 1_level_1,Unnamed: 2_level_1
No,1321,84.0
Yes,252,16.0


La columna BUSINESSTRAVEL tiene 3 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
businesstravel,Unnamed: 1_level_1,Unnamed: 2_level_1
Travel_Rarely,1117,71.0
Travel_Frequently,297,18.9
Non-Travel,159,10.1


La columna DEPARTMENT tiene 3 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
department,Unnamed: 1_level_1,Unnamed: 2_level_1
Research & Development,1030,65.5
Sales,477,30.3
Human Resources,66,4.2


La columna EDUCATION tiene 5 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
education,Unnamed: 1_level_1,Unnamed: 2_level_1
3,615,39.1
4,422,26.8
2,303,19.3
1,181,11.5
5,52,3.3


La columna EDUCATIONFIELD tiene 6 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
educationfield,Unnamed: 1_level_1,Unnamed: 2_level_1
Life Sciences,655,41.6
Medical,489,31.1
Marketing,167,10.6
Technical Degree,144,9.2
Other,90,5.7


La columna GENDER tiene 2 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,953,60.6
Female,620,39.4


La columna JOBLEVEL tiene 5 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
joblevel,Unnamed: 1_level_1,Unnamed: 2_level_1
1,587,37.3
2,571,36.3
3,229,14.6
4,111,7.1
5,75,4.8


La columna JOBROLE tiene 9 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
jobrole,Unnamed: 1_level_1,Unnamed: 2_level_1
Sales Executive,348,22.1
Research Scientist,308,19.6
Laboratory Technician,278,17.7
Manufacturing Director,157,10.0
Healthcare Representative,145,9.2


La columna MARITALSTATUS tiene 3 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
maritalstatus,Unnamed: 1_level_1,Unnamed: 2_level_1
Married,720,45.8
Single,496,31.5
Divorced,357,22.7


La columna NUMCOMPANIESWORKED tiene 10 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
numcompaniesworked,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,542,34.5
0.0,213,13.5
3.0,166,10.6
2.0,156,9.9
4.0,149,9.5


La columna PERCENTSALARYHIKE tiene 15 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
percentsalaryhike,Unnamed: 1_level_1,Unnamed: 2_level_1
11,223,14.2
13,220,14.0
14,220,14.0
12,214,13.6
15,107,6.8


La columna STOCKOPTIONLEVEL tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
stockoptionlevel,Unnamed: 1_level_1,Unnamed: 2_level_1
0,676,43.0
1,641,40.8
2,166,10.6
3,90,5.7


La columna TRAININGTIMESLASTYEAR tiene 7 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
trainingtimeslastyear,Unnamed: 1_level_1,Unnamed: 2_level_1
2,589,37.4
3,522,33.2
4,133,8.5
5,129,8.2
1,75,4.8


La columna ENVIRONMENTSATISFACTION tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
environmentsatisfaction,Unnamed: 1_level_1,Unnamed: 2_level_1
3.0,482,30.6
4.0,465,29.6
2.0,304,19.3
1.0,299,19.0


La columna JOBSATISFACTION tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
jobsatisfaction,Unnamed: 1_level_1,Unnamed: 2_level_1
4.0,490,31.2
3.0,468,29.8
1.0,301,19.1
2.0,296,18.8


La columna WORKLIFEBALANCE tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
worklifebalance,Unnamed: 1_level_1,Unnamed: 2_level_1
3.0,942,59.9
2.0,357,22.7
4.0,156,9.9
1.0,83,5.3


La columna JOBINVOLVEMENT tiene 4 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
jobinvolvement,Unnamed: 1_level_1,Unnamed: 2_level_1
3,933,59.3
2,395,25.1
4,155,9.9
1,90,5.7


La columna PERFORMANCERATING tiene 2 valores únicos, de los cuales los primeros son:


Unnamed: 0_level_0,count,pct
performancerating,Unnamed: 1_level_1,Unnamed: 2_level_1
3,1333,84.7
4,240,15.3


**Missing values**

There are some missing values with moderately low impact:

- numcompaniesworked 1.20%
- totalworkingyears	0.57%
- environmentsatisfaction 1.46%
- jobsatisfaction - 1.14%
- worklifebalance - 2.22%

These will be have to be imputed, although a simple imputer might suffice given their low percetanges might not have that much of an impact.

**Low variability columns**

- After the exploration script dropped unique value columns, no other column has low variability of categories or values.


### Categorical columns values

No major problems are observed from either high cardinality, data type errors or typos. There is however, an odd value identified from the report for the variable 'numcompaniesworked'; numcompaniesworked = 0.

It is possible that people have worked in no other company and maybe 0 in this variable accounts for '0 companies prior to this one', although that contradicts the feature definition from the provided column dictionary. However, just to check, the way to check if people come from other companies appart from this same column is to check if the difference between 'totalworkingyears' and 'yearsatcompany' is bigger than 0.


In [214]:
working_years_diff = (employee_attrition["totalworkingyears"] - employee_attrition["yearsatcompany"]) > 0

employee_attrition[working_years_diff & (employee_attrition["numcompaniesworked"] == 0)].shape[0]

211

Almost all records (211/213) that have 'numworkingyears' == 0 should have previous experience. This indicates that 0 could also be an encoding for NaN values, so this values are to be set to NaN. 

In [215]:
employee_attrition.loc[(employee_attrition["numcompaniesworked"] == 0),"numcompaniesworked"] = np.nan

### Numerical columns

For numerical columns, the quickest way to assess value ranges and other statistics is through a descriptive summary:

In [216]:
display(employee_attrition.describe().T)
employee_attrition.select_dtypes(np.number).nunique().reset_index()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,1573.0,36.904005,9.105911,18.0,30.0,36.0,43.0,60.0
distancefromhome,1573.0,9.158296,8.124414,1.0,2.0,7.0,14.0,29.0
monthlyincome,1573.0,64979.459631,47121.739301,10090.0,28990.0,49000.0,83800.0,199990.0
totalworkingyears,1564.0,11.245524,7.748763,0.0,6.0,10.0,15.0,40.0
yearsatcompany,1573.0,6.970757,6.068348,0.0,3.0,5.0,9.0,40.0
yearssincelastpromotion,1573.0,2.184361,3.203105,0.0,0.0,1.0,3.0,15.0
yearswithcurrmanager,1573.0,4.102988,3.572701,0.0,2.0,3.0,7.0,17.0


Unnamed: 0,index,0
0,age,43
1,distancefromhome,29
2,monthlyincome,1349
3,totalworkingyears,40
4,yearsatcompany,37
5,yearssincelastpromotion,16
6,yearswithcurrmanager,18


No presence of outliers is detected judging from the ranges of features.



Yearswithcurrmanager and yearssincelastpromotion could also be treated as object.


In [217]:
employee_attrition[["yearssincelastpromotion","yearswithcurrmanager"]] = employee_attrition[["yearssincelastpromotion","yearswithcurrmanager"]].astype("object")

# 4. Export

Export the cleaned data in parquet format to retain data types and compress the information. This data will be used in the next phase of the project `notebooks\2_EDA.ipynb`.

In [218]:
employee_attrition.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1573 entries, 0 to 4409
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      1573 non-null   int64  
 1   attrition                1573 non-null   object 
 2   businesstravel           1573 non-null   object 
 3   department               1573 non-null   object 
 4   distancefromhome         1573 non-null   int64  
 5   education                1573 non-null   object 
 6   educationfield           1573 non-null   object 
 7   gender                   1573 non-null   object 
 8   joblevel                 1573 non-null   object 
 9   jobrole                  1573 non-null   object 
 10  maritalstatus            1573 non-null   object 
 11  monthlyincome            1573 non-null   int64  
 12  numcompaniesworked       1341 non-null   object 
 13  percentsalaryhike        1573 non-null   object 
 14  stockoptionlevel         1573

In [219]:
employee_attrition.to_parquet("../data/cleaned/employee_attrition_clean.parquet")
employee_attrition.to_pickle("../data/cleaned/employee_attrition_clean.pkl")