# Job Change of Data Scientists | Data Science Project | Data Preprocessing

> Data from [Kaggle](https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists) with modification in problem context.

*This project was completed as a part of Rakamin Academy Data Science Bootcamp.*

Ascencio, a leading Data Science agency, offers training courses to companies to enhance their employees' skills. Companies want to predict which employees are **unlikely to seek a job change** after completing the course, as well as identify those who are **likely to finish it quickly**. By focusing on employees who are committed to staying and can contribute sooner, Ascencio helps companies optimize their training investments.

To achieve this, Ascencio will build two machine learning models: one to predict the training hours needed for an employee to complete the course, and another to predict whether an employee will seek a job change or not.

# Prepare Everything!

In [None]:
# import library
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.gridspec as grid_spec
import seaborn as sns
from lightgbm import LGBMClassifier

import warnings
warnings.filterwarnings('ignore')

print('numpy version : ',np.__version__)
print('pandas version : ',pd.__version__)
print('matplotlib version : ',mpl.__version__)
print('seaborn version : ',sns.__version__)

numpy version :  2.2.2
pandas version :  2.2.3
matplotlib version :  3.10.0
seaborn version :  0.13.2


In [26]:
# read the data
df_train = pd.read_csv(r'Data/aug_train.csv')
df_test = pd.read_csv(r'Data/aug_test.csv')

# Data Preprocessing

## A. Feature Selection

Check that we will only use these columns as features<br/>
'city_development_index', 'gender', 'relevent_experience', 'enrolled_university', 'education_level', 'major_discipline', 'experience', 'company_size', 'company_type', 'last_new_job'

In [27]:
# drop column enrollee_id, city, and training_hours
print("df_train Dataframe")
df_train.drop(['enrollee_id','city','training_hours'], axis=1, inplace=True)
display(df_train.head())
print("df_test Dataframe")
df_test.drop(['enrollee_id','city','training_hours'], axis=1, inplace=True)
display(df_test.head())


df_train Dataframe


Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,target
0,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,1.0
1,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,0.0
2,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,0.0
3,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,1.0
4,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,0.0


df_test Dataframe


Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
0,0.827,Male,Has relevent experience,Full time course,Graduate,STEM,9,<10,,1
1,0.92,Female,Has relevent experience,no_enrollment,Graduate,STEM,5,,Pvt Ltd,1
2,0.624,Male,No relevent experience,no_enrollment,High School,,<1,,Pvt Ltd,never
3,0.827,Male,Has relevent experience,no_enrollment,Masters,STEM,11,10/49,Pvt Ltd,1
4,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,10000+,Pvt Ltd,>4


In [28]:
# drop rows with equal or more than 4 NaN values train data
print(f"Rows in df_train before is {df_train.shape[0]}")
df_train = df_train[df_train.isnull().sum(axis=1) < 4]
df_train.reset_index(inplace=True)
print(f"Rows in df_train after is {df_train.shape[0]}")

Rows in df_train before is 19158
Rows in df_train after is 18280


In [29]:
# drop rows with equal or more than 4 NaN values test data
print(f"Rows in df_test before is {df_test.shape[0]}")
df_test = df_test[df_test.isnull().sum(axis=1) < 4]
df_test.reset_index(inplace=True)
print(f"Rows in df_test after is {df_test.shape[0]}")

Rows in df_test before is 2129
Rows in df_test after is 2044


## B. Feature Revision

We will rename, grouping, and imputation data based on 1_EDA.ipynb analysis. We will set copy to dftr dataframe from df_train dataframe and dfte dataframe from df_test dataframe.

In [30]:
# feature revision for df_train
# rename relevent_experience to relevant_experience
df_train.rename(columns={'relevent_experience':'relevant_experience'}, inplace=True)

# copy the data
dftr = df_train.copy()

# grouping city_development_index
dftr['city_development_index'] = df_train['city_development_index'].apply(lambda x: '<=0.6' if x <= 0.6
                                                                                else '0.6-0.7' if x <= 0.7
                                                                                else '0.7-0.8' if x <= 0.8
                                                                                else '0.8-0.9' if x <= 0.9
                                                                                else np.nan if pd.isna(x)
                                                                                else '0.9-1.0')

# grouping relevant_experience
dftr['relevant_experience'] = df_train['relevant_experience'].apply(lambda x: True if x == "Has relevent experience" 
                                                 else np.nan if pd.isna(x) else False)

# grouping enrolled_university
dftr['enrolled_university'] = df_train['enrolled_university'].apply(lambda x: "No Enroll" if x == "no_enrollment"
                                                                        else "Full Time" if x == "Full time course" 
                                                                        else np.nan if pd.isna(x) else "Part Time")

# grouping and imputation major_discipline
dftr['major_discipline'] = df_train['major_discipline'].apply(lambda x: "STEM" if x == "STEM"
                              else "No Major" if x == "No Major" 
                              else np.nan if pd.isna(x) else "Non-STEM")
dftr['major_discipline'] = np.where((dftr['education_level'].isin(['Graduate', 'Masters'])) & (dftr['major_discipline'] == 'No Major'), np.nan, 
                        np.where((dftr['education_level'].isin(['Primary School', 'High School'])) & (dftr['major_discipline'].isnull()),'No Major',dftr['major_discipline']))

# grouping experience
dftr['experience'] = df_train['experience'].apply(lambda x: "Early Career" if x in ['<1','1','2','3','4']
                                                        else "Mid Career" if x in ['5','6','7','8','9','10']
                                                        else "Senior Career" if x in ['11','12','13','14','15']
                                                        else np.nan if pd.isna(x) else "High Experience")

# grouping company_size
dftr['company_size'] = df_train['company_size'].apply(lambda x: "Medium" if x in ['100-500', '500-999']
                                                            else "Large" if x in ['1000-4999', '5000-9999']
                                                            else "Very Large" if x in ['10000+'] 
                                                            else np.nan if pd.isna(x) else "Small")

# rename company_type
dftr['company_type'] = df_train['company_type'].apply(lambda x: "Early Startup" if x == "Early Stage Startup" else x)

# check data
dftr.head(10)

Unnamed: 0,index,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,target
0,0,0.9-1.0,Male,True,No Enroll,Graduate,STEM,High Experience,,,1,1.0
1,1,0.7-0.8,Male,False,No Enroll,Graduate,STEM,Senior Career,Small,Pvt Ltd,>4,0.0
2,2,0.6-0.7,,False,Full Time,Graduate,STEM,Mid Career,,,never,0.0
3,3,0.7-0.8,,False,,Graduate,Non-STEM,Early Career,,Pvt Ltd,never,1.0
4,4,0.7-0.8,Male,True,No Enroll,Masters,STEM,High Experience,Small,Funded Startup,4,0.0
5,5,0.7-0.8,,True,Part Time,Graduate,STEM,Senior Career,,,1,1.0
6,6,0.9-1.0,Male,True,No Enroll,High School,No Major,Mid Career,Small,Funded Startup,1,0.0
7,7,0.7-0.8,Male,True,No Enroll,Graduate,STEM,Senior Career,Small,Pvt Ltd,>4,1.0
8,8,0.9-1.0,Male,True,No Enroll,Graduate,STEM,Mid Career,Small,Pvt Ltd,1,1.0
9,9,0.9-1.0,,True,No Enroll,Graduate,STEM,High Experience,Very Large,Pvt Ltd,>4,0.0


In [31]:
# feature revision for df_test
# rename relevent_experience to relevant_experience
df_test.rename(columns={'relevent_experience':'relevant_experience'}, inplace=True)

# copy the data
dfte = df_test.copy()

# grouping city_development_index
dfte['city_development_index'] = df_test['city_development_index'].apply(lambda x: '<=0.6' if x <= 0.6
                                                                                else '0.6-0.7' if x <= 0.7
                                                                                else '0.7-0.8' if x <= 0.8
                                                                                else '0.8-0.9' if x <= 0.9
                                                                                else np.nan if pd.isna(x)
                                                                                else '0.9-1.0')

# grouping relevant_experience
dfte['relevant_experience'] = df_test['relevant_experience'].apply(lambda x: True if x == "Has relevent experience" 
                                                 else np.nan if pd.isna(x) else False)

# grouping enrolled_university
dfte['enrolled_university'] = df_test['enrolled_university'].apply(lambda x: "No Enroll" if x == "no_enrollment"
                                                                        else "Full Time" if x == "Full time course" 
                                                                        else np.nan if pd.isna(x) else "Part Time")

# grouping and imputation major_discipline
dfte['major_discipline'] = df_test['major_discipline'].apply(lambda x: "STEM" if x == "STEM"
                              else "No Major" if x == "No Major" 
                              else np.nan if pd.isna(x) else "Non-STEM")
dfte['major_discipline'] = np.where((dfte['education_level'].isin(['Graduate', 'Masters'])) & (dfte['major_discipline'] == 'No Major'), np.nan, 
                        np.where((dfte['education_level'].isin(['Primary School', 'High School'])) & (dfte['major_discipline'].isnull()),'No Major',dfte['major_discipline']))

# grouping experience
dfte['experience'] = df_test['experience'].apply(lambda x: "Early Career" if x in ['<1','1','2','3','4']
                                                        else "Mid Career" if x in ['5','6','7','8','9','10']
                                                        else "Senior Career" if x in ['11','12','13','14','15']
                                                        else np.nan if pd.isna(x) else "High Experience")

# grouping company_size
dfte['company_size'] = df_test['company_size'].apply(lambda x: "Medium" if x in ['100-500', '500-999']
                                                            else "Large" if x in ['1000-4999', '5000-9999']
                                                            else "Very Large" if x in ['10000+'] 
                                                            else np.nan if pd.isna(x) else "Small")

# rename company_type
dfte['company_type'] = df_test['company_type'].apply(lambda x: "Early Startup" if x == "Early Stage Startup" else x)

# check data
dfte.head(10)

Unnamed: 0,index,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
0,0,0.8-0.9,Male,True,Full Time,Graduate,STEM,Mid Career,Small,,1
1,1,0.9-1.0,Female,True,No Enroll,Graduate,STEM,Mid Career,,Pvt Ltd,1
2,2,0.6-0.7,Male,False,No Enroll,High School,No Major,Early Career,,Pvt Ltd,never
3,3,0.8-0.9,Male,True,No Enroll,Masters,STEM,Senior Career,Small,Pvt Ltd,1
4,4,0.9-1.0,Male,True,No Enroll,Graduate,STEM,High Experience,Very Large,Pvt Ltd,>4
5,5,0.8-0.9,Male,False,Part Time,Masters,STEM,Mid Career,,,2
6,6,0.6-0.7,,True,No Enroll,Graduate,STEM,Early Career,Medium,Pvt Ltd,1
7,7,0.9-1.0,Female,True,No Enroll,Graduate,STEM,High Experience,,,>4
8,8,0.8-0.9,Male,True,No Enroll,Graduate,STEM,Senior Career,,,4
9,9,0.6-0.7,Male,True,Full Time,Graduate,,Early Career,Small,Funded Startup,1


## C. Type Data

In [32]:
# check type data of train data
print(dftr.info())
# relevant_experience have bool type since it only yes or no value

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18280 entries, 0 to 18279
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   index                   18280 non-null  int64  
 1   city_development_index  18280 non-null  object 
 2   gender                  14456 non-null  object 
 3   relevant_experience     18280 non-null  bool   
 4   enrolled_university     18076 non-null  object 
 5   education_level         18133 non-null  object 
 6   major_discipline        17895 non-null  object 
 7   experience              18247 non-null  object 
 8   company_size            13185 non-null  object 
 9   company_type            12990 non-null  object 
 10  last_new_job            18052 non-null  object 
 11  target                  18280 non-null  float64
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 1.6+ MB
None


In [33]:
# check type data of test data
print(dfte.info())
# relevant_experience have bool type since it only yes or no value

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2044 entries, 0 to 2043
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   index                   2044 non-null   int64 
 1   city_development_index  2044 non-null   object
 2   gender                  1606 non-null   object
 3   relevant_experience     2044 non-null   bool  
 4   enrolled_university     2026 non-null   object
 5   education_level         2024 non-null   object
 6   major_discipline        2000 non-null   object
 7   experience              2041 non-null   object
 8   company_size            1500 non-null   object
 9   company_type            1493 non-null   object
 10  last_new_job            2020 non-null   object
dtypes: bool(1), int64(1), object(9)
memory usage: 161.8+ KB
None


In [34]:
# change type to category for dftr
dftr['city_development_index'] = pd.Categorical(dftr['city_development_index'], categories=['<=0.6','0.6-0.7','0.7-0.8','0.8-0.9','0.9-1.0'], ordered=True)
dftr['gender'] = pd.Categorical(dftr['gender'], categories=["Male","Female","Other"])
dftr['relevant_experience'] = pd.Categorical(dftr['relevant_experience'], categories=[False, True], ordered=True)
dftr['enrolled_university'] = pd.Categorical(dftr['enrolled_university'], categories=['No Enroll','Part Time','Full Time'], ordered=True)
dftr['education_level'] = pd.Categorical(dftr['education_level'], categories=['Primary School','High School','Graduate','Masters','Phd'], ordered=True)
dftr['major_discipline'] = pd.Categorical(dftr['major_discipline'], categories=['No Major','Non-STEM','STEM'])
dftr['experience'] = pd.Categorical(dftr['experience'], categories=['Early Career','Mid Career','Senior Career','High Experience'], ordered=True)
dftr['company_size'] = pd.Categorical(dftr['company_size'], categories=['Small','Medium','Large','Very Large'], ordered=True)
dftr['company_type'] = pd.Categorical(dftr['company_type'], categories=['Early Startup','Funded Startup','NGO','Public Sector','Pvt Ltd','Other'])
dftr['last_new_job'] = pd.Categorical(dftr['last_new_job'], categories=['never','1','2','3','4','>4'], ordered=True)
dftr['target'] = pd.Categorical(dftr['target'], categories=[0,1], ordered=True)


In [35]:
# change type to category for dfte
dfte['city_development_index'] = pd.Categorical(dfte['city_development_index'], categories=['<=0.6','0.6-0.7','0.7-0.8','0.8-0.9','0.9-1.0'], ordered=True)
dfte['gender'] = pd.Categorical(dfte['gender'], categories=["Male","Female","Other"])
dfte['relevant_experience'] = pd.Categorical(dfte['relevant_experience'], categories=[False, True], ordered=True)
dfte['enrolled_university'] = pd.Categorical(dfte['enrolled_university'], categories=['No Enroll','Part Time','Full Time'], ordered=True)
dfte['education_level'] = pd.Categorical(dfte['education_level'], categories=['Primary School','High School','Graduate','Masters','Phd'], ordered=True)
dfte['major_discipline'] = pd.Categorical(dfte['major_discipline'], categories=['No Major','Non-STEM','STEM'])
dfte['experience'] = pd.Categorical(dfte['experience'], categories=['Early Career','Mid Career','Senior Career','High Experience'], ordered=True)
dfte['company_size'] = pd.Categorical(dfte['company_size'], categories=['Small','Medium','Large','Very Large'], ordered=True)
dfte['company_type'] = pd.Categorical(dfte['company_type'], categories=['Early Startup','Funded Startup','NGO','Public Sector','Pvt Ltd','Other'])
dfte['last_new_job'] = pd.Categorical(dfte['last_new_job'], categories=['never','1','2','3','4','>4'], ordered=True)

In [36]:
# check data
print(dftr.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18280 entries, 0 to 18279
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   index                   18280 non-null  int64   
 1   city_development_index  18280 non-null  category
 2   gender                  14456 non-null  category
 3   relevant_experience     18280 non-null  category
 4   enrolled_university     18076 non-null  category
 5   education_level         18133 non-null  category
 6   major_discipline        17895 non-null  category
 7   experience              18247 non-null  category
 8   company_size            13185 non-null  category
 9   company_type            12990 non-null  category
 10  last_new_job            18052 non-null  category
 11  target                  18280 non-null  category
dtypes: category(11), int64(1)
memory usage: 341.1 KB
None


In [37]:
# check data
print(dfte.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2044 entries, 0 to 2043
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   index                   2044 non-null   int64   
 1   city_development_index  2044 non-null   category
 2   gender                  1606 non-null   category
 3   relevant_experience     2044 non-null   category
 4   enrolled_university     2026 non-null   category
 5   education_level         2024 non-null   category
 6   major_discipline        2000 non-null   category
 7   experience              2041 non-null   category
 8   company_size            1500 non-null   category
 9   company_type            1493 non-null   category
 10  last_new_job            2020 non-null   category
dtypes: category(10), int64(1)
memory usage: 37.8 KB
None


## D. Handle Missing Value

We will using LGBMClasifier (ML model) to imputate missing data since missing value is only on categorical feature

In [38]:
# handling missing value using LGBM Classifier for dftr
for col in [x for x in dftr.columns if dftr[x].isnull().sum() > 0]:
    data = dftr.copy()
    nan_ixs = np.where(data[col].isnull())[0]
    data['is_nan'] = 0
    data.loc[nan_ixs, 'is_nan'] = 1

    X = data.drop([col], axis=1)
    y = data[col]
    for col2 in X.columns:
        X[col2] = pd.factorize(X[col2], sort=True)[0]
    data = X.join(y)

    train = data[data['is_nan'] == 0]
    test = data[data['is_nan'] == 1]
    X_train = train.drop([col, 'is_nan'], axis=1)
    y_train = train[col]
    X_test = test.drop([col, 'is_nan'], axis=1)

    model = LGBMClassifier()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    dftr.loc[nan_ixs, col] = y_pred

display(dftr.head(10))
print(f"Number of null values in dftr dataframe: {dftr.isnull().sum().sum()}")


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000604 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 302
[LightGBM] [Info] Number of data points in the train set: 14456, number of used features: 11
[LightGBM] [Info] Start training from score -2.463283
[LightGBM] [Info] Start training from score -0.102858
[LightGBM] [Info] Start training from score -4.374858
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000774 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 301
[LightGBM] [Info] Number of data points in the train set: 18076, number of used features: 11
[LightGBM] [Info] Start training from score -1.643824
[LightGBM] [Info] Start training from score -0.297839
[LightGBM] [Info] Start tr

Unnamed: 0,index,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,target
0,0,0.9-1.0,Male,True,No Enroll,Graduate,STEM,High Experience,Medium,Pvt Ltd,1,1
1,1,0.7-0.8,Male,False,No Enroll,Graduate,STEM,Senior Career,Small,Pvt Ltd,>4,0
2,2,0.6-0.7,Male,False,Full Time,Graduate,STEM,Mid Career,Small,Pvt Ltd,never,0
3,3,0.7-0.8,Male,False,No Enroll,Graduate,Non-STEM,Early Career,Small,Pvt Ltd,never,1
4,4,0.7-0.8,Male,True,No Enroll,Masters,STEM,High Experience,Small,Funded Startup,4,0
5,5,0.7-0.8,Male,True,Part Time,Graduate,STEM,Senior Career,Small,Pvt Ltd,1,1
6,6,0.9-1.0,Male,True,No Enroll,High School,No Major,Mid Career,Small,Funded Startup,1,0
7,7,0.7-0.8,Male,True,No Enroll,Graduate,STEM,Senior Career,Small,Pvt Ltd,>4,1
8,8,0.9-1.0,Male,True,No Enroll,Graduate,STEM,Mid Career,Small,Pvt Ltd,1,1
9,9,0.9-1.0,Male,True,No Enroll,Graduate,STEM,High Experience,Very Large,Pvt Ltd,>4,0


Number of null values in dftr dataframe: 0


In [39]:
# handling missing value using LGBM Classifier
for col in [x for x in dfte.columns if dfte[x].isnull().sum() > 0]:
    data = dfte.copy()
    nan_ixs = np.where(data[col].isnull())[0]
    data['is_nan'] = 0
    data.loc[nan_ixs, 'is_nan'] = 1

    X = data.drop([col], axis=1)
    y = data[col]
    for col2 in X.columns:
        X[col2] = pd.factorize(X[col2], sort=True)[0]
    data = X.join(y)

    train = data[data['is_nan'] == 0]
    test = data[data['is_nan'] == 1]
    X_train = train.drop([col, 'is_nan'], axis=1)
    y_train = train[col]
    X_test = test.drop([col, 'is_nan'], axis=1)

    model = LGBMClassifier()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    dfte.loc[nan_ixs, col] = y_pred

display(dfte.head(10))
print(f"Number of null values in dftr dataframe: {dfte.isnull().sum().sum()}")


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000272 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 300
[LightGBM] [Info] Number of data points in the train set: 1606, number of used features: 10
[LightGBM] [Info] Start training from score -2.461521
[LightGBM] [Info] Start training from score -0.104945
[LightGBM] [Info] Start training from score -4.246008
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000150 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 299
[LightGBM] [Info] Number of data points in the train set: 2026, number of used features: 10
[LightGBM] [Info] Start training from score -1.617367
[LightGBM] [Info] Start training from score -0.311996
[LightGBM] [Info] Start training from score -2.665059
[LightGBM] [Info] Auto-choosing row-wi

Unnamed: 0,index,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
0,0,0.8-0.9,Male,True,Full Time,Graduate,STEM,Mid Career,Small,Pvt Ltd,1
1,1,0.9-1.0,Female,True,No Enroll,Graduate,STEM,Mid Career,Small,Pvt Ltd,1
2,2,0.6-0.7,Male,False,No Enroll,High School,No Major,Early Career,Medium,Pvt Ltd,never
3,3,0.8-0.9,Male,True,No Enroll,Masters,STEM,Senior Career,Small,Pvt Ltd,1
4,4,0.9-1.0,Male,True,No Enroll,Graduate,STEM,High Experience,Very Large,Pvt Ltd,>4
5,5,0.8-0.9,Male,False,Part Time,Masters,STEM,Mid Career,Small,Public Sector,2
6,6,0.6-0.7,Male,True,No Enroll,Graduate,STEM,Early Career,Medium,Pvt Ltd,1
7,7,0.9-1.0,Female,True,No Enroll,Graduate,STEM,High Experience,Very Large,Pvt Ltd,>4
8,8,0.8-0.9,Male,True,No Enroll,Graduate,STEM,Senior Career,Small,Pvt Ltd,4
9,9,0.6-0.7,Male,True,Full Time,Graduate,STEM,Early Career,Small,Funded Startup,1


Number of null values in dftr dataframe: 0


## E. Feature Encoding

We will use Label Endoding for city_development_index, relevent_experience, enrolled_university, education_level, experience, company_size, and last_new_job (6 features)

In [40]:
# define columns for label encoding
le_cols = ['city_development_index', 'relevant_experience','enrolled_university', 
           'education_level', 'experience', 'company_size', 'last_new_job']

# label encoding using pd.factorize with sort=True
for col in le_cols:
    dftr[col] = pd.factorize(dftr[col], sort=True)[0]
    dfte[col] = pd.factorize(dfte[col], sort=True)[0]

# check data
print("dftr Dataframe")
display(dftr.head(10))
print("dfte Dataframe")
display(dfte.head(10))

dftr Dataframe


Unnamed: 0,index,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,target
0,0,4,Male,1,0,2,STEM,3,1,Pvt Ltd,1,1
1,1,2,Male,0,0,2,STEM,2,0,Pvt Ltd,5,0
2,2,1,Male,0,2,2,STEM,1,0,Pvt Ltd,0,0
3,3,2,Male,0,0,2,Non-STEM,0,0,Pvt Ltd,0,1
4,4,2,Male,1,0,3,STEM,3,0,Funded Startup,4,0
5,5,2,Male,1,1,2,STEM,2,0,Pvt Ltd,1,1
6,6,4,Male,1,0,1,No Major,1,0,Funded Startup,1,0
7,7,2,Male,1,0,2,STEM,2,0,Pvt Ltd,5,1
8,8,4,Male,1,0,2,STEM,1,0,Pvt Ltd,1,1
9,9,4,Male,1,0,2,STEM,3,3,Pvt Ltd,5,0


dfte Dataframe


Unnamed: 0,index,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job
0,0,3,Male,1,2,2,STEM,1,0,Pvt Ltd,1
1,1,4,Female,1,0,2,STEM,1,0,Pvt Ltd,1
2,2,1,Male,0,0,1,No Major,0,1,Pvt Ltd,0
3,3,3,Male,1,0,3,STEM,2,0,Pvt Ltd,1
4,4,4,Male,1,0,2,STEM,3,3,Pvt Ltd,5
5,5,3,Male,0,1,3,STEM,1,0,Public Sector,2
6,6,1,Male,1,0,2,STEM,0,1,Pvt Ltd,1
7,7,4,Female,1,0,2,STEM,3,3,Pvt Ltd,5
8,8,3,Male,1,0,2,STEM,2,0,Pvt Ltd,4
9,9,1,Male,1,2,2,STEM,0,0,Funded Startup,1


We will use One-Hot Encoding for gender (2 features), major_discipline (2 features), and company_type (5 features). We just using n-1 unique value of each feature to avoid multicollinearity.

In [None]:
# one hot encoding 
dftr = pd.get_dummies(dftr,columns=['gender','major_discipline','company_type'],dtype=int)
dfte = pd.get_dummies(dfte,columns=['gender','major_discipline','company_type'],dtype=int)

# drop one column of each one hot encoded to avoid multicollinearity
dftr.drop(['gender_Other','major_discipline_No Major','company_type_Other'], axis=1, inplace=True)
dfte.drop(['gender_Other','major_discipline_No Major','company_type_Other'], axis=1, inplace=True)

# rearrange so target column is the last column
dftr = pd.concat([dftr.drop('target', axis=1), dftr['target']], axis=1)

# check data
print("dftr Dataframe")
display(dftr.head(10))
print("dfte Dataframe")
display(dfte.head(10))

NameError: name 'pd' is not defined

# Export the CSV

In [42]:
# export dataframe to csv
# dftr.to_csv(r'C:\Anoth3rChaos\Rakamin\HR Analytics of Job Change in Data Scientist - Final Project Rakamin\Data\dftr.csv', index=False)
# dfte.to_csv(r'C:\Anoth3rChaos\Rakamin\HR Analytics of Job Change in Data Scientist - Final Project Rakamin\Data\dfte.csv', index=False)