# Job Change of Data Scientists | Data Science Project | Data Preprocessing

> Data from [Kaggle](https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists) with modification in problem context.

*This project was completed as a part of Rakamin Academy Data Science Bootcamp.*

Ascencio, a leading Data Science agency, offers training courses to companies to enhance their employees' skills. Companies want to predict which employees are **unlikely to seek a job change** after completing the course, as well as identify those who are **likely to finish it quickly**. By focusing on employees who are committed to staying and can contribute sooner, Ascencio helps companies optimize their training investments.

To achieve this, Ascencio will build two machine learning models: one to predict the training hours needed for an employee to complete the course, and another to predict whether an employee will seek a job change or not.

# Prepare Everything!

In [45]:
# import library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

# import sklearn
from sklearn.model_selection import train_test_split, StratifiedKFold
from imblearn.over_sampling import SMOTE
from sklearn import metrics
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer

# import classification models
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, StackingClassifier, VotingClassifier

# import feature importance
import shap
from sklearn.inspection import permutation_importance


print('numpy version : ',np.__version__)
print('pandas version : ',pd.__version__)


numpy version :  1.24.4
pandas version :  2.2.3


In [46]:
# read the data
df_train = pd.read_csv(r'Data/aug_train.csv')

# Data Preprocessing

## A. Feature Selection

Check that we will only use these columns as features<br/>
'city_development_index', 'gender', 'relevent_experience', 'enrolled_university', 'education_level', 'major_discipline', 'experience', 'company_size', 'company_type', 'last_new_job'

In [47]:
# drop column enrollee_id, city, and training_hours
print("df_train Dataframe")
df_train.drop(['enrollee_id','city','training_hours'], axis=1, inplace=True)
display(df_train.head())

df_train Dataframe


Unnamed: 0,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,target
0,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,1.0
1,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,0.0
2,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,0.0
3,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,1.0
4,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,0.0


## B. Feature Revision

We will rename and imputation data based on 1_EDA.ipynb analysis. We will set copy to dftr dataframe from df_train dataframe and dfte dataframe from df_test dataframe.

In [48]:
# feature revision for df_train
# rename relevent_experience to relevant_experience
df_train.rename(columns={'relevent_experience':'relevant_experience'}, inplace=True)

# copy the data
dftr = df_train.copy()

# change relevant_experience
dftr['relevant_experience'] = dftr['relevant_experience'].apply(lambda x: True if x == "Has relevent experience" 
                                                 else np.nan if pd.isna(x) else False)

# change enrolled_university
dftr['enrolled_university'] = dftr['enrolled_university'].apply(lambda x: "No Enroll" if x == "no_enrollment"
                                                                        else "Full Time" if x == "Full time course" 
                                                                        else np.nan if pd.isna(x) else "Part Time")

# imputation major_discipline
dftr['major_discipline'] = np.where((dftr['education_level'].isin(['Graduate', 'Masters'])) & (dftr['major_discipline'] == 'No Major'), np.nan, 
                        np.where((dftr['education_level'].isin(['Primary School', 'High School'])) & (dftr['major_discipline'].isnull()),'No Major',dftr['major_discipline']))

# grouping company_size
dftr['company_size'] = dftr['company_size'].apply(lambda x: '10-49' if x == '10/49' else x)

# rename company_type
dftr['company_type'] = dftr['company_type'].apply(lambda x: "Early Startup" if x == "Early Stage Startup" else x)

# check data
dftr.head(10)

Unnamed: 0,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,target
0,0.92,Male,True,No Enroll,Graduate,STEM,>20,,,1,1.0
1,0.776,Male,False,No Enroll,Graduate,STEM,15,50-99,Pvt Ltd,>4,0.0
2,0.624,,False,Full Time,Graduate,STEM,5,,,never,0.0
3,0.789,,False,,Graduate,Business Degree,<1,,Pvt Ltd,never,1.0
4,0.767,Male,True,No Enroll,Masters,STEM,>20,50-99,Funded Startup,4,0.0
5,0.764,,True,Part Time,Graduate,STEM,11,,,1,1.0
6,0.92,Male,True,No Enroll,High School,No Major,5,50-99,Funded Startup,1,0.0
7,0.762,Male,True,No Enroll,Graduate,STEM,13,<10,Pvt Ltd,>4,1.0
8,0.92,Male,True,No Enroll,Graduate,STEM,7,50-99,Pvt Ltd,1,1.0
9,0.92,,True,No Enroll,Graduate,STEM,17,10000+,Pvt Ltd,>4,0.0


In [49]:
# drop rows with equal or more than 4 NaN values train data
print(f"Rows in dftr before is {dftr.shape[0]}")
dftr = dftr[dftr.isnull().sum(axis=1) < 4]
dftr.reset_index(inplace=True, drop=True)
print(f"Rows in dftr after is {dftr.shape[0]}")

Rows in dftr before is 19158
Rows in dftr after is 18628


## C. Type Data

In [50]:
# check type data of train data
print(dftr.info())
# relevant_experience have bool type since it only yes or no value

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18628 entries, 0 to 18627
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   city_development_index  18628 non-null  float64
 1   gender                  14487 non-null  object 
 2   relevant_experience     18628 non-null  bool   
 3   enrolled_university     18396 non-null  object 
 4   education_level         18481 non-null  object 
 5   major_discipline        18258 non-null  object 
 6   experience              18593 non-null  object 
 7   company_size            13190 non-null  object 
 8   company_type            12992 non-null  object 
 9   last_new_job            18392 non-null  object 
 10  target                  18628 non-null  float64
dtypes: bool(1), float64(2), object(8)
memory usage: 1.4+ MB
None


## D. Feature Encoding

In [51]:
# define ordered categories as lists
relevant_experience_cats = [False, True]
enrolled_university_cats = ['No Enroll', 'Part Time', 'Full Time']
education_level_cats = ['Primary School', 'High School', 'Graduate', 'Masters', 'Phd']
experience_cats = ['<1', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '>20']
company_size_cats = ['<10', '10-49', '50-99', '100-500', '500-999', '1000-4999', '5000-9999', '10000+']
last_new_job_cats = ['never', '1', '2', '3', '4', '>4']
categories = [relevant_experience_cats, enrolled_university_cats, education_level_cats, experience_cats, company_size_cats, last_new_job_cats]

# define columns for ordinal encoding
cats_oe = ['relevant_experience', 'enrolled_university', 'education_level', 'experience', 'company_size', 'last_new_job']

# ordinal encoding
oe = OrdinalEncoder(categories=categories, handle_unknown='use_encoded_value', unknown_value=np.nan)
dftr[cats_oe] = oe.fit_transform(dftr[cats_oe])
display(dftr.head(10))


Unnamed: 0,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,target
0,0.92,Male,1.0,0.0,2.0,STEM,21.0,,,1.0,1.0
1,0.776,Male,0.0,0.0,2.0,STEM,15.0,2.0,Pvt Ltd,5.0,0.0
2,0.624,,0.0,2.0,2.0,STEM,5.0,,,0.0,0.0
3,0.789,,0.0,,2.0,Business Degree,0.0,,Pvt Ltd,0.0,1.0
4,0.767,Male,1.0,0.0,3.0,STEM,21.0,2.0,Funded Startup,4.0,0.0
5,0.764,,1.0,1.0,2.0,STEM,11.0,,,1.0,1.0
6,0.92,Male,1.0,0.0,1.0,No Major,5.0,2.0,Funded Startup,1.0,0.0
7,0.762,Male,1.0,0.0,2.0,STEM,13.0,0.0,Pvt Ltd,5.0,1.0
8,0.92,Male,1.0,0.0,2.0,STEM,7.0,2.0,Pvt Ltd,1.0,1.0
9,0.92,,1.0,0.0,2.0,STEM,17.0,7.0,Pvt Ltd,5.0,0.0


In [52]:
# define column for one hot encoding
cats_ohe = ['gender','major_discipline','company_type']

# simple imputer nan to missing
si_miss = SimpleImputer(strategy='constant', fill_value='missing')
dftr[cats_ohe] = si_miss.fit_transform(dftr[cats_ohe])

# one hot encoding
ohe = OneHotEncoder(sparse_output=False)

ohe_dftr_array = ohe.fit_transform(dftr[cats_ohe])
ohe_dftr_name = ohe.get_feature_names_out(cats_ohe)
dftr = dftr.drop(cats_ohe, axis=1)
for i, col in enumerate(ohe_dftr_name):
    dftr[col] = ohe_dftr_array[:,i]
dftr.loc[dftr.gender_missing == 1, dftr.columns.str.startswith("gender_")] = np.nan
dftr.loc[dftr.major_discipline_missing == 1, dftr.columns.str.startswith("major_discipline_")] = np.nan
dftr.loc[dftr.company_type_missing == 1, dftr.columns.str.startswith("company_type_")] = np.nan
dftr.drop(columns=['gender_missing','gender_Other','major_discipline_missing','major_discipline_Other','company_type_missing','company_type_Other'], inplace=True)

In [53]:
display(dftr.head(10))

Unnamed: 0,city_development_index,relevant_experience,enrolled_university,education_level,experience,company_size,last_new_job,target,gender_Female,gender_Male,major_discipline_Arts,major_discipline_Business Degree,major_discipline_Humanities,major_discipline_No Major,major_discipline_STEM,company_type_Early Startup,company_type_Funded Startup,company_type_NGO,company_type_Public Sector,company_type_Pvt Ltd
0,0.92,1.0,0.0,2.0,21.0,,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,,,,,
1,0.776,0.0,0.0,2.0,15.0,2.0,5.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
2,0.624,0.0,2.0,2.0,5.0,,0.0,0.0,,,0.0,0.0,0.0,0.0,1.0,,,,,
3,0.789,0.0,,2.0,0.0,,0.0,1.0,,,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.767,1.0,0.0,3.0,21.0,2.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
5,0.764,1.0,1.0,2.0,11.0,,1.0,1.0,,,0.0,0.0,0.0,0.0,1.0,,,,,
6,0.92,1.0,0.0,1.0,5.0,2.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
7,0.762,1.0,0.0,2.0,13.0,0.0,5.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
8,0.92,1.0,0.0,2.0,7.0,2.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
9,0.92,1.0,0.0,2.0,17.0,7.0,5.0,0.0,,,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


## E. Handle Missing Value

We will using MICE to imputate missing data

In [54]:
# using logistic regression to imputate missing value by MICE
lr = LinearRegression()
mice = IterativeImputer(estimator=lr, random_state=1, n_nearest_features=2)
final_dftr = mice.fit_transform(dftr)
final_dftr = pd.DataFrame(final_dftr, columns=dftr.columns)

In [55]:
# check missing value
print(final_dftr.isnull().sum().sum())

0


## F. Split Data and SMOTE

In [56]:
# split data
X = final_dftr.drop('target', axis=1)
y = final_dftr['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [57]:
# check data
print("Number of data points in train data :",X_train.shape)
print("Number of data points in validation data :",X_test.shape)

Number of data points in train data : (14902, 19)
Number of data points in validation data : (3726, 19)


In [58]:
# SMOTE
sm = SMOTE(random_state=1)
X_train, y_train = sm.fit_resample(X_train, y_train)

In [59]:
# check imbalance data
print("Proportion of target in train data :",y_train.value_counts(normalize=True))

Proportion of target in train data : target
0.0    0.5
1.0    0.5
Name: proportion, dtype: float64


In [60]:
# standard scaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## G. Machine Learning

In [61]:
# make metrics dataframe
def eval_model(model:list,fit:bool=True)->pd.DataFrame:
    # Initialize empty lists to store metrics
    name = []
    pr_auc_test = []
    pr_auc_train = []
    recall_test = []
    recall_train = []
    precision_test = []
    precision_train = []
    roc_auc_test = []
    roc_auc_train = []

    for m in model:
        if fit: # Check if model needs to be fitted
            m.fit(X_train, y_train)
        else:
            pass

        if type(m).__name__ == 'GridSearchCV': # If GridSearchCV, append estimator name
            name.append(str(type(m.estimator).__name__) + '_' + str(type(m).__name__))
        else:
            name.append(type(m).__name__)
        
        # Append metrics to lists
        pr_auc_test.append(metrics.average_precision_score(y_test, m.predict(X_test)))
        pr_auc_train.append(metrics.average_precision_score(y_train, m.predict(X_train)))
        recall_test.append(metrics.recall_score(y_test, m.predict(X_test)))
        recall_train.append(metrics.recall_score(y_train, m.predict(X_train)))
        precision_test.append(metrics.precision_score(y_test, m.predict(X_test)))
        precision_train.append(metrics.precision_score(y_train, m.predict(X_train)))
        roc_auc_test.append(metrics.roc_auc_score(y_test, m.predict_proba(X_test)[:,1]))
        roc_auc_train.append(metrics.roc_auc_score(y_train, m.predict_proba(X_train)[:,1]))

    # Make a DataFrame with all metrics and differences
    df_sum = pd.DataFrame({'name': name,
                            'pr_auc_test': pr_auc_test,'pr_auc_train': pr_auc_train,
                            'recall_test': recall_test,'recall_train': recall_train,
                            'precision_test': precision_test,'precision_train': precision_train,
                            'roc_auc_test': roc_auc_test,'roc_auc_train': roc_auc_train})
    
    df_sum['diff_pr_auc'] = df_sum['pr_auc_train'] - df_sum['pr_auc_test']
    df_sum['diff_recall'] = df_sum['recall_train'] - df_sum['recall_test']
    df_sum['diff_precision'] = df_sum['precision_train'] - df_sum['precision_test']
    df_sum['diff_roc_auc'] = df_sum['roc_auc_train'] - df_sum['roc_auc_test']

    return df_sum.sort_values(by='pr_auc_test', ascending=False)

In [62]:
# make model explanation
def explain_model(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_pred_train = model.predict(X_train)

    print(f'{type(model).__name__} Evaluation\n')

    print(f'AUC PR (test): {metrics.average_precision_score(y_test, y_pred):.2f}')
    print(f'AUC PR (train): {metrics.average_precision_score(y_train, y_pred_train):.2f}')
    print(f'Diff AUC PR: {np.abs(metrics.average_precision_score(y_test, y_pred)-metrics.average_precision_score(y_train, y_pred_train)):.2f}\n')
    
    print(f'Recall (test): {metrics.recall_score(y_test, y_pred):.2f}')
    print(f'Recall (train): {metrics.recall_score(y_train, y_pred_train):.2f}')
    print(f'Diff Recall: {np.abs(metrics.recall_score(y_test, y_pred)-metrics.recall_score(y_train, y_pred_train)):.2f}\n')

    print(f'Precision (test): {metrics.precision_score(y_test, y_pred):.2f}')
    print(f'Precision (train): {metrics.precision_score(y_train, y_pred_train):.2f}')
    print(f'Diff Precision: {np.abs(metrics.precision_score(y_test, y_pred)-metrics.precision_score(y_train, y_pred_train)):.2f}\n')

    print(f'ROC-AUC (test): {metrics.roc_auc_score(y_test, y_pred):.2f}')
    print(f'ROC-AUC (train): {metrics.roc_auc_score(y_train, y_pred_train):.2f}')
    print(f'Diff ROC-AUC: {np.abs(metrics.roc_auc_score(y_test, y_pred)-metrics.roc_auc_score(y_train, y_pred_train)):.2f}\n')

    print('-'*100)

In [63]:
# ML models
lr = LogisticRegression()
knn = KNeighborsClassifier()
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
ada = AdaBoostClassifier()
xgb = XGBClassifier()
lgb = LGBMClassifier()
cat = CatBoostClassifier()

init_models = [lr, knn, dt, rf, ada, xgb, lgb, cat]

In [64]:
eval_model(init_models)

[LightGBM] [Info] Number of positive: 11210, number of negative: 11210
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001252 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4841
[LightGBM] [Info] Number of data points in the train set: 22420, number of used features: 19
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Learning rate set to 0.038874
0:	learn: 0.6664311	total: 11ms	remaining: 11s
1:	learn: 0.6544472	total: 20.3ms	remaining: 10.1s
2:	learn: 0.6431809	total: 29.1ms	remaining: 9.68s
3:	learn: 0.6327023	total: 38.7ms	remaining: 9.64s
4:	learn: 0.6220844	total: 47.9ms	remaining: 9.53s
5:	learn: 0.6134234	total: 56.9ms	remaining: 9.43s
6:	learn: 0.5964807	total: 66.5ms	remaining: 9.44s
7:	learn: 0.5864177	total: 79.4ms	remaining: 9.85s
8:	learn: 0.5792836	total: 89.9ms	remaining: 9.9s
9:	learn: 0.5731852	total: 101ms	remaining: 10s
10:	learn: 0.5671317	total: 1

Unnamed: 0,name,pr_auc_test,pr_auc_train,recall_test,recall_train,precision_test,precision_train,roc_auc_test,roc_auc_train,diff_pr_auc,diff_recall,diff_precision,diff_roc_auc
5,XGBClassifier,0.634505,0.921662,0.754063,0.928278,0.760656,0.954241,0.913175,0.984949,0.287157,0.174215,0.193585,0.071774
6,LGBMClassifier,0.627842,0.883842,0.789816,0.9124,0.729,0.920695,0.907774,0.973918,0.256,0.122584,0.191695,0.066144
7,CatBoostClassifier,0.624563,0.906808,0.748646,0.920339,0.751087,0.94202,0.907971,0.979458,0.282245,0.171693,0.190933,0.071487
3,RandomForestClassifier,0.57466,0.988747,0.700975,0.988671,0.714128,0.994348,0.88593,0.999226,0.414087,0.287696,0.28022,0.113296
2,DecisionTreeClassifier,0.485788,0.989798,0.661972,0.98653,0.607356,0.996486,0.764332,0.999819,0.50401,0.324558,0.38913,0.235486
4,AdaBoostClassifier,0.445898,0.739065,0.736728,0.825156,0.516717,0.789721,0.816841,0.87125,0.293167,0.088428,0.273003,0.054409
1,KNeighborsClassifier,0.397134,0.804384,0.656555,0.893845,0.475294,0.840534,0.754579,0.944287,0.40725,0.23729,0.365239,0.189707
0,LogisticRegression,0.364506,0.644314,0.643554,0.663782,0.429191,0.717412,0.72759,0.745201,0.279809,0.020229,0.288222,0.017611


In [65]:
# import joblib

# # Misalnya, model terbaik adalah LightGBM
# best_model = xgb.fit(X_train, y_train)  # Ganti sesuai dengan model yang dipilih

# # Simpan model ke file .pkl
# joblib.dump(best_model, 'app/model/model.pkl')

# print("Model berhasil disimpan sebagai model.pkl")
