### NOTE:
- This model doesn't take into account the possibility of an employee leaving their position and getting re-employed some time later. The occurrence of this situation has not been calculated but we've speculated it to be quite low.
- This model doesn't take into account the difference in being fired vs. leaving the job.
- Missing values for age group are replaced with the mode.
- The majority of data for highest education level is not indicated so we'll leave it as is.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
df = pd.read_csv("10_05_2021.csv", parse_dates = ["start_date", "end_date"], dtype={'work_postal':'str'})

We'll preprocess the data to make it easier for us to use them in the model.

First, we'll clean the data.

In [2]:
#Replace under_29 by .under_29 to make it easier when sorting
df['age_group'] = df['age_group'].replace(to_replace='under_29', value='.under_29')

#Fill in na values for age_group
df['age_group'] = df['age_group'].fillna(df['age_group'].value_counts().index[0])

#Replace missing values in event column with unknown
df['event'] = df['event'].fillna('unknown')

Next, we'll create a new dataframe that only consists of input and output parameters we're interested in.

In [3]:
#Temporary end_date: this end date is used if the employee is still working
temp_end_date = pd.to_datetime('2021-11-01')

#Get the list of all employees by their unique IDs
employee_ids = df.emplid_sec.unique()

#Define list of features we want in our model
duration = []
division = []
department = []
comprate = []
last_pay_raise = []
highest_educ_lvl = []
age_group = []
event = []

#Loop through each employee records
for ID in employee_ids:
    #Get all records of the employee
    employee = df[df['emplid_sec'] == ID].copy()
    
    ##### DURATION #####
    #Add up all durations (there are some inaccuracies doing this)
    duration.append(sum(employee['duration'].tolist(), employee.shape[0]))
    
    ##### DIVISION #####
    #Get the last division they were in
    employee.sort_values(by=['end_date'], inplace=True)
    division.append(employee.iloc[-1]['division'])
    
    ##### DEPARTMENT #####
    #Get the last department they were in
    department.append(employee.iloc[-1]['department'])
    
    ##### COMP RATE #####
    #Get the highest comprate
    comprate.append(max(employee['comprate'].tolist()))
    
    ##### LAST PAY RAISE #####
    #Get last date of work or temporary last date
    if(employee['end_date'].isna().sum()):
        end = temp_end_date
    else:
        end = employee['end_date'].sort_values().tolist()[-1]
    #Get date of last pay raise
    employee.sort_values(by=['comprate'], inplace=True)
    last_raise = employee.iloc[-1]['start_date']
    #Calculate the difference
    last_pay_raise.append((end - last_raise).days)
    
    ##### EDUCATION LEVEL #####
    #Get the highest education level
    highest_educ_lvl.append(sorted(employee['highest_educ_lvl'].tolist())[-1])
    
    ##### AGE GROUP #####
    #Get the age group they were before they left
    age_group.append(sorted(employee['age_group'].tolist())[-1])
    
    ##### EVENT #####
    #Get the employee's latest event
    employee.sort_values(by=['end_date'], inplace=True)
    event.append(employee.iloc[-1]['event'])

data = {'duration': duration,
        'division': division,
        'department': department,
        'comprate': comprate,
        'last_pay_raise': last_pay_raise,
        'highest_educ_lvl': highest_educ_lvl,
        'age_group': age_group,
        'event': event}
model_df = pd.DataFrame(data)

For our ML model, we'll be using [XGBoost](https://xgboost.readthedocs.io/en/stable/) with the help of [scikit-learn](https://scikit-learn.org/stable/).

In [4]:
#pip install xgboost
#pip install scikit-learn
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OrdinalEncoder

#Define input and output parameters for model
X = model_df.iloc[:, :-1]
y = model_df.iloc[:, -1]

#One-hot encoding for division, department, highest_educ_lvl, age_group
X = pd.get_dummies(X, prefix = ['division', 'department', 'educ', 'age'], columns = ['division', 'department', 'highest_educ_lvl', 'age_group'])

#Encode labels for y
y = y.replace(to_replace='unknown', value=0)
y = y.replace(to_replace='Retirement', value=1)
y = y.replace(to_replace='Termination', value=2)

cross_val_score(XGBClassifier(eval_metric='mlogloss', use_label_encoder=False), X, y)

array([0.69142125, 0.76611182, 0.75234842, 0.75960717, 0.50768574])