# Employee Retention Model User Manual

Table of Contents
1. Deliverables
1. Making prediction
2. Result interpretation
3. Interview pointers

## 1. Delieverables
Deliverables are located in the `deliverables` folder, and include: 
1. **retention_model.py** : the script needed to ran the model and make predictions in the python shell,
2. **model_dict.pkl**: fitted prediction model and dataframes needed to ensure new augmented data conforms with data format the model was trained on
3. **interview_sample_size_reference.csv**: a reference for HR to determine proper sample size in order to reach the desired number of employees who are likely to leave.
4. **user_manual.ipynb**: this notebook

## 2. Making Prediction
To make predictions using the model, there are currently two ways:
1. **execute it in Terminal**:
In terminal, change into the directory of **retention_model.py**, type `python rentention_model data_location output_location model_location True True`. The last two arguments are input values for `clean=, augment=` parameters for the `predict_proba` method, therefore,
    1. if the data is raw, then `True True` is appropriate,
    2. if the data is cleaned, but requires augmentation, then `False True` is appropriate,
    3. if the data is both cleaned and augmented, then `False False` is appropriate.
> For example, if prediction script (provided by us) `retention_model.py`, new data file (provided by you, raw), e.g. `unseen_data.csv`, and model_dict (provided by us) `model_dict.pkl` are all saved under the same folder. Then, type
`python retention_model.py unseen_data.csv prediction.csv model_dict.pkl True True`

2. excecute the cells below directly in the notebook:
    1. locate new data file with address like 'path/to/new_data.csv'
    2. determine whether the data needs to be cleaned, augmented or both, follow the `True False` rules above.
 

In [1]:
import numpy as np
import pandas as pd

import pickle as pickle
import sklearn

In [3]:
# load data: put 'path/to/new_data.csv' in the parathesis below
df = pd.read_csv('../scr/unseen_employee_data.csv')
df.head()

Unnamed: 0,avg_monthly_hrs,department,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure
0,228,management,,0.735618,2,,high,0.805661,3.0
1,229,product,,1.0,4,,low,0.719961,4.0
2,196,sales,1.0,0.557426,4,,low,0.749835,2.0
3,207,IT,,0.715171,3,,high,0.987447,3.0
4,129,management,,0.484818,2,,low,0.441219,3.0


In [12]:
# load model
class EmployeeRetentionModel:

    def __init__(self, model_dict_location):
        with open(model_dict_location, 'rb') as f:
            self.model_dict = pickle.load(f)
            self.model = self.model_dict['final_model']
            self.trained_df = self.model_dict['trained_df']

    def predict_proba(self, X, clean=True, augment=True):
        if clean == True:
            X = self.clean_data(X)
        if augment == True:
            X = self.engineer_features(X)            
        return X, self.model.predict_proba(X)

    def clean_data(self, df):    
        df = df[df.department != 'temp'].copy()
        df.loc[:, 'department'] = df.department.replace('information_technology', 'IT')
        df.loc[:, 'salary'] = df.salary.replace({'low':0, 'medium':1, 'high':2})
        df.loc[:, 'filed_complaint'] = df.filed_complaint.fillna(0)
        df.loc[:, 'recently_promoted'] = df.recently_promoted.fillna(0)
        df.loc[:, 'department'] = df.department.fillna('Missing')
        df.loc[:, 'last_evaluation_missing'] = df.last_evaluation.isnull().astype(int)
        df.loc[:, 'last_evaluation'] = df.last_evaluation.fillna(0.72)    
        return df

    def engineer_features(self, df):
        trained_df = self.trained_df
        df = df.copy()
        df.loc[:,'underperformer'] = (df.last_evaluation < 0.65).astype(int)
        df.loc[:,'overqualified'] = ((df.satisfaction < 0.2) & (df.last_evaluation >0.7)).astype(int)
        df.loc[:,'overachiever'] = ((df.last_evaluation > 0.8) & (df.satisfaction > 0.7)).astype(int)    
        df.loc[:,'burnout'] = ((df.avg_monthly_hrs>240) & (df.satisfaction < 0.2)).astype(int)
        df = pd.get_dummies(df, columns = ['department'])
        _, df = trained_df.align(df, join = 'left', axis = 1)
        for col in df.columns:
            df.loc[:, col] = df[col].astype(trained_df[col].dtypes.name)
        return df

In [22]:
X.head()

Unnamed: 0,avg_monthly_hrs,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure,last_evaluation_missing,underperformer,...,department_Missing,department_admin,department_engineering,department_finance,department_management,department_marketing,department_procurement,department_product,department_sales,department_support
0,228,0.0,0.735618,2,0.0,2,0.805661,3.0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,229,0.0,1.0,4,0.0,0,0.719961,4.0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,196,1.0,0.557426,4,0.0,0,0.749835,2.0,0,1,...,0,0,0,0,0,0,0,0,1,0
3,207,0.0,0.715171,3,0.0,2,0.987447,3.0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,129,0.0,0.484818,2,0.0,0,0.441219,3.0,0,1,...,0,0,0,0,1,0,0,0,0,0


In [31]:
# make prediction
model = EmployeeRetentionModel('model_dict.pkl')
df, predictions = model.predict_proba(df, True, True)

In [34]:
df.loc[:, 'predictions'] = predictions[:, 1]

In [35]:
df

Unnamed: 0,avg_monthly_hrs,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure,last_evaluation_missing,underperformer,...,department_admin,department_engineering,department_finance,department_management,department_marketing,department_procurement,department_product,department_sales,department_support,predictions
0,228,0.0,0.735618,2,0.0,2,0.805661,3.0,0,0,...,0,0,0,1,0,0,0,0,0,0.000
1,229,0.0,1.000000,4,0.0,0,0.719961,4.0,0,0,...,0,0,0,0,0,0,1,0,0,0.105
2,196,1.0,0.557426,4,0.0,0,0.749835,2.0,0,1,...,0,0,0,0,0,0,0,1,0,0.005
3,207,0.0,0.715171,3,0.0,2,0.987447,3.0,0,0,...,0,0,0,0,0,0,0,0,0,0.000
4,129,0.0,0.484818,2,0.0,0,0.441219,3.0,0,1,...,0,0,0,1,0,0,0,0,0,1.000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
745,211,0.0,0.599134,4,0.0,1,0.946140,3.0,0,1,...,0,0,1,0,0,0,0,0,0,0.005
746,136,0.0,0.720000,2,0.0,1,0.393581,3.0,1,0,...,0,0,0,0,0,0,0,0,0,0.995
747,258,0.0,0.809516,4,0.0,0,0.913363,2.0,0,0,...,0,0,0,0,1,0,0,0,0,0.060
748,197,1.0,0.774142,3,0.0,0,0.682195,3.0,0,0,...,0,0,0,0,0,0,0,1,0,0.005


## 3. Interprete the result
1. If one is curious of whether a specific employee would leave, then, simply look up the probability based on their index number. 
2. One could also sort the probability values from the largest to the lowest, and use the reference table to determine the number of employee to reach out to. See below for an example: 
    1. Predictions are made on 740 employees, which according to our training data, would see around 150 employees leaving.
    2. Currently, HR has the capacity to interview 74 employees, which is 10% of total employees predicted;
    3. Open and load the reference table and locate sample size 10%

In [37]:
# rank employees from most likely to leave to least
df.sort_values('predictions', ascending = False).head()

Unnamed: 0,avg_monthly_hrs,filed_complaint,last_evaluation,n_projects,recently_promoted,salary,satisfaction,tenure,last_evaluation_missing,underperformer,...,department_admin,department_engineering,department_finance,department_management,department_marketing,department_procurement,department_product,department_sales,department_support,predictions
377,250,0.0,0.788487,6,0.0,0,0.100074,4.0,0,0,...,0,0,0,0,0,0,0,0,1,1.0
393,133,0.0,0.51184,2,0.0,1,0.3489,3.0,0,1,...,0,0,0,0,0,0,0,0,0,1.0
267,139,0.0,0.521865,2,0.0,0,0.418834,3.0,0,1,...,0,0,0,0,0,0,0,0,1,1.0
108,139,0.0,0.573507,2,0.0,0,0.452876,3.0,0,1,...,0,0,0,0,0,0,0,1,0,1.0
428,306,0.0,0.843287,6,0.0,0,0.105194,4.0,0,0,...,0,0,0,0,0,0,0,0,0,1.0


In [38]:
# load reference
reference = pd.read_csv('../delieverables/interview-sample-size-reference.csv')
reference.head()

Unnamed: 0.1,Unnamed: 0,fpr,tpr,threshold,sampleSize,Accuracy
0,0,0.0,0.0,2.0,0.0,
1,1,0.0,0.267459,1.0,0.063966,1.0
2,2,0.000467,0.377415,0.995,0.090618,0.996078
3,3,0.000934,0.451709,0.99,0.108742,0.993464
4,4,0.000934,0.508172,0.985,0.122246,0.994186


In [39]:
# our capacity only allows us to interview 10% of all employees
reference[reference.sampleSize <= 0.10].head()

Unnamed: 0.1,Unnamed: 0,fpr,tpr,threshold,sampleSize,Accuracy
0,0,0.0,0.0,2.0,0.0,
1,1,0.0,0.267459,1.0,0.063966,1.0
2,2,0.000467,0.377415,0.995,0.090618,0.996078


4. We see that we can expect 99% accuracey, meaning all 74 employees are probably going to leave, thus by reaching out to them, we can proactively address any issues that can be addressed.
5. Given it takes on average 4000 dollars and 52 days to fill on the most conservative estimate, we just saved $100,000 and 3500 days of work. Not bad with just a couple lines of code if you ask me!

## 4 Interview pointers
1. Employees could be hesitant to talk about their problems with their work to their employer, therefore, it would help to have an idea of what the problem is when speaking with them and/or their managers. 
2. We could combine the predicted probability with important features in our analysis to help drive the conversation.
3. Let's use employee number 377 as an exmaple:

In [40]:
df.loc[377, :]

avg_monthly_hrs            250.000000
filed_complaint              0.000000
last_evaluation              0.788487
n_projects                   6.000000
recently_promoted            0.000000
salary                       0.000000
satisfaction                 0.100074
tenure                       4.000000
last_evaluation_missing      0.000000
underperformer               0.000000
overqualified                1.000000
overachiever                 0.000000
burnout                      1.000000
department_IT                0.000000
department_Missing           0.000000
department_admin             0.000000
department_engineering       0.000000
department_finance           0.000000
department_management        0.000000
department_marketing         0.000000
department_procurement       0.000000
department_product           0.000000
department_sales             0.000000
department_support           1.000000
predictions                  1.000000
Name: 377, dtype: float64

We observe that this employee comes form the support department, works on 6 projects and 250 average monthly hours, and is evaluted at 0.78, above average working hour snad projects. His satisfaction is at only 0.1. It is possible his dissatisfaction comes from overworked. Therefore, when interviewing, this could be addressed. 

**This is the end of this project. Please always feel free to reach out with questions, improvements, or anything you would like to share!**