# Employee performance prediction using Fastai

This python script aims to predict employee performance flag using 1731 data points originally provided.

## Install relevant packages

If users are using google colab, except fastai upagrade, rest of the packages mentioned in the "Data Import" section below are installed by default. In your local machine you might have to install these packages as per your prior usage of python - 
1. pandas
2. numpy
3. os
4. fastai
5. seaborn
6. imblearn
7. metrics
8. matlplotlib

In [None]:
#!pip install fastai
#!pip install fastai --upgrade

## Import libraries

In [None]:
from google.colab import drive
import pandas as pd
import os
from scipy import mean
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, fbeta_score, cohen_kappa_score, roc_auc_score, confusion_matrix, precision_score,make_scorer
from imblearn.over_sampling import BorderlineSMOTE
from fastai import *
from fastai.tabular.all import *
import warnings
warnings.filterwarnings('ignore')

## Set working directory

If you are using google colab, place the csv input file in your google directory and the next 3 lines of code should do the magic. Else,if you are using your own local machine, coment the first line of code in below cell and just replace the root_dir below with your local working directory to set.

In [None]:
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/MyDrive/Colab_Notebooks/Fastai_Tabular_Classification"
os.chdir(root_dir)

Mounted at /content/gdrive


**Note to users: Make sure to create a new folder in your working directory named "Fastai_Saved_Models".**

## Data Import

In [None]:
df = pd.read_csv("Input.csv")
df.head(3)

## Data pre-processing


Data cleaning has already been taken care. Given the nature of the problem, dropping irrelevant columns 
'Person_ID','Name','Type_ID','Service','Group','Group_ID','Score'

In [None]:
df_model = df.drop(columns = ['Person_ID','Name','Type_ID','Service','Group','Group_ID','Score'])
df_model.head(3)

## Borderline SMOTEing

Documentation - https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.BorderlineSMOTE.html

Reference article - https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/


**Note to users: 
Users of this notebook don't need to run the below code cell. This cell was used during model training. Best model has been shared as ".h5" file that can be loaded and made to predict on desired period in "User-section" of the notebook later.**

In [None]:
def borderline_sample(train_valid):
  
  # save the list of train_valid column names
  clmns = list(train_valid.columns)

  # after separate multiple iterations of testing , k=3 and m = 4 were best hyperparameters for our data
  oversample = BorderlineSMOTE(k_neighbors=3,m_neighbors=8)

  # split into X and y
  X_train_valid = train_valid.drop(columns='Class')
  y_train_valid = train_valid['Class']
  X_train_valid, y_train_valid = oversample.fit_resample(X_train_valid, y_train_valid)

  # join back again
  train_valid = pd.DataFrame(np.concatenate((y_train_valid.reshape(-1,1),X_train_valid), axis=1))

  # rename target column
  train_valid.columns = clmns

  return train_valid

## Data preparation

In [None]:
# prepare train data for training
def prepare_train_data(df_model,p):
  
  # define train and test set in each iteration of training
  train_valid = df_model[df_model['Period']<=p]
  train_valid = train_valid.drop(columns='Period')

  # SMOTEing
  #train_valid = borderline_sample(train_valid)

  # SMOTEing disturbs data type of objects, hence fixing them
  train_valid['Class'] = train_valid['Class'].astype('int')
  train_valid['Class'] = train_valid['Class'].astype('category')
  train_valid = train_valid.infer_objects()

  # split criterion - 80% train and 20% test
  cond = (train_valid.index<=round(0.80*len(train_valid)))

  # get indices of train and valid
  train_idx = np.where( cond)[0]
  valid_idx = np.where(~cond)[0]

  # create list of train and valid index, call it splits
  splits = (list(train_idx),list(valid_idx))

  return train_valid, splits

# prepare test data for evaluation
def prepare_test_data(df_model,p):
  
  # define test set in each iteration of evaluation
  test = df_model[df_model['Period']==p]

  # split data into train and test as per each iteration
  X_test = test.iloc[:,2:]
  y_test = test.iloc[:,1]

  return X_test,y_test

## Model Definition

Documentation for autokeras - https://course.fast.ai/videos/?lesson=1
**Note to users: 
Users of this notebook don't need to run the below code cell. This cell was used during model training. Best model has been shared as ".h5" file that can be loaded and made to predict on desired period in "User-section" of the notebook later.**

In [None]:
def fastai_classifier(train_valid,splits):
  
  # define target
  dep_var = 'Class'

  # define categorical and continuous variables split
  cont_nn,cat_nn = cont_cat_split(train_valid, dep_var=dep_var)

  # define data preprocessing steps - although not needed in this use cases
  procs = [FillMissing, Categorify, Normalize]

  # define tabular pandas object and load it to data loaders
  to_nn = TabularPandas(train_valid, procs, cat_nn, cont_nn, splits=splits, y_names=dep_var, y_block = CategoryBlock())
  dls = to_nn.dataloaders()

  # define model
  model = tabular_learner(dls, layers=[20,50,100,150], metrics=[accuracy, error_rate, Recall(), Precision(),APScoreBinary()])

  # find optimum learning rate if fine tuning seems necessary
  _, lr_steepest = model.lr_find()

    # define callbacks list
  #callbacks = [EarlyStoppingCallback(model, min_delta=1e-5, patience=3),SaveModelCallback(model)]

  # add callbacks
  #model.callbacks = callbacks

  # fit model
  model.unfreeze()
  model.fit_one_cycle(3,slice(lr_steepest),cbs=SaveModelCallback()) 

  # fine tune
  model.fine_tune(2)

  return model

## Model Run

**Note to users: 
Users of this notebook don't need to run the below code cell. This cell was used during model training. Best model has been shared as ".h5" file that can be loaded and made to predict on desired period in "User-section" of the notebook later.**

In [None]:
# Shift data and train model. As data is time sensitive, need to find the best model
for p in range(3,4):
  
  # prepare data for current iteration - if you want to disable SMOTEing go to prepare_train_data function and comment it out
  train_valid, splits = prepare_train_data(df_model,p)
  
  # fit model
  model = fastai_classifier(train_valid,splits)

  # save the best performing model to file
  model.export('Fastai_Saved_Models/Model trained upto period ' + str(p) + '.pkl')

  # print training iteration
  print('Trained period upto= ',p)

## Model Evaluation

**Note to users: 
Users of this notebook don't need to run the below code cell. This cell was used during model training. Best model has been shared as ".h5" file that can be loaded and made to predict on desired period in "User-section" of the notebook later.**

In [None]:
# define a mean precision score dictionary at model level to evaluate best model
mean_prec_dict = {}

# load all models one by one
for mdl in range(3,7):

  # load model
  model = load_learner('Fastai_Saved_Models/Model trained upto period ' + str(mdl) + '.pkl')

  # forward chaining evaluation
  for p in range(mdl+1,26):

      # define primary evaluation dictionary
      prec_dict = {}

      # prepare test data for each iteration
      X_test,y_test = prepare_test_data(df_model,p)

      # predict on test - this predicts probabilities
      dl = model.dls.test_dl(X_test)
      pred = model.get_preds(dl=dl)
      y_pred = pred[0][:,0] # this is done b/c fastai produces a tensor, need to convert to array

      # convert probabilities to classes using default threshold = 0.5
      y_hat = np.where(y_pred > 0.5, 1, 0)

      # f beta score, beta = 0.5 to inc rease weightage to precision - want to reduce false +ve
      prec = precision_score(y_test, y_hat)

      # store scores in a dictionary
      prec_dict['mdl=' + str(mdl) + '_test_p=' + str(p)] = prec
      #print(prec_dict)

  # mean score of precision for a model across 24 period of evaluation
  mean_prec_dict['Model=' + str(mdl)] = list(map(lambda x: mean(prec_dict[x]), prec_dict))

mean_prec_dict

## User section - load best model and predict

After running the above two cells of code we obtain some insights on the performance of different models over different test sets. Based on the observations on model no. and mean precision score, model-6 had the highest mean precision score of all models, hence selecting it to be the final model.
User can run the below cell to load model-6 from working directory and predict on any period they like. Model-6 is shared as separate ".h5" file. Make sure it is present in your working directory.

Model-6 is trained using first 6 periods only, yet it yielded highest mean precision score. This indicates a seasonal behaviour in data and that data is indeed time sensitive.

In [None]:
# load model
m = 6
model = load_learner('Fastai_Saved_Models/Model trained upto period ' + str(m) + '.pkl')

# prepare test for period 25 and onwards - just change p to whichever period you want to predict. Make sure to not disturb the format of ML_Input.csv file.
p = 7
X_test,y_test = prepare_test_data(df_model,p)

# predict on test - this predicts probabilities
dl = model.dls.test_dl(X_test)
pred = model.get_preds(dl=dl)
y_pred = pred[0][:,0] # this is done b/c fastai produces a tensor, need to convert to array

# convert probabilities to classes using default threshold = 0.5
y_hat = np.where(y_pred > 0.5, 1, 0)

# confusion matrix of trained model
#print('confusion matrix of trained model')
#interpret = ClassificationInterpretation.from_learner(model)
#interpret.plot_confusion_matrix()

# plot confusion matrix
cm=confusion_matrix(y_test,y_hat)
ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['Low P','High P'])
ax.yaxis.set_ticklabels(['Low P','High P'])

# precision score
prec = precision_score(y_test, y_hat)
print("Precision: ", round(prec*100,2), "%")

# accuracy
acc = accuracy_score(y_test, y_hat)
print("Accuracy: ", round(acc*100,2), "%")

# cohen kappa score
kappa = cohen_kappa_score(y_test, y_hat)
print("Kappa: ", round(kappa*100,2), "%")

# f beta score, beta = 0.5 to increase weightage to precision - want to reduce false +ve
f_beta = fbeta_score(y_test, y_hat,beta=0.5)
print("F-0.5: ", round(f_beta*100,2), "%")

# roc score
auc = roc_auc_score(y_test, y_hat) 
print("AUC: ", round(auc*100,2), "%")

# END