# Employee performance prediction using Auto ML

This python script aims to predict employee performance flag using 1731 data points originally provided.

## Install relevant packages

If users are using google colab, except autokeras rest of the packages mentioned in the "Data Import" section below are installed by default. In your local machine you might have to install these packages as per your prior usage of python - 
1. pandas
2. numpy
3. os
4. tensorflow
5. seaborn
6. imblearn
7. metrics
8. matlplotlib
10. autokeras

In [None]:
#!pip install autokeras
#!pip install tensorflow-addons

## Import libraries

In [None]:
from google.colab import drive
import pandas as pd
import os
from scipy import mean
import numpy as np
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, fbeta_score, cohen_kappa_score, roc_auc_score, confusion_matrix, precision_score,make_scorer
from imblearn.over_sampling import BorderlineSMOTE
from autokeras import StructuredDataClassifier
import tensorflow_addons as tfa
from kerastuner import Objective
import warnings
warnings.filterwarnings('ignore')

## Set working directory

If you are using google colab, place the csv input file in your google directory and the next 3 lines of code should do the magic. Else,if you are using your own local machine, coment the first line of code in below cell and just replace the root_dir below with your local working directory to set.

In [None]:
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/MyDrive/Colab_Notebooks/AutoML_development"
os.chdir(root_dir)

Mounted at /content/gdrive


**Note to users: Make sure to create a new folder in your working directory named "Autokeras_Saved_Models".**

## Data Import

In [None]:
df = pd.read_csv("ML_input.csv")
df.head(3)

## Data pre-processing


Data cleaning has already been taken care. Given the nature of the problem, dropping irrelevant columns 
'Person_ID','Name','Type_ID','Service','Group','Group_ID','Score'

In [None]:
df_model = df.drop(columns = ['Person_ID','Name','Type_ID','Service','Group','Group_ID','Score'])
df_model.head(3)

## Data preparation

In [None]:
# prepare train data for training
def prepare_train_data(df_model,p):
  
  # define train and test set in each iteration of training
  train = df_model[df_model['Period']<=p]

  # split data into train and test as per each iteration
  X_train = train.iloc[:,2:]
  y_train = train.iloc[:,1]

  return X_train,y_train

# prepare test data for evaluation
def prepare_test_data(df_model,p):
  
  # define test set in each iteration of evaluation
  test = df_model[df_model['Period']==p]

  # split data into train and test as per each iteration
  X_test = test.iloc[:,2:]
  y_test = test.iloc[:,1]

  return X_test,y_test

## Borderline SMOTEing

Documentation - https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.BorderlineSMOTE.html

Reference article - https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/


**Note to users: 
Users of this notebook don't need to run the below code cell. This cell was used during model training. Best model has been shared as ".h5" file that can be loaded and made to predict on desired period in "User-section" of the notebook later.**

In [None]:
def borderline_sample(X_train,y_train):
  oversample = BorderlineSMOTE()
  X_train, y_train = oversample.fit_resample(X_train, y_train)

  return X_train,y_train

## Model Definition

Documentation for autokeras - https://autokeras.com/tutorial/structured_data_classification/

**Note to users: 
Users of this notebook don't need to run the below code cell. This cell was used during model training. Best model has been shared as ".h5" file that can be loaded and made to predict on desired period in "User-section" of the notebook later.**

In [None]:
# Autokeras
def auto_keras(X_train,y_train):

  # build model
  model = StructuredDataClassifier(max_trials=20,
                                  overwrite=True,
                                  objective=Objective('val_true_positives', direction='max'),
                                  metrics=["TruePositives"]
                                  )

  # fit model
  model.fit(x=X_train,y=y_train)

  return model

## Model Run

**Note to users: 
Users of this notebook don't need to run the below code cell. This cell was used during model training. Best model has been shared as ".h5" file that can be loaded and made to predict on desired period in "User-section" of the notebook later.**

In [None]:
# Shift data and train model. As data is time sensitive, need to find the best model
for p in range(1,25):
  
  # prepare data for current iteration
  X_train,y_train = prepare_train_data(df_model,p)

  # borderline SMOTE
  X_train, y_train = borderline_sample(X_train,y_train)
  
  # fit model
  model = auto_keras(X_train,y_train)

  # get the best performing model
  exprt = model.export_model()

  # print summary
  #exprt.summary()

  # save the best performing model to file
  exprt.save('Autokeras_Saved_Models/Model trained upto period ' + str(p) + '.h5')

  # print training iteration
  print('Trained period upto= ',p)

## Model Evaluation

**Note to users: 
Users of this notebook don't need to run the below code cell. This cell was used during model training. Best model has been shared as ".h5" file that can be loaded and made to predict on desired period in "User-section" of the notebook later.**

In [None]:
# define a mean precision score dictionary at model level to evaluate best model
mean_prec_dict = {}

# load all models one by one
for mdl in range(1,24):

  # load model
  model = tf.keras.models.load_model('Autokeras_Saved_Models/Model trained upto period ' + str(mdl) + '.h5')

  # forward chaining evaluation
  for p in range(mdl+1,26):

      # define primary evaluation dictionary
      prec_dict = {}

      # prepare test data for each iteration
      X_test,y_test = prepare_test_data(df_model,p)

      # predict on test - this predicts probabilities
      y_hat = model.predict(X_test)

      # convert probabilities to classes using threshold = 0.5
      y_hat = np.where(y_hat > 0.5, 1, 0)

      # f beta score, beta = 0.5 to increase weightage to precision - want to reduce false +ve
      prec = precision_score(y_test, y_hat)

      # store scores in a dictionary
      prec_dict['mdl=' + str(mdl) + '_test_p=' + str(p)] = prec
      print(prec_dict)

  # mean score of precision for a model across 24 period of evaluation
  mean_prec_dict['Model=' + str(mdl)] = list(map(lambda x: mean(prec_dict[x]), prec_dict))

## User section - load best model and predict

After running the above two cells of code we obtain some insights on the performance of different models over different test sets. Based on the observations on model no. and mean precision score, model-6 had the highest mean precision score of all models, hence selecting it to be the final model.
User can run the below cell to load model-6 from working directory and predict on any period they like. Model-6 is shared as separate ".h5" file. Make sure it is present in your working directory.

Model-6 is trained using first 6 periods only, yet it yielded highest mean precision score. This indicates a seasonal behaviour in data and that data is indeed time sensitive.

In [None]:
# load model
m = 6
model = tf.keras.models.load_model('Autokeras_Saved_Models/Model trained upto period ' + str(m) + '.h5')

# prepare test for period 25 and onwards - just change p to whichever period you want to predict. Make sure to not disturb the format of ML_Input.csv file.
p = 7
X_test,y_test = prepare_test_data(df_model,p)

# predict on test - this predicts probabilities
y_hat = model.predict(X_test)

# convert probabilities to classes using threshold = 0.5. User can play around with this threshold of 0.5 to increase true positives.
y_hat = np.where(y_hat > 0.5, 1, 0)

# plot confusion matrix
cm=confusion_matrix(y_test,y_hat)
ax= plt.subplot()
sns.heatmap(cm, annot=True, fmt='g', ax=ax)
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['Low P','High P'])
ax.yaxis.set_ticklabels(['Low P','High P'])

# precision score
prec = precision_score(y_test, y_hat)
print("Precision: ", round(prec*100,2), "%")

# accuracy
acc = accuracy_score(y_test, y_hat)
print("Accuracy: ", round(acc*100,2), "%")

# cohen kappa score
kappa = cohen_kappa_score(y_test, y_hat)
print("Kappa: ", round(kappa*100,2), "%")

# f beta score, beta = 0.5 to increase weightage to precision - want to reduce false +ve
f_beta = fbeta_score(y_test, y_hat,beta=0.5)
print("F-0.5: ", round(f_beta*100,2), "%")

# roc score
auc = roc_auc_score(y_test, y_hat) 
print("AUC: ", round(auc*100,2), "%")

# END