##**Scenario:** Predict Employee Attrition using Classification Algorithms


###**Dataset Description**
The data set contains the following attributes:

- **satisfaction_level** 
- **last_evaluation**
- **number_project**
- **average_montly_hours**
- **time_spend_company**
- **Work_accident**
- **quit**
- **promotion_last_5years**
- **department**
- **salary**

In [None]:
  #Importing Required Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
sns.set()


import warnings
warnings.filterwarnings("ignore")    
import os

print('Libraries Imported')

In [None]:
#Loading the dataset

df = pd.read_csv('employee_data.csv?dl=0')

df.head() #Printing the first 5 rows of dataframe

###**Exploratory Data Analysis**

**Note:** If you want to learn more about Pandas-Profiling [**Click Here!**](https://pypi.org/project/pandas-profiling/)

In [None]:
import pandas_profiling
from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file='output.html') #Generating a Data Report


In [None]:
ProfileReport(df).to_notebook_iframe()

In [None]:
#Generating a Pandas Profiling Report 

import pandas_profiling
from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file='output.html')

Please refer to the HTML file created by the name of **output.html**

___
**Observations:**

- There are **10** variables or features in the dataframe and the total number of instances or rows are **14999**
- We have **5** Numeric, **3** Boolean, and **2** Categorical variables
- We have **2** Categorical Variables namely **Department** and **Salary**
- The **Salary** column is divided into **low**, **medium**, and **high** 
- There are no missing cells in the dataset which is a big relief
___

In [None]:
import plotly.express as px
fig = px.histogram(df, x = 'average_montly_hours')
fig.show()

___
**Observations:**
- Most of the employees work between 125 and 265 hours monthly
- Very few employees spend less than 140 and more than 265 working on a monthly basis
___

In [None]:
fig = px.histogram(df, x = 'satisfaction_level')
fig.show()

___
**Observations:**
- More than 800 employees are not satasfied with their work and may leave the company
- Most of the employees are quite content with their job
___

**What's the Attrition percentage in the company?**


In [None]:
plt.figure(figsize=(12,8))

ax = sns.countplot(df["quit"], color='green')
for p in ax.patches:
    x = p.get_bbox().get_points()[:,0]
    
    y = p.get_bbox().get_points()[1,1]
    
    ax.annotate('{:.2g}%'.format(100.*y/len(df)), (x.mean(), y), ha='center', va='bottom')
plt.show()

___
**Observations:**
- 76% of employees did not leave the organization while 24% did leave
___

**Which Department of the company has the highest Attrition rate?**

In [None]:
plt.figure(figsize=(12,8))

sns.countplot(data=df,x=df['department'],hue="quit")

plt.xlabel('Departments')
plt.ylabel('Frequency')

plt.show()

___
**Observations:**
- **Sales** department has the highest attrition or turnover rate followed by **technical**, and **support** departments
- **Management** recorded the lowest number of employees leaving the company
___

####**Bi-variate Distributions**
- A Bi-variate distribution is a distribution of two random variables
- The concept generalizes to any number of random variables, giving a **Multivariate Distribution**

**How does salary affect the attrition rate?**

In [None]:
df_new = pd.crosstab(df['salary'], df['quit'])

df_new.plot(kind = 'bar')

plt.title('Employee Attrition Frequency based on Salary')
plt.xlabel('Salary')
plt.ylabel('Frequency')

plt.show()

___
**Observations:**

- People with **low** salary are more likely to quit as compared to people with **medium** and **high** salaries
- People with **high** salary are very less likely to leave the organization
- Salary seems to be a significant factor in determining the turnover rate in employees
___

**Do experienced employees tend to leave the company if they are not satisfied?**

In [None]:
px.scatter(df, x=df['satisfaction_level'],y=df['time_spend_company'],color=df['quit'])

**Which department executes the most number of projects?**

In [None]:
fig = px.box(df, x="department",y="number_project")
fig.show()

###**Create Training and Testing Set**

In [None]:
X = df.drop('quit', axis = 1)
y = df.quit

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size=0.2, stratify = y)

###**Data Pre-processing**

####**Encode Categorical Variables**

The dataset contains **2** Categorical Variables:

- **department**
- **salary**

We have to encode them before modelling because scikit learn doesn't accept string data as input

In [None]:
cat_vars = ['department', 'salary']

for vars in cat_vars:
  cat_list = pd.get_dummies(X_train[vars], prefix=vars)
  X_train = X_train.join(cat_list)

In [None]:
cat_vars = ['department', 'salary']

for vars in cat_vars:
  cat_list = pd.get_dummies(X_test[vars], prefix=vars)
  X_test = X_test.join(cat_list)

In [None]:
#Let us drop the department and salary columns

X_train.drop(columns=['department', 'salary'], axis = 1, inplace=True)
X_train.shape

In [None]:
X_test.drop(columns=['department', 'salary'], axis = 1, inplace=True)
X_test.shape

###**Build an Interactive Decision Tree Model**

[**Click Here!**](https://ipywidgets.readthedocs.io/en/latest/) to learn more about **ipywidgets**

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.tree import export_graphviz # display the tree within a Jupyter notebook
from IPython.display import SVG
from graphviz import Source
from IPython.display import display
from ipywidgets import interactive, IntSlider, FloatSlider, interact
import ipywidgets
from IPython.display import Image
from subprocess import call
import matplotlib.image as mpimg

In [None]:
@interact #To convert any function into an inteactive one just write "@interact" immediately before the function definition

def plot_tree(
    crit = ['gini', 'entropy'],
    split = ['best','random'],
    depth = IntSlider(min = 1, max = 25, value =2, continuous_update = False),
    min_split = IntSlider(min = 1, max = 5, value =2, continuous_update = False),
    #min_split is the minimum number of samples  required to split an internal node in our decision tree
    min_leaf = IntSlider(min = 1, max = 5, value =1, continuous_update = False)):
  
  estimator = DecisionTreeClassifier(criterion=crit,
                                     splitter=split,
                                     max_depth = depth,
                                     min_samples_split = min_split,
                                     min_samples_leaf = min_leaf
                                     )
  estimator.fit(X_train, y_train)
  print('Decision Tree Training Accuracy:', accuracy_score(y_train, estimator.predict(X_train)))
  print('Decision Tree Testing Accuracy:', accuracy_score(y_test, estimator.predict(X_test)))

  a = accuracy_score(y_train, estimator.predict(X_train))
  b = accuracy_score(y_test, estimator.predict(X_test))

  if a > 0.99:
    print('Decision Tree Training Accuracy',a, 'Decision Tree Testing Accuracy', b)
    print('Criterion:',crit,'\n', 'Split:', split,'\n', 'Depth:', depth,'\n', 'Min_split:', min_split,'\n', 'Min_leaf:', min_leaf,'\n')

  #Let us use GraphViz to export the model and display it as an image on the screen
  graph = Source(tree.export_graphviz(estimator, out_file=None, 
                                      feature_names = X_train.columns,
                                      class_names = ['stayed', 'quit'],
                                      filled = True))
  
  display(Image(data=graph.pipe(format = 'png')))
  

**Advantages** & **Disadvantages** of Decision Tree:

**Advantages:**
- Interpretable and easy to understand
- Can Handle Missing Values
- Feature Selection happens automatically


**Disadvantages:**
- Prone to overfitting
- Tends to add High Variance which means they tend to overfit
- Small changes in data greatly affect prediction
 



**One problem with Decision Tree is that they have Low Bias and High Variance which means they are prone to overfitting on the training set**



Now, let us see what **Underfit**, **Goodfit**, and **Overfit** is:

- **Underfit**
  - Model has not learned anything
  - **Training Accuracy**: **54%**
  - **Testing Accuracy**: **49%**

- **Overfit**
  - Model has memorized everything
  - **Training Accuracy**: **99%**
  - **Testing Accuracy**: **46%**

- **Goodfit**
  - Model has performed well on the testing data as well alongwith the training data 
  - **Training Accuracy**: **93%**
  - **Testing Accuracy**: **91%**



Now, let's use a Random Forest Classifier to overcome the variance problem to get a better generalizable result

###**Build an Interactive Random Forest Model**

In [None]:
@interact
def plot_tree_rf(crit= ['gini','entropy'],
                 bootstrap= ['True', 'False'],
                 depth=IntSlider(min= 1 ,max= 20,value=3, continuous_update=False),
                 forests=IntSlider(min= 1,max= 1000,value= 100,continuous_update=False),
                 min_split=IntSlider(min= 2,max= 5,value= 2, continuous_update=False),
                 min_leaf=IntSlider(min= 1,max= 5,value= 1, continuous_update=False)):
  
  estimator = RandomForestClassifier(
      random_state = 1,
      criterion = crit,
      bootstrap = bootstrap,
      n_estimators = forests,
      max_depth = depth, 
      min_samples_split = min_split,
      min_samples_leaf = min_leaf,
      n_jobs = -1,
      verbose = False)
  
  estimator.fit(X_train, y_train)

  print('Random Forest Training Accuracy:', accuracy_score(y_train, estimator.predict(X_train)))
  print('Random Forest Testing Accuracy:', accuracy_score(y_test, estimator.predict(X_test)))  

  a = accuracy_score(y_train, estimator.predict(X_train))
  b = accuracy_score(y_test, estimator.predict(X_test))

  if a > 0.99:
    print('Random Forest Training Accuracy',a, 'Random Forest Testing Accuracy', b)
    print('Criterion:',crit,'\n', 'Bootstrap:', bootstrap,'\n', 'Depth:', depth,'\n', 'forests:', forests,'\n', 'Min_split:', min_split,'\n', 'Min_leaf:', min_leaf,'\n')


**Advantages** & **Disadvantages** of Random Forest:

**Advantages:**
- Not prone overfitting
- Runs efficiently huge data sets
- Gives better accuracy than other classification algorithms

**Disadvantages:**
- Computationally Slower
- Found to be biased while dealing with categorical variables
- Although much lower than decision trees, overfitting is still a risk with random forests



###**Implement GridSearchCV and RandomizedSearchCV Model**

**GridSearchCV**

In [None]:
#GridSearchCV
from sklearn.model_selection import GridSearchCV

rfc=RandomForestClassifier(random_state=42)

param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X_train, y_train)



In [None]:
CV_rfc.best_params_

In [None]:
rfc1=RandomForestClassifier(random_state=42, max_features='auto', n_estimators= 200, max_depth=8, criterion='gini')

In [None]:
rfc1.fit(X_train, y_train)

In [None]:
pred=rfc1.predict(X_test)

In [None]:
print("Accuracy for Random Forest on CV data: ",accuracy_score(y_test,pred))

In [None]:
from sklearn.metrics import  accuracy_score, f1_score, precision_score, recall_score, roc_auc_score
gd_roc=roc_auc_score(y_test, pred)
gd_acc = accuracy_score(y_test, pred)
gd_prec = precision_score(y_test, pred)
gd_rec = recall_score(y_test, pred)
gd_f1 = f1_score(y_test, pred)

**RandomizedSearchCV**

In [None]:
#Randomized Search CV

from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

In [None]:
#Base Model
rfc=RandomForestClassifier(random_state=42)

rf_random = RandomizedSearchCV(estimator = rfc, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)


In [None]:
rf_random.best_params_

In [None]:
rfc1=RandomForestClassifier(random_state=42, bootstrap = False,
 max_depth = 70,
 max_features= 'sqrt',
 min_samples_leaf = 1,
 min_samples_split= 5,
 n_estimators= 1600)


In [None]:
rfc1.fit(X_train, y_train)

In [None]:
pred=rfc1.predict(X_test)

###**Model Evaluation**

**Accuracy:** No. of correct predictions made by the model over all kinds predictions made

**When to use Accuracy:**

Accuracy is a good measure when the target variable classes in the data are nearly balanced. For example, No. of people who Survived Titanic (60% yes - 40% no)

In [None]:
print("Accuracy: ",accuracy_score(y_test,pred))

**Confusion Matrix:**
 Gives the Performance of a classification model on a set of test data for which the true values are known.

A way to visualize **Precision** and **Recall**

**When to use Confusion Matrix:** When we have an Imbalanced Classification Task



In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)

- **Precision:**
  - What percebtage of positive predictions made were correct? This is **Precision**
  - No. of True Positives divided by the no. of True Positives plus the No. of False Positives
 
- **Recall:** Ratio of True Positives to all the positives in your Dataset

- **When to use Precision & Recall:** 
 - In the credit card fraud detection task, lets say we modify the model slightly, and identify a single transaction correctly as fraud. 

 - Now, our precision will be 1.0 (no false positives) but our recall will be very low because we will still have many false negatives. 

 - If we go to the other extreme and classify all transactions as fraud, we will have a recall of 1.0 — we’ll catch every fraud transaction — but our precision will be very low and we’ll misclassify many legit transactions. In other words, as we increase precision we decrease recall and vice-versa.

- **F1-Score:**
 F1 Score is the weighted average of Precision and Recall. F1 is usually more useful than accuracy, especially when we have an uneven class distribution

 - **When to use F1-Score:** 
   - Useful when you have data with imbalance classes
   - Let us say, we have a model with a precision of 1, and recall of 0 which gives a simple average as 0.5 and an F1 score of 0
   - If one of the parameters is low, the second one no longer matters in the F1 score 
   - The F1 score favors classifiers that have similar precision and recall
   - F1 score is a better measure to use if you are seeking a balance between Precision and Recall


In [None]:
from sklearn.metrics import  accuracy_score, f1_score, precision_score, recall_score, roc_auc_score

roc=roc_auc_score(y_test, pred)
acc = accuracy_score(y_test, pred)
prec = precision_score(y_test, pred)
rec = recall_score(y_test, pred)
f1 = f1_score(y_test, pred)

results = pd.DataFrame([['RandomizedSearchCV', acc,prec,rec, f1,roc], 
                        ['GridSearchCV',gd_acc, gd_prec, gd_rec, gd_f1, gd_roc]],
               columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
results

**Saving the model and dumping it to a pickle file**

In [None]:
import pickle 

filename = 'final_model.sav'
pickle.dump(rfc1, open(filename, 'wb'))
 

###**Interpreting Employee Attrition Prediction With SHAP**


**SHAP** (SHapley Additive exPlanations) :break down a prediction to show the impact of each feature

**Install SHAP**: pip install shap

In [None]:
pip install shap

In [None]:
import shap

**shap.summary_plot function**

- Produces the variable importance plot
- A variable importance plot lists the most significant variables in descending order
- The top variables contribute more to the model than the bottom ones and thus have high predictive power

In [None]:

shap_values = shap.TreeExplainer(rfc1).shap_values(X_train)
shap.summary_plot(shap_values, X_train, plot_type="bar")

**Dependency Plot**

- Shows the effect a single feature has on the prediction
- How much the prediction depends on a particular feature
- shap.dependence_plot(indexoffeature,matrix_shap_values,dataset_matrix)

In [None]:
shap.dependence_plot('satisfaction', shap_values, X_train)

###**PyCaret**


Use **PyCaret** to find the best model and perform Automatic Hyperparameter tuning

**NOTE:** It is alwasy used in Industry as a Directional Tool

**PyCaret** is an open source, low-code machine learning library in **Python** that allows you to go from preparing your data to deploying your model within minutes in your choice of notebook environment

[**Click Here!**](https://pycaret.org/) to learn more about **PyCaret**

**Installing PyCaret**

- !pip install pycaret

####**Tasks to be performed**

- Import PyCaret and load the data set
- Initialize or setup the environment 
- Compare Multiple Models and their Accuracy Metrics
- Create the model
- Tune the model
- Evaluate the model


####**Import PyCaret and load the data set**

In [None]:
!pip install pycaret

In [None]:
import pycaret.classification as pc
#dir(pc)

In [None]:
#Loading the dataset
import pandas as pd
df = pd.read_csv('/content/employee_data.csv?dl=0')

df.head() #Printing the first 5 rows of dataframe

In [None]:
df['department'].unique()

####**Initialize or setup the environment**

In [None]:
pc.setup(df, target='quit')

___
**Observations:**
- The target type (Serial No. 2) is **Binary** because we have two values in **quit** column i.e., **0** and **1**
- The data contains **3** Numeric Features and **6** Categorical Features
___

####**Compare Multiple Models and their Accuracy Metrics**

In [None]:
pc.compare_models()

**Note:** Don't worry about the models. You are gonna learn most of them in the coming modules

####**Create the Model**



In [None]:
rf_model = pc.create_model('rf') #Performs K-Fold (10) CV for the selected model

####**Tune the Model**

In [None]:
tuned_rf = pc.tune_model(rf_model)

In [None]:
print(rf_model)

In [None]:
print(tuned_rf)

See the difference between the original model (**rf_model**) and the tuned model (**tuned_rf**)

####**Evaluate the Model**

In [None]:
tuned_rf_eval = pc.evaluate_model(tuned_rf)

###**Deploying the Model Using Streamlit**

Go to your local system and use a text editor such as **Sublime Text** to deploy your app using the pickle file genereated

- **Save the next cell as a .py file**
- **Run it in your local system** (streamlit run filename.py)

In [None]:
from pycaret.classification import load_model, predict_model
import streamlit as st
import pandas as pd
import numpy as np

#Loading Trained Model
model = load_model('Final_model')


def predict(model, input_df):
    predictions_df = predict_model(estimator=model, data=input_df)
    #predict_model function takes a trained model object and the dataset to predict
    
    predictions = predictions_df['Label'][0]
    return predictions


st.title('Employee Attrition Prediction Web App')

satisfaction_level=st.number_input('satisfaction_level' , min_value=0.1, max_value=1.0, value=0.1)
last_evaluation =st.number_input('last_evaluation',min_value=0.1, max_value=1.0, value=0.1)
number_project = st.number_input('number_project', min_value=0, max_value=50, value=5)
time_spend_company = st.number_input('time_spend_company', min_value=1, max_value=10, value=3)
Work_accident = st.number_input('Work_accident',  min_value=0, max_value=50, value=0)
promotion_last_5years = st.number_input('promotion_last_5years',  min_value=0, max_value=50, value=0)
salary = st.selectbox('salary', ['low', 'high','medium'])
average_montly_hours = st.number_input('average_montly_hours',  min_value=96, max_value=310, value=100)
department = st.selectbox('department', ['sales', 'accounting', 'hr', 'technical', 'support', 'management',
       'IT', 'product_mng', 'marketing', 'RandD'])


output=""

input_dict={'satisfaction_level':satisfaction_level,'last_evaluation':last_evaluation,'number_project':number_project,'time_spend_company':time_spend_company,'Work_accident': Work_accident,'promotion_last_5years':promotion_last_5years,'salary' : salary, 'average_montly_hours':average_montly_hours, 'department':department}


input_df = pd.DataFrame([input_dict])
print(input_df)

if st.button("Predict"):
	output = predict(model=model, input_df=input_df)
	output = str(output)
        
if output=='1':
	st.success('Employee will leave the company')
else:
	st.success('Employee will stay')