**Case study : Logistic Regression**

**Problem Statement:**

A large company named XYZ, employs, at any given point of time, around 4000 employees. However, every year, around 15% of its employees leave the company and need to be replaced with the talent pool available in the job market. The management believes that this level of attrition is bad for the company, because of the following reasons -
1. The former employees’ projects get delayed, which makes it difficult to meet timelines, resulting in a reputation loss among consumers and partners.
2. A sizeable department has to be maintained, for the purposes of recruiting new talent.
3. More often than not, the new employees have to be trained for the job and/or given time to acclimatise themselves to the company.
 
Hence, the management has contracted an analytics firm to understand what factors they should focus on, in order to curb attrition. Also, they want to know which of these variables is most important and needs to be addressed right away.

**Objective:**

You are required to model the probability of attrition using a logistic regression. The results thus obtained will be used by the management to understand what changes they should make to their workplace, in order to get most of their employees to stay.

**Data analysis process:**

Analysis will be done as follows.

1. Understanding data
2. Data cleaning (Removing duplicates, treating outiliers and missing values etc.)
3. Exporatory data analysis
4. Model building
5. Model evaluation

In [None]:
# Load the following packages.

import numpy as np
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import r2_score, accuracy_score, confusion_matrix
import kds

**Understanding the data**

In [None]:
# Loading the given datasets.

employee_survey_data = pd.read_csv("employee_survey_data.csv")

general_data = pd.read_csv("general_data.csv")

in_time = pd.read_csv("in_time.csv")

out_time = pd.read_csv("out_time.csv")

manager_survey_data = pd.read_csv("manager_survey_data.csv")

In [None]:
# View the structure of dataframe employee_survey_data.

employee_survey_data

In [None]:
# View the structure of dataframe general_data.

general_data

In [None]:
# View the structure of dataframe in_time.

in_time

In [None]:
# View the structure of dataframe out_time.

out_time

In [None]:
# View the structure of dataframe manager_survey_data.

manager_survey_data

**Data Preparation**

In [None]:
## Treating the columns of "in_time" dataframe.

# Renaming column "Unnamed: 0" as "EmployeeID".

in_time.rename(columns = {'Unnamed: 0':'EmployeeID'}, inplace = True)

# Storing "EmployeeID" in "in_id" df.

in_id = in_time["EmployeeID"]

# Dropping "EmployeeID" from "in_time" df.

in_time.drop(["EmployeeID"], axis = 1, inplace = True)

# Converting all columns to datetime datatype and storing them in new df as "new_intime".

new_intime = in_time.apply(lambda x :pd.to_datetime(x))

# Extracting hours from all columns.

intime_hr = new_intime.apply(lambda x : x.dt.hour)

# Calculating row wise mean of hours.

mean_hrs = intime_hr.mean(axis = 1)

# Joining "in_id" df that contains "EmployeeID" with df "mean_hrs".

intime_df = pd.concat([in_id, mean_hrs], axis = 1)

# Assigning name to column containing mean hours as "in_time_mean".

intime_df.rename(columns = {intime_df.columns[1]:'in_time_mean'}, inplace = True)

# View intime_df.

intime_df

In [None]:
## Treating the columns of "out_time" dataframe.

# Renaming column "Unnamed: 0" as "EmployeeID".

out_time.rename(columns = {'Unnamed: 0':'EmployeeID'}, inplace = True)

# Storing "EmployeeID" in "out_id" df.

out_id = out_time["EmployeeID"]

# Dropping "EmployeeID" from "out_time" df.

out_time.drop(["EmployeeID"], axis = 1, inplace = True)

# Converting all columns to datetime datatype and storing them in new df as "new_outtime".

new_outtime = out_time.apply(lambda x :pd.to_datetime(x))

# Extracting hours from all columns.

outtime_hr = new_outtime.apply(lambda x : x.dt.hour)

# Calculating row wise mean of hours.

mean_hrs_out = outtime_hr.mean(axis = 1)

# Joining "out_id" df that contains "EmployeeID" with df "mean_hrs_out".

outtime_df = pd.concat([out_id, mean_hrs_out], axis = 1)

# Assigning name to column containing mean hours as "out_time_mean".

outtime_df.rename(columns = {outtime_df.columns[1]:'out_time_mean'}, inplace = True)

# View outtime_df.

outtime_df

In [None]:
# Merging all data frames by "EmployeeID".

merge_1 = pd.merge(employee_survey_data, general_data, on = "EmployeeID")

merge_2 = pd.merge(merge_1, manager_survey_data, on = "EmployeeID")

merge_3 = pd.merge(merge_2, intime_df , on = "EmployeeID")

merge_4 = pd.merge(merge_3, outtime_df, on = "EmployeeID")

In [None]:
# Calculating "working_hours" using "in_time_mean" and "out_time_mean".

merge_4['working_hours'] = merge_4['out_time_mean'] - merge_4['in_time_mean']

In [None]:
# Storing "merge_4" df as "final_data" for further analysis.

final_data = merge_4

In [None]:
# Checking if there are missing values in final_data.

final_data.isnull().sum()

# Some columns contain missing values. But these are less in number. Hence we will remove them instead of imputing them with mode and median values.

In [None]:
# Removing missing values.

final_data = final_data.dropna(axis = 0, subset = ['EnvironmentSatisfaction'])

final_data = final_data.dropna(axis = 0, subset = ['JobSatisfaction'])

final_data = final_data.dropna(axis = 0, subset = ['WorkLifeBalance'])

final_data = final_data.dropna(axis = 0, subset = ['NumCompaniesWorked'])

final_data = final_data.dropna(axis = 0, subset = ['TotalWorkingYears'])

In [None]:
# Converting certain variables to datatype object.

final_data['EnvironmentSatisfaction'] = final_data['EnvironmentSatisfaction'].astype(str)

final_data['JobSatisfaction'] = final_data['JobSatisfaction'].astype(str)

final_data['WorkLifeBalance'] = final_data['WorkLifeBalance'].astype(str)

final_data['Education'] = final_data['Education'].astype(str)

final_data['Department'] = final_data['Department'].astype(str)

final_data['BusinessTravel'] = final_data['BusinessTravel'].astype(str)

final_data['EducationField'] = final_data['EducationField'].astype(str)

final_data['Gender'] = final_data['Gender'].astype(str)

final_data['JobRole'] = final_data['JobRole'].astype(str)

final_data['MaritalStatus'] = final_data['MaritalStatus'].astype(str)

final_data['Over18'] = final_data['Over18'].astype(str)

final_data['JobLevel'] = final_data['JobLevel'].astype(str)

final_data['StockOptionLevel'] = final_data['StockOptionLevel'].astype(str)

final_data['JobInvolvement'] = final_data['JobInvolvement'].astype(str)

final_data['PerformanceRating'] = final_data['PerformanceRating'].astype(str)

In [None]:
# View final_data.

final_data

**Outlier Treatment**

In [None]:
# Extracting only numeric columns from final_data.

numeric_cols = final_data.select_dtypes(include = 'number')

In [None]:
# Dropping "EmployeeID" from df "numeric_cols".

numeric_cols.drop(["EmployeeID"], axis = 1, inplace = True)

In [None]:
# Plotting boxplots for all continous variables in order to detect outliers.

for column in numeric_cols:
    plt.figure(figsize = (10,5))
    numeric_cols.boxplot([column])
    
# Some variables have outliers. These outliers will be treated in the next step.

In [None]:
# Flooring and capping ouliers.

for col in numeric_cols.columns:
    Q1 = numeric_cols[col].quantile(0.25)
    Q3 = numeric_cols[col].quantile(0.75)
    IQR = Q3 - Q1 
    Lower_cap  = Q1 - 1.5*IQR
    Upper_cap = Q3 + 1.5*IQR
    numeric_cols[col][numeric_cols[col] <= Lower_cap] = Lower_cap
    numeric_cols[col][numeric_cols[col] >= Upper_cap] = Upper_cap

In [None]:
# Dropping numeric columns from final_data.

final_data.drop(['Age', 'DistanceFromHome', 'EmployeeCount', 'MonthlyIncome',
       'NumCompaniesWorked', 'PercentSalaryHike', 'StandardHours',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany',
       'YearsSinceLastPromotion', 'YearsWithCurrManager', 'in_time_mean',
       'out_time_mean', 'working_hours'], axis = 1, inplace = True)

In [None]:
# Setting the index of "numeric_cols" and "final_data" before joining them.

numeric_cols.reset_index(drop = True, inplace = True)

final_data.reset_index(drop = True, inplace = True)

In [None]:
# Joining "numeric_cols" and "final_data" after outlier treatment.

final_data = pd.concat([final_data, numeric_cols], axis = 1)

**Exploratory data analysis**

**Effect of categorical variables on employee atrrition**

In [None]:
# Effect of Environment satistfaction on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "EnvironmentSatisfaction", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of Environment satistfaction on Attrition", size = 15)

# When EnvironmentSatisfaction = 1 (Low), atrrition is more.

In [None]:
# Effect of Job satistfaction on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "JobSatisfaction", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of Job satistfaction on Attrition", size = 15)

# When JobSatisfaction level is low, atrrition is high.

In [None]:
# Effect of WorkLifeBalance on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "WorkLifeBalance", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of WorkLifeBalance on Attrition", size = 15)

# When WorkLifeBalance = 2 and WorkLifeBalance = 3, atrrition is more.

In [None]:
# Effect of Education on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "Education", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of Education on Attrition", size = 15)

# Attrition is high in highly educated employees (Education level 1 to 4).

In [None]:
# Effect of Department on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "Department", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of Department on Attrition", size = 15)

# Attrition is high in "Research and Development" department.

In [None]:
# Effect of BusinessTravel on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "BusinessTravel", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of BusinessTravel on Attrition", size = 15)

# When there is no travel component in the job, atrrition is less.

In [None]:
# Effect of EducationField on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "EducationField", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of EducationField on Attrition", size = 15)

# Attrition is high in education fields like Life sciences and Medical.

In [None]:
# Effect of Gender on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "Gender", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of Gender on Attrition", size = 15)

# High level of attrition is seen in male employees.

In [None]:
# Effect of JobRole on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, y = "JobRole", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of JobRole on Attrition", size = 15)

# Research scientists and Sales executives are more prone to atrrition.

In [None]:
# Effect of MaritalStatus on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "MaritalStatus", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of MaritalStatus on Attrition", size = 15)

# Employees with marital status as single are more prone to attrition.

In [None]:
# Effect of JobLevel on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "JobLevel", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of JobLevel on Attrition", size = 15)

# Higher attrition is seen in employees with job level 1 and 2.

In [None]:
# Effect of PerformanceRating on Attrition.

plt.figure(figsize = (10,10))

sns.histplot(binwidth = 0.5, x = "PerformanceRating", hue = "Attrition", data = final_data, stat = "count", multiple = "stack")

plt.title("Effect of PerformanceRating on Attrition", size = 15)

# Employees with less performance rating are likely to undergo attrition.

**Effect of continuous variables on employee attrition**

In [None]:
# Effect of Age on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "Age", hue = "Attrition", kind = "bar", data = final_data)

# Younger employees are more prone to attrition.

In [None]:
# Effect of DistanceFromHome on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "DistanceFromHome", hue = "Attrition", kind = "bar", data = final_data)

# No significant difference is seen in DistanceFromHome in terms of attrition.

In [None]:
# Effect of MonthlyIncome on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "MonthlyIncome", hue = "Attrition", kind = "bar", data = final_data)

# As monthly income decreases, attrition increases.

In [None]:
# Effect of NumCompaniesWorked on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "NumCompaniesWorked", hue = "Attrition", kind = "bar", data = final_data)

# As NumCompaniesWorked increases, attrition increases.

In [None]:
# Effect of PercentSalaryHike on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "PercentSalaryHike", hue = "Attrition", kind = "bar", data = final_data)

# Attrition increases with PercentSalaryHike.

In [None]:
# Effect of TotalWorkingYears on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "TotalWorkingYears", hue = "Attrition", kind = "bar", data = final_data)

# Attrition is seen in employees with less TotalWorkingYears.

In [None]:
# Effect of TrainingTimesLastYear on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "TrainingTimesLastYear", hue = "Attrition", kind = "bar", data = final_data)

# Attrition is more when TrainingTimesLastYear is less.

In [None]:
# Effect of YearsAtCompany on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "YearsAtCompany", hue = "Attrition", kind = "bar", data = final_data)

# When YearsAtCompany is less, attrition is more.

In [None]:
# Effect of YearsWithCurrManager on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "YearsWithCurrManager", hue = "Attrition", kind = "bar", data = final_data)

# When YearsWithCurrManager is less, attrition is more.

In [None]:
# Effect of in_time_mean on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "in_time_mean", hue = "Attrition", kind = "bar", data = final_data)

# When in_time_mean increases, attrition increases.

In [None]:
# Effect of out_time_mean on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "out_time_mean", hue = "Attrition", kind = "bar", data = final_data)

# When out_time_mean increases, attrition increases.

In [None]:
# Effect of working_hours on attrition.

plt.figure(figsize = (10,10))

sns.catplot(x = "Attrition", y = "working_hours", hue = "Attrition", kind = "bar", data = final_data)

# When working_hours increases, attrition increases.

**Insights:**

Attrition increases when,
1. EnvironmentSatisfaction level of employee is low.
2. JobSatisfaction level of employee is low.
3. WorkLifeBalance of employee is low (i.e. level 2 and 3).
4. Level  of education of employee increases.
5. Department of employee is Research and Development.
6. Employees have to travel as a part of their job.
7. EducationField of employee is Life sciences and Medical.
8. Gender of employee is male.
9. JobRole of employee is Research scientist and Sales ececutive.
10. Marital status of employee is single.
11. JobLevel of employee is 1 and 2 (i.e. low).
12. Performance rating of employee is less.
13. Age of employee is less.
14. Monhly income of employee is less.
15. NumCompaniesWorled by employee is more.
16. PercentSalaryHike of employee is more.
17. TotalWorkingYears of employee is less.
18. TrainingTimesLastYear is less.
19. YearsAtCompany is less.
20. YearsWithCurrManager is less.
21. working hours are more.

**Model building:**

In [None]:
# Extracting only cartegorical column from final_data.

cat_cols = final_data.select_dtypes(include = 'object')

In [None]:
# Getting column names from df cat_cols.

cat_cols.columns

In [None]:
# Creating dummy varibles for categorical variables.

dummy = pd.get_dummies(final_data, columns = ['EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'BusinessTravel', 'Department', 'Education','EducationField', 'Gender', 'JobLevel', 'JobRole', 'MaritalStatus','Over18', 'StockOptionLevel', 'JobInvolvement', 'PerformanceRating'], drop_first = True)

final_data = pd.concat([final_data, dummy], axis = 1)

final_data.drop(['EnvironmentSatisfaction', 'JobSatisfaction', 'WorkLifeBalance', 'BusinessTravel', 'Department', 'Education','EducationField', 'Gender', 'JobLevel', 'JobRole', 'MaritalStatus','Over18', 'StockOptionLevel', 'JobInvolvement', 'PerformanceRating'], axis = 1, inplace = True)

In [None]:
# View final_data after dummy variable creation.

final_data

In [None]:
# Storing EmployeeID and Attrition in different dataframes and dropping them from final_data.

Emp_ID = final_data["EmployeeID"]

final_data.drop(["EmployeeID"], axis = 1, inplace = True)

Attr = final_data["Attrition"]

final_data.drop(["Attrition"], axis = 1, inplace = True)

In [None]:
# Scaling the numeric columns before building the model.

scaler = StandardScaler()

scaled_df = pd.DataFrame(scaler.fit_transform(final_data),columns = final_data.columns)

In [None]:
# Setting index before joining dataframes "Attr" and "scaled_df".

Attr.reset_index(drop = True, inplace = True)

scaled_df.reset_index(drop = True, inplace = True)

In [None]:
# joining dataframes "Attr" and "scaled_df".

scaled_df = pd.concat([scaled_df, Attr], axis = 1)

In [None]:
# Removing duplicate columns from scaled_df dataset.

scaled_df = scaled_df.loc[:,~scaled_df.columns.duplicated()]

In [None]:
# Converting Attrition to 1's and 0's based on whether value is "Yes" and "No" respectively.

scaled_df['Attrition'] = scaled_df['Attrition'].apply(lambda x: 1 if x == 'Yes' else 0)

In [None]:
# Examine valriable "Attrition".

scaled_df['Attrition'].value_counts()

**Splitting the given data into train and test data.**

In [None]:
# Storing "Attrition" in "y" and dependent variables in "X".

y = scaled_df['Attrition']

X = scaled_df.drop(['Attrition'], axis = 1)

In [None]:
# Splitting X anf y into train and test data.

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, test_size = 0.3, random_state = 100)

In [None]:
# Using Recursive feature elemination for feature selection.

logReg = LogisticRegression()

rfe = RFE(logReg, 15)  

rfe = rfe.fit(X_train, y_train)

print(rfe.support_) 

print(rfe.ranking_) 

In [None]:
# Getting names of top 15 variables that have to be used for model building.

feature_names = np.array(X_train.columns)

top_cols = feature_names[rfe.support_]

top_cols

In [None]:
# Creating new dataframe as "new_train" with top 15 features.

new_train = X_train[top_cols]

new_train

In [None]:
# Creating new dataframe as "new_test" with top 15 features.

new_test = X_test[top_cols]

new_test

In [None]:
# Function to calculate Variance Inflation factor (VIF).

def Cal_VIF(X_train):
    
    vif = pd.DataFrame()
    
    X = X_train
    
    vif['Features'] = X.columns
    
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    
    vif['VIF'] = round(vif['VIF'], 2)
    
    vif = vif.sort_values(by = "VIF", ascending = False)
    
    return(vif)

In [None]:
# Building logistic regression model_1.

model_1 = sm.GLM(y_train, (sm.add_constant(new_train)), family = sm.families.Binomial()).fit()

print(model_1.summary())

In [None]:
# Calculating VIF.

VIF = Cal_VIF(new_train)

VIF

In [None]:
# Removing "Department_Research & Development".

new_train = new_train.drop("Department_Research & Development", axis = 1)

In [None]:
# Building model_2.

model_2 = sm.GLM(y_train, (sm.add_constant(new_train)), family = sm.families.Binomial()).fit()

print(model_2.summary())

In [None]:
# Calculating VIF.

VIF = Cal_VIF(new_train)

VIF

In [None]:
# Removing "TotalWorkingYears" as it is highly insignificant and has higher VIF.

new_train = new_train.drop(["TotalWorkingYears"], axis = 1)

In [None]:
# Building model_3.

model_3 = sm.GLM(y_train, (sm.add_constant(new_train)), family = sm.families.Binomial()).fit()

print(model_3.summary())

In [None]:
# Calculating VIF.

VIF = Cal_VIF(new_train)

VIF

In [None]:
# Removing "BusinessTravel_Travel_Frequently".

new_train = new_train.drop(["BusinessTravel_Travel_Frequently"], axis = 1)

In [None]:
# Building model_4.

model_4 = sm.GLM(y_train, (sm.add_constant(new_train)), family = sm.families.Binomial()).fit()

print(model_4.summary())

In [None]:
# Calculating VIF.

VIF = Cal_VIF(new_train)

VIF

# All VIF values are below 2. In further steps, variables will be eliminated based on p-values.

In [None]:
# Removing "Department_Sales".

new_train = new_train.drop(["Department_Sales"], axis = 1)

In [None]:
# Building model_5.

model_5 = sm.GLM(y_train, (sm.add_constant(new_train)), family = sm.families.Binomial()).fit()

print(model_5.summary())

In [None]:
# Removing "BusinessTravel_Travel_Rarely".

new_train = new_train.drop(["BusinessTravel_Travel_Rarely"], axis = 1)

In [None]:
# Building model_6.

model_6 = sm.GLM(y_train, (sm.add_constant(new_train)), family = sm.families.Binomial()).fit()

print(model_6.summary())

# All variables are significant. Hence model_6 will be used for predicting test data.

**Model evaluation**

In [None]:
# Dropping the varibles in test data that were removed during variable selection from train data.

new_test = new_test.drop(['TotalWorkingYears',
       'BusinessTravel_Travel_Frequently', 'BusinessTravel_Travel_Rarely',
       'Department_Research & Development', 'Department_Sales'], axis = 1)

In [None]:
# Building final logistic regression model for the purpose of predicting the test data.

log_model = LogisticRegression()

log_model.fit(new_train, y_train)

In [None]:
# Using log_model for predicting test data and storing predictions as "y_pred".

y_pred = log_model.predict_proba(new_test)

In [None]:
# Converting y_pred to dataframe.

y_pred_df = pd.DataFrame(y_pred)

In [None]:
# Converting to column dataframe.

y_pred_1 = y_pred_df.iloc[:,[1]]

y_pred_1

In [None]:
# Converting y_test to dataframe.

y_test_df = pd.DataFrame(y_test)

y_test_df

In [None]:
# Creating "EmployeeID" column.

y_test_df['EmployeeID'] = y_test_df.index

y_test_df

In [None]:
# Setting indices.

y_pred_1.reset_index(drop = True, inplace = True)

y_test_df.reset_index(drop = True, inplace = True)

In [None]:
# Appending y_test_df and y_pred_1.

y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)

y_pred_final

In [None]:
# Renaming the column.

y_pred_final= y_pred_final.rename(columns={ 1 : 'Attrition_Prob'})

y_pred_final

In [None]:
# Creating new column 'predicted' with 1 if Attrition_Prob > 0.5 else 0.

y_pred_final['predicted'] = y_pred_final.Attrition_Prob.map( lambda x: 1 if x > 0.5 else 0)

y_pred_final

In [None]:
# Creating confusion matrix.

confusion = metrics.confusion_matrix( y_pred_final.Attrition, y_pred_final.predicted )

confusion

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score( y_pred_final.Attrition, y_pred_final.predicted)

In [None]:
# Calculating TP, TN, FP and FN.

TP = confusion[0,0] # true positive 

TN = confusion[1,1] # true negatives

FP = confusion[0,1] # false positives

FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model.

TP / float(TP+FN)

In [None]:
# Let us calculate specificity.

TN / float(TN+FP)

1. Accuracy = 85.89%
2. Sensitivity = 87.18%
3. Specificity = 61.53%

These metrics will be further optimised by calculating optimal decision threshold value.

In [None]:
# Defination to draw ROC curve.

def draw_roc( actual, probs ):
    fpr, tpr, thresholds = metrics.roc_curve( actual, probs,
                                              drop_intermediate = False )
    auc_score = metrics.roc_auc_score( actual, probs )
    plt.figure(figsize=(6, 4))
    plt.plot( fpr, tpr, label='ROC curve (area = %0.2f)' % auc_score )
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate or [1 - True Negative Rate]')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

    return fpr, tpr, thresholds

In [None]:
# Drawing ROC curve.

draw_roc(y_pred_final.Attrition, y_pred_final.predicted)

**Finding optimal threshold value.**

In [None]:
# Let's create columns with different probability cutoffs .

numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_pred_final[i]= y_pred_final.Attrition_Prob.map( lambda x: 1 if x > i else 0)
y_pred_final.head()

In [None]:
# Now let's calculate accuracy, sensitivity and specificity for various probability cutoffs.

cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensitivity','specificity'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix( y_pred_final.Attrition, y_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    sensitivity = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    specificity = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensitivity,specificity]
print(cutoff_df)

In [None]:
# Let's plot accuracy, sensitivity and specificity for various probabilities.

cutoff_df.plot.line(x='prob', y=['accuracy','sensitivity','specificity'])

# From the plot, optimal threshold = 0.2.

In [None]:
# Using optimal threshold value to predict test data.

y_pred_final['final_predicted'] = y_pred_final.Attrition_Prob.map( lambda x: 1 if x > 0.2 else 0)

y_pred_final

In [None]:
# Let's check the overall accuracy.

metrics.accuracy_score( y_pred_final.Attrition, y_pred_final.final_predicted)

In [None]:
# Creating confusion matrix.

confusion = metrics.confusion_matrix( y_pred_final.Attrition, y_pred_final.final_predicted)

confusion

In [None]:
# Calculating TP, TN, FP and FN.

TP = confusion[0,0] # true positive 

TN = confusion[1,1] # true negatives

FP = confusion[0,1] # false positives

FN = confusion[1,0] # false negatives

In [None]:
# Let's see the sensitivity of our logistic regression model.

TP / float(TP+FN)

In [None]:
# Let us calculate specificity.

TN / float(TN+FP)

In [None]:
# Plotting KS Statistic Plot.

kds.metrics.plot_ks_statistic(y_pred_final["Attrition"], y_pred_final["Attrition_Prob"])

# KS statistic is 46.67% at 3rd decile indicating the the model is good.

In [None]:
# Plotting Cumulative Gain Plot.

kds.metrics.plot_cumulative_gain(y_pred_final["Attrition"], y_pred_final["Attrition_Prob"])

# By the 4th decile, the model is able to identify 80% of employees who are prone to attrition.

In [None]:
# Plotting Lift Plot.

kds.metrics.plot_lift(y_pred_final["Attrition"], y_pred_final["Attrition_Prob"])

# In the 2nd decile, the lift is 2.56 which indicates that we have 2,56 times advantage over a random model.

**Conclusions:**
1. Accuracy = 76.04%
2. Sensitivity = 93.17%
3. Specificity = 35.34%
4. KS statistic = 46.67 at decile 3.
5. Gain = 80% by 4th decile.
6. Lift = 2.56 at decile 2.

From our logistic regression model, the following factors influence the rate of attrition.
1. Age
2. NumCompaniesWorked
3. YearsSinceLastPromotion
4. YearsWithCurrManager
5. working_hours
6. EnvironmentSatisfaction_2.0
7. EnvironmentSatisfaction_3.0
8. EnvironmentSatisfaction_4.0
9. JobSatisfaction_4.0
10. MaritalStatus_Single

Hence the company must consider above mentioned factors to overcome attrition of their employees.