## Employee Retention Using Logistic Regression

Download employee retention dataset from here: https://www.kaggle.com/giripujar/hr-analytics.

Now do some exploratory data analysis to figure out which variables have direct and clear impact on employee retention (i.e. whether they leave the company or continue to work)
Plot bar charts showing impact of employee salaries on retention
Plot bar charts showing corelation between department and employee retention
Now build logistic regression model using variables that were narrowed down in step 1
Measure the accuracy of the model

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as pyplot
from sklearn.model_selection import train_test_split

In [3]:
HR_analytics_data = pd.read_csv("../data/HR_comma_sep.csv")

In [4]:
HR_analytics_data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,Department,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


### 1. EDA - Exploratory Data Analysis

##### Lets try to group data based on column 'left
 - 0 who have not left
 - 1 who have left

In [None]:
HR_analytics_data.groupby('left').mean()

 #### Here below factor seems to be creating an impact
- Satisifaction level : People who have left seem to have less satisfaction level
- last_evaluation: Seems no impact
- Numer_project : Seems minimal or no impact
- average_monthly_hoours : Persoon who have left seems to have spent more no of hours
- time_spend_company : Time spend does not have a significant impact
- work_accident: No impact
- Promotion_last_5years: Seems to have impact

#### Lets check impact of salary and department 

In [None]:
pd.crosstab(HR_analytics_data.salary, HR_analytics_data.left).plot(kind ='bar')

In [None]:
pd.crosstab(HR_analytics_data.Department, HR_analytics_data.left).plot(kind ='bar')

From above plotted bar, it is clear that
- Employees whos salary falls in 'high' range, the no of exits are very less compared to low and medium categories
- Department wise, it seems the pattern is uniform

- Thus based on above assumptions below columns/features can be considered 
    - satisfaction_level
    - average_montly_hours
    - promotion_last_5years
    - salary
- Out of the above 'salary' is not a continious feature.Hence dummy substitution need to happen

In [None]:
dummies_salary = pd.get_dummies(HR_analytics_data.salary,prefix='salary')
dummies_salary

In [None]:
subdf = HR_analytics_data[['satisfaction_level','average_montly_hours','promotion_last_5years','salary']]

In [None]:
hr_analytics_merge = pd.concat([subdf, dummies_salary],axis='columns')
hr_analytics_merge

In [None]:
hr_analytics_merge.drop('salary', axis='columns',inplace= True)

#### 2.  Train your data with 70-30 ratio for train and test

In [None]:
X_train, X_test,  y_train, y_test = train_test_split(hr_analytics_merge, HR_analytics_data.left, test_size = 0.3)

In [None]:
Model  = LogisticRegression()
Model.fit(X_train, y_train)

In [None]:
X_test

#### 3. Predict 

In [None]:
Model.predict(X_test)

In [None]:
Model.score(X_test, y_test)

In [None]:
Model.predict_proba(X_test)

In [None]:
# Determine the co efficent and intercept
# x1, x2 , x3, x4, x5, x6
Model.coef_[0]

In [None]:
Model.intercept_

 ###### In a equation of y = mx+c, coefficient is reperesented as m and intercept is reperesented as c
 ###### Lets try to calculate below predictions using:
 - Sigmoid function ()
 - Using the model trained ()
     - Predict for given use case:
        - satisfaction_level = 0.11
        - average_montly_hours = 260 
        - promotion_last_5years = No
        - salary_high = No
        - salary_low =Yes
        - salary_medium = No

In [None]:
import math as math

def sigmoid(x):
    return 1/(1+math.exp(-x))

In [None]:
def predict(x1,x2,x3,x4,x5,x6):
    emp_left = False
    prob_left = -3.8*x1+0.002*x2-0.729*x3-1.037*x4+0.772*x5+0.311*x6-0.031
    prob_sigmoid = sigmoid(prob_left)
    if prob_sigmoid > 0.5:
        emp_left = True
    print("Employee leaving: {left} with a probability of {probability} ".format(left = emp_left, probability = prob_sigmoid))

In [None]:

predict(0.11,260,0,0,1,0)

In [None]:
Model.predict([[0.11,260,0,0,1,0]])