# **Logistic Regression**: HR Dataset 🕵️‍♀️

This is the dataset of a large company, which has around 4000 employees. Every year, around 15% of its employees leave the company and need to be replaced with the talent pool available in the job market. This level of attrition (employees leaving, either on their own or because they got fired) is bad for the company, because of the following reasons:
* the former employees’ projects get delayed, which makes it difficult to meet timelines, resulting in a reputation loss among consumers and partners
* a sizeable department has to be maintained, for the purposes of recruiting new talent
* more often than not, the new employees have to be trained for the job and/or given time to acclimatise themselves to the company

Hence, it is important to understand what factors cause attrition. In other words, it is important to undersand what needs to be changed in order to get most of the employees to stay. Therefore, the goal of the study is to model the probability of attrition based on certain features.

# **Logistic Regression with Statsmodels**

$\qquad$ <span style="color:gray"><b>0.</b> Settings </span><br>
$\qquad$ <span style="color:gray"><b>1.</b> Dataset </span><br>
$\qquad$ <span style="color:gray"><b>2.</b> Data Preprocessing </span><br>
$\qquad$ <span style="color:gray"><b>3.</b> Data Preparation </span><br>
$\qquad$ <span style="color:gray"><b>4.</b> Logistic Regression with Statsmodels </span><br>

## **0.** Settings

In [42]:
# Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

import statsmodels.api as sm
import pandas as pd

%matplotlib inline

## **1.** Dataset

In [None]:
'''
    DATASET INFORMATIONS

    |--------------------------|------------|-------------------------------------------------------------------------------|
    | Name                     | Data Type  | Description                                                                   |
    |--------------------------|------------|-------------------------------------------------------------------------------|
    | Age                      | continuous | Age of the employee                                                           |
    | Attrition                | nominal    | Whether the employee left in the previous year or not                         |
    | BusinessTravel           | nominal    | How frequently the employees travelled for business purposes in the last year |
    | Department               | nominal    | Department in company                                                         |
    | DistanceFromHome         | continuous | Distance from home in kms                                                     |
    | Education                | continuous | Education Level                                                               |
    | EducationField           | nominal    | Field of education                                                            |
    | EmployeeCount            | continuous | Employee count                                                                |
    | EmployeeID               | nominal    | Employee number/id                                                            |
    | Gender                   | nominal    | Gender of employee                                                            |
    | JobLevel                 | continuous | Job level at company on a scale of 1 to 5                                     |
    | JobRole                  | nominal    | Name of job role in company                                                   |
    | MaritalStatus            | nominal    | Marital status of the employee                                                |
    | MonthlyIncome            | continuous | Monthly income in rupees per month                                            |
    | NumCompaniesWorked       | continuous | Total number of companies the employee has worked for                         |
    | Over18                   | nominal    | Whether the employee is above 18 years of age or not                          |
    | PercentSalaryHike        | continuous | Percent salary hike for last year                                             |
    | PerformanceRating	       | continuous | Performance rating for last year                                              |
    | RelationshipSatisfaction | continuous | Relationship satisfaction level                                               |
    | StandardHours            | continuous | Standard hours of work for the employee                                       |
    | StockOptionLevel         | continuous | Stock option level of the employee                                            |
    | TotalWorkingYears        | continuous | Total number of years the employee has worked so far                          |
    | TrainingTimesLastYear    | continuous | Number of times training was conducted for this employee last year            |
    | WorkLifeBalance          | continuous | Work life balance level                                                       |
    | YearsAtCompany           | continuous | Total number of years spent at the company by the employee                    |
    | YearsSinceLastPromotion  | continuous | Number of years since last promotion                                          |
    | YearsWithCurrManager     | continuous | Number of years under current manager                                         |
    |--------------------------|------------|-------------------------------------------------------------------------------|

'''

In [3]:
# Import the dataset
data = pd.read_csv('./dataset.csv')
data.head().T

Unnamed: 0,0,1,2,3,4
Age,51,31,32,38,32
Attrition,No,Yes,No,No,No
BusinessTravel,Travel_Rarely,Travel_Frequently,Travel_Frequently,Non-Travel,Travel_Rarely
Department,Sales,Research & Development,Research & Development,Research & Development,Research & Development
DistanceFromHome,6,10,17,2,10
Education,2,1,4,5,1
EducationField,Life Sciences,Life Sciences,Other,Life Sciences,Medical
EmployeeCount,1,1,1,1,1
EmployeeID,1,2,3,4,5
Gender,Female,Female,Male,Male,Male


In [4]:
data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'EmployeeCount', 'EmployeeID', 'Gender',
       'JobLevel', 'JobRole', 'MaritalStatus', 'MonthlyIncome',
       'NumCompaniesWorked', 'Over18', 'PercentSalaryHike', 'StandardHours',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'YearsAtCompany', 'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4410 entries, 0 to 4409
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      4410 non-null   int64  
 1   Attrition                4410 non-null   object 
 2   BusinessTravel           4410 non-null   object 
 3   Department               4410 non-null   object 
 4   DistanceFromHome         4410 non-null   int64  
 5   Education                4410 non-null   int64  
 6   EducationField           4410 non-null   object 
 7   EmployeeCount            4410 non-null   int64  
 8   EmployeeID               4410 non-null   int64  
 9   Gender                   4410 non-null   object 
 10  JobLevel                 4410 non-null   int64  
 11  JobRole                  4410 non-null   object 
 12  MaritalStatus            4410 non-null   object 
 13  MonthlyIncome            4410 non-null   int64  
 14  NumCompaniesWorked      

## **2.** Data Preprocessing

In [6]:
# Null elements
data.isnull().sum()

Age                         0
Attrition                   0
BusinessTravel              0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EmployeeCount               0
EmployeeID                  0
Gender                      0
JobLevel                    0
JobRole                     0
MaritalStatus               0
MonthlyIncome               0
NumCompaniesWorked         19
Over18                      0
PercentSalaryHike           0
StandardHours               0
StockOptionLevel            0
TotalWorkingYears           9
TrainingTimesLastYear       0
YearsAtCompany              0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

In [7]:
data.isnull().any()

Age                        False
Attrition                  False
BusinessTravel             False
Department                 False
DistanceFromHome           False
Education                  False
EducationField             False
EmployeeCount              False
EmployeeID                 False
Gender                     False
JobLevel                   False
JobRole                    False
MaritalStatus              False
MonthlyIncome              False
NumCompaniesWorked          True
Over18                     False
PercentSalaryHike          False
StandardHours              False
StockOptionLevel           False
TotalWorkingYears           True
TrainingTimesLastYear      False
YearsAtCompany             False
YearsSinceLastPromotion    False
YearsWithCurrManager       False
dtype: bool

In [8]:
data.shape

(4410, 24)

In [9]:
# We can either fill the nulls with 0 or delete the rows. 
# In this case we can try to delete the rows with missing 
# elements since we have quite a lot of data.

# Fill
# data.fillna(0, inplace =True)

# Drop
data = data.dropna(how='any', axis=0)
data.shape

(4382, 24)

In [10]:
# We remove unnecessary information
data.drop(['EmployeeCount','EmployeeID','StandardHours', 'Over18'], axis=1, inplace=True)

#   * 'EmployeeCount' is always equal to 1 (as employees are interviewed one at a time)
#   * 'EmployeeID' are unique values 
#   * 'StandardHours' is (basically) always 8
#   * 'Over18' because all the employees are >18

data.shape

(4382, 20)

## **3.** Data Preparation

From previous analyses (EDA) we certainly want to take into account (since they were visually influential):

* Business Travel
* Department
* Education Field
* Marital Status

In [11]:
data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'Department', 'DistanceFromHome',
       'Education', 'EducationField', 'Gender', 'JobLevel', 'JobRole',
       'MaritalStatus', 'MonthlyIncome', 'NumCompaniesWorked',
       'PercentSalaryHike', 'StockOptionLevel', 'TotalWorkingYears',
       'TrainingTimesLastYear', 'YearsAtCompany', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

In [12]:
data.iloc[0]

Age                                               51
Attrition                                         No
BusinessTravel                         Travel_Rarely
Department                                     Sales
DistanceFromHome                                   6
Education                                          2
EducationField                         Life Sciences
Gender                                        Female
JobLevel                                           1
JobRole                    Healthcare Representative
MaritalStatus                                Married
MonthlyIncome                                 131160
NumCompaniesWorked                               1.0
PercentSalaryHike                                 11
StockOptionLevel                                   0
TotalWorkingYears                                1.0
TrainingTimesLastYear                              6
YearsAtCompany                                     1
YearsSinceLastPromotion                       

In [13]:
# Convert all the Categorical data into numerical data
print(data['BusinessTravel'].unique())
print(data['Department'].unique())
print(data['EducationField'].unique())
print(data['Gender'].unique())
print(data['JobRole'].unique())
print(data['MaritalStatus'].unique())

['Travel_Rarely' 'Travel_Frequently' 'Non-Travel']
['Sales' 'Research & Development' 'Human Resources']
['Life Sciences' 'Other' 'Medical' 'Marketing' 'Technical Degree'
 'Human Resources']
['Female' 'Male']
['Healthcare Representative' 'Research Scientist' 'Sales Executive'
 'Human Resources' 'Research Director' 'Laboratory Technician'
 'Manufacturing Director' 'Sales Representative' 'Manager']
['Married' 'Single' 'Divorced']


In [14]:
# Encode categorical features
labelEncoder_X = LabelEncoder()

data['BusinessTravel'] = labelEncoder_X.fit_transform(data['BusinessTravel'])
data['Department']     = labelEncoder_X.fit_transform(data['Department'])
data['EducationField'] = labelEncoder_X.fit_transform(data['EducationField'])
data['Gender']         = labelEncoder_X.fit_transform(data['Gender'])
data['JobRole']        = labelEncoder_X.fit_transform(data['JobRole'])
data['MaritalStatus']  = labelEncoder_X.fit_transform(data['MaritalStatus'])

# Encode label
label_encoder_y = LabelEncoder()

data['Attrition'] = label_encoder_y.fit_transform(data['Attrition'])

This encoding transforms categorical variables into numerical values, it is not one-hot-encoding.<br>
We proceed in this way in order to visualize the correlation of the categorical variables with the target (also categorical).

In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4382 entries, 0 to 4408
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      4382 non-null   int64  
 1   Attrition                4382 non-null   int32  
 2   BusinessTravel           4382 non-null   int32  
 3   Department               4382 non-null   int32  
 4   DistanceFromHome         4382 non-null   int64  
 5   Education                4382 non-null   int64  
 6   EducationField           4382 non-null   int32  
 7   Gender                   4382 non-null   int32  
 8   JobLevel                 4382 non-null   int64  
 9   JobRole                  4382 non-null   int32  
 10  MaritalStatus            4382 non-null   int32  
 11  MonthlyIncome            4382 non-null   int64  
 12  NumCompaniesWorked       4382 non-null   float64
 13  PercentSalaryHike        4382 non-null   int64  
 14  StockOptionLevel        

In [16]:
data.head().T

Unnamed: 0,0,1,2,3,4
Age,51.0,31.0,32.0,38.0,32.0
Attrition,0.0,1.0,0.0,0.0,0.0
BusinessTravel,2.0,1.0,1.0,0.0,2.0
Department,2.0,1.0,1.0,1.0,1.0
DistanceFromHome,6.0,10.0,17.0,2.0,10.0
Education,2.0,1.0,4.0,5.0,1.0
EducationField,1.0,1.0,4.0,1.0,3.0
Gender,0.0,0.0,1.0,1.0,1.0
JobLevel,1.0,1.0,4.0,3.0,1.0
JobRole,0.0,6.0,7.0,1.0,7.0


In [28]:
X = data.drop('Attrition', axis=1)
Y = data['Attrition']

# Train and validation split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(3505, 19)
(877, 19)
(3505,)
(877,)


## **4.** Logistic Regression with **Statsmodels**

In [29]:
# Normalization of the features
Scaler_X = StandardScaler()
X_train  = Scaler_X.fit_transform(X_train)
X_test   = Scaler_X.transform(X_test)

# To have the intercept in the model
# (in Statsmodels the intercept has to be added manually)
X_train = sm.add_constant(X_train)
X_test  = sm.add_constant(X_test)

# Logistic Regression
model = sm.Logit(Y_train, X_train)
model = model.fit()
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.399419
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:              Attrition   No. Observations:                 3505
Model:                          Logit   Df Residuals:                     3485
Method:                           MLE   Df Model:                           19
Date:                Fri, 17 Jun 2022   Pseudo R-squ.:                  0.1051
Time:                        17:51:07   Log-Likelihood:                -1400.0
converged:                       True   LL-Null:                       -1564.4
Covariance Type:            nonrobust   LLR p-value:                 2.388e-58
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.8948      0.057    -33.497      0.000      -2.006      -1.784
x1            -0.2976      0.

In [None]:
'''
FULL OUTPUT

Optimization terminated successfully.
         Current function value: 0.399419
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:              Attrition   No. Observations:                 3505
Model:                          Logit   Df Residuals:                     3485
Method:                           MLE   Df Model:                           19
Date:                Fri, 17 Jun 2022   Pseudo R-squ.:                  0.1051
Time:                        17:51:07   Log-Likelihood:                -1400.0
converged:                       True   LL-Null:                       -1564.4
Covariance Type:            nonrobust   LLR p-value:                 2.388e-58
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.8948      0.057    -33.497      0.000      -2.006      -1.784
x1            -0.2976      0.069     -4.305      0.000      -0.433      -0.162
x2            -0.0088      0.049     -0.182      0.856      -0.104       0.086
x3            -0.1451      0.048     -3.024      0.002      -0.239      -0.051
x4             0.0180      0.048      0.373      0.709      -0.077       0.113
x5            -0.0573      0.048     -1.186      0.236      -0.152       0.037
x6            -0.0933      0.049     -1.907      0.057      -0.189       0.003
x7             0.0254      0.049      0.521      0.602      -0.070       0.121
x8            -0.0308      0.049     -0.630      0.529      -0.127       0.065
x9             0.0769      0.049      1.570      0.116      -0.019       0.173
x10            0.4372      0.051      8.491      0.000       0.336       0.538
x11           -0.0780      0.050     -1.566      0.117      -0.176       0.020
x12            0.2805      0.051      5.490      0.000       0.180       0.381
x13            0.0620      0.048      1.303      0.192      -0.031       0.155
x14           -0.0635      0.049     -1.297      0.195      -0.159       0.032
x15           -0.3980      0.101     -3.930      0.000      -0.596      -0.199
x16           -0.1480      0.050     -2.960      0.003      -0.246      -0.050
x17            0.0751      0.125      0.603      0.547      -0.169       0.319
x18            0.3965      0.073      5.414      0.000       0.253       0.540
x19           -0.4898      0.089     -5.524      0.000      -0.664      -0.316
==============================================================================

'''