# Using Sk-learn to develop a logistic regression model

### Author/Data-scientist: Leon Hamnett
### [LinkedIn](https://www.linkedin.com/in/leon-hamnett/)

#### Contents:

1. [Introduction](#introduction)
2. [Cleaning/Preprocessing](#clean_prep)
3. [Logistic regression machine learning with Sk-learn](#MLSK)
4. [Interpretation of the model](#interpret)


### Introduction:  <a name="introduction"></a>

In this notebook, we will use Sk-learn to develop a logistic regression to help a company decide the likelihood that a given employee, with certain reasons for absence will spend a large amount of time out of the office. This will help the human resources department as well as the team leads with scheduling to make sure they have enough people for any given day.

The datset we are using contains data for a specific company with each employees reason for abscence. The company uses internal codes numbered for the reason for abscence from 0 (No reason given) and reasons from 1 to 28 covering a wide variety of both medical and personal reasons a person might take time away from the office during a working day.

We also have the following columns in the dataset:

Date, Transportation Expense,Distance to Work, Age, Daily Work Load Average (minutes), Body Mass Index,
Education (Level of highest education obtained: 0-highschool, 1-graduate, 2-postgraduate, 3-masters or doctorate degree), 
Children(how many), Pets(how many) and finally Absenteeism Time in Hours.

We want to see which factors are the most important in determining if the employee will take a large amount of time off work during a certain day.

### Data cleaning/preprocessing:  <a name="clean_prep"></a>

First we will examine our dataset, correct some issues and create dummy variables as needed for the inputs of interest.

In [3]:
#import libraries
import pandas as pd

In [4]:
raw_data = pd.read_csv('Absenteeism_data.csv')
data = raw_data.copy()

In [5]:
data.columns

Index(['ID', 'Reason for Absence', 'Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average', 'Body Mass Index',
       'Education', 'Children', 'Pets', 'Absenteeism Time in Hours'],
      dtype='object')

In [6]:
data.describe(include='all')

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
count,700.0,700.0,700,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
unique,,,432,,,,,,,,,
top,,,10/08/2016,,,,,,,,,
freq,,,5,,,,,,,,,
mean,17.951429,19.411429,,222.347143,29.892857,36.417143,271.801774,26.737143,1.282857,1.021429,0.687143,6.761429
std,11.028144,8.356292,,66.31296,14.804446,6.379083,40.021804,4.254701,0.66809,1.112215,1.166095,12.670082
min,1.0,0.0,,118.0,5.0,27.0,205.917,19.0,1.0,0.0,0.0,0.0
25%,9.0,13.0,,179.0,16.0,31.0,241.476,24.0,1.0,0.0,0.0,2.0
50%,18.0,23.0,,225.0,26.0,37.0,264.249,25.0,1.0,1.0,0.0,3.0
75%,28.0,27.0,,260.0,50.0,40.0,294.217,31.0,1.0,2.0,1.0,8.0


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


We see there are no missing values.

#### Drop un-needed columns:
First we drop the un-needed ID column as this adds nothing to the analysis.

In [8]:
#drop id column
cols_to_drop = ['ID']
data2 = data.drop(columns=cols_to_drop)

#### Collapse the reason for absence column into 4 groups of reasons and create the dummy variable:

In [9]:
#check reasons given by employees
data2['Reason for Absence'].value_counts()

23    147
28    110
27     66
13     52
0      38
19     36
22     32
26     31
25     29
11     24
10     22
18     21
14     18
1      16
7      13
12      8
6       6
21      6
8       5
9       4
5       3
24      3
16      3
4       2
15      2
3       1
2       1
17      1
Name: Reason for Absence, dtype: int64

In [10]:
data2['Reason for Absence'].value_counts().sort_index(ascending=True) #we see reason 20 is missing
#convert reason to dummy variable
reason_cols = pd.get_dummies(data2['Reason for Absence'],drop_first=True) #we take 0 reason as the reference category for the dummy variables
#check for missing values or incorrect data entry when creating the dummy variables
reason_cols['check'] = reason_cols.sum(axis=1)
reason_cols.shape[0] == reason_cols['check'].sum(axis=0)

False

In [11]:
#remove check column as no longer needed
reason_cols_no_check = reason_cols.drop(columns=['check'])
reason_cols_no_check

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,18,19,21,22,23,24,25,26,27,28
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
696,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
697,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
698,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


Now we will group the dummies into 4 categories based broadly on the similar themes for reasons given.

In [12]:
#grouping reasons for absence
#reason 1 illnesss
reason1 = reason_cols_no_check.iloc[:,0:15].max(axis=1)
#reason 2 pregnancy
reason2 = reason_cols_no_check.iloc[:,15:18].max(axis=1)
#reason 3 poisoning or others
reason3 = reason_cols_no_check.iloc[:,18:22].max(axis=1)
#reason 4 - 'light' reasons - doctors appoinment, fisio,
reason4 = reason_cols_no_check.iloc[:,22:29].max(axis=1)


In [13]:
#concatanate reasons to main dataframe
pd.set_option('display.max_rows', None)
data2_cols = data2.columns
data3 = pd.concat([data2,reason1,reason2,reason3,reason4],axis=1)
#rename columns for clarity
data4 = data3.rename(columns={0:'reason1',1:'reason2',2:'reason3',3:'reason4'})
data4 = data4.drop(columns='Reason for Absence')
data4.head()


Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,reason1,reason2,reason3,reason4
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,0,0,0,1
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,0,0,0,0
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,0,0,1,0
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,1,0,0,0
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,0,0,1,0


#### Change type of date column and extract month and day of week:

Now we will focus on the date column, converting it into the correct data type and extracting the month and the day of the week.

In [14]:
data5 = data4.copy()
#change column type
data5['Date'] = pd.to_datetime(data5['Date'],format='%d/%m/%Y')
#create month column
data5['Month'] = data5.Date.dt.month
#create day of the week column
data5['Day_of_week'] = data5.Date.dt.dayofweek
#remap to give day of week names rather than numbers
dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
data5['name_of_weekday'] = data5['Day_of_week'].map(dayOfWeek)
#we check some dates to make sure we have generated the day names correctly
data5.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,reason1,reason2,reason3,reason4,Month,Day_of_week,name_of_weekday
0,2015-07-07,289,36,33,239.554,30,1,2,1,4,0,0,0,1,7,1,Tuesday
1,2015-07-14,118,13,50,239.554,31,1,1,0,0,0,0,0,0,7,1,Tuesday
2,2015-07-15,179,51,38,239.554,31,1,0,0,2,0,0,1,0,7,2,Wednesday
3,2015-07-16,279,5,39,239.554,24,1,2,0,4,1,0,0,0,7,3,Thursday
4,2015-07-23,289,36,33,239.554,30,1,2,1,2,0,0,1,0,7,3,Thursday


In [15]:
#get dummies for the day of the week
day_dummies = pd.get_dummies(data5['name_of_weekday'],drop_first=False,)
day_dummies = day_dummies[['Tuesday','Wednesday','Thursday','Friday', 'Saturday', 'Sunday']]
#add the dummies to the main dataframe
data6 = pd.concat([data5,day_dummies],axis=1)
data6 = data6[['reason1', 'reason2', 'reason3', 'reason4','Tuesday', 'Wednesday', 'Thursday', 'Friday',
       'Saturday', 'Sunday','Month',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education', 'Children',
       'Pets', 'Absenteeism Time in Hours']]
data6.head()

Unnamed: 0,reason1,reason2,reason3,reason4,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday,Month,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,1,0,0,0,0,0,7,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,1,0,0,0,0,0,7,118,13,50,239.554,31,1,1,0,0
2,0,0,1,0,0,1,0,0,0,0,7,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,0,0,1,0,0,0,7,279,5,39,239.554,24,1,2,0,4
4,0,0,1,0,0,0,1,0,0,0,7,289,36,33,239.554,30,1,2,1,2


#### Change education method to categorical and set dummy variables

Now we will change the education variables to categorical and get appropiate dummy variables.

In [16]:
#1 = high school, 2 = graduate, 3 = postgraduate , 4 = master/doctors degree
data6.Education.value_counts()
#combine graduate and higher education into 1, highschool into reference category of 0
data6['higher_education'] = data6['Education'].map({1:0,2:1,3:1,4:1})
data6.head()

Unnamed: 0,reason1,reason2,reason3,reason4,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday,...,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,higher_education
0,0,0,0,1,1,0,0,0,0,0,...,289,36,33,239.554,30,1,2,1,4,0
1,0,0,0,0,1,0,0,0,0,0,...,118,13,50,239.554,31,1,1,0,0,0
2,0,0,1,0,0,1,0,0,0,0,...,179,51,38,239.554,31,1,0,0,2,0
3,1,0,0,0,0,0,1,0,0,0,...,279,5,39,239.554,24,1,2,0,4,0
4,0,0,1,0,0,0,1,0,0,0,...,289,36,33,239.554,30,1,2,1,2,0


#### Change month to categorical and get dummies:

Now we will set the month to categorical, set january as the reference category and obtain the dummy variables:

In [17]:
month_dummies = pd.get_dummies(data6['Month'],drop_first=True,prefix='month')
data7 = pd.concat([data6,month_dummies],axis=1)
month_dummies.columns

Index(['month_2', 'month_3', 'month_4', 'month_5', 'month_6', 'month_7',
       'month_8', 'month_9', 'month_10', 'month_11', 'month_12'],
      dtype='object')

### Set clean data and export as csv:

Now we have the data in a suitable fashion, we will rearrange all the columns and then export the CSV.

In [18]:
data8 = data7[['reason1', 'reason2', 'reason3', 'reason4', 'Tuesday', 'Wednesday',
       'Thursday', 'Friday', 'Saturday', 'Sunday',
        'month_2', 'month_3', 'month_4', 'month_5', 'month_6', 'month_7',
       'month_8', 'month_9', 'month_10', 'month_11', 'month_12',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index','higher_education', 'Children',
       'Pets', 'Absenteeism Time in Hours']]
#set cleaned and preprocessed data to variable
df_preprocessed = data8.copy()
df_preprocessed.head()
df_preprocessed.to_csv('Absenteeism_preprocessed.csv', index=False)

### Machine learning with Sk-learn to give logistic regression for absenteeism:  <a name="MLSK"></a>

In [19]:
#import libraries
import numpy as np
from sklearn.preprocessing import StandardScaler

#### Create targets:

We will create the targets for the regression to aim at. We will take the median value of abscence time in hours and consider abscences under this to be moderate abscences and abscences over this amount of time to be extreme abscences.

In [20]:
#create two classes moderately absent and excessively absent
#use median as is numerically stable and rigid and also dataset is automatically balanced with this method
median_absent_hours = df_preprocessed['Absenteeism Time in Hours'].median()
print(median_absent_hours) # we see the median time absent is 3 hours

#below median is moderate abscence, above is excessive abscence
targets = np.where(df_preprocessed['Absenteeism Time in Hours'] > median_absent_hours,1,0)
df_preprocessed['excessive_abscence'] = targets
df_with_targets = df_preprocessed.copy()
df_with_targets.drop(columns=['Absenteeism Time in Hours'],inplace = True)
df_with_targets.head()

3.0


Unnamed: 0,reason1,reason2,reason3,reason4,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday,...,month_12,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,higher_education,Children,Pets,excessive_abscence
0,0,0,0,1,1,0,0,0,0,0,...,0,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,1,0,0,0,0,0,...,0,118,13,50,239.554,31,0,1,0,0
2,0,0,1,0,0,1,0,0,0,0,...,0,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,0,0,1,0,0,0,...,0,279,5,39,239.554,24,0,2,0,1
4,0,0,1,0,0,0,1,0,0,0,...,0,289,36,33,239.554,30,0,2,1,0


#### Select inputs for regression and scale data:

In [21]:
#choose all columns for inputs (all except targets)
unscaled_inputs = df_with_targets.iloc[:,0:-1]
unscaled_inputs.head()

Unnamed: 0,reason1,reason2,reason3,reason4,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday,...,month_11,month_12,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,higher_education,Children,Pets
0,0,0,0,1,1,0,0,0,0,0,...,0,0,289,36,33,239.554,30,0,2,1
1,0,0,0,0,1,0,0,0,0,0,...,0,0,118,13,50,239.554,31,0,1,0
2,0,0,1,0,0,1,0,0,0,0,...,0,0,179,51,38,239.554,31,0,0,0
3,1,0,0,0,0,0,1,0,0,0,...,0,0,279,5,39,239.554,24,0,2,0
4,0,0,1,0,0,0,1,0,0,0,...,0,0,289,36,33,239.554,30,0,2,1


Now we will create a custom scaler object so that we only scale non-dummy values. This means when we have the coefficients at the end of the logistic regression, we will be able to interpret them easier.

In [22]:

from sklearn.base import BaseEstimator, TransformerMixin

class CustomScaler(BaseEstimator,TransformerMixin): 
    
    # init or what information we need to declare a CustomScaler object
    # and what is calculated/declared as we do
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        
        # scaler is nothing but a Standard Scaler object
        self.scaler = StandardScaler(copy,with_mean,with_std)
        # with some columns 'twist'
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    
    # the fit method, which, again based on StandardScale
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    # the transform method which does the actual scaling

    def transform(self, X, y=None, copy=None):
        
        # record the initial order of the columns
        init_col_order = X.columns
        
        # scale all features that you chose when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        # declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        
        # return a data frame which contains all scaled features and all 'not scaled' features
        # use the original order (that you recorded in the beginning)
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]


In [23]:
#choose columns to scale (only non-dummy variables)
columns_to_scale = ['Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index',
       'Children', 'Pets']

# create custom scaler model
absenteeism_scaler = CustomScaler(columns_to_scale)

#fit scaler to data
absenteeism_scaler.fit(unscaled_inputs)
#scale inputs
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)
scaled_inputs.shape #700 rows, 29 inputs



(700, 29)

#### Split data into training and test sets and shuffle data

In [24]:
#get library
from sklearn.model_selection import train_test_split

In [25]:
#split data into training and test sets
xtrain, xtest, ytrain, ytest = train_test_split(scaled_inputs,targets,train_size=0.8,shuffle=True,random_state = 42)
#check shapes
print(xtrain.shape)
print(xtest.shape)
print(ytrain.shape)
print(ytest.shape)

(560, 29)
(140, 29)
(560,)
(140,)


#### Logistic regression algorithm:

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [27]:
#create variable for logistic regression
reg = LogisticRegression(verbose=2)
#fit the model
reg.fit(xtrain,ytrain)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s finished


LogisticRegression(verbose=2)

In [28]:
#checking accuracy
reg.score(xtrain,ytrain)

0.7625

In [29]:
#checking coefficients and creating a summary table
feature_names = unscaled_inputs.columns.values
summary_table = pd.DataFrame(columns=['Feature name'],data = feature_names)
summary_table['Coefficient'] = np.transpose(reg.coef_)
summary_table2 = summary_table.copy()
summary_table2.loc[-1] = 'Intercept',reg.intercept_[0]
summary_table2.index = summary_table2.index + 1
summary_table2 = summary_table2.sort_index()
summary_table2

Unnamed: 0,Feature name,Coefficient
0,Intercept,-1.295871
1,reason1,2.816115
2,reason2,2.386337
3,reason3,1.384314
4,reason4,0.759678
5,Tuesday,-0.221583
6,Wednesday,-0.35664
7,Thursday,-0.325011
8,Friday,-0.528899
9,Saturday,0.510989


### Interpreting weights and bias: <a name="interpret"></a>

In [30]:
# we take the exponential of the coefficients to give us the odds ratio for each feature
summary_table2['odds_ratio'] = np.exp(summary_table2['Coefficient'])
summary_table2.sort_values('odds_ratio',ascending = False)

Unnamed: 0,Feature name,Coefficient,odds_ratio
1,reason1,2.816115,16.711794
2,reason2,2.386337,10.873595
3,reason3,1.384314,3.992087
16,month_7,0.991098,2.694192
4,reason4,0.759678,2.137588
22,Transportation Expense,0.633395,1.883995
9,Saturday,0.510989,1.666939
13,month_4,0.504879,1.656785
15,month_6,0.427432,1.533315
28,Children,0.40279,1.495992


We can see that for the absence reasons 1 (illness) ,2 (pregnancy related) and 3 (poisoning related) these have the biggest impact on whether someone will take an excessive amount of time for an absence increasing the liklihood of an excessive abscence by 16 times, 10 times and 3 times respectively when compared to the reference category of no reason given. 

We also see abscences in July are around 2.5 times more likely to be an excessive abscence when compared to January. This could be because children are finishing for the summar holidays and so parents need to look after their children during this time period.

We could use backwards elimination to eliminate some of the lower weighted features (near 1) to simplify the model in the future if desired.

### Testing the model on the test dataset:

In [31]:
#accuracy
reg.score(xtest,ytest)

0.7142857142857143

When the model is applied on the test data, we obtain an accuracy of around 70%. This shows that with our current model we can accurately predict around 70% of the reasons for abscence to see if the staff member is likely to be away from the office for a long time. 

The model could be further improved if a more accurate prediction was required, possibly adding more variables for each staff member or using more advanced machine learning methods such as a neural net.

In [32]:
#see prediction probabilities
predicted_proba = reg.predict_proba(xtest)
#only interested in prediction of ones to show an excessive abscence
predicted_proba_ones = predicted_proba[:,1]
predicted_proba_ones

array([0.61548705, 0.16472707, 0.34704039, 0.42957   , 0.61071643,
       0.87864466, 0.4642995 , 0.43326159, 0.1461904 , 0.25344276,
       0.1811559 , 0.2815058 , 0.83878731, 0.41518423, 0.15121937,
       0.65481399, 0.13368497, 0.62829318, 0.10957746, 0.31547568,
       0.51820714, 0.33734308, 0.37704962, 0.17769312, 0.10215172,
       0.79465649, 0.60703954, 0.34848909, 0.42071842, 0.65600239,
       0.12528516, 0.11780353, 0.72588173, 0.58039098, 0.22490442,
       0.64894394, 0.3375235 , 0.16960897, 0.8552578 , 0.33777266,
       0.64869635, 0.21721857, 0.79700956, 0.19526292, 0.19433686,
       0.60701829, 0.59380338, 0.86014599, 0.18154406, 0.15541438,
       0.1969692 , 0.39687632, 0.36662637, 0.98105487, 0.14026365,
       0.25029093, 0.94789252, 0.39164209, 0.89651932, 0.76403361,
       0.43744776, 0.07135281, 0.53489115, 0.43038081, 0.1811559 ,
       0.38211568, 0.6197357 , 0.05298041, 0.42961447, 0.64324351,
       0.25344276, 0.55893441, 0.85493444, 0.39687632, 0.35979

### Save model for production use:

We now save the regression(reg) and scaler (absenteeism_scaler) variables into text form via pickling so that the model can be applied on new data without having to train the model again and for speedier processing.

In [33]:
import pickle
#save reg variable to file
with open('model','wb') as file:
    pickle.dump(reg,file)
#save standardisation scaler for preprocessing of new data
with open('scaler','wb') as file:
    pickle.dump(absenteeism_scaler, file)