# Building Logistic Regression Model for Absenteeism Project

Now that we have our preprocessed data, we can start building a model that can take in the predictors and predict whether a random person is likely to be absent at work.

In this notebook, we will be creating a Logistic Regression (Logit) model to do the job.

## Import the Necessary Libraries

In [182]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

## Import the Data

In [183]:
data_preprocessed = pd.read_csv('../Dataset/Absenteeism_preprocessed.csv')
data_preprocessed

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,8
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,3
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,8
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2


## Create the Targets

For this project, we will be differentiating between excessive absenteeism and moderate absenteeism. For that, we will create target variable containing 1 or 0 for each depending on if the `Absenteeism Time in Hours` is more than or less than its median value respectively.

Finally, we can drop the `Absenteeism Time in Hours` column from the dataframe as it won't be needed anymore.

In [184]:
# create a target variable based on the median of `Absenteeism Time in Hours`
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [185]:
# add the newly created targets to the dataframe
data_preprocessed['Excessive Absenteeism'] = targets
data_preprocessed

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,8,1
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,3,0
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,8,1
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2,0


In [186]:
# drop the `Absenteeism in Hours` column
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours'], axis=1)

data_with_targets

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,1
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,0
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,1
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,0


### Balancing classes

For a good classification model, it is necessary to balance the classes of the samples in the dataset. It will create more generalised model.

In [187]:
# check the proportion of the 2 classes
targets.sum() / targets.shape[0]

0.45571428571428574

As we can see, around 46% of the data belongs to class 1 (Excessive Absenteeism) while the rest 54% to class 0 (Moderate Absenteeism). Hence, the dataset is roughly balanced and does not need further interference.

## Create the Inputs

In [188]:
# inputs will be all columns except the targets
unscaled_inputs = data_with_targets.iloc[:, :-1]
unscaled_inputs

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0


## Scale the Inputs

As we can see, the inputs vary in magnitude and units, and so can impact our model negatively. To ensure equal importance for all the features, we should scale them.

I will use the Standard Scaler such that all column values have a mean of 0 and standard deviation of 1.

In [189]:
# create the Standard Scaler
absenteeism_scaler = StandardScaler()

# fit the scaler to the data
absenteeism_scaler.fit(unscaled_inputs)

# transform the inputs
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [190]:
scaled_inputs

array([[-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
         0.88046927,  0.26848661],
       [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
        -0.01928035, -0.58968976],
       [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
        -0.91902997, -0.58968976],
       ...,
       [ 1.73205081, -0.09298136, -0.31448545, ...,  2.23224237,
        -0.91902997, -0.58968976],
       [-0.57735027, -0.09298136, -0.31448545, ...,  2.23224237,
        -0.91902997, -0.58968976],
       [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
        -0.01928035,  0.26848661]])

In [191]:
type(scaled_inputs)

numpy.ndarray

## Split the Dataset

We will split the dataset into training and testing in an 80:20 ratio. We will fix a random state to ensure consistency, and stratify the dataset to ensure the class balance in the splits.

In [192]:
# split the data into 80:20
# use random state = 1
# stratify the data
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, test_size=0.2, stratify=targets, random_state=1)

In [193]:
# check train shape
print(x_train.shape, y_train.shape)

(560, 14) (560,)


In [194]:
# check test shape
print(x_test.shape, y_test.shape)

(140, 14) (140,)


In [195]:
# check dataset balance
print(y_train.sum() / y_train.shape[0])
print(y_test.sum() / y_test.shape[0])

0.45535714285714285
0.45714285714285713


The split was a success and the class balance has also been maintained.

## Build the Model

In [196]:
# build a logistic regression model
reg = LogisticRegression()

# train the model
reg.fit(x_train, y_train)

In [197]:
# check the accuracy
reg.score(x_train, y_train)

0.7821428571428571

Using the model, we achieved a training accuracy of 78%.

### Manually check the accuracy

In [198]:
model_output = reg.predict(x_train)
model_output

array([0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,

In [199]:
true_val = np.sum(model_output == y_train)
acc = true_val / model_output.shape[0]
acc

0.7821428571428571

## Find the Intercept and Coefficients

In [200]:
# find intercept
reg.intercept_

array([-0.19825352])

In [201]:
# find coefficients
reg.coef_

array([[ 2.10842293,  0.33410377,  1.48834795,  1.34865044,  0.08599849,
        -0.30055171,  0.70990632,  0.00509328, -0.21449324,  0.02097972,
         0.29245484,  0.05150941,  0.45863909, -0.26910136]])

### Create a Summary Table

A summary table will give us a better look at the coefficients and their respective features, and the intercept.

In [210]:
# create a dataframe for features and coefficients
summary_table = pd.DataFrame(data = unscaled_inputs.columns.values, columns = ['Feature Names'])

In [211]:
# add the coeff
# transpose the coeff as it has multiple columns
summary_table['Coefficients'] = np.transpose(reg.coef_)

In [212]:
summary_table

Unnamed: 0,Feature Names,Coefficients
0,Reason 1,2.108423
1,Reason 2,0.334104
2,Reason 3,1.488348
3,Reason 4,1.34865
4,Month Value,0.085998
5,Day of the Week,-0.300552
6,Transportation Expense,0.709906
7,Distance to Work,0.005093
8,Age,-0.214493
9,Daily Work Load Average,0.02098


In [213]:
# add the intercept
# shift all rows 1 down and add at the beginning
summary_table.index = summary_table.index + 1

summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

In [214]:
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature Names,Coefficients
0,Intercept,-0.198254
1,Reason 1,2.108423
2,Reason 2,0.334104
3,Reason 3,1.488348
4,Reason 4,1.34865
5,Month Value,0.085998
6,Day of the Week,-0.300552
7,Transportation Expense,0.709906
8,Distance to Work,0.005093
9,Age,-0.214493


## Interpret the Results

We should calculate the odds ratio to find out how much the change in each feature impact the odds of being absent.

In [215]:
# calculate the odds ratio
# append in the summary table
summary_table['Odds Ratio'] = np.exp(summary_table['Coefficients'])

In [216]:
summary_table

Unnamed: 0,Feature Names,Coefficients,Odds Ratio
0,Intercept,-0.198254,0.820162
1,Reason 1,2.108423,8.235243
2,Reason 2,0.334104,1.396688
3,Reason 3,1.488348,4.429771
4,Reason 4,1.34865,3.852223
5,Month Value,0.085998,1.089805
6,Day of the Week,-0.300552,0.74041
7,Transportation Expense,0.709906,2.033801
8,Distance to Work,0.005093,1.005106
9,Age,-0.214493,0.80695


In [217]:
# sort by highest odds ratio first
summary_table.sort_values(by=['Odds Ratio'], ascending=False)

Unnamed: 0,Feature Names,Coefficients,Odds Ratio
1,Reason 1,2.108423,8.235243
3,Reason 3,1.488348,4.429771
4,Reason 4,1.34865,3.852223
7,Transportation Expense,0.709906,2.033801
13,Children,0.458639,1.58192
2,Reason 2,0.334104,1.396688
11,Body Mass Index,0.292455,1.339712
5,Month Value,0.085998,1.089805
12,Education,0.051509,1.052859
10,Daily Work Load Average,0.02098,1.021201


A feature is not particularly important if its:
- coefficient is around 0
- odds ratio is around 1

A weight of 0 implies that no matter the feature value, we will multiply by it by 0 in the model.

For a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio. Hence, 1 implies no change.

I have assumed at least a 5% change in odds to be significant.

From the above table, it seems that the `Daily Work Load Average` and `Distance to Work` make no difference, given all features.