# Building Logistic Regression Model for Absenteeism Project

Now that we have our preprocessed data, we can start building a model that can take in the predictors and predict whether a random person is likely to be absent at work.

In this notebook, we will be creating a Logistic Regression (Logit) model to do the job.

## Import the Necessary Libraries

In [245]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.base import BaseEstimator, TransformerMixin

## Import the Data

In [246]:
data_preprocessed = pd.read_csv('../Dataset/Absenteeism_preprocessed.csv')
data_preprocessed

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,8
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,3
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,8
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2


## Create the Targets

For this project, we will be differentiating between excessive absenteeism and moderate absenteeism. For that, we will create target variable containing 1 or 0 for each depending on if the `Absenteeism Time in Hours` is more than or less than its median value respectively.

Finally, we can drop the `Absenteeism Time in Hours` column from the dataframe as it won't be needed anymore.

In [247]:
# create a target variable based on the median of `Absenteeism Time in Hours`
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [248]:
# add the newly created targets to the dataframe
data_preprocessed['Excessive Absenteeism'] = targets
data_preprocessed

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,8,1
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,3,0
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,8,1
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2,0


In [249]:
# drop the `Absenteeism in Hours` column
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours'], axis=1)

data_with_targets

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,1
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,0
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,1
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,0


### Balancing classes

For a good classification model, it is necessary to balance the classes of the samples in the dataset. It will create more generalised model.

In [250]:
# check the proportion of the 2 classes
targets.sum() / targets.shape[0]

0.45571428571428574

As we can see, around 46% of the data belongs to class 1 (Excessive Absenteeism) while the rest 54% to class 0 (Moderate Absenteeism). Hence, the dataset is roughly balanced and does not need further interference.

## Create the Inputs

In [251]:
# inputs will be all columns except the targets
unscaled_inputs = data_with_targets.iloc[:, :-1]
unscaled_inputs

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0


## Scale the Inputs

As we can see, the inputs vary in magnitude and units, and so can impact our model negatively. To ensure equal importance for all the features, we should scale them.

I will use the Standard Scaler such that all column values have a mean of 0 and standard deviation of 1.

I will be creating a custom scaler that does not scale the dummy variables.

In [255]:
# create the Custom Scaler class

class CustomScaler(BaseEstimator,TransformerMixin):

    # init or what information we need to declare a CustomScaler object
    # and what is calculated/declared as we do

    def __init__(self,columns,copy=True,with_mean=True,with_std=True):

        # scaler is nothing but a Standard Scaler object
        self.scaler = StandardScaler(copy=copy, with_mean=with_mean, with_std=with_std)
        # with some columns 'twist'
        self.columns = columns
        self.mean_ = None
        self.var_ = None


    # the fit method, which, again based on StandardScale

    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self

    # the transform method which does the actual scaling

    def transform(self, X, y=None, copy=None):

        # record the initial order of the columns
        init_col_order = X.columns

        # scale all features that you chose when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)

        # declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]

        # return a data frame which contains all scaled features and all 'not scaled' features
        # use the original order (that you recorded in the beginning)
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [256]:
# select the columns to omit
columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4','Education']

# create the columns to scale, based on the columns to omit
# use list comprehension to iterate over the list
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [257]:
# declare a scaler object, specifying the columns to scale
absenteeism_scaler = CustomScaler(columns_to_scale)

In [258]:
# fit the scaler to the data
absenteeism_scaler.fit(unscaled_inputs)

# transform the inputs
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

  return mean(axis=axis, dtype=dtype, out=out, **kwargs)


In [259]:
scaled_inputs

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,-0.577350,-0.092981,-0.314485,0.821365,0.182726,-0.683704,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
1,-0.577350,-0.092981,-0.314485,-1.217485,0.182726,-0.683704,-1.574681,-1.141882,2.130803,-0.806331,1.002633,0,-0.019280,-0.589690
2,-0.577350,-0.092981,-0.314485,0.821365,0.182726,-0.007725,-0.654143,1.426749,0.248310,-0.806331,1.002633,0,-0.919030,-0.589690
3,1.732051,-0.092981,-0.314485,-1.217485,0.182726,0.668253,0.854936,-1.682647,0.405184,-0.806331,-0.643782,0,0.880469,-0.589690
4,-0.577350,-0.092981,-0.314485,0.821365,0.182726,0.668253,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1.732051,-0.092981,-0.314485,-1.217485,-0.388293,-0.007725,-0.654143,-0.533522,0.562059,-0.853789,-1.114186,1,0.880469,-0.589690
696,1.732051,-0.092981,-0.314485,-1.217485,-0.388293,-0.007725,0.040034,-0.263140,-1.320435,-0.853789,-0.643782,0,-0.019280,1.126663
697,1.732051,-0.092981,-0.314485,-1.217485,-0.388293,0.668253,1.624567,-0.939096,-1.320435,-0.853789,-0.408580,1,-0.919030,-0.589690
698,-0.577350,-0.092981,-0.314485,0.821365,-0.388293,0.668253,0.190942,-0.939096,-0.692937,-0.853789,-0.408580,1,-0.919030,-0.589690


In [260]:
type(scaled_inputs)

pandas.core.frame.DataFrame

## Split the Dataset

We will split the dataset into training and testing in an 80:20 ratio. We will fix a random state to ensure consistency, and stratify the dataset to ensure the class balance in the splits.

In [261]:
# split the data into 80:20
# use random state = 1
# stratify the data
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, test_size=0.2, stratify=targets, random_state=1)

In [262]:
# check train shape
print(x_train.shape, y_train.shape)

(560, 14) (560,)


In [263]:
# check test shape
print(x_test.shape, y_test.shape)

(140, 14) (140,)


In [264]:
# check dataset balance
print(y_train.sum() / y_train.shape[0])
print(y_test.sum() / y_test.shape[0])

0.45535714285714285
0.45714285714285713


The split was a success and the class balance has also been maintained.

## Build the Model

In [265]:
# build a logistic regression model
reg = LogisticRegression()

# train the model
reg.fit(x_train, y_train)

In [266]:
# check the accuracy
reg.score(x_train, y_train)

0.7821428571428571

Using the model, we achieved a training accuracy of 78%.

### Manually check the accuracy

In [267]:
model_output = reg.predict(x_train)
model_output

array([0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,

In [268]:
true_val = np.sum(model_output == y_train)
acc = true_val / model_output.shape[0]
acc

0.7821428571428571

## Find the Intercept and Coefficients

In [269]:
# find intercept
reg.intercept_

array([-0.2192659])

In [270]:
# find coefficients
reg.coef_

array([[ 2.10882947,  0.33460029,  1.48895172,  1.34893831,  0.08582456,
        -0.30031065,  0.71072768,  0.00376763, -0.21499039,  0.02048637,
         0.29134028,  0.12707367,  0.45766961, -0.26984331]])

### Create a Summary Table

A summary table will give us a better look at the coefficients and their respective features, and the intercept.

In [271]:
# create a dataframe for features and coefficients
summary_table = pd.DataFrame(data = unscaled_inputs.columns.values, columns = ['Feature Names'])

In [272]:
# add the coeff
# transpose the coeff as it has multiple columns
summary_table['Coefficients'] = np.transpose(reg.coef_)

In [273]:
summary_table

Unnamed: 0,Feature Names,Coefficients
0,Reason 1,2.108829
1,Reason 2,0.3346
2,Reason 3,1.488952
3,Reason 4,1.348938
4,Month Value,0.085825
5,Day of the Week,-0.300311
6,Transportation Expense,0.710728
7,Distance to Work,0.003768
8,Age,-0.21499
9,Daily Work Load Average,0.020486


In [274]:
# add the intercept
# shift all rows 1 down and add at the beginning
summary_table.index = summary_table.index + 1

summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

In [275]:
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature Names,Coefficients
0,Intercept,-0.219266
1,Reason 1,2.108829
2,Reason 2,0.3346
3,Reason 3,1.488952
4,Reason 4,1.348938
5,Month Value,0.085825
6,Day of the Week,-0.300311
7,Transportation Expense,0.710728
8,Distance to Work,0.003768
9,Age,-0.21499


## Interpret the Results

We should calculate the odds ratio to find out how much the change in each feature impact the odds of being absent.

In [276]:
# calculate the odds ratio
# append in the summary table
summary_table['Odds Ratio'] = np.exp(summary_table['Coefficients'])

In [277]:
summary_table

Unnamed: 0,Feature Names,Coefficients,Odds Ratio
0,Intercept,-0.219266,0.803108
1,Reason 1,2.108829,8.238592
2,Reason 2,0.3346,1.397382
3,Reason 3,1.488952,4.432447
4,Reason 4,1.348938,3.853332
5,Month Value,0.085825,1.089615
6,Day of the Week,-0.300311,0.740588
7,Transportation Expense,0.710728,2.035472
8,Distance to Work,0.003768,1.003775
9,Age,-0.21499,0.806549


In [278]:
# sort by highest odds ratio first
summary_table.sort_values(by=['Odds Ratio'], ascending=False)

Unnamed: 0,Feature Names,Coefficients,Odds Ratio
1,Reason 1,2.108829,8.238592
3,Reason 3,1.488952,4.432447
4,Reason 4,1.348938,3.853332
7,Transportation Expense,0.710728,2.035472
13,Children,0.45767,1.580387
2,Reason 2,0.3346,1.397382
11,Body Mass Index,0.29134,1.33822
12,Education,0.127074,1.135501
5,Month Value,0.085825,1.089615
10,Daily Work Load Average,0.020486,1.020698


A feature is not particularly important if its:
- coefficient is around 0
- odds ratio is around 1

A weight of 0 implies that no matter the feature value, we will multiply by it by 0 in the model.

For a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio. Hence, 1 implies no change.

I have assumed at least a 5% change in odds to be significant.

From the above table, it seems that the `Daily Work Load Average` and `Distance to Work` make no difference, given all features.