# Building Logistic Regression Model for Absenteeism Project

Now that we have our preprocessed data, we can start building a model that can take in the predictors and predict whether a random person is likely to be absent at work.

In this notebook, we will be creating a Logistic Regression (Logit) model to do the job.

## Import the Necessary Libraries

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.base import BaseEstimator, TransformerMixin

## Import the Data

In [2]:
data_preprocessed = pd.read_csv('../Dataset/Absenteeism_preprocessed.csv')
data_preprocessed

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,8
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,3
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,8
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2


## Create the Targets

For this project, we will be differentiating between excessive absenteeism and moderate absenteeism. For that, we will create target variable containing 1 or 0 for each depending on if the `Absenteeism Time in Hours` is more than or less than its median value respectively.

Finally, we can drop the `Absenteeism Time in Hours` column from the dataframe as it won't be needed anymore.

In [3]:
# create a target variable based on the median of `Absenteeism Time in Hours`
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [4]:
# add the newly created targets to the dataframe
data_preprocessed['Excessive Absenteeism'] = targets
data_preprocessed

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,8,1
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,3,0
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,8,1
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,2,0


In [5]:
# drop the `Absenteeism in Hours` column
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours'], axis=1)

data_with_targets

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0,1
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2,0
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0,1
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0,0


### Balancing classes

For a good classification model, it is necessary to balance the classes of the samples in the dataset. It will create more generalised model.

In [6]:
# check the proportion of the 2 classes
targets.sum() / targets.shape[0]

0.45571428571428574

As we can see, around 46% of the data belongs to class 1 (Excessive Absenteeism) while the rest 54% to class 0 (Moderate Absenteeism). Hence, the dataset is roughly balanced and does not need further interference.

## Create the Inputs

In [7]:
# inputs will be all columns except the targets
unscaled_inputs = data_with_targets.iloc[:, :-1]
unscaled_inputs

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0


## Scale the Inputs

As we can see, the inputs vary in magnitude and units, and so can impact our model negatively. To ensure equal importance for all the features, we should scale them.

I will use the Standard Scaler such that all column values have a mean of 0 and standard deviation of 1.

I will be creating a custom scaler that does not scale the dummy variables.

In [8]:
# create the Custom Scaler class

class CustomScaler(BaseEstimator,TransformerMixin):

    # init or what information we need to declare a CustomScaler object
    # and what is calculated/declared as we do

    def __init__(self,columns,copy=True,with_mean=True,with_std=True):

        # scaler is nothing but a Standard Scaler object
        self.scaler = StandardScaler(copy=copy, with_mean=with_mean, with_std=with_std)
        # with some columns 'twist'
        self.columns = columns
        self.mean_ = None
        self.var_ = None


    # the fit method, which, again based on StandardScale

    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns], axis=1)
        self.var_ = np.var(X[self.columns], axis=1)
        return self

    # the transform method which does the actual scaling

    def transform(self, X, y=None, copy=None):

        # record the initial order of the columns
        init_col_order = X.columns

        # scale all features that you chose when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)

        # declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]

        # return a data frame which contains all scaled features and all 'not scaled' features
        # use the original order (that you recorded in the beginning)
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [9]:
# choose the columns to scale
columns_to_scale = ['Month Value','Day of the Week', 'Transportation Expense', 'Distance to Work','Age', 'Daily Work Load Average', 'Body Mass Index', 'Children', 'Pets']

In [10]:
# declare a scaler object, specifying the columns to scale
absenteeism_scaler = CustomScaler(columns = columns_to_scale)

In [11]:
# fit the scaler to the data
absenteeism_scaler.fit(unscaled_inputs)

# transform the inputs
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [12]:
scaled_inputs

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,-0.683704,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-0.683704,-1.574681,-1.141882,2.130803,-0.806331,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.007725,-0.654143,1.426749,0.248310,-0.806331,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.182726,0.668253,0.854936,-1.682647,0.405184,-0.806331,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.182726,0.668253,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.388293,-0.007725,-0.654143,-0.533522,0.562059,-0.853789,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.388293,-0.007725,0.040034,-0.263140,-1.320435,-0.853789,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.388293,0.668253,1.624567,-0.939096,-1.320435,-0.853789,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.668253,0.190942,-0.939096,-0.692937,-0.853789,-0.408580,1,-0.919030,-0.589690


In [13]:
type(scaled_inputs)

pandas.core.frame.DataFrame

## Split the Dataset

We will split the dataset into training and testing in an 80:20 ratio. We will fix a random state to ensure consistency, and stratify the dataset to ensure the class balance in the splits.

In [14]:
# split the data into 80:20
# use random state = 1
# stratify the data
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, test_size=0.2, stratify=targets, random_state=1)

In [15]:
# check train shape
print(x_train.shape, y_train.shape)

(560, 14) (560,)


In [16]:
# check test shape
print(x_test.shape, y_test.shape)

(140, 14) (140,)


In [17]:
# check dataset balance
print(y_train.sum() / y_train.shape[0])
print(y_test.sum() / y_test.shape[0])

0.45535714285714285
0.45714285714285713


The split was a success and the class balance has also been maintained.

## Build the Model

In [18]:
# build a logistic regression model
reg = LogisticRegression()

# train the model
reg.fit(x_train, y_train)

In [19]:
# check the accuracy
reg.score(x_train, y_train)

0.7821428571428571

Using the model, we achieved a training accuracy of 78%.

### Manually check the accuracy

In [20]:
model_output = reg.predict(x_train)
model_output

array([0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,

In [21]:
true_val = np.sum(model_output == y_train)
acc = true_val / model_output.shape[0]
acc

0.7821428571428571

## Find the Intercept and Coefficients

In [22]:
# find intercept
reg.intercept_

array([-1.72367645])

In [23]:
# find coefficients
reg.coef_

array([[ 2.88027971,  0.92324043,  2.90326738,  0.88967415,  0.07005452,
        -0.2952236 ,  0.59927666,  0.02972708, -0.1832303 ,  0.01310325,
         0.24716318,  0.21118   ,  0.43666114, -0.22218894]])

### Create a Summary Table

A summary table will give us a better look at the coefficients and their respective features, and the intercept.

In [24]:
# create a dataframe for features and coefficients
summary_table = pd.DataFrame(data = unscaled_inputs.columns.values, columns = ['Feature Names'])

In [25]:
# add the coeff
# transpose the coeff as it has multiple columns
summary_table['Coefficients'] = np.transpose(reg.coef_)

In [26]:
summary_table

Unnamed: 0,Feature Names,Coefficients
0,Reason 1,2.88028
1,Reason 2,0.92324
2,Reason 3,2.903267
3,Reason 4,0.889674
4,Month Value,0.070055
5,Day of the Week,-0.295224
6,Transportation Expense,0.599277
7,Distance to Work,0.029727
8,Age,-0.18323
9,Daily Work Load Average,0.013103


In [27]:
# add the intercept
# shift all rows 1 down and add at the beginning
summary_table.index = summary_table.index + 1

summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

In [28]:
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature Names,Coefficients
0,Intercept,-1.723676
1,Reason 1,2.88028
2,Reason 2,0.92324
3,Reason 3,2.903267
4,Reason 4,0.889674
5,Month Value,0.070055
6,Day of the Week,-0.295224
7,Transportation Expense,0.599277
8,Distance to Work,0.029727
9,Age,-0.18323


## Interpret the Results

We should calculate the odds ratio to find out how much the change in each feature impact the odds of being absent.

In [29]:
# calculate the odds ratio
# append in the summary table
summary_table['Odds Ratio'] = np.exp(summary_table['Coefficients'])

In [30]:
summary_table

Unnamed: 0,Feature Names,Coefficients,Odds Ratio
0,Intercept,-1.723676,0.178409
1,Reason 1,2.88028,17.819257
2,Reason 2,0.92324,2.517435
3,Reason 3,2.903267,18.233624
4,Reason 4,0.889674,2.434336
5,Month Value,0.070055,1.072567
6,Day of the Week,-0.295224,0.744365
7,Transportation Expense,0.599277,1.820801
8,Distance to Work,0.029727,1.030173
9,Age,-0.18323,0.832576


In [31]:
# sort by highest odds ratio first
summary_table.sort_values(by=['Odds Ratio'], ascending=False)

Unnamed: 0,Feature Names,Coefficients,Odds Ratio
3,Reason 3,2.903267,18.233624
1,Reason 1,2.88028,17.819257
2,Reason 2,0.92324,2.517435
4,Reason 4,0.889674,2.434336
7,Transportation Expense,0.599277,1.820801
13,Children,0.436661,1.547532
11,Body Mass Index,0.247163,1.280388
12,Education,0.21118,1.235135
5,Month Value,0.070055,1.072567
8,Distance to Work,0.029727,1.030173


A feature is not particularly important if its:
- coefficient is around 0
- odds ratio is around 1

A weight of 0 implies that no matter the feature value, we will multiply it by 0 in the model.

For a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio. Hence, 1 implies no change.

I have assumed at least a 5% change in odds to be significant.

From the above table, it seems that the `Daily Work Load Average` and `Distance to Work` make no difference, given all features.

According to the table, if people have poisoning (Reason 3), they are 18.23 times more likely to be excessively absent, followed by diseases (Reason 1) at 17.8 times, pregnancy and giving birth (Reason 2) at 2.5 times and light diseases (Reason 4) at 2.4 times.

It is understandable as if a person is poisoned or sick, he/she is more likely to miss work. In the case of pregnancy, a woman might miss work when she visits the gynaecologist, and when she gives birth (which is often paid leave and not absence).

Coming towards the bottom of the table, we can see that a 1 standard deviation increase in Age, Pets and Day of the Week causes the odds of missing work excessively to go down by 17%, 20% and 26% respectively.

It could be interpreted that as a person becomes old (or senior), he/she has a higher responsibility at work and chooses to be absent less frequently.

In case of pets, when someone has a lot of pets, he/she might have hire someone to take care of them for which they don't have to miss work if the pets need to visit the vet.

Also, as we enter into a new week, people might take an extended weekend and be absent, but as we progress towards the middle of the week, absence is less frequent.

## Remove Unnecessary Features

Now we will remove the unnecessary features, split the dataset and rebuild the classification model.

In [32]:
scaled_inputs

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,-0.683704,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-0.683704,-1.574681,-1.141882,2.130803,-0.806331,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.007725,-0.654143,1.426749,0.248310,-0.806331,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.182726,0.668253,0.854936,-1.682647,0.405184,-0.806331,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.182726,0.668253,1.005844,0.412816,-0.536062,-0.806331,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.388293,-0.007725,-0.654143,-0.533522,0.562059,-0.853789,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.388293,-0.007725,0.040034,-0.263140,-1.320435,-0.853789,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.388293,0.668253,1.624567,-0.939096,-1.320435,-0.853789,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.668253,0.190942,-0.939096,-0.692937,-0.853789,-0.408580,1,-0.919030,-0.589690


In [33]:
# remove insignificant features
# update the inputs
scaled_inputs_mod = scaled_inputs.drop(['Distance to Work'], axis=1)
scaled_inputs_mod = scaled_inputs_mod.drop(['Daily Work Load Average'], axis=1)

In [34]:
scaled_inputs_mod

Unnamed: 0,Reason 1,Reason 2,Reason 3,Reason 4,Month Value,Day of the Week,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,-0.683704,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-0.683704,-1.574681,2.130803,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.007725,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.182726,0.668253,0.854936,0.405184,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.182726,0.668253,1.005844,-0.536062,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.388293,-0.007725,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.388293,-0.007725,0.040034,-1.320435,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.388293,0.668253,1.624567,-1.320435,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.668253,0.190942,-0.692937,-0.408580,1,-0.919030,-0.589690


### Split the Dataset

In [35]:
# split the data into 80:20
# use random state = 1
# stratify the data
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs_mod, targets, test_size=0.2, stratify=targets, random_state=1)

### Build A New Model

In [36]:
# build a logistic regression model
new_reg = LogisticRegression()

# train the model
new_reg.fit(x_train, y_train)

In [37]:
new_reg.score(x_train, y_train)

0.7821428571428571

### Create New Summary Table

In [38]:
# create a dataframe for features and coefficients
summary_table = pd.DataFrame(data = scaled_inputs_mod.columns.values, columns = ['Feature Names'])
# add the coeff
# transpose the coeff as it has multiple columns
summary_table['Coefficients'] = np.transpose(new_reg.coef_)
# add the intercept
# shift all rows 1 down and add at the beginning
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', new_reg.intercept_[0]]
# calculate the odds ratio
# append in the summary table
summary_table['Odds Ratio'] = np.exp(summary_table['Coefficients'])
# sort by highest odds ratio first
summary_table.sort_values(by=['Odds Ratio'], ascending=False)

Unnamed: 0,Feature Names,Coefficients,Odds Ratio
3,Reason 3,2.905188,18.268682
1,Reason 1,2.884277,17.89063
2,Reason 2,0.930553,2.535912
4,Reason 4,0.892699,2.441712
7,Transportation Expense,0.606298,1.833631
11,Children,0.434739,1.54456
9,Body Mass Index,0.249953,1.283965
10,Education,0.183126,1.200965
5,Month Value,0.067504,1.069834
8,Age,-0.189029,0.827763


## Test the Model

Now that training is complete, we should test the model on unseen data to check the accuracy of our model. We will also be checking the probability of a person being excessively absent, given the inputs.

In [39]:
# check test score
new_reg.score(x_test, y_test)

0.75

The model is 75% accurate in predicting excessive absenteeism using the input variables. A better could be built using some other inputs as well which were not present in this dataset but might impact classification.

Since the test accuracy is close to the training accuracy, we can safely assume that the model did not overfit on our training data, and is pretty generalised in predicting excessive absenteeism from unseen data.

In [40]:
# check the probability of classes
predicted_proba = new_reg.predict_proba(x_test)
predicted_proba

array([[0.79452391, 0.20547609],
       [0.79765248, 0.20234752],
       [0.64926124, 0.35073876],
       [0.69912802, 0.30087198],
       [0.83487669, 0.16512331],
       [0.23108961, 0.76891039],
       [0.09561371, 0.90438629],
       [0.76124205, 0.23875795],
       [0.84012218, 0.15987782],
       [0.64155305, 0.35844695],
       [0.76857971, 0.23142029],
       [0.77709918, 0.22290082],
       [0.73666243, 0.26333757],
       [0.25150345, 0.74849655],
       [0.8398086 , 0.1601914 ],
       [0.40700275, 0.59299725],
       [0.75416584, 0.24583416],
       [0.71114837, 0.28885163],
       [0.739629  , 0.260371  ],
       [0.81976551, 0.18023449],
       [0.29431389, 0.70568611],
       [0.67292592, 0.32707408],
       [0.84238444, 0.15761556],
       [0.0563535 , 0.9436465 ],
       [0.42015331, 0.57984669],
       [0.87163807, 0.12836193],
       [0.19025385, 0.80974615],
       [0.84546082, 0.15453918],
       [0.65908135, 0.34091865],
       [0.61902157, 0.38097843],
       [0.

In [41]:
# check the probability of assigning class=1
predicted_proba[:,1]

array([0.20547609, 0.20234752, 0.35073876, 0.30087198, 0.16512331,
       0.76891039, 0.90438629, 0.23875795, 0.15987782, 0.35844695,
       0.23142029, 0.22290082, 0.26333757, 0.74849655, 0.1601914 ,
       0.59299725, 0.24583416, 0.28885163, 0.260371  , 0.18023449,
       0.70568611, 0.32707408, 0.15761556, 0.9436465 , 0.57984669,
       0.12836193, 0.80974615, 0.15453918, 0.34091865, 0.38097843,
       0.18330961, 0.08860786, 0.20864038, 0.87668012, 0.66722094,
       0.53443538, 0.91845768, 0.10950621, 0.89360646, 0.7485525 ,
       0.16377528, 0.76402249, 0.79680217, 0.54189164, 0.55230928,
       0.20547609, 0.17537755, 0.29339493, 0.25719936, 0.5371035 ,
       0.34569159, 0.34526233, 0.15856448, 0.77041014, 0.7219908 ,
       0.14139781, 0.21811888, 0.15423463, 0.80444383, 0.29254906,
       0.09373584, 0.75568308, 0.56569921, 0.23946768, 0.1910733 ,
       0.87975627, 0.89556944, 0.5338549 , 0.51730302, 0.95278196,
       0.58165426, 0.57085147, 0.56569921, 0.74987513, 0.55421

## Save the Model

In order to reuse this model in the future without having to perform the training once again, we need to save it. I will be saving it as a pickle.

Similarly, we also need to save the scaler object so that we can scale new data in the future.

In [44]:
# import the library to save model
import pickle

In [45]:
# pickle the model file
with open('model', 'wb') as file:
    pickle.dump(new_reg, file)

In [46]:
# pickle the scaler file
with open('scaler','wb') as file:
    pickle.dump(absenteeism_scaler, file)