# Absenteeism Exercise Machine Learning without Custom Scaler

#### Load the preprocessing data

In [1]:
import numpy as np
import pandas as pd

In [2]:
data_preprocessed = pd.read_csv('Absenteeism-preprocessed.csv')
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pet,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


It will use a logistic regression which will take the reaosn for absence, month of year, day of the week, transportation expense, distance to work, age, daily work load average, education, children and pets of a given employee and will predict their absenteeism.

The model itself will give us a fair indication of which variables are important for the analysis and which aren't.

#### Create the targets

Use the logistic regression to predict the abseentism, logistic regression is a type of classification, so it will be basically classifying people into classes.

What are the classes, that's must settle first and them preprocess the data to reflect this deicision.

The approach will use here is to create two classes, one representing people who have been excessively absent, the another which represents people who have moderately absent.

We will take the median value of the abseenteeism time in hours and use it as a cut-off line, everything below median would be considered normal, everything above the median would be excessive.

In [3]:
# everyone has been absent for more than median will be thought of as excessively absent
# everyone has been absent below than median will be thought as moderately absent
# so this median is the cut-off line

# so if an observations has been absent for less than median will assign it the value of 0
# otherwise the value of 1

# so this 0s and 1s, in supervised machine learning it is targets
# that means these are the value we are aiming for
# the task will be to predict whether will obtain 0 or 1

# use the numpy.where(), this method checks if a condition has been satisfied and assigns a value accordingly

# put the median method into the np.where argument
# it is always better to do that as parametrization makes the code easier to understand and follow
# moreover, this minimized the chance of making mistakes

targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0) 

In [4]:
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [5]:
data_preprocessed['Excessive Absenteeism'] = targets

In [6]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pet,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


#### A comment on the targets

Using the median as a cut-off line is numerically stable and rigid, that's because by using the median we have implicitly balance the dataset, roughly half of the targets are 0s while the other half 1s.

The will prevent the model from learning to output one of the two classes exclusively, that mean this will prevent the model from learning to output only 0s or 1s.

In [7]:
targets.sum()

319

In [8]:
targets.shape

(700,)

In [9]:
# divide the number of targets that are 1s by the total number of targets
# so around 46% of the targets are 1s
# thus around 54% of the targets are 0s
# when balancing the dataset, the two classed needn't represent 50% of the sample exactly
# usually 60/40 split will work equally well for a logistic regression
# but that's not true for other algorithms such as neural networks

targets.sum() / targets.shape[0]

0.45571428571428574

In [10]:
# is this a checkpoint
# in other words is this new variable pointing to the same piece of memory as data_preprocessed

data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours'], axis = 1)

In [11]:
# use the reserved word is
# can do the double check

# if the result is true, that mean the two variables refer to the same object or the same piece of memory
# if the result is false, that mean the two variables refer to different objects

data_with_targets is data_preprocessed

False

In [12]:
# so the data_with_targets is the checkpoint

data_with_targets.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pet,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,0


#### Select the inputs for the regression

In [13]:
data_with_targets.shape

(700, 15)

In [14]:
# use the pandas method iloc
# the iloc method is used for selection by position in the DataFrame
# there are two arguments
# the first one refers to the row indices 
# and the second one to column indices

# to select the inputs for the regression, we muct select all rows and all columns excrpt for excessive absenteeism
# iloc excludes the ending index
# if data_with_targets.iloc[:, :14], the result will be the same
# if data_with_targets.iloc[:, :-1], the result will be the same

# so this is the inputs

data_with_targets.iloc[:, 0:14]

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pet
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,2,179,22,40,237.656,22,1,2,0
696,1,0,0,0,5,2,225,26,28,237.656,24,0,1,2
697,1,0,0,0,5,3,330,16,28,237.656,25,1,0,0
698,0,0,0,1,5,3,235,16,32,237.656,25,1,0,0


In [15]:
unscaled_inputs = data_with_targets.iloc[:, :-1]

#### Standardize the data

In [16]:
from sklearn.preprocessing import StandardScaler

In [17]:
# this is an empty standard scalar object
# there is no information in it
# this object will be used to subtract the mean and divide by the standard deviation variablewise(featurewise)

absenteeism_scaler = StandardScaler()

In [18]:
# this line will calculate and store the mean and the standard deviation of each feature from unscaled_inputs
# this is scaling mechanism

absenteeism_scaler.fit(unscaled_inputs)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [19]:
# in order to apply it, we must use transform
# if got the new data, then just apply absenteeism_scalar.transform(new_data) to reach the same transformation as just did
# this is the most common and useful way to transform new data when you deploy the model

scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)

In [20]:
# all the input data has been standardize

scaled_inputs

array([[-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
         0.88046927,  0.26848661],
       [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
        -0.01928035, -0.58968976],
       [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
        -0.91902997, -0.58968976],
       ...,
       [ 1.73205081, -0.09298136, -0.31448545, ...,  2.23224237,
        -0.91902997, -0.58968976],
       [-0.57735027, -0.09298136, -0.31448545, ...,  2.23224237,
        -0.91902997, -0.58968976],
       [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
        -0.01928035,  0.26848661]])

In [21]:
# this can know there are 700 observations and 14 features

scaled_inputs.shape

(700, 14)

#### Split data into train & test and shuffle

Overfitting occurs when the model learns to predict the data we've given it so well, than when applied in a real life situation with new data, it falis miserably.

One way to deal with overfitting is to hide a small part of the dataset from the algorithm, so we train the model based on most of the data but not all of.

After that, we use the small piece of data we left aside to test if the model will do well in real life, so it will split the dataset into train and test, so we can assess the models accuracy on data it has never seen before.

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
# the output will give 4 array

# array 1 is a training dataset with inputs
# array 2 is a training dataset with targets
# array 3 is a test dataset with inputs
# array 4 is a test dataset with targets

# put the 4 array into the variables

train_test_split(scaled_inputs, targets, shuffle = True)

[array([[-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
         -0.01928035,  1.12666297],
        [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
         -0.01928035,  1.12666297],
        [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
         -0.91902997, -0.58968976],
        ...,
        [ 1.73205081, -0.09298136, -0.31448545, ..., -0.44798003,
         -0.01928035,  1.12666297],
        [ 1.73205081, -0.09298136, -0.31448545, ...,  2.23224237,
         -0.91902997, -0.58968976],
        [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
         -0.91902997, -0.58968976]]),
 array([[-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
         -0.91902997, -0.58968976],
        [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
         -0.01928035,  1.12666297],
        [-0.57735027, -0.09298136, -0.31448545, ..., -0.44798003,
         -0.01928035,  2.8430157 ],
        ...,
        [-0.57735027, -0.09298136, -0.31448545, ..., -

In [24]:
# so this method has basically split the scaled_inputs and targets into matching forms that we can use in the machine learning part
# in the default, the train is 525, the test is 175
# in other words 75% of observations help us with training
# the 25% of observations serve for testing
# this % is the default split

# usually we opt for splits like 90-10 or 80-20, because we want to train on more data
# don't like to setting aside too much data for testing because this means we are going to train the model on less data

# it need to shuffling on several occasions
# the train_test_split has a shuffle parameter
# shuffle parameter is a boolean so it can either true or false 
# by default shuffling is set to true

# random_state takes integer values
# can set the random_state
# this will make the shuffle pseudo random
# in this way, the method will always shuffle the observations in the same random way

x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size = 0.8, random_state = 20)

In [25]:
# through the shape
# it can know that inputs contain 525 observations along 14 features
# the targets are a vector of length 525
# the latter corresponding to the excessive absenteeism column
# so there are 525 observations, 14 inputs and 1 target value per observation

# use the 80-20, it get the different result

print(x_train.shape, y_train.shape)

(560, 14) (560,)


In [26]:
# through the shape
# it can know that include 175 observations along 14 features and 1 target variable

# use the 80-20, it get the different result

print(x_test.shape, y_test.shape)

(140, 14) (140,)


#### Logistic Regression

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [28]:
reg = LogisticRegression()

In [29]:
# there are the default parameters(since we didn't specify anything)

reg.fit(x_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [30]:
# the model has an accuracy of around 80%
# in other words, based on the data we used, the model learned to classify around 80% of the observations correctly

reg.score(x_train, y_train)

0.7803571428571429

#### Manually check the accuracy

Manually check the accuracy, it is always good to have full understanding of what we are doing.

What does accuracy mean, the logistic model is train on the train inputs, based on them, it finds outputs which are trying to be as close to the targets as possible.

However, accuracy means that 80% of the model outputs match the targets, so if we want to find the accuracy of a model manually, we should find the outputs and compare them with the targets.

In [31]:
# in order to find the model outputs
# use the simple sklearn predict() method

# this method will find the predicted outputs of the regression
# we are choosing to predict the outputs associated with the training inputs and contained an x_train
# these are the prediction of the model

model_outputs = reg.predict(x_train)
model_outputs

array([0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [32]:
y_train

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,

In [33]:
# compare to the outputs
# if there is a match, the result is true
# otherwise it is false

model_outputs == y_train

array([ True,  True, False,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True, False,  True,  True, False,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True, False,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [34]:
# these number is the total number of correct predictions(true entries)

np.sum((model_outputs == y_train))

437

In [35]:
# if we dividd the number of matches by the total number of elements
# it will get the accuracy
# this results is actualy the same as the sklearn

np.sum((model_outputs == y_train)) / model_outputs.shape[0] 

0.7803571428571429

#### Creating a summary table, finding the coefficients and intercept

Regression, no matter if linear or non-linear is about determining certain coefficients or weights which we apply to the inputs to obtain a final result.

So to use this logistic regression model outside of Python, we must get our hands on the coefficients and the intercept, moreover in order to interpret this logistic model we still need to do so.

In [36]:
reg.intercept_

array([-0.20908973])

In [37]:
# it want to know what variable those coefficients refet to

reg.coef_

array([[ 2.07044024,  0.33037751,  1.56321412,  1.31103462,  0.02578678,
        -0.08632909,  0.7233164 , -0.06154754, -0.20633005, -0.02869987,
         0.32595295, -0.1615189 ,  0.38151211, -0.32141202]])

In [38]:
# it can get the coefficients from the names of our inputs column values

# we standardized the pandas' DataFrame unscaled_inputs using the standard scalar
# the result was stored in scalar inputs
# this shows us that sklearn methods are compatible with pandas DataFrame but whenever we employ some sklearn function, everything is transformed to ndarrays
# this happend with the intercept, the coefficients and obviously with the inputs 

scaled_inputs.columns.values

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

In [39]:
# this is a pandas DataFrame

unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Day of the Week', 'Transportation Expense', 'Distance to Work',
       'Age', 'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pet'], dtype=object)

In [40]:
feature_name = unscaled_inputs.columns.values

In [41]:
# create a neat DataFrame that will contain the intercept, the feature names and corresponding coefficients
# this DataFrame is a summary table

# it need to transpose the coef array, because by default ndarrays are rows not columns

summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)

summary_table['Coefficient'] = np.transpose(reg.coef_)

summary_table

Unnamed: 0,Feature name,Coefficient
0,Reason_1,2.07044
1,Reason_2,0.330378
2,Reason_3,1.563214
3,Reason_4,1.311035
4,Month Value,0.025787
5,Day of the Week,-0.086329
6,Transportation Expense,0.723316
7,Distance to Work,-0.061548
8,Age,-0.20633
9,Daily Work Load Average,-0.0287


In [42]:
# in this way, we will shift up all index by 1
# so the index 0 is empty

# the intercept will specify the 0 elements
# so it can extract a float rather than the whole array

summary_table.index = summary_table.index + 1

summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

summary_table = summary_table.sort_index()

summary_table

Unnamed: 0,Feature name,Coefficient
0,Intercept,-0.20909
1,Reason_1,2.07044
2,Reason_2,0.330378
3,Reason_3,1.563214
4,Reason_4,1.311035
5,Month Value,0.025787
6,Day of the Week,-0.086329
7,Transportation Expense,0.723316
8,Distance to Work,-0.061548
9,Age,-0.20633


#### Interpreting the coefficients

The coefficients are also called weights while the intercept bias, this notions are useful because the weights show how we weigh a certain input, the closer they(the weights) are to 0, the smaller the weights, and alternatively the further weight from 0, no matter if positive or negative, the bigger the weights of this feature.

Note this is something which is true for the model, but it's not universally true, it holds only for models where all variables are of the same scale such as the ond we just built.

There are coefficient values and standardized coefficient values, these standardized coefficients are basically the coefficient values of a regression where all variables have been standardized, other package in software include the standardized coefficients, because they allow for a simple and easy to understand comparison between the variables since in such cases the features are standardized, they all have a variance of 1 or same scale and whenever the scale is standard or the same that is, we can simply say whichever way is bigger its corresponding feature is more important.

For machine learning purposes and prediction in general, we usually standardize the variables, like we did now.

Another notion we must emphasize is that whenever we are dealing wiht a logistic regression, the coefficients we are predicting or the so-called log odds, this is a consequence of the choice of model,  logistic regression by default are nothing but a linear function, predicting log odds, these log odds are later transformed into 0s and 1s.

In [43]:
# odds ratio is the correct term for what we will get after we find the exponentials of the coefficients
# this has added a new column to the table
# where we had the exponentials of the coefficients

summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)
summary_table

Unnamed: 0,Feature name,Coefficient,Odds_ratio
0,Intercept,-0.20909,0.811322
1,Reason_1,2.07044,7.928313
2,Reason_2,0.330378,1.391493
3,Reason_3,1.563214,4.774141
4,Reason_4,1.311035,3.71001
5,Month Value,0.025787,1.026122
6,Day of the Week,-0.086329,0.917292
7,Transportation Expense,0.723316,2.061258
8,Distance to Work,-0.061548,0.940308
9,Age,-0.20633,0.813565


In [44]:
# sort_values requires us to choose the appropriate column by which we want to sort the whole DataFrame
# logically, for now this is the new column odds ratio

# by default the coefficients are sorted in ascending order
# so this means the most important ones are at the bottom

summary_table.sort_values('Odds_ratio')

Unnamed: 0,Feature name,Coefficient,Odds_ratio
14,Pet,-0.321412,0.725124
0,Intercept,-0.20909,0.811322
9,Age,-0.20633,0.813565
12,Education,-0.161519,0.85085
6,Day of the Week,-0.086329,0.917292
8,Distance to Work,-0.061548,0.940308
10,Daily Work Load Average,-0.0287,0.971708
5,Month Value,0.025787,1.026122
11,Body Mass Index,0.325953,1.38535
2,Reason_2,0.330378,1.391493


In [45]:
# in this time, all coefficients sorted according to the relevance of the prblem at hand
# so the most importan is on the top, the less important is on the bottom

summary_table.sort_values('Odds_ratio', ascending = False)

Unnamed: 0,Feature name,Coefficient,Odds_ratio
1,Reason_1,2.07044,7.928313
3,Reason_3,1.563214,4.774141
4,Reason_4,1.311035,3.71001
7,Transportation Expense,0.723316,2.061258
13,Children,0.381512,1.464497
2,Reason_2,0.330378,1.391493
11,Body Mass Index,0.325953,1.38535
5,Month Value,0.025787,1.026122
10,Daily Work Load Average,-0.0287,0.971708
8,Distance to Work,-0.061548,0.940308


If a coefficient is around 0 or it's odds_ratio is close to 1, this means that the corresponding feature is not particularly important.

The reasoning in terms of weights is that a weight(coefficients) of 0 implies that no matter the feature value, we will multiply it by 0 and whole result will be 0(in the model).

The meaning in terms of odds_ratio is the following, for one unit change in the standardized feature, the odds increase by a multiple equal to the odds_ratio(1 = no change), so if the odds_ratio is 1, then the odds don't change at all.

Example:

Odds                  |Odds ratio             |New Odds(for a unit change)
---                   |---                    |---
5:1                   |2                      |10:1
5:1                   |0.2                    |1:1
5:1                   |1                      |5:1

The daily work load average weight is -0.03 so almost 0 and it's odd ratio is 0.97 so almost 1, so this feature is almost useless for the model and with or without it, the result would likely be the same.

The other variables, day of the week and the distance to work are the same situatiom like the daily work load average.

So this is the time to know that they may not necessarily be useless, a more accuracy statement is that given all features, they seem to be the ones that make no difference, maybe consider dropping them later on.

The four reasons for absence which are the most important predictors, when we were creating the dummies, the one we dropped was reason 0, reason 0 represented a situation when a person was absent but no particular reason was given, therefore the base model is the case when there is no reason aka a reason 0, from the coefficients, it seems that whenever a person has stated any reason, we have a much higher chance of getting excessive absence, a good question would be how much bigger of a chance.