# Creating a Logistic Regression to predict absenteeism

### Import relevant libraries

In [1]:
import pandas as pd
import numpy as np

### Load the data

In [2]:
data_preprocessed = pd.read_csv('Absenteeism_preprocessed.csv')
data_preprocessed.head()

Unnamed: 0,G1,G2,G3,G4,Month Value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


In [3]:
## The model will give us a fair indication of which variables are important for the analysis

## More Preprocessing

### Create Targets

we will use logistic regression because it's type of classification so we will basically classifying people into classes
but first we must know what are these classes first and then preprocess our data to reflect this decision.

there's other types of methodologies like random forest or neural network


### classes are:-
    Excessively absent
    Moderately absent
So we will use Median value of the 'Absenteeism Time in Hours' and use it as a cut-off line

In [4]:
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

#### Everyone will be considered moderately absent (<= 3.0 hours) and will be Excessively absent (> 3.0 hours)
#### So if an observation will be absent for less than 3.0 hours will be assigned to 0 otherwise value will be 1
#### In supervised machine learning we will call 0 and 1 targets

#### Our task now is predicting whether we will obtain 0 or 1
#### we will create a variable that measures if a person has been absent for more than 3 hours

we will use #### np.where(condition, value if True,value if False) #### checks if a condition has been satisfied and assigns a value accordingly Like the excel 'if' function

In [5]:
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > 
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [6]:
data_preprocessed['Excessive Absenteeism'] = targets

In [7]:
data_preprocessed.head()

Unnamed: 0,G1,G2,G3,G4,Month Value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


## A comment on targets

using median as a cut-off line is numerically stable and rigid because by using median we have implicitly balanced the dataset!
This will prevent our model from learning to output one of the two classes exclusively and to prove that let's divide the total numbers of one by the total targets

In [8]:
targets.sum() / targets.shape[0]

0.45571428571428574

In [9]:
## so around 46% of our targets are 1
## when balancing the dataset the 2 classes needn't to represent 50% of the sample exactly.
## In linear Regression 60% to 40%  split will work equally well for logistic regression
## but that's not true for other algorithms Like neural network

In [10]:
## now let's drop 'Absenteeism Time in Hours'
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Daily Work Load Average', 'Distance to Work', 'Day of the week'], axis = 1)
data_with_targets

Unnamed: 0,G1,G2,G3,G4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,1,2,0,1
696,1,0,0,0,5,225,28,24,0,1,2,0
697,1,0,0,0,5,330,28,25,1,0,0,1
698,0,0,0,1,5,235,32,25,1,0,0,0


Now let's check if is this new variable point to same piece of memory as data pre-processed
we will use 'is' to check. If this true then it's pointing to same piece of memory as data pre-processed else it will out false

True = two variables refer to same object

False = two variables refer to different objects

In [11]:
data_with_targets is data_preprocessed

False

In [12]:
data_with_targets.head()

Unnamed: 0,G1,G2,G3,G4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0


#### data_with_targets is our checkpoint now

## Selecting Inputs for the Regression

In [13]:
data_with_targets.shape

(700, 12)

In [14]:
## we use DataFrame.iloc[row indices, column indices] to selects (slices) data by position when given rows and columns wanted
## we will select all columns except Excessive Absenteeism
data_with_targets.iloc[:,0:14]
## same to previous code we can use data_with_targets.iloc[:,:-1]

Unnamed: 0,G1,G2,G3,G4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,1,2,0,1
696,1,0,0,0,5,225,28,24,0,1,2,0
697,1,0,0,0,5,330,28,25,1,0,0,1
698,0,0,0,1,5,235,32,25,1,0,0,0


In [15]:
## select the inputs
unscaled_data = data_with_targets.iloc[:,:-1]

In [16]:
# scale the data

## Standardize the data

In [17]:
# from sklearn.preprocessing import StandardScaler

# # first we must declare standard scaler object
# absenteeism_scaler = StandardScaler()

In [18]:
## absenteeism_scaler will be used to subtract the mean and divide by standard deviation

In [19]:
# new code goes here you will find the explanation after Line[53]
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

In [20]:
class CustomScaler(BaseEstimator,TransformerMixin):
        def __init__(self,columns, copy= True, with_mean = True, with_std=True):
            self.scaler = StandardScaler(copy,with_mean,with_std)
            self.columns = columns
            self.mean_ = None
            self.var_ = None
            
        def fit(self,X, y = None):
            self.scaler.fit(X[self.columns],y)
            self.mean_ = np.mean(X[self.columns])
            self.var_ = np.var(X[self.columns])
            
        def transform(self,X, y = None, copy=None):
            init_col_order = X.columns
            X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns = self.columns)
            X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
            return pd.concat([X_not_scaled, X_scaled], axis = 1)[init_col_order]
        
## this is a custom scaler class based on StandardScaler from sklearn
## when we declare scalar object there's an extra argument columns to scale
## so our custom scalar will not standardize all inputs but only ones we choose
## In this way we will be able to preserve the dummies untouched

In [21]:
#new
unscaled_data.columns.values

array(['G1', 'G2', 'G3', 'G4', 'Month Value', 'Transportation Expense',
       'Age', 'Body Mass Index', 'Education', 'Children', 'Pets'],
      dtype=object)

In [22]:
# #new
# columns_to_scale = ['Month Value', 'Day of the week',
#        'Transportation Expense', 'Distance to Work', 'Age',
#        'Daily Work Load Average', 'Body Mass Index',
#        'Children', 'Pets']
#List comprehension is a syntactic construct which allows us to create a list from existing lists based on loops, conditionals, etc

columns_to_omit = ['G1','G2','G3','G4','Education']

In [23]:
columns_to_scale = [x for x in unscaled_data.columns.values if x not in columns_to_omit]
# this way the list will work like a loop which looks into all columns values and takes cells which are not part of the variable
# columns_to_omit

In [24]:
#new
absenteeism_scaler = CustomScaler(columns_to_scale)



In [25]:
## next we fit our input data
absenteeism_scaler.fit(unscaled_data) # this line will calculate and store the mean and the standard deviation

In [26]:
## whenever you get new data you will know that the standardization information is contained in absenteeism_scaler

## we use another method to apply our scaling mechanism called transform
scaled_data = absenteeism_scaler.transform(unscaled_data)

In [27]:
scaled_data

Unnamed: 0,G1,G2,G3,G4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.182726,-1.574681,2.130803,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.182726,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.182726,0.854936,0.405184,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.182726,1.005844,-0.536062,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.388293,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.388293,0.040034,-1.320435,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.388293,1.624567,-1.320435,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.388293,0.190942,-0.692937,-0.408580,1,-0.919030,-0.589690


In [28]:
scaled_data.shape

(700, 11)

In [29]:
## we are in a good shape to proceed with the final preprocessing step.

## Split the data into train ,test and shuffle

#### Import Relevant module

In [30]:
from sklearn.model_selection import train_test_split

### Split

In [31]:
train_test_split(scaled_data, targets)

[     G1  G2  G3  G4  Month Value  Transportation Expense       Age  \
 591   0   0   0   1    -1.244823               -0.654143 -1.006686   
 587   1   0   0   0    -1.244823                0.387122  1.660180   
 135   0   0   0   1    -1.530333                0.040034 -1.320435   
 233   1   0   0   0    -0.102784               -0.654143  0.248310   
 435   0   0   0   1    -0.388293                0.040034 -1.320435   
 ..   ..  ..  ..  ..          ...                     ...       ...   
 606   1   0   0   0    -1.244823               -0.654143  0.248310   
 59    0   0   0   1     0.753746                0.387122  1.660180   
 106   0   0   0   1     1.610276                0.040034 -1.320435   
 268   1   0   0   0     0.468236                2.092381 -1.320435   
 453   0   0   0   1    -0.102784                0.040034 -1.320435   
 
      Body Mass Index  Education  Children      Pets  
 591        -1.819793          1 -0.919030 -0.589690  
 587         1.237836          0  0.

In [32]:
## 4 arrays are[train dataset with data inputs, train dataser with targets, test dataset with inputs, test with targets]
x_train, x_test, y_train, y_test = train_test_split(scaled_data, targets, train_size = 0.8, random_state = 20)
## random_state will always shuffle the observations in the same random way

In [33]:
print(x_train.shape, y_train.shape)

(560, 11) (560,)


In [34]:
print(x_test.shape, y_test.shape)

(140, 11) (140,)


## Logistic Regression with sklearn

In [35]:
## statsmodels couldn't provide an answer that's because whenever we are training machine 
## model there are many mathematical issues
## arising in background besides they are not always numerically stable for more complicated models

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

### Train the model

In [37]:
## we will declare new variable 'reg' which will be a LogisticRegression object
reg = LogisticRegression()

In [38]:
## next we fit the model
reg.fit(x_train, y_train)

LogisticRegression()

In [39]:
## we will evaluate our model
reg.score(x_train, y_train)

0.7732142857142857

In [40]:
## our model learned to classify 80% of observations correctly
## In order to understand the result we manually check the accuracy
## because it's always good to have full understanding of what we are doing
## we will be using it later on

### Manually Check The Accuracy

In [41]:
## sklearn.linear_model.LogisticRegression.predict(inputs)
## predict class labels (logistic regression outputs) for given input samples

In [42]:
model_outputs = reg.predict(x_train)

In [43]:
model_outputs

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [44]:
y_train

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,

In [45]:
model_outputs == y_train

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [46]:
np.sum(model_outputs == y_train)

433

In [47]:
## 439 Total number of predictions are correct

In [48]:
## next we will divide number of matches by total number of elements we will get accuracy
model_outputs.shape[0]

560

In [49]:
np.sum(model_outputs == y_train) / model_outputs.shape[0]

0.7732142857142857

### Find intercept and coefficients

In [50]:
## Next to find the intercept
reg.intercept_

array([-1.6474549])

In [51]:
reg.coef_

array([[ 2.80019733,  0.95188356,  3.11555338,  0.83900082,  0.1589299 ,
         0.60528415, -0.16989096,  0.27981088, -0.21053312,  0.34826214,
        -0.27739602]])

In [52]:
## to check what variables those coefficient refers to
## scaled_data.columns.values ## 'numpy.ndarray' object has no attribute 'columns' 
## so we will use unscaled_data since it's Pandas DF


## from log(odds) equation ---> -0.22 + 2.08*G1 + 0.34 * G2 + 1.56 * G3 + ....


unscaled_data.columns.values

array(['G1', 'G2', 'G3', 'G4', 'Month Value', 'Transportation Expense',
       'Age', 'Body Mass Index', 'Education', 'Children', 'Pets'],
      dtype=object)

In [53]:
feature_name = unscaled_data.columns.values

In [54]:
## we will create a data frame that contains intercept, feature names and coefficients
summary_table = pd.DataFrame(columns = ['Feature name'], data = feature_name)
summary_table['coefficient'] = np.transpose(reg.coef_) ## we must transpose because by default nd arrays are rows not columns
summary_table

Unnamed: 0,Feature name,coefficient
0,G1,2.800197
1,G2,0.951884
2,G3,3.115553
3,G4,0.839001
4,Month Value,0.15893
5,Transportation Expense,0.605284
6,Age,-0.169891
7,Body Mass Index,0.279811
8,Education,-0.210533
9,Children,0.348262


In [55]:
## since most of methods put newly appended data at the end of dataframe


## we will use the following code to solve this problem
summary_table.index = summary_table.index + 1 ## we will shift up all indices by 1


## Zeroth index is empty so to fill it:-
summary_table.loc[0] = ['Intercept',reg.intercept_[0]]

## finally we sort dataframe by index
summary_table = summary_table.sort_index()

summary_table

Unnamed: 0,Feature name,coefficient
0,Intercept,-1.647455
1,G1,2.800197
2,G2,0.951884
3,G3,3.115553
4,G4,0.839001
5,Month Value,0.15893
6,Transportation Expense,0.605284
7,Age,-0.169891
8,Body Mass Index,0.279811
9,Education,-0.210533


In [56]:
## please note the first column is the intercept 0,1,2,3,....
## bias = intercept
## weight = coefficient

## Interpreting the coefficients

In [57]:
summary_table['odds_ratio'] = np.exp(summary_table.coefficient)
summary_table

Unnamed: 0,Feature name,coefficient,odds_ratio
0,Intercept,-1.647455,0.192539
1,G1,2.800197,16.447892
2,G2,0.951884,2.590585
3,G3,3.115553,22.545903
4,G4,0.839001,2.314054
5,Month Value,0.15893,1.172256
6,Transportation Expense,0.605284,1.831773
7,Age,-0.169891,0.843757
8,Body Mass Index,0.279811,1.32288
9,Education,-0.210533,0.810152


In [58]:
summary_table.sort_values('odds_ratio', ascending=False)

Unnamed: 0,Feature name,coefficient,odds_ratio
3,G3,3.115553,22.545903
1,G1,2.800197,16.447892
2,G2,0.951884,2.590585
4,G4,0.839001,2.314054
6,Transportation Expense,0.605284,1.831773
10,Children,0.348262,1.416604
8,Body Mass Index,0.279811,1.32288
5,Month Value,0.15893,1.172256
7,Age,-0.169891,0.843757
9,Education,-0.210533,0.810152


From Logistic Regression
log(odds) = intercept + coefficient1 * G1 + coefficient2 * G

The standardized coefficients are : the coefficients of regression where all variables have been standardized
when we say the coefficient is bigger (away from zero no matter positive or negative) the more feature is important


#### The equation of logistic regression will be found on line [46]

odds_ratio is a correct term for what we have got after we find the exponentials of the coefficient

A feature is not particularly important:-
      - If its coefficient is aroun 0
      - If its odds ratio is around 1

A weight (coefficient) of 0 implies that no matter the feature value we will multiply it by 0 (in the model)


The meaning in terms of odds ratios is the following:-

For a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio (1 = no change)

for example if the ODDS is 5:1 and the ODDS RATIO is 2 for one unit change the odds change from 5:1 to 10:1

if the odds ratio is 0.2 new odds will be 1:1

if the odds ratio is 1 new odds will be 5:1

by checking the table we find Daily Work Load Average coefficient = -0.028 nearly zero and its odds_ratio is 0.97 so it's nearly 1

so this feature is almost useless for our model and it won't affect the result

same for day of the week and distance to work

given all features they seem to be the ones that make no difference we will keep the features for now but consider dropping them later on

G_0 represented a situation where a person was absent but no particular reason was given

Therefore the base model is when there is no reason

So it seems whenever a person has stated any reason we have a much higher chance of getting excessive

next we will explore how much bigger of a chance.

There's a big problem here. when we standardized the inputs we standardized the dummies that's bad because we standardize dummies we lose the whole interpretability. If we left them zeros and ones we could have said for a unit of change it's 7.2 times more likely the person will be excessively absent.

A unit change in a dummy variable universe means a change from disregarding this dummy to taking only this dummy into account.

So in Reason 1 we would said that 7.93 times more likely person will be absent compared to no reason given.

However we standardized reasons and now a unit change is completely uninterpretable. The predictive power of the model is still valid and it's a good classifier but we don't know how the different reasons compare.

So we need to correct our code so we go back to Standardize at Line [17] the previous code will be a comment and we put our new code.

In [59]:
# we lose amount of accuracy from 80% to 77% in reg.score(x_train, y_train) but we won iterpretability

In [60]:
summary_table.sort_values('odds_ratio', ascending=False)

Unnamed: 0,Feature name,coefficient,odds_ratio
3,G3,3.115553,22.545903
1,G1,2.800197,16.447892
2,G2,0.951884,2.590585
4,G4,0.839001,2.314054
6,Transportation Expense,0.605284,1.831773
10,Children,0.348262,1.416604
8,Body Mass Index,0.279811,1.32288
5,Month Value,0.15893,1.172256
7,Age,-0.169891,0.843757
9,Education,-0.210533,0.810152


by looking at coefficients tables

strong reasons for absence is the from G1:G4 (G1 = Poisoning, G3=various disease, G2 = pregnancy, G4 = Light disease) our base model is G0 = no reason so when comparing from G1:G4 by G0 we find that:-
        The odds of G3 means there's a chance of  22 times higher to be absent than G0
        The odds of G1 means there's a chance of  17 times higher to be absent than G0
        The odds of G2 means there's a chance of   3 times higher to be absent than G0
        The odds of G4 means there's a chance of 2.4 times higher to be absent than G0

Transportation Expense This is the most important non-dummy feature in the model But at the same time it's one of our standardized variables so we don't have direct interpretability of it. it's odds ratio implies that for one standardized unit or for one standard deviation increase in transportation expense it's close to [twice] as likely to be excessively absent.

PS: standardized models almost always yields higher accuracy because the optimization algorithms works better in this way.

ML engineers prefer models with higher accuracy so the normally go for standardization.

Econometricians and statisticians prefer less accurate but more interpretable models, because they care about the underlying reasons behind different phenomena.

DS may be in either position.Sometimes they need higher accuracy, other times - they must find the main drivers of a problem...

Pet odd value is 0.75 so for each additional standardized unit of pat the odds are 1-0.75 or 24% lower than the base model (no pet) explanation maybe if you have several pets you're probably not taking care of them on his own

Intercept is used to get more accurate predictions but there's no specific meaning attached to it.

That's why in machine learning you can say that it calibrates

without an intercept each prediction would be off the mark by precisely that value.

## Backward Elimination


Since Daily Work Load Average, Distance to work, Day of the week, impact as their weights is almost = 0 so we apply backward elimination

The idea is that we can simplify our model by removing all features which have close to no contribution to the model.

when we have P-values,we get rid of all coefficients with p-values > 0.05. But since we use sklearn we won't use it. But the engineer who created the package is that if the weight is small enough it won't make a difference anyway...

If we remove these values the rest won't change in the coefficient values

so we go back to Line 10 and add those values

        Daily Work Load Average,
        Distance to work,
        Day of the week.

In [61]:
# Checking the regression accuracy we see a slight difference
# this shows the three variable we dropped were useless with or without them we obtained practically the same result
# Either way a simple model always preferable

## Testing the model

In [62]:
## first we must see the accuracy with inputs x_test, y_test
reg.score(x_test, y_test)

0.75

In [63]:
# so we say at 75% our model will predict a person is going to be excessively absent 
# The test accuracy is always less than train accuracy
# By definition if we get a higher number then we either get lucky or made a mistake
# Test accuracy often is dramatically lower than the train accuracy something like 10 percent or 20 percent lower
# This would mean that our model overfit it learned the train data very well but is prone to fail in real life

# But We are in neither case

#now getting the output using predict method or predict_proba method from sklearn
predicted_proba = reg.predict_proba(x_test)
predicted_proba

array([[0.71340413, 0.28659587],
       [0.58724228, 0.41275772],
       [0.44020821, 0.55979179],
       [0.78159464, 0.21840536],
       [0.08410854, 0.91589146],
       [0.33487603, 0.66512397],
       [0.29984576, 0.70015424],
       [0.13103971, 0.86896029],
       [0.78625404, 0.21374596],
       [0.74903632, 0.25096368],
       [0.49397598, 0.50602402],
       [0.22484913, 0.77515087],
       [0.07129151, 0.92870849],
       [0.73178133, 0.26821867],
       [0.30934135, 0.69065865],
       [0.5471671 , 0.4528329 ],
       [0.55052275, 0.44947725],
       [0.5392707 , 0.4607293 ],
       [0.40201117, 0.59798883],
       [0.05361575, 0.94638425],
       [0.7003009 , 0.2996991 ],
       [0.78159464, 0.21840536],
       [0.42037128, 0.57962872],
       [0.42037128, 0.57962872],
       [0.24783565, 0.75216435],
       [0.74566259, 0.25433741],
       [0.51017274, 0.48982726],
       [0.85690195, 0.14309805],
       [0.20349733, 0.79650267],
       [0.78159464, 0.21840536],
       [0.

In [64]:
predicted_proba.shape

(140, 2)

In [65]:
## the first column shows the probapility of our model being 0 and 2nd probability of being 1
## we slice out the 2nd column

predicted_proba[:,1]

array([0.28659587, 0.41275772, 0.55979179, 0.21840536, 0.91589146,
       0.66512397, 0.70015424, 0.86896029, 0.21374596, 0.25096368,
       0.50602402, 0.77515087, 0.92870849, 0.26821867, 0.69065865,
       0.4528329 , 0.44947725, 0.4607293 , 0.59798883, 0.94638425,
       0.2996991 , 0.21840536, 0.57962872, 0.57962872, 0.75216435,
       0.25433741, 0.48982726, 0.14309805, 0.79650267, 0.21840536,
       0.36956558, 0.67906035, 0.68502567, 0.52868083, 0.21840536,
       0.53506551, 0.22147081, 0.73692105, 0.40498044, 0.60505988,
       0.21075848, 0.45224466, 0.23751292, 0.39833498, 0.82755447,
       0.56797575, 0.69113325, 0.28659587, 0.21935267, 0.2033097 ,
       0.57628256, 0.3294664 , 0.66512397, 0.26949499, 0.83321968,
       0.43491525, 0.88374612, 0.23127072, 0.33415858, 0.34432939,
       0.69909345, 0.65494263, 0.29244941, 0.79200758, 0.20750276,
       0.26840558, 0.08708566, 0.22147081, 0.73245417, 0.30530219,
       0.22147081, 0.29014408, 0.90438191, 0.46061297, 0.60174

In [66]:
## In reality,logistic Regression models calculate these probabilities in background
## IF probability is:-
    ## below 0.5, it places a 0
    ## above 0.5, it places a 1

In [67]:
## Next steps:-
    ## 1- Save a model
    ## 2- Create a module
    ## 3- Get new data, classify it, pass through SQL, and analyze it in Tableau

### Save the model

In [68]:
## saving a model is the process of creating a file that will contain all the information regarding the machine learning
## we will create a file that contain these info:-
    # this file is a logistic regression that has 
        # coef = ...
        # intercept = ...
        # random_state = 20

In [69]:
# we use pickle[module] : it's a python module used to convert a python object into character stream.
# we will save the variable into a file that can be loaded into a new notebook and thus be able to use machine learning algorithm.

In [70]:
import pickle

In [71]:
with open('model', 'wb') as file:
    pickle.dump(reg, file)
# 'model' is file's name
# 'wb' stands for write bytes will be 'rb' "read bytes" when we unpickle
# dump method stands for save
# .dump(object to be dumbed, just a python syntax)

In [72]:
## we must save absenteeism_scaler too
## we used it to standardize all numerical variables
# In this way we seperating the model from the training data

# and logically the data in the absnteeism scaler is needed to preprocess any new data using the same rules as the one
# apply to training data

with open('scaler','wb') as file:
    pickle.dump(absenteeism_scaler, file)

In [73]:
## The second step of the deployment is about creating a mechanism to load saved model and make predictions
## we will go and store the code in a module and this will allow us to reuse it without trouble
## we will treat the methods in module in the same way we treat numpy, sklearn, and pandas methods