## Creating a logistic regression to predict absenteeism


import relevant libraries 

In [143]:
import pandas as pd
import numpy as np

#### Load the data with pandas method .read_csv( )

In [144]:
data_preprocessed = pd.read_csv('Absenteeism_preprocessed.csv')

In [145]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


### Create Targets

In order to process further our data we will need to create two classes of people who: were moderately absent and those that were excessively absent. 
We will use a methodology which is quite simple, we will calculate the median value of the 'Absenteeism Time in  Hours' and use it as cut-off line, everything below the median is normal, everything above will be considered excessive. ### .median() is a pandas method.

In [146]:
data_preprocessed ['Absenteeism Time in Hours'].median()

3.0

##### What are the classes? Moderately absent (<= 3 hours)
####                                      Excessively absent (=> 4 hours)

If an observation is absent less than 3 hours then we assign it a zero
If an observation is absent more than 3 hours then we assign it a 1

In supervised learning we call these zeros and 1s: TARGETS, because these are the values we are aiming for.

In [147]:
targets=np.where(data_preprocessed['Absenteeism Time in Hours']>
                 data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

# this line of code calculates if a person was absent more than 3 hours
# we will use numpy method where, which has 3 arguments
# the syntax of the method is: np.where(condition, value if True, value if False) - checks the condition and assigns a value accordingly
# The result will be stored in a variable called : targets
# the condition is: if our series in the column "Absenteeism Time in hours" is bigger than the median
# the result is numpy array

In [148]:
targets

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,

In [149]:
data_preprocessed['Excessive Absenteeism'] = targets
# this line adds the targets to the DataFrame ( data preprocessed)
# a new column will appear called Excessive Absenteeism

In [150]:
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


### A comment on target

Using the median as a cut-off line is 'numerically stable and rigid', that's because using the median we have implicitly balanced the dataset. This will prevent out model from learning to output only 0s or only 1s.In order to prove that let's the divide the number of targets that are 1 by the total number of targets. The total number of targets that are 1 can be found by summing up all values of targets, while all number of targets is the shape on axis 0 (vertical line). 

In [151]:
targets.sum() / targets.shape[0]
# the result means 46% of targets are 1s, while 54% are zeros. Equally a 60-40 split will usually work well for a logistic regression.
# Which can't be true for neural networks. 

0.45571428571428574

In [152]:
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours', 'Day of the week', 'Daily Work Load Average','Distance to Work'], axis = 1)

# let's drop the column Absenteeism time in hours as we won't need it. An create a new check point 
# by creating a new variable data_with_targets.

In [153]:
data_with_targets is data_preprocessed

False

In [154]:
data_with_targets.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0


## Select the inputs in regression with pandas method iloc


In [155]:
data_with_targets.shape

(700, 12)

##### The iloc pandas method by selection in the data frame. Has 2 arguments
Dataframe.iloc[row indices, column indices] - selects (slices) data by position when given rows and columns wanted.



In [156]:
data_with_targets.iloc[:,:14]
# we selected all rows (:)
# all columns except last column "Axcessive Absenteeism"

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,1,2,0,1
696,1,0,0,0,5,225,28,24,0,1,2,0
697,1,0,0,0,5,330,28,25,1,0,0,1
698,0,0,0,1,5,235,32,25,1,0,0,0


In [157]:
data_with_targets.iloc[:,:-1]

# does the same thing as the line code above 

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,7,289,33,30,0,2,1
1,0,0,0,0,7,118,50,31,0,1,0
2,0,0,0,1,7,179,38,31,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0
4,0,0,0,1,7,289,33,30,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,5,179,40,22,1,2,0
696,1,0,0,0,5,225,28,24,0,1,2
697,1,0,0,0,5,330,28,25,1,0,0
698,0,0,0,1,5,235,32,25,1,0,0


In [158]:
unscaled_inputs = data_with_targets.iloc[:,:-1]
# creates a variable called unscaled_inputs and store out dataframe without the last column
# that's another checkpoint

## Standardize the data 


In [159]:
# import relevant module
#from sklearn.preprocessing import StandardScaler

#absenteeism_scaler = StandardScaler()


We create a StandardScaler object (which is empty) that will help us to scale our data. Absenteeism_scaler is a SrandardScaler object which will be used to substract the mean and divide by the standard deviation variablewise(featurewise). 

In [160]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

class  CustomScaler(BaseEstimator, TransformerMixin):
    
    def __init__(self,columns,copy=True, with_mean=True,with_std=True):
        self.scaler = StandardScaler(copy,with_mean,with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns],y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis =1)[init_col_order]

In [161]:
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [163]:
#columns_to_scale =['Month Value','Day of the week', 'Transportation Expense', 'Distance to Work',
       #'Age', 'Daily Work Load Average', 'Body Mass Index','Children', 'Pets']

In [164]:
columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Education']

List comprehension is a syntactic construct which allows us to create a list from existing lists based on loops, conditionals, etc.

In [165]:
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [166]:
absenteeism_scaler = CustomScaler(columns_to_scale)

In [167]:
absenteeism_scaler.fit(unscaled_inputs)

# this line fits our data from unscaled inputs into the standardscaler called 'absenteeism_scaler'
# meaning it will calculate and store the mean and standard deviation of each feature from unscaled_inputs 
# notice with_mean= True (it stored the mean) and with_std=True it means it stored the standard deviation in our object



CustomScaler(columns=['Month Value', 'Transportation Expense', 'Age',
                      'Body Mass Index', 'Children', 'Pets'],
             copy=None, with_mean=None, with_std=None)

To apply the scaling mechanism we need to use another method called transform
This operation tansforms uncaled inputs from the absenteeism_scaler
In other words: we substracted the mean and divided by standard deviation

Whenever we get a new data we will use this syntax

new_data_raw = pd.read_csv('new_data.csv')
new_data_scaled = absenteeism_scaler.transform(new_data_raw)

To reach the same transformation

In [168]:
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)


In [169]:
scaled_inputs

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.030796,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.030796,-1.574681,2.130803,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.030796,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.030796,0.854936,0.405184,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.030796,1.005844,-0.536062,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.568019,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.568019,0.040034,-1.320435,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.568019,1.624567,-1.320435,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.568019,0.190942,-0.692937,-0.408580,1,-0.919030,-0.589690


In [170]:
scaled_inputs.shape

# we have 700 observation and 14 features

(700, 11)

### Train-Test Split 

 Overfitting means that a model learns to predict the train data so well that it fails miserably on new data.
 
 One way to deal with overfitting is to hide a small portion of the data from the algorithm. We train the data on most of it, but test a portion. We will split the data set into train and test, to asses the accuracy of the model with data it never seen before.
 Data shuffle - we want to shuffle the data to remove all the dependencies that comes from the order of the data set( to  make it random).
Import the relevant module

In [171]:
from sklearn.model_selection import train_test_split

# train_test_split method has many arguments
# the most important one are inputs and targets
# syntax sklearn.mode_selection.train_test_split(inputs, targets) splits arrays or matrices into random train and test subsets


## Split


In [172]:
train_test_split(scaled_inputs, targets)

[     Reason_1  Reason_2  Reason_3  Reason_4  Month Value  \
 259         0         0         0         1     0.330204   
 387         1         0         0         0    -1.466241   
 402         0         0         0         1    -1.166834   
 518         0         0         0         1     0.929019   
 23          0         0         0         1     0.330204   
 ..        ...       ...       ...       ...          ...   
 282         0         0         0         1     0.629611   
 688         0         0         0         0    -0.568019   
 544         0         0         0         1     1.228426   
 244         1         0         0         0     0.030796   
 517         0         0         0         1     0.929019   
 
      Transportation Expense       Age  Body Mass Index  Education  Children  \
 259                1.005844 -0.536062         0.767431          0  0.880469   
 387                0.356940  0.718933        -0.878984          0 -0.919030   
 402               -1.5746

###### the result obtained is 4 arrays:
array 1: a training dataset with inputs,
array2: a training dataset with targets,
array3: a test dataset with inputs,
array4: a test dataset with targets

let's place these arrays in 4 variables and use the train_test_split method

In [173]:
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, train_size = 0.8, random_state = 20)

# notice the train_size is set to 80% or 90%
# random_state it checks randomly 20 observations at a time

In [174]:
print(x_train.shape, y_train.shape)

# the train inputs are 560 by 14, while the training targets 560 by comma
# this tells us that the inputs contain 560 observations along 14 features, while the targets are vector of length 560
# 560 observation with 14 input and one target value per each observation

(560, 11) (560,)


In [175]:
print(x_test.shape, y_test.shape)

# same for the test data

(140, 11) (140,)


## Logistic regression with sklearn 

In [176]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

## Training model and assessing its Accuracy

In [177]:
reg = LogisticRegression()

In [178]:
reg.fit(x_train,y_train)

# this line fits the regression
# sklearn.linear_model.LogisticRegression.fit(x,y)-fits the model according to the given training data
#basically the output is all the machine learning, the parameters are default

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [179]:
reg.score(x_train,y_train)

# sklearn.linear_model.LogisticRegression.score(inputs,targets)
# returns the mean accuracy on the given test data and lables
# the output tells us that the accuracy is 80%

0.7767857142857143

## Manually check the accuracy


In [180]:
model_outputs = reg.predict(x_train)

#sklearn.linear_model.LogisticRegression.predict(inputs)
# this method predicts class labels (logistic regression outputs) for given input samples

In [181]:
model_outputs

# these are the predictions of our model

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0,

In [182]:
y_train

# these are the targets

array([0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,

In [183]:
model_outputs == y_train

# let's compare our predictions with our targets,if there is a match the result is True otherwise False.
# we can see which elements were guessed correctly and which haven't

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False,  True, False, False,  True,  True,  True,  True,
       False,  True, False,  True, False, False,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
       False,  True, False,  True,  True, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
       False,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,

In [184]:
np.sum(model_outputs == y_train)

# sum the array above of all the True outputs

435

In [185]:
model_outputs.shape[0]

560

In [186]:
# To calculate the Accuracy = Correct predictions (437)/ the number of observations (560 the train data)
# same result as the method score
np.sum(model_outputs == y_train) / model_outputs.shape[0]

0.7767857142857143

## Finding the Intercept and Coefficient 

Regression Analysis can be Linear or non-linear (also called logistic regression)

To use our logistic regression model outside of python we need to find the intercept and coefficients.



In [187]:
reg.intercept_

array([-1.60957471])

In [188]:
reg.coef_

array([[ 2.77151176,  0.93168817,  3.09210221,  0.8090592 ,  0.00781237,
         0.62505482, -0.17390339,  0.28829409, -0.24081615,  0.35753531,
        -0.27337422]])

In [189]:
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [190]:
feature_name = unscaled_inputs.columns.values


In [191]:
summary_table = pd.DataFrame(columns=['Feature name'], data =feature_name)
summary_table['Coefficient'] = np.transpose(reg.coef_)
summary_table

Unnamed: 0,Feature name,Coefficient
0,Reason_1,2.771512
1,Reason_2,0.931688
2,Reason_3,3.092102
3,Reason_4,0.809059
4,Month Value,0.007812
5,Transportation Expense,0.625055
6,Age,-0.173903
7,Body Mass Index,0.288294
8,Education,-0.240816
9,Children,0.357535


In [192]:
summary_table.index = summary_table.index + 1

In [193]:
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature name,Coefficient
0,Intercept,-1.609575
1,Reason_1,2.771512
2,Reason_2,0.931688
3,Reason_3,3.092102
4,Reason_4,0.809059
5,Month Value,0.007812
6,Transportation Expense,0.625055
7,Age,-0.173903
8,Body Mass Index,0.288294
9,Education,-0.240816


## Interpreting Coefficients 

This is the logistic regression equation

log(odds) = intercept + b1x1 + b2x2 +...+ b14x14

log(odds) = -0.21 + 2.07*Reason_1 + 0.33*Reason_2 +...+ (-0.32)*Pet

So all the coefficients we have refer to Log(odds)

In [194]:
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)

# finds the exponential of the coefficients

In [195]:
summary_table

Unnamed: 0,Feature name,Coefficient,Odds_ratio
0,Intercept,-1.609575,0.199973
1,Reason_1,2.771512,15.982778
2,Reason_2,0.931688,2.538791
3,Reason_3,3.092102,22.023327
4,Reason_4,0.809059,2.245794
5,Month Value,0.007812,1.007843
6,Transportation Expense,0.625055,1.868348
7,Age,-0.173903,0.840378
8,Body Mass Index,0.288294,1.33415
9,Education,-0.240816,0.785986


In [196]:
summary_table.sort_values('Odds_ratio', ascending=False)

# DataFrame.sort_values(Series) sorts the values in a data frame with respect to a given column (Series)

Unnamed: 0,Feature name,Coefficient,Odds_ratio
3,Reason_3,3.092102,22.023327
1,Reason_1,2.771512,15.982778
2,Reason_2,0.931688,2.538791
4,Reason_4,0.809059,2.245794
6,Transportation Expense,0.625055,1.868348
10,Children,0.357535,1.429801
8,Body Mass Index,0.288294,1.33415
5,Month Value,0.007812,1.007843
7,Age,-0.173903,0.840378
9,Education,-0.240816,0.785986


- if  its coefficient is around zero
- if  its odds ratio is around 1
the corresponding feature has little importance 

A weight (coefficient) of zero implies that no matter the feature value, we will multiply it by zero ( in the model) and the whole result will be equal to zero.

For a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratio ( 1 = no chnage)

For example if we say:

odds * odds ratio = new odds ( for a unit change 

5:1  *   2        = 10:1

5:1  *   0.2      = 1:1

5:1  *   1        = 5:1 

The 'Daily Work Load Average' weight(coefficient) is almost zero and odds are almost 1, so this feature is make no difference. So it 'Day of Week' and 'Distance to Work'.


### Backward Elimination

The idea is that we can simply our model by removing all features which have close to no contribution to the model. 

### Testing the model the end part of the Machine Learning

In [198]:
reg.score(x_test,y_test)

0.7357142857142858

There is another method 
sklearn.linear_model.LogisticRegression.predict_proba(x) -returns the probability estimates for all possible outputs(classes)

In [199]:
predicted_proba = reg.predict_proba(x_test)
predicted_proba

array([[0.75466347, 0.24533653],
       [0.60917591, 0.39082409],
       [0.48329093, 0.51670907],
       [0.75768232, 0.24231768],
       [0.08357741, 0.91642259],
       [0.3052464 , 0.6947536 ],
       [0.303675  , 0.696325  ],
       [0.11636888, 0.88363112],
       [0.7400284 , 0.2599716 ],
       [0.75596036, 0.24403964],
       [0.50609784, 0.49390216],
       [0.19501503, 0.80498497],
       [0.06248668, 0.93751332],
       [0.7055465 , 0.2944535 ],
       [0.29675526, 0.70324474],
       [0.52028649, 0.47971351],
       [0.50551315, 0.49448685],
       [0.50843643, 0.49156357],
       [0.36713074, 0.63286926],
       [0.06422143, 0.93577857],
       [0.73822433, 0.26177567],
       [0.75768232, 0.24231768],
       [0.47994423, 0.52005577],
       [0.47760936, 0.52239064],
       [0.22619725, 0.77380275],
       [0.74047815, 0.25952185],
       [0.51148533, 0.48851467],
       [0.87702735, 0.12297265],
       [0.24005377, 0.75994623],
       [0.75768232, 0.24231768],
       [0.

In [200]:
predicted_proba.shape

(140, 2)

In [201]:
predicted_proba[:,1]

array([0.24533653, 0.39082409, 0.51670907, 0.24231768, 0.91642259,
       0.6947536 , 0.696325  , 0.88363112, 0.2599716 , 0.24403964,
       0.49390216, 0.80498497, 0.93751332, 0.2944535 , 0.70324474,
       0.47971351, 0.49448685, 0.49156357, 0.63286926, 0.93577857,
       0.26177567, 0.24231768, 0.52005577, 0.52239064, 0.77380275,
       0.25952185, 0.48851467, 0.12297265, 0.75994623, 0.24231768,
       0.38859882, 0.71238179, 0.69821485, 0.49507156, 0.24231768,
       0.59772596, 0.26042186, 0.78022686, 0.4398591 , 0.60641126,
       0.24188848, 0.49713003, 0.25862385, 0.40689715, 0.80759194,
       0.59889345, 0.71944702, 0.24317762, 0.24676126, 0.2414598 ,
       0.50180816, 0.29299268, 0.6947536 , 0.24459387, 0.82033268,
       0.39193844, 0.90599346, 0.26442957, 0.32150234, 0.3220128 ,
       0.70502976, 0.69623972, 0.26579672, 0.77584327, 0.24541121,
       0.24490372, 0.07551158, 0.26087263, 0.76176924, 0.29640101,
       0.25772789, 0.31539593, 0.88408057, 0.4387068 , 0.59547

In reality, logistic regression models calculate these probabilities in the background. 

if the probability is:
- below 0.5, it places zero
- ablove 0.5 it places a 1

Next steps:
    1. Save the model 
    2. Create a module
    3. Get new data, classify it, pass it through SQL, and analyze it in Tableau

## Save the model

###### Saving the model means saving the 'reg' object

In [202]:
# pickle [module] is a Python module used to convert a Python object into a character stream

import pickle


In [203]:
with open('model', 'wb') as file:
    pickle.dump(reg,file)
    
    
# model is the file name and wb stands for write bytes, when we unpickle we will use rb (read bytes)
# pickle.dump is a method which means to 'save' (or dump the information in a file, 
# when we unpickle we load it 
# as argument we put the object to be dumped (reg)

##### We must save the absenteeism_scaler, too!

we are basically separating the model from the training data...for good
The scaler will be used also to preprocess new data. We must pickle the scaler too

In [204]:
with open('scaler','wb') as file:
    pickle.dump(absenteeism_scaler,file)

Pickle is teh standard Python tool for serialization and deserialization. In simple words, pickling means: converting a Python object(no matter what) into a string of characters. Logically, unpickling is about converting a string of charcaters (that has been pickled) into a Python object.