# Machine Learning Case Study: Absenteeism 
#### by Sooyeon Won 

### Part 2: Machine Learning 

### Keywords 
- Supervised Machine Learning 
- Classification Model
- Logistic Regression 


### Contents 

<ul>    
<li><a href="#Preprocessing">1.  Data Preprocessing</a></li>
<li><a href="#Analysis">2.  Machine Learning</a></li>
<li><a href="#Deployment">3.  Model Deployment</a></li>
</ul>


### Data Preparation from Part 1 

In [1]:
# Import the relevant libraries
import pandas as pd
import numpy as np

In [2]:
# Load the preprocessed data
data_preprocessed = pd.read_csv('Absenteeism_preprocessed.csv')

In [3]:
# Eyeball the data
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pet,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


<a id='Analysis'></a>
### 2. Machine Learning 
- 2.1. Create the targets
- 2.2. Select the inputs for the regression
- 2.3. Standardize & Split the data
- 2.4. Logistic regression with sklearn
- 2.5. Training the Model
- 2.6. Testing the model

#### 2. 1. Create the targets

In [4]:
# Check the median value for the cut-off line 
data_preprocessed['Absenteeism Time in Hours'].median()

3.0

In [5]:
# Create targets using parameterized code
targets = np.where(data_preprocessed['Absenteeism Time in Hours'] > 
                   data_preprocessed['Absenteeism Time in Hours'].median(), 1, 0)

# Create a Series in the original dataframe
data_preprocessed['Excessive Absenteeism'] = targets

In [6]:
# Check the result of it
data_preprocessed.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pet,Absenteeism Time in Hours,Excessive Absenteeism
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4,1
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2,0
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4,1
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2,0


> **Comments**: In this section, I create the target variable for the regression.As in the dataset, the column of 'Absenteeism Time in Hours' is numeric. I basically classfied each data point (people) into classes by setting a cutoff line based on the median value of it. 
To do so, I found the median value of 'Absenteeism Time in Hours', which is 3 hours. Thus, if the value of 'Absenteeism Time in Hours' is larger than the median value, it is defined as one class of **'Excessively Absent (1)'**. On the other hand, if the values are smaller than or equal to the median value, the data points are classified as the class of **'Moderately Absent (0)'**. These are the values I aim for throughout this analysis. 

> The reason why the median value of the dataset is taken as as a cut-off line is that dataset will be balanced. In this way, there will be roughly equal number of 0s and 1s for the logistic regression. This is useful to balance the given data especially when the provided dataset is sufficiently enough. Note that .where() method is to assign 1 to anyone who has been absent 4 hours or more (more than 3 hours). This is the equivalent of taking half a day off.

> Then I created a Series in the original data frame that will contain the targets for the regression.

In [7]:
# Re-check the Balanced Targets
round(targets.sum() / targets.shape[0], 2)

0.46

In [8]:
# Drop the unnecessary variables
data_with_targets = data_preprocessed.drop(['Absenteeism Time in Hours','Day of the Week',
                                            'Daily Work Load Average','Distance to Work'],axis=1)

In [9]:
# Eyeball the modified data 
data_with_targets.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pet,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0


In [10]:
data_with_targets.shape

(700, 12)

> **Comments:** 
After I classified the targets, I checked whether the dataset is balanced (what % of targets are 1s). 
>- targets.sum() implies the number of 1s in the targets 
>- shape[0] indicates the length of the targets array <br><br>
> Then I created a checkpoint by dropping the unnecessary variables and the variables I 'eliminated' after exploring the weights. Finally I ended up with 700 samples with 12 variables. 

#### 2. 2.  Select the inputs for the regression

In [11]:
# Create a variable that will contain the inputs (everything without the targets)
unscaled_inputs = data_with_targets.iloc[:,:-1]

#### 2. 3. Standardize the data

In [12]:
# Import the relevant libraries
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin

# Define scaler as an object
absenteeism_scaler = StandardScaler()

> Standardization is one of the most common preprocessing tools. Since data of different magnitude (scale) can be biased towards high values, I want all inputs to be of similar magnitude. To do so, I used StandardScaler from sklearn library. Then I created a variable that contains the scaling information for this particular dataset.

In [13]:
class CustomScaler(BaseEstimator,TransformerMixin): 
    
    # init: to declare a CustomScaler object and what is calculated/declared
    
    def __init__(self,columns,copy=True,with_mean=True,with_std=True):
        
        # Scaler is nothing but a Standard Scaler object
        self.scaler = StandardScaler(copy,with_mean,with_std)
        # with some columns 'twist'
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    
    # Method 1: fit, based on StandardScale
    
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    
    # Method 2: transform, conducting the actual scaling

    def transform(self, X, y=None, copy=None):
        
        # Record the initial order of the columns
        init_col_order = X.columns
        
        # Scale all features, when creating the instance of the class
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        
        # Declare a variable containing all information that was not scaled
        X_not_scaled = X.loc[:,~X.columns.isin(self.columns)]
        
        # Return a dataframe which contains all scaled features and all 'not scaled' features
        # Use the original order recorded in the beginning
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]

In [14]:
# All columns 
unscaled_inputs.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pet'], dtype=object)

In [15]:
# Select the columns to omit
columns_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4','Education']

# Choose the columns to scale
columns_to_scale = [x for x in unscaled_inputs.columns.values if x not in columns_to_omit]

In [16]:
# Declare a scaler object, specifying the columns you want to scale
absenteeism_scaler = CustomScaler(columns_to_scale)



In [17]:
# Fit the data, meaning calculate mean and standard deviation. 
# The information will be automatically stored inside the object. 
absenteeism_scaler.fit(unscaled_inputs)



CustomScaler(columns=['Month Value', 'Transportation Expense', 'Age',
                      'Body Mass Index', 'Children', 'Pet'],
             copy=None, with_mean=None, with_std=None)

In [18]:
# Standardizes the data, using the transform method 
scaled_inputs = absenteeism_scaler.transform(unscaled_inputs)
scaled_inputs

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pet
0,0,0,0,1,0.030796,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.030796,-1.574681,2.130803,1.002633,0,-0.019280,-0.589690
2,0,0,0,1,0.030796,-0.654143,0.248310,1.002633,0,-0.919030,-0.589690
3,1,0,0,0,0.030796,0.854936,0.405184,-0.643782,0,0.880469,-0.589690
4,0,0,0,1,0.030796,1.005844,-0.536062,0.767431,0,0.880469,0.268487
...,...,...,...,...,...,...,...,...,...,...,...
695,1,0,0,0,-0.568019,-0.654143,0.562059,-1.114186,1,0.880469,-0.589690
696,1,0,0,0,-0.568019,0.040034,-1.320435,-0.643782,0,-0.019280,1.126663
697,1,0,0,0,-0.568019,1.624567,-1.320435,-0.408580,1,-0.919030,-0.589690
698,0,0,0,1,-0.568019,0.190942,-0.692937,-0.408580,1,-0.919030,-0.589690



> By fitting the data, I found the internal parameters of a model that will be used to transform data. 
Transforming applies these parameters to the data. The scaled_inputs are now an ndarray, because sklearn works with ndarrays. 

#### Split the data into train & test and shuffle

In [19]:
# Import the relevant libraries
from sklearn.model_selection import train_test_split

In [20]:
# Declare 4 variables for the split
x_train, x_test, y_train, y_test = train_test_split(scaled_inputs, targets, test_size = 0.2, random_state = 42)

# Check the shape of the train inputs and targets
print ('Train inputs shape: ', x_train.shape, 'Train targets shape', y_train.shape)

# Check the shape of the test inputs and targets
print ('Test inputs shape: ', x_test.shape, 'Test targets shape', y_test.shape)

Train inputs shape:  (560, 11) Train targets shape (560,)
Test inputs shape:  (140, 11) Test targets shape (140,)


#### 2. 4. Logistic regression with sklearn

In [21]:
# Import the relevant libraries 
from sklearn.linear_model import LogisticRegression 
from sklearn import metrics 

#### 2. 5. Training the model

In [22]:
# Create a logistic regression object
reg = LogisticRegression()

# Fit The train inputs that is basically the whole training part of the machine learning
reg.fit(x_train,y_train)

# Assess the train accuracy of the model
round(reg.score(x_train,y_train), 2)

0.77

####  Accuracy Check (Details)
> I re-check the accuracy by comparing the outputs and the true targets. 

In [23]:
# Model outputs according to the LogReg model
model_outputs = reg.predict(x_train)

# True Targets 
true_targets = y_train 

# find out in how many instances we predicted correctly
np.sum((model_outputs==true_targets))

433

In [24]:
# The total number of instances
model_outputs.shape[0]

560

In [25]:
# Calculate the accuracy of the model
accuracy = np.sum((model_outputs==y_train)) / model_outputs.shape[0]
round(accuracy, 2)

0.77

#### The Intercept and Coefficients

In [26]:
# The intercept (a.k.a bias) of the model
reg.intercept_

array([-1.68800249])

In [27]:
# The coefficients (a.k.a. weights) of the model
reg.coef_

array([[ 2.90600225,  0.7548816 ,  3.08731323,  0.94700556,  0.02451849,
         0.65563825, -0.25286853,  0.25704553, -0.25145264,  0.40171316,
        -0.2957514 ]])

In [28]:
# The names of the columns
unscaled_inputs.columns.values

# Save the names of the columns in an ad-hoc variable
feature_name = unscaled_inputs.columns.values

> To summarize the results, I created a dataframe, then used the coefficients from the table. The model coefficients (model.coef_) should be transposted before throwing throws them into a df (a vertical organization, so that they can be multiplied by certain matrices later). Finally, I added the coefficient values to the summary table.



In [29]:
summary_table = pd.DataFrame (columns=['Feature name'], data = feature_name)
# Add coefficients 
summary_table['Coefficient'] = np.transpose(reg.coef_)

# Add the intercept 
summary_table.index = summary_table.index + 1  # Move all indices by 1
# Add the intercept at index 0
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]

# Sort the df by index
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature name,Coefficient
0,Intercept,-1.688002
1,Reason_1,2.906002
2,Reason_2,0.754882
3,Reason_3,3.087313
4,Reason_4,0.947006
5,Month Value,0.024518
6,Transportation Expense,0.655638
7,Age,-0.252869
8,Body Mass Index,0.257046
9,Education,-0.251453


#### Interpretion of the coefficients

In [30]:
# Calculate the 'Odds ratio' of each feature
summary_table['Odds_ratio'] = np.exp(summary_table.Coefficient)

# Sort the table according to odds ratio
summary_table.sort_values('Odds_ratio', ascending=False)

Unnamed: 0,Feature name,Coefficient,Odds_ratio
3,Reason_3,3.087313,21.91811
1,Reason_1,2.906002,18.283559
4,Reason_4,0.947006,2.577978
2,Reason_2,0.754882,2.12736
6,Transportation Expense,0.655638,1.926372
10,Children,0.401713,1.494383
8,Body Mass Index,0.257046,1.293104
5,Month Value,0.024518,1.024822
9,Education,-0.251453,0.77767
7,Age,-0.252869,0.77657


> **Interpretation of the coefficients:** The closer the weights (coefficients) to 0, their odds ratios become smaller. Standardized coefficients are basically the coefficients of a regression, where all the variables have been standardized. One advantage of standardization is the standardized weights allow for a simple & straight forward to understand comparison between the variables. Thus whichever weights are larger, its corresponding features is more important to the target. Note that when the dummy variables are standardized, we lose the whole interpretability of a dummy.

> In the Logistic Regression, the coefficients are predicting 'log odds'. From the table, it can be interpretated that when a feature is not particularly important to the targets, the coefficient of a feature is close to 0 and also, its odd ratio is also very close to 1. This is because a weight (i.e. coefficient) of 0 implies that no matter the feature value is, the value is multiplied by 0. In addition, for a unit change in the standardized feature, the odds increase by a multiple equal to the odds ratios. Thus, if the odd ratio is 1, it means there is no changes. <br>
> Thus, given all features, these features seem to be the ones that make no difference. Thus we can drop them. 

> **4 reasons for Absent**: Reason_0 is dropped when dummy variables are created. To recall, Reason_0 represents a situation, where a person was absent, but no particular reasons were given. Therefore, the base model is when there is no reason for absenteeism. To be specific, I can conclude that it seems wherever a person has stated any reasons, we have a  much higher chance of getting excessive absence. 

>- **Reason_0**: Baseline
>- **Reason_1**: Various Dieases 
>- **Reason_3**: Poisoning 
>- **Reason_2**: Pregnancy/ Giving Birth
>- **Reason_4**: Light Dieases

> A person who has reported the Reason_1 is 14 times more likely to be excessively absent than the Reason_0 (a person who didn't specify a reason). I particularly focused on the Reason_2 (Pregnancy & Giving birth). It is a prominent cause of absenteeism, but at the same time, it is the way less pronounced than Reason_1 or Reason_3.

> Main drawback of standardisation comes when we deal with non-dummy standardised variables. For example, according to the table, the odd ratio of transportation expense is about 1.86. Its odds ratio implies that for 1 standardized unit increases in transportation expense, it is close to twice as likely to be excessively absent. This interpretation is hardly understandable. 


> Standardization models (almost) always yield higher accuracy, because the optimisation algorithms work better in this way. However depending which position I work, the preference of accuracy could be different. This it makes sense to create two versions of models one with standardisation features and one without them. Then try to draw insights from both. Since I predict values later, I prefer higher accuracy rate at least throughout the analysis.


> **Interpretation of a negative coefficient (feature: pet)**: For each additional standardised unit of pet, the odds are 1 0.7608 = 24% lower than the base model (no pet). I can interpret the result that if a person has more than a single pet, the person is probably not taking cate of the pets by himself, maybe someone else. 

> **Interpretation of the Intercept**: Intercept is used to get more accurate predictions but there is no specific meaning attached to it. In general, in machine learning, the intercept or the Bise calibrate the model. Without an intercept each prediction would be off the mark by precisely that value. 

> **Backward Elimination**: Backward elimination is one method to simplify the model. The idea is that we can simplify the model by excluding the features which have close to no contribution to the model. This indicates if these variables are removed, the rest of the model should not really change in terms of coefficient values. 

#### 2. 6. Testing the model

In [31]:
# Assess the test accuracy of the model
round(reg.score(x_test,y_test), 4)

0.7714

> Based on the data that the model has never seen before, in 75% of the cases the model will predict correctly, if the person is going to be excessively absent. Often the test accuracy is 10-20% lower than the train accuracy. This indicate an overfitting. The model learned the train data very well, but it is prone to fail in real life. 

In [32]:
# Find the predicted probabilities of each class
# The first column shows the probability of a particular observation to be 0, while the second one - to be 1
predicted_proba = reg.predict_proba(x_test)

predicted_proba

array([[0.82382633, 0.17617367],
       [0.86071363, 0.13928637],
       [0.79418937, 0.20581063],
       [0.58674649, 0.41325351],
       [0.59047288, 0.40952712],
       [0.0784706 , 0.9215294 ],
       [0.67472738, 0.32527262],
       [0.37480563, 0.62519437],
       [0.72183578, 0.27816422],
       [0.75625756, 0.24374244],
       [0.86419682, 0.13580318],
       [0.69223869, 0.30776131],
       [0.25934972, 0.74065028],
       [0.46136628, 0.53863372],
       [0.71570213, 0.28429787],
       [0.47841848, 0.52158152],
       [0.89102932, 0.10897068],
       [0.22758282, 0.77241718],
       [0.85983121, 0.14016879],
       [0.59755422, 0.40244578],
       [0.70535717, 0.29464283],
       [0.7576082 , 0.2423918 ],
       [0.7096897 , 0.2903103 ],
       [0.70154218, 0.29845782],
       [0.86673748, 0.13326252],
       [0.16895155, 0.83104845],
       [0.59755422, 0.40244578],
       [0.60012512, 0.39987488],
       [0.76295969, 0.23704031],
       [0.59931834, 0.40068166],
       [0.

In [33]:
predicted_proba.shape

(140, 2)

In [34]:
# select ONLY the probabilities referring to 1s
predicted_proba[:,1]

array([0.17617367, 0.13928637, 0.20581063, 0.41325351, 0.40952712,
       0.9215294 , 0.32527262, 0.62519437, 0.27816422, 0.24374244,
       0.13580318, 0.30776131, 0.74065028, 0.53863372, 0.28429787,
       0.52158152, 0.10897068, 0.77241718, 0.14016879, 0.40244578,
       0.29464283, 0.2423918 , 0.2903103 , 0.29845782, 0.13326252,
       0.83104845, 0.40244578, 0.39987488, 0.23704031, 0.40068166,
       0.12365487, 0.12933194, 0.6182887 , 0.54228018, 0.28579392,
       0.65097172, 0.29334453, 0.12933194, 0.84969171, 0.2088663 ,
       0.52524396, 0.24104626, 0.62862838, 0.13323921, 0.23177082,
       0.73803013, 0.77412439, 0.8772991 , 0.29639723, 0.13408928,
       0.24374244, 0.29793046, 0.40952712, 0.93294823, 0.13099423,
       0.23177082, 0.97566448, 0.28579392, 0.87182782, 0.23054446,
       0.5864671 , 0.12850755, 0.49468459, 0.63375504, 0.13580318,
       0.40340333, 0.69254506, 0.0562837 , 0.28280654, 0.51246653,
       0.23439529, 0.23704031, 0.69907946, 0.2872947 , 0.13157

### Save the model

In [35]:
# Import the relevant module
import pickle

In [36]:
# Pickle the model file
with open('model', 'wb') as file:
    pickle.dump(reg, file)

In [37]:
# Pickle the scaler file
with open('scaler','wb') as file:
    pickle.dump(absenteeism_scaler, file)