# Predicting Absenteeism at Work (ML)
### Load Data
Pull in pre-processed data and load relevant libraries.

In [1]:
# load libraries
import pandas as pd
import numpy as np
import sklearn

# read in data
df = pd.read_csv('S:/Matt/Data Science/Udemy/Python, SQL and Tableau/Data/absent_out_data (my_file).csv')

# peek at data
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,7,1,289,36,33,239.554,30,0,2,1,4
1,0,0,0,0,7,1,118,13,50,239.554,31,0,1,0,0
2,0,0,0,1,7,2,179,51,38,239.554,31,0,0,0,2
3,1,0,0,0,7,3,279,5,39,239.554,24,0,2,0,4
4,0,0,0,1,7,3,289,36,33,239.554,30,0,2,1,2


### Logistic Regression
Overview:
* A logistic regression produces a binary output of 0 or 1, allowing us to perform a 2 class classification.
* Here, we will classify each individual/row of our data as either 'excessively absent' (1) or not (0).
* To do this, we will use the median of the target variable (absenteeism in hours) as our class decision boundary.
* Using the median ensures that we are splitting our data ~50:50. For logistic regression, it is important that both output classes are relatively evenly distributed (down to a 60:40 split at the most drastic) to avoid over/under-sampling and simply outputting one class predominantly in the results.

In [2]:
# create target variable
targets = np.where(df['Absenteeism Time in Hours'] >
                   df['Absenteeism Time in Hours'].median(), 1, 0)

# add targets to df
df['Excessive Absenteeism'] = targets

# check distribution of targets
print(targets.sum() / targets.shape)

[0.45571429]


We can now remove our previous target variable, leaving us with the binary target only. We will also checkpoint our df to ensure we have a safe copy moving forwards.

In [3]:
# remove redundant col and store in new df
df_pp = df.drop(['Absenteeism Time in Hours', 'Day of the Week', 'Daily Work Load Average', 'Distance to Work'], axis=1)

# check new df is distinct from original (i.e. successful checkpoint)
df_pp is df

False

In [4]:
# peek at data
df_pp.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets,Excessive Absenteeism
0,0,0,0,1,7,289,33,30,0,2,1,1
1,0,0,0,0,7,118,50,31,0,1,0,0
2,0,0,0,1,7,179,38,31,0,0,0,0
3,1,0,0,0,7,279,39,24,0,2,0,1
4,0,0,0,1,7,289,33,30,0,2,1,0


### Standardizing Inputs
Overview:
* We have multiple input variables/features, each with different scales (min, max, avg etc.), this means that when we run them through our model, our model may infer that higher values carry more significance, negative values mean something drastic etc.
* Therefore, we must standardize our inputs (subtract mean, divide by std.) to ensure they all have the same (or very similar) scales, centred around 0, ranging between -1 and 1.

In [5]:
# extract features and targets
#X = df_pp.iloc[:, :-1]
#y = df_pp['Excessive Absenteeism']

# load libraries
#from sklearn.preprocessing import StandardScaler

# instantiate scaler (subtract mean, divide by std.)
#scaler = StandardScaler()

# fit scaler to data
#scaler.fit(X)

# apply scaler (centre each feature around 0 with ~1 std.)
#X_scaled = scaler.transform(X)

# peek at data
#X_scaled.shape

The above (commented out) code will scale all our features, which is an issue because some of our features are dummy variables and already scaled (i.e. can only be 0 or 1 due to binary dummies).

This isn't necessarily an issue, as we could still produce a good model from scaling these dummies, but we would lose the ability of saying "for every unit change in our dummy variable, we see x change in our predicted value" when we come to assess our co-efficients and odds ratios later on.

Therefore, the below code implements a custom scaler class where we can choose to scale only selected features (thus excluding our dummy variables) and retain the ability to quantify unit changes in our dummy variables.

In [5]:
# extract features and targets
X = df_pp.iloc[:, :-1]
y = df_pp['Excessive Absenteeism']

# load libraries
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler

# create custom scaler class
class CustomScaler(BaseEstimator, TransformerMixin):
    
    def __init__(self, columns, copy=True, with_mean=True, with_std=True):
        self.scaler = StandardScaler(copy, with_mean, with_std)
        self.columns = columns
        self.mean_ = None
        self.var_ = None
        
    def fit(self, X, y=None):
        self.scaler.fit(X[self.columns], y)
        self.mean_ = np.mean(X[self.columns])
        self.var_ = np.var(X[self.columns])
        return self
    
    def transform(self, X, y=None, copy=None):
        init_col_order = X.columns
        X_scaled = pd.DataFrame(self.scaler.transform(X[self.columns]), columns=self.columns)
        X_not_scaled = X.loc[:, ~X.columns.isin(self.columns)]
        return pd.concat([X_not_scaled, X_scaled], axis=1)[init_col_order]
    
# extract cols to scale (i.e. all except dummy features)
cols_to_omit = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Education']

cols_to_scale = [x for x in X.columns.values if x not in cols_to_omit]

# fit custom scaler to data
scaler = CustomScaler(cols_to_scale)

# fit scaler to data
scaler.fit(X)

# transform data
X_scaled = scaler.transform(X)

# view data after standardization
X_scaled.head()



Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Transportation Expense,Age,Body Mass Index,Education,Children,Pets
0,0,0,0,1,0.030796,1.005844,-0.536062,0.767431,0,0.880469,0.268487
1,0,0,0,0,0.030796,-1.574681,2.130803,1.002633,0,-0.01928,-0.58969
2,0,0,0,1,0.030796,-0.654143,0.24831,1.002633,0,-0.91903,-0.58969
3,1,0,0,0,0.030796,0.854936,0.405184,-0.643782,0,0.880469,-0.58969
4,0,0,0,1,0.030796,1.005844,-0.536062,0.767431,0,0.880469,0.268487


### Splitting Train and Test Sets
Here we will split our dataset into train (80%) and test (20%) sets so that we can validate our results and also ensure that we're not leaking our validation data to the model. We will also seed the randomness of the split to ensure that our results are directly comparable each time instead of producing entirely new results with different train/test splits each time.

In [6]:
# import libraries
from sklearn.model_selection import train_test_split

# split dataset (seed randomness for consistent results)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, train_size = 0.8, random_state = 20) # 80:20 train:test split

# check shape of vars
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(560, 11) (140, 11) (560,) (140,)


### Accuracy of Model
Overview:
* We will build our logistic regression model, fit it to our training set and then test the score of the model.
* We score it by comparing the predicted results to the known outputs and assessing how similar they are (~77% here).

In [7]:
# load libraries
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# instantiate model
reg = LogisticRegression()

# fit model to data
reg.fit(X_train, y_train)

# check in-built score of model
print(reg.score(X_train, y_train))

# check manual score of model (proof that the above method is a shorthand for the below code)
y_pred = reg.predict(X_train)
print(np.sum(y_pred == y_train) / y_train.shape[0]) # % correct matches of total values (i.e. train accuracy %)

0.7767857142857143
0.7767857142857143


### Extracting Logistic Regression Function
Overview:
* The above code is a fairly high level view of our model.
* In order to fully understand our model, tweak it and feel confident with our analysis, we should dig into the details of the model function.
* Any regression (linear or logistic) is designed to optimize a function in order to achieve the best fit to our training data. In order to do this, it assigned weights to each of the input variables/features which it then adjusts during optimization.
* Here, we will expose these weights (a.k.a. co-efficients) as well as the intercept (a.k.a. bias, the only other part of our model function) and summarize.

In [9]:
# view intercept
reg.intercept_

array([-1.60957471])

In [10]:
# view co-efficients
reg.coef_

array([[ 2.77151176,  0.93168817,  3.09210221,  0.8090592 ,  0.00781237,
         0.62505482, -0.17390339,  0.28829409, -0.24081615,  0.35753531,
        -0.27337422]])

In [11]:
# co-efficient labels (i.e. features)
X.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value',
       'Transportation Expense', 'Age', 'Body Mass Index', 'Education',
       'Children', 'Pets'], dtype=object)

In [8]:
# build a summary table of the above
summary_table = pd.DataFrame(columns = ['Feature Names'], data = X.columns.values)
summary_table['Co-Efficients'] = reg.coef_.transpose()
summary_table.index = summary_table.index + 1 # free up 0th index
summary_table.loc[0] = ['Intercept', reg.intercept_[0]] # fill with intercept (to appear at top)
summary_table = summary_table.sort_index()
summary_table

Unnamed: 0,Feature Names,Co-Efficients
0,Intercept,-1.609575
1,Reason_1,2.771512
2,Reason_2,0.931688
3,Reason_3,3.092102
4,Reason_4,0.809059
5,Month Value,0.007812
6,Transportation Expense,0.625055
7,Age,-0.173903
8,Body Mass Index,0.288294
9,Education,-0.240816


### Analysing the Co-Efficients
Notes:
* Co-efficients are the weights given to each feature, therefore if the co-efficients are 0 (or close to it) then our model is saying that these features have little or no significance in the prediction of our targets.
* Similarly, the odds ratio tells you what to multiply your log odds by in order to produce the final function. Therefore, any value which is 1 (or close to it) is not changing at all, and similarly has little or no prediction power.
* Therefore, we can select and remove any features which have a co-efficient of 0 and/or an odds ratio of 1 (they should always occur together as they are mathematically linked).
* **NOTE:** because we created our custom scaler class earlier on, we can now effectively discuss unit changes in our dummy variables (e.g. for 1 unit change in 'Reason_3', we see people are 22 times more likely to be absent than someone else). Previously, we wouldn't have been able to do this as our dummy variables had been scaled to non-easily interpretable numbers (i.e. not simply binary values).

In [9]:
# calculate log odds (i.e. get exponentials of standardized co-efficients)
summary_table['Odds Ratio'] = np.exp(summary_table['Co-Efficients'])
summary_table.sort_values('Odds Ratio', ascending=False)

Unnamed: 0,Feature Names,Co-Efficients,Odds Ratio
3,Reason_3,3.092102,22.023327
1,Reason_1,2.771512,15.982778
2,Reason_2,0.931688,2.538791
4,Reason_4,0.809059,2.245794
6,Transportation Expense,0.625055,1.868348
10,Children,0.357535,1.429801
8,Body Mass Index,0.288294,1.33415
5,Month Value,0.007812,1.007843
7,Age,-0.173903,0.840378
9,Education,-0.240816,0.785986


In the above table, our dummy features are the 4 reasons as well as education. Because we did not standardize our dummy values (they are simply 0 or 1), we can easily interpret the results. For example, reason 3 was poisoning, and these values show us that someone who has been poisoned is likely to have 3 more absent hours than a non-poisoned employee. It also shows us that someone who's been poisoned is 22 times more likely to be excessively absent than someone who hasn't. So that's how you interpret both the co-efficient and odds ratio values.

From this, we can begin picking up reasons behind our key predictors. For example, the 4 reasons for sickness (i.e. poisoning, pregnancy, general illness etc.) are the biggest contributors towards excessive absence as you might expect. Also, people with kids or who spend more money on transport are also more likely to be absent for longer. You can also see that people with a higher BMI, who are more likely to be unfit/in poor shape, are also more likely to be ill longer.

In contrast, the older you are, the better your education and the more pets you have all contribute towards a lower likelihood of excessive illness. This is perhaps because adults take better care of themselves and have less sick days, a better education could lead to greater importance of job and perhaps responsibility not to take sick days and finally that if you have many pets, it's likely you live with other people and therefore don't take sole responsibility for taking days off for the vets etc.

**NOTE:** standardization of features results in better model accuracy because scaled, consistent inputs do not throw off the model by introducing bias etc. However, you also lose the interpretability of your data, because in the summary table above you aren't looking at a change in e.g. age in years, you're looking at scaled age (where 0.2 could reprent 65 years for example). Therefore, you can produce both tables (i.e. one scaled, one not) if you want to fully explain your results for readers.

### Simplifying the Model (Backward Elmination)
Machine learning models require a trade off between simplicity (fewer features) and accuracy (enough valuable data/features to create a good predictor). From the above table we can see that certain features appear to have little or no impact on our predictive power, therefore we will remove them, ensure that the accuracy of our model is not significantly affected and proceed.

**NOTE:** I have removed day of the week, daily work load average and distance to work retrospectively (i.e. in an earlier chunk of code) so you will not see the changes in order below for example, they have already happened.

### Testing the Model
We will now use our test data to validate the accuracy of our model. Our model scores at around 74% which is encouraging as it's close to our train accuracy. We are fairly happy then that it is not over or under fitted excessively.

In [10]:
# assess model with test data
reg.score(X_test, y_test)

0.7357142857142858

Logistic regression works by calculating a probability that a sample should be in each class. For example, you might have a 75% probability of being class 0 (not excessively absent) and 25% of being class 1. The model then sets a threshold (normally 50%) and says if you're below that then you're assigned class 0, above and you're assigned class 1.

To extract these probabilities so we have full granularity of data, we can use the following method. The first column is the probability of class 0, whilst the second is probability of class 1 (i.e. excessive absence).

In [11]:
# view predicted probabilities
X_test_pred_prob = reg.predict_proba(X_test)

# extract probability of class 1/excessive absence only
X_test_pred_prob[:, 1][:5]

array([0.24533653, 0.39082409, 0.51670907, 0.24231768, 0.91642259])

###  Save our Module
Overview:
* Save our model (i.e. 'reg' variable) into a file which can be read/loaded as it's own file.
* We will use pickle to package and store it for transport (you then use pickle to unpack it when you want to use it again).
* **NOTE:** Pickle is a way of serializing (i.e. converting code to strings) and deserialiazing (strings to code/python objects). You should be careful when using it on external files as viruses can be activated this way. Also, it can be quite slow for large files and should always be used on code from the same version of Python as it was written in.

In [16]:
# load libraries
import pickle

# store model in file called 'model'
with open('model', 'wb') as file:
    pickle.dump(reg, file) # dump serializes code to a string (load is the inverse)

# store scaler in file called 'scaler'
with open('scaler', 'wb') as file:
    pickle.dump(scaler, file)