# Absenteeism at work 

Problem definition: predict the time of absence of an employee knowing some information on the reason of absence or the type of person. 

## Supervised machine learning

**Goal:** predict the time in hours of absenteisme. In the context of the problem we don't really need to get the time down to the minute but rather a global estimation of: is this employee going to be absent for 1/2 day or rather 2 days? 

**Type of supervised learning:** I will use classification model such as Decisiion Tree to get a prediction on the range of absenteeism. Decision Tree model will allow to deal with categorical data because I have many of them in the dataset.

Though regression model would work on the target type of data, I will keep it as a way for improvement if classification models fail. 

**Preprocessing and modelling tasks:** 
- [x] Drop id column which is irrelevant for modelling
- [x] Check types of columns an ensure the categorical data are well identified
- [x] Check for multicollinearity and drop columns with high correlation
- [x] Check distribution and choose the right scaling method
- [x] Check for balance of dataset and over/under sampling if needed
- [x] Create train/test samples 
- [x] Build Decision tree model 
- [x] Check performance of model using accuracy score, visualize confusion matrix using heatmap 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

In [None]:
df = pd.read_csv('../data/absenteeism_clusterized.csv')
print(df.shape)
df.head()

In [None]:
# Splitting dataset into features and target dataframes

X = df.drop('absenteeism_bins',axis=1).copy()
print(X.shape)

y = df.absenteeism_bins
print(y.shape)

__________________________
## Preprocessing

In [None]:
# Dropping Id columns because it has no impact on predicting the time of absence
X.drop('id', axis=1, inplace= True)

In [None]:
# Convert categorical dtypes to object

X[['reason_for_absence',
   'month_of_absence',
   'day_of_the_week',
   'seasons','cluster']] = X[['reason_for_absence',
                              'month_of_absence',
                              'day_of_the_week',
                              'seasons','cluster']].astype(object)
X.dtypes

In [None]:
# Checking dtype of target 

y.dtype

In [None]:
# Checking multicollinearity through data visualization

sns.heatmap(abs(X.corr().round(2)), annot=True);

In [None]:
# Checking multicollinearity between numeric columns using VIF metrics

from statsmodels.stats.outliers_influence import variance_inflation_factor as VIF

def drop_check_vif(column, X):
    if column:
        X=X.drop(column, axis=1)
    vifs=pd.Series([VIF(X.values,i) for i in range(X.shape[1])],index=X.columns)
    display(vifs[vifs>10])
    return X

In [None]:
# Creating a list of col to drop for multicollinearity (numeric columns)
col_drop = []
X_num = X._get_numeric_data()

In [None]:
col_drop.append('hit_target')

X_num = drop_check_vif(col_drop[-1], X_num)

# I dropped 1 column with VIF above 10: hit_target

In [None]:
# Dropping columns with high multicollinearity

X.drop(columns=col_drop[-1],inplace=True)
print(X.shape)

In [None]:
# Checking the frequency distribution of categorical features
cat_features = X[X.columns[X.dtypes==object]]

fig, axs=plt.subplots(2,3, figsize=(17,8))

for i in range(cat_features.shape[1]):
    ax=axs[i//3,i%3]
    sns.distplot(cat_features.iloc[:,i],ax=ax)

fig.delaxes(ax=axs[1,2])
plt.show()

In [None]:
# Checking the frequency distribution of numeric features
fig, axs=plt.subplots(1,3, figsize=(17,4))

for i in range(X_num.shape[1]):
    sns.distplot(X_num.iloc[:,i],ax=axs[i])

plt.show()

In [None]:
# Apply standardization because scale of values is the same for all

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_num_scaled = sc_X.fit_transform(X_num)

X[X.columns[X.columns.isin(X_num.columns)]] = X_num_scaled
X.head()

In [None]:
X.dtypes

In [None]:
# Checking balance of dataset
y.hist();

### Conclusion on preprocessing

We have imbalanced dataset regarding the frequency distribution of targets but this is normal because the categories with low frequency are outliers. 

Distribution of numeric features are not normally dstributed. We can assume that Decision Tree is not sensitive to normal distribution of data so we will keep them as it is. 

We can see that the categorical data are kinda uniformly distributed for seasons and day_of_week, so they may not affect much the model. 

We will build the model as it is and see for application of possible improvements afterwards.

**Possible improvements:** 
- Apply over and under sampling methods to balance the dataset if imbalance of dataset has too much effect
- Apply box-cox transformation if normallity would improve the performance of the model

_______________________
## Modelling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, precision_score

In [None]:
# Creating samples for train and test data
X_train, X_test, y_train, y_test =  train_test_split(X,y,test_size = 0.3, random_state=42, stratify=y)

print('Checking shape of samples')
print('X_train',X_train.shape)
print('X_test',X_test.shape,'\n')
print('Checking stratify of samples')
print('y_train\n',y_train.value_counts(normalize=True))
print('y_test\n', y_test.value_counts(normalize=True))

In [None]:
# Building model
dtree = DecisionTreeClassifier(random_state=8)
dtree = dtree.fit(X_train, y_train)

y_pred_dtree = dtree.predict(X_test)

# Checking performance of model using evaluation metrics
print("Accuracy score:",accuracy_score(y_test,y_pred_dtree))

# Checking overfitting of model by checking the accuracy of train sample
y_train_pred = model.predict(X_train)
print("Accuracy score for train sample:",accuracy_score(y_train,y_train_pred))

In [None]:
feat_importance_dt = pd.DataFrame(model.feature_importances_, index=X.columns)
feat_importance_dt

In [None]:
ac_score = accuracy_score(y_test,y_pred_dtree)

sns.heatmap(confusion_matrix(y_test,y_pred_dtree, normalize='true').round(2), annot=True, xticklabels=class_names, yticklabels=class_names)
plt.ylabel('True labels')
plt.xlabel('Predicted labels')
plt.title(f'Normalized Confusion Matrix for Decision Tree\n accuracy = {ac_score.round(4)}', fontsize=14)
plt.savefig('../img/norm_confusion_matrix_decision_tree.png')
plt.show()

### Conclusion on Decision Tree

The accuracy score is not so good and we can see there is an overfit of train sample, which can be frequent for Decision tree. 

In the stratify we clearly see the imbalance of target, we may want to correct that to see if there is improvement of accuracy (but we should keep in mind that the imbalance is due to outliers we may want to keep track on). 

**Possible improvements:**
- Use crossvalidation to avoid overfitting
- Test Random Forest which is better to handle overfitting
- Under/over sampling dataset

___________________________
## Use Cross validation for Decision Tree

In [None]:
from sklearn.model_selection import StratifiedKFold # Use of Stratified to keep imbalanced samples

list_of_accuracies=[]
skf = StratifiedKFold(n_splits=4, random_state=8, shuffle=True) 
dtree_cv = DecisionTreeClassifier(random_state=8)

for train_idx, test_idx in skf.split(X,y):
    dtree_cv = dtree_cv.fit(X.iloc[train_idx,:],y[train_idx])
    list_of_accuracies.append(accuracy_score(y[test_idx],dtree_cv.predict(X.iloc[test_idx,:])))
    
print(np.mean(list_of_accuracies))

In [None]:
from sklearn.model_selection import cross_val_score

dtree_cv = DecisionTreeClassifier(random_state=8)
csv = cross_val_score(dtree_cv, X, y, cv=4)
print(csv)
print(np.mean(csv))

### Conclusion ofr cross-validation 

We can see there is no such improvement of the model accuracy using cross-validation.

So we should use Random Forest to check the accuracy and make sure the overfitting is not the problem. Altough we may want to keep cross-validation. 

_______________
## Random Forest 

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
list_of_accuracies=[]
skf = StratifiedKFold(n_splits=4, random_state=8, shuffle=True) 
randomf = RandomForestClassifier(random_state=8)

for train_idx, test_idx in skf.split(X,y):
    randomf = randomf.fit(X.iloc[train_idx,:],y[train_idx])
    list_of_accuracies.append(accuracy_score(y[test_idx],randomf.predict(X.iloc[test_idx,:])))
    
print("Average accuracy:",np.mean(list_of_accuracies))
list_of_accuracies

In [None]:
y_pred_rf = randomf.predict(X_test)
ac_score = accuracy_score(y_test,y_pred_rf)

sns.heatmap(confusion_matrix(y_test,y_pred_rf, normalize='true').round(2), annot=True, xticklabels=class_names, yticklabels=class_names)
plt.ylabel('True labels')
plt.xlabel('Predicted labels')
plt.title(f'Normalized Confusion Matrix for Random Forest\n accuracy = {ac_score.round(4)}', fontsize=14)
plt.savefig('../img/norm_confusion_matrix_random_forest.png')
plt.show()

In [None]:
feat_importance_rf = pd.DataFrame(randomf.feature_importances_,index=X.columns)

In [None]:
feat_importance = pd.merge(feat_importance_dt,feat_importance_rf, left_index=True, right_index=True, suffixes=('_dt','_rf'))
feat_importance


### Conclusion on Random Forest

Accuracy of model is much better with Random Forest so we can see it handle the overfitting of Decision tree model and the imbalanced of the dataset didn't affect too much the results. 

**Prossible improvements:**
- Under/over sampling dataset to check if accuracy is better without keep outliers in low number
- Test other decision-tree-like models such as Xgboost, Catboost and Adaboost