# Absenteeism at work 

Problem definition: predict the time of absence of an employee knowing some information on the reasons of abscence or the type of person. 

## Unsupervised Learning 

**Goal:** Create cluster for columns that relate of employee personal caracteristics. 

**Methods:** I will use first Kmeans and may use another method such as Hierarchical clustering or DB Scan to compare the output of 2 different methods. 

**Unsupervised Learning tasks:** 
- [x] Define columns that can be gathered as a cluster 
- [x] Check distribution of data and relation between each other (apply scaling if necessary)
- [x] Build model using Kmeans 
- [x] Plot the clusters to check the relevance of them
- [x] Use Silouhette coefficient and Davies Bouldin evaluation metrics to check the performance of the model

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

pd.set_option('max_columns',25)

In [None]:
df = pd.read_csv('../data/absenteeism_clean.csv')
print(df.shape)
df.head()

In [None]:
df.id.nunique()

In [None]:
cluster_cols = ['education','children','social_drinker','social_smoker','pet','weight',
                'height','body_mass_index','age','distance_from_residence_to_work','transportation_expense']

In [None]:
sns.pairplot(df[cluster_cols].drop(columns=['social_drinker','social_smoker']));

In [None]:
df[cluster_cols]

## KMeans model

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

In [None]:
df_cluster = df.copy()

In [None]:
# Building KMeans model
model = KMeans(n_clusters=3)
labels = model.fit_predict(df[cluster_cols])

# Adding the label to the dataframe
df_cluster['kmeans_label'] = labels

# Number of unique id per cluster_label
df_cluster.groupby('id').kmeans_label.min().value_counts()

In [None]:
pd.crosstab(df_cluster.id,df_cluster.kmeans_label)

In [None]:
# Checking the relation of features and clusters

sns.pairplot(df_cluster[cluster_cols+['kmeans_label']],hue='kmeans_label');

In [None]:
# Checking accuracy of the cluster using evaluation metrics 

print('KMeans clusters')
print("DB score", davies_bouldin_score(df_cluster[cluster_cols], labels))
print('Avg Silouhette coef:', silhouette_score(df_cluster[cluster_cols], labels))

### Conclusion on KMeans
Here we can see the first approach is mostly based on transportation expenses because it is the column
with the higher scale data. 

**Possible improvement:**
- Apply a scaling method to weight equally the features

_________________________
## KMeans model on scaled data

- We will use the minmax scale method to reduce the weight of data with higher values
- Build the model and compare how the clusters are different

In [None]:
df_scaled = df.copy()

In [None]:
minmax_scale = ['weight','height','body_mass_index','age','distance_from_residence_to_work','transportation_expense']

for i in range(len(minmax_scale)):
    df_scaled[minmax_scale[i]] = (df_scaled[minmax_scale[i]] - df_scaled[minmax_scale[i]].min())/(df_scaled[minmax_scale[i]].max()- df_scaled[minmax_scale[i]].min())

df_scaled[cluster_cols].head()    
    

In [None]:
# Building the model on scaled dataset
model_scaled = KMeans(n_clusters=3)
labels_scaled = model_scaled.fit_predict(df_scaled[cluster_cols])

# Adding the labels to dataframe
df_cluster['kmeans_label_scaled'] = labels_scaled

# Number of unique id per cluster_label
df_cluster.groupby('id').kmeans_label_scaled.min().value_counts()

In [None]:
# Checking the relance of clusters through visualization - WARNING! Takes time to run. 

sns.pairplot(df_cluster[cluster_cols+['kmeans_label_scaled']], kind="scatter", hue='kmeans_label_scaled', 
             markers=["o", "s", "D"], palette="Set2");


In [None]:
fig, axs=plt.subplots(4,3, figsize=(25,30))

for i in range(df_cluster[cluster_cols].shape[1]):
    tab = pd.crosstab(df_cluster[cluster_cols[i]],df_cluster['kmeans_label_scaled'], normalize='columns').round(2)
    ax = axs[i//3,i%3]
    sns.heatmap(tab,annot=True,ax=ax)

fig.delaxes(axs[3,2])
plt.show()    
    

In [None]:
# Checking accuracy of the cluster using evaluation metrics 

print('KMeans performance (scaled data)')
print("DB score", davies_bouldin_score(df_scaled[cluster_cols], labels_scaled))
print('Avg Silouhette coef:', silhouette_score(df_scaled[cluster_cols], labels_scaled))

In [None]:
# Checking evaluation metrics with non-scaled data - is it relevant?

print('KMeans performance (scaled data) - comparison with non-scaled data')
print("DB score", davies_bouldin_score(df_cluster[cluster_cols], labels_scaled))
print('Avg Silouhette coef:', silhouette_score(df_cluster[cluster_cols], labels_scaled))

### Conclusions on having scaled data for KMeans

In regards of the evaluation metrics we can see that the clusters are overlapping, which is not that good.

But when checking visually the difference between clusters we can see it differenciates mostly with:
- number of children (cluster 0 has mostly no children; cluster 1 has 1 or 2 children; cluster 3 has 2 or more children)
- number of pet (cluster 0 has no pet; cluster 1 has mostly 2 or more pets; cluster 2 between 0 and 1)
- age (majority of people in cluster 2 is around 28; cluster 1 is mostly more than 37 yo; cluster 0 is mostly between 31 and 36)
- social drinker (cluster 0 is both drinkers and not; cluster 1 is mostly not social drinkers; cluster 2 is mostly social drinkers)

Criteria of differentiation seem pretty relevant to me.

**Possible improvements:**
- Iterating on the number of clusters
- Test using another method of clustering
- Add train/test split to ensure model is not overfitted

____________________
## Kmeans iteration on clusters

Assumptions: 
- We should iterate on the number of clusters by keeping in mind that we have only 36 uniques id so only 36 combinaisons of caracteristics. 

In [None]:
range_n_cluster = list(range(2,15))

df_it = df.copy()

for n_cluster in range_n_cluster:
    # Building the model
    model = KMeans(n_clusters=n_cluster)
    labels = model.fit_predict(df_scaled[cluster_cols])
    
    # Checking evaluation metrics
    print(f'n_cluster = {n_cluster}')
    print("DB score", davies_bouldin_score(df_scaled[cluster_cols], labels))
    print('Avg Silouhette coef:', silhouette_score(df_scaled[cluster_cols], labels))
    
    # Adding the cluster to a new dataframe to check the frequency per id
    df_it[f'n_cluster_{n_cluster}'] = labels
    
    # Number of unique id per cluster_label
    print(df_it.groupby('id')[f'n_cluster_{n_cluster}'].min().value_counts(),'\n')

In [None]:
df_it.head()

### Conlusions on KMeans iterations

The more we add clusters the better the evaluation metrics but does it make sense to get so much clusters? 

As we have a limited number of id (people), maybe it is more relevant to keep a low number of clusters. 

**Possible improvements:**
- Try DB Scan where the nb of clusters is not given

_________________________
## DB Scan

Assumptions: 
- Because DB Scan is based on distance between datapoints we will keep the scaled dataframe to use this model
- We want to find the ideal number of clusters so we use this model where that parameter is not given

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
for eps in np.arange(0.5, 4, 0.5):
    for min_samples in range(2,6):
    
        dbscan = DBSCAN(eps=eps,min_samples=min_samples)
        labels = dbscan.fit_predict(df_scaled[cluster_cols])

        # Checking accuracy of the cluster using evaluation metrics 
        print(f'DB Scan, eps={eps} & min_samples={min_samples}')
        print('Count clusters', len(set(labels)))
        print("DB score", davies_bouldin_score(df_scaled[cluster_cols], labels))
        print('Avg Silouhette coef:', silhouette_score(df_scaled[cluster_cols], labels),'\n')

In [None]:
clustering = DBSCAN(eps=2, min_samples=2)
labels = clustering.fit_predict(df_scaled[cluster_cols])

# Checking accuracy of the cluster using evaluation metrics 
print('DB Scan model')
print("DB score", davies_bouldin_score(df_scaled[cluster_cols], labels))
print('Avg Silouhette coef:', silhouette_score(df_scaled[cluster_cols], labels))

# Adding the labels to dataframe
df_cluster['dbscan_label'] = labels

# Number of unique id per cluster_label
df_cluster.groupby('id').dbscan_label.min().value_counts()

### Conclusions on DB Scan

We can see here that the more we have clusters the better are the evaluation metrics (DB is low and Silouhette is high) but it seems not relevant to get so many clusters. 

Then, the output with low number of clusters (eps is 2 or 2.5) is not so relevant because thay are imbalanced. Most of the id are in one cluster and only few id are on the others.

Finally, I would say that DB Scan is not relevant for our dataset. 

____________________
## Saving final dataframe with clusters

In [None]:
# Saving the new dataframe with the cluster

df_final = df_cluster.drop(columns=cluster_cols+['cluster_label','dbscan_label']).copy().rename(columns={'kmean_label_scaled':'cluster'})
print(df_final.shape)

df_final.to_csv('../data/absenteeism_clusterized.csv', index=False)

df_final.head()