# Projet data mining  
#### Sujet : Analyse et prédiction de l'absenteisme au travail.
###### BUT SD 23-24

Algassimou DIALLO

# Plan
1. [Introduction](#introduction)
2. [Prétraitement des données](#datapreprocessing)
3. [Exploratory Data Analysis](#EDA)
4. [Analyse en composantes principales](#PCA)
4. [Models](#models)
5. [Conclusion](#conclusion)


### Packages

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve

<a name="datapreprocessing"></a>
# Prétraitement des données

In [2]:
df = pd.read_csv("Absenteeism_at_work.csv", sep =";")
df.head()

- Nous avons 740 observations de 20 variables 
- Pas de NA dans les variables
- Les types des variables semblent correctes.

In [3]:
df.info()

In [4]:
df.describe()

On remarque des "min = 0" pour "Reason for absence" et "Month of absence" ce qui est bizarre comparé à la documentation du dataset.

 - Pour la variable "Reason for absence" on remarque que ces valeurs nulles correspondent à une "non absence"(l'individu n'a pas été absent) car on remarque que ces individus ont tous un taux d'absence nulle. Ces informations sont très importantes, ce ne sont pas des erreurs donc on doit garder ces lignes.

In [5]:
df[df["Reason for absence"] ==0]

- Pour la variable "month of absence" on remarque que ces individus ont tous un taux d'absence nul et ils ne sont qu'au nombre de 3. 
- On decide donc remplacer ces valeurs par le mois moyen des individus précédants(raison = 0 & taux d'absence = 0). 

In [6]:
df[df["Month of absence"]==0]

In [7]:
df[df["Reason for absence"] ==0].mean()


 - On remplace donc par le mois de Juillet (7)

In [8]:
df["Month of absence"] = df["Month of absence"].replace([0],7)


<a name="EDA"></a>
# Exploratory Data Analysis 

- Reason for absence & Absenteeism time

On remarque que la raison d'absence la plus fréquente  est la raison 23 ("medical consultation") suivi de 28 ("Dental consultation") et de 13 ("Diseases of the musculoskeletal system and connective tissue")

In [9]:
df.groupby('Reason for absence')['Absenteeism time in hours'].mean()

 #### Graphique pour mieux visualiser ces données : 

In [10]:
top_raison=df.groupby('Reason for absence')['Absenteeism time in hours'].count()
top_raison=np.array(top_raison) 
fig= plt.subplots(figsize=(10,10))

table=["no reason (0) ",'Certain infectious and parasitic diseases', 'Neoplasms', 'blood-forming organs and involving the immune mechanism', 'Endocrine, nutritional and metabolic diseases', 'Mental and behavioural disorders', 'Diseases of the nervous system', 'Diseases of the eye and adnexa', 'Diseases of the ear and mastoid process', 'Diseases of the circulatory system', 'Diseases of the respiratory system', 'Diseases of the digestive system', 'Diseases of the skin and subcutaneous tissue', 'Diseases of the musculoskeletal system and connective tissue', 'Diseases of the genitourinary system','Pregnancy, childbirth and the puerperium','Certain conditions originating in the perinatal period','Congenital malformations, deformations and chromosomal abnormalities', 'Symptoms, signs and abnormal clinical and laboratory findings', 'Injury, poisoning and certain other consequences of external causes', 'Factors influencing health status and contact with health services.','patient follow-up','medical consultation','blood donation','laboratory examination','unjustified absence','physiotherapy','dental consultation']

plt.barh(y=np.arange(len(top_raison)),width=top_raison,label='No. of people',color='lightblue')
plt.yticks(np.arange(len(top_raison)),table,rotation=0)

plt.ylabel('Reason of Absence')
plt.xlabel('Count of people')
plt.savefig('sample.jpg')

In [11]:
top_raison=df.groupby('Reason for absence')['Absenteeism time in hours'].mean()
top_raison=np.array(top_raison) 
fig= plt.subplots(figsize=(10,10))

table=["no reason (0) ",'Certain infectious and parasitic diseases', 'Neoplasms', 'blood-forming organs and involving the immune mechanism', 'Endocrine, nutritional and metabolic diseases', 'Mental and behavioural disorders', 'Diseases of the nervous system', 'Diseases of the eye and adnexa', 'Diseases of the ear and mastoid process', 'Diseases of the circulatory system', 'Diseases of the respiratory system', 'Diseases of the digestive system', 'Diseases of the skin and subcutaneous tissue', 'Diseases of the musculoskeletal system and connective tissue', 'Diseases of the genitourinary system','Pregnancy, childbirth and the puerperium','Certain conditions originating in the perinatal period','Congenital malformations, deformations and chromosomal abnormalities', 'Symptoms, signs and abnormal clinical and laboratory findings', 'Injury, poisoning and certain other consequences of external causes', 'Factors influencing health status and contact with health services.','patient follow-up','medical consultation','blood donation','laboratory examination','unjustified absence','physiotherapy','dental consultation']

plt.barh(y=np.arange(len(top_raison)),width=top_raison,label='No. of people',color='lightblue')
plt.yticks(np.arange(len(top_raison)),table,rotation=0)

plt.ylabel('Reason of Absence')
plt.xlabel("Mean absenteeism time")

- Month of absence & Absenteeism

Le mois de Mars compte le plus d'absences suivi de Février et Octobre.
Ce qui est compréhensible car le mois de Mars est le mois des "festival" au Brésil.(festival de rio)

In [12]:
df.groupby('Month of absence')['Absenteeism time in hours'].count()

#### Graphique pour mieux visualiser la variation par mois.

In [13]:
mois_absence = np.array(df.groupby('Month of absence')['Absenteeism time in hours'].mean())


plt.bar(x=np.arange(len(mois_absence)),height=mois_absence ,color='lightblue')
plt.xlabel('Month')
plt.xticks(np.arange(len(mois_absence)),['JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC'])
plt.title("Variation du nombre d'absences par mois")
plt.show()

- Day of the week & Absenteeism time

Pas de jours particulier. Le nombre d'absence est à peu près le même pour tous les jours de la semaine mais la moyenne est plus elevé le Lundi et très faible le jeudi et le vendredi.

In [14]:
df.groupby('Day of the week')['Absenteeism time in hours'].count()

In [15]:
semaines_absence = df.groupby('Day of the week')['Absenteeism time in hours'].mean()
semaines_absence=np.array(semaines_absence)

plt.bar(x=np.arange(len(semaines_absence)),height=semaines_absence,color='lightblue')

plt.xlabel('Days of week')
plt.xticks(np.arange(len(semaines_absence)),['MON','TUE','WED','THUR','FRI'])
plt.title("Variation des absences en fonction des jours de la semaine")
plt.show()

- Son & Absenteeims time

On remarque que les gens avec 2 et 3 enfants ont la durée d'absence moyenne la plus longue.

In [16]:
df.groupby('Son')['Absenteeism time in hours'].mean()


In [17]:
absence_child = np.array(df.groupby('Son')['Absenteeism time in hours'].mean())

plt.bar(x=np.arange(len(absence_child )),height=absence_child  ,color='lightblue')
plt.xlabel("Nombre d'enfants")
plt.ylabel("Temps moyen d'absence")
plt.title("Temps d'absence moyen par nombre d'enfants")
plt.show()

- Social drinker & Absenteeism time

Les fumeurs ont une durée moyenne d'absence plus longue mais la différence est juste d'environ 1h.

In [18]:
df.groupby('Social drinker')['Absenteeism time in hours'].mean()

- Social smoket & Absenteeism time

Presque pas de différence

In [19]:
df.groupby('Social smoker')['Absenteeism time in hours'].mean()

In [20]:
df.info()

- Age & Absenteeism time

Pas de patterns en particuler. On remarque juste un outlier et que les indivdux dans la trentaine ont le temps moyen d'absence le plus elevé.

In [21]:
absence_age = df.groupby('Age')['Absenteeism time in hours'].mean()

absence_age.plot(kind='bar', figsize=(10,5), color="lightblue" )
plt.xlabel('Age')
plt.ylabel('Absenteeism time in hours')
plt.title("Temps d'absence moyen par age")



#### Autre manière de visualiser les données

In [22]:
def group_age(age):
    if age >= 0:
        if age % 10 != 0:
            lower = int(math.floor(age/10.0))*10
            upper = int(math.ceil(age/10.0))*10-1
            return f"{lower}-{upper}"
        else:
            lower = int(age)
            upper = int(age+9)
            return f"{lower}-{upper}"
        return np.nan 


In [23]:
df1 = df.copy()
df1["age_range"] = df1["Age"].apply(group_age)
age_order=df1['age_range'].unique()




In [24]:

sns.barplot(x='age_range',y='Absenteeism time in hours',data=df1, color="lightblue")
plt.xlabel('Age Range')
plt.ylabel('Absenteeism time in hours')
plt.title("Temps d'absence moyen par age")
plt.show() 



- IMC et Absenteeism time in hours

Pas d'énormes différences, les individus en surpoids ont le temps d'absence moyen le plus élevé.

In [25]:
imc = [0, 18.4, 24.9, 29.9, 39.9]
df1['bmi'] = pd.cut(df['Body mass index'], bins=imc, labels=[f'IMC {i}' for i in range(1, len(imc))])


In [26]:
bmi=df1.groupby('bmi')['Absenteeism time in hours'].mean()
plt.bar(x=np.arange(len(bmi)),height=bmi,color='lightblue')
plt.ylabel('Absenteeism time in hours')
plt.xticks(np.arange(len(bmi)),['Maigre','Normal','Surpoids','Obésité'],rotation=30)
plt.title("Temps d'absence moyen par IMC")
plt.show()

### Corrélation des variables

In [27]:
df_corr = df.drop(columns = ['ID', 'Disciplinary failure', 'Social drinker', 'Social smoker', 'Seasons', 
                            'Month of absence', 'Day of the week', 'Reason for absence', 'Education'])
corr = df_corr.corr()
plt.figure(figsize = (10,10))
sns.heatmap(corr, annot = True,cmap='Blues')
plt.title('Correlation Heatmap')

## Transformation des données: Ajout de classe d'absence (retard,normal,absenteiste)

In [28]:
df.head()

In [29]:

label_encoder = LabelEncoder() 

df = pd.concat([df  ,pd.get_dummies(df  ['Reason for absence'], prefix = 'reason')], axis=1)
df  = pd.concat([df  ,pd.get_dummies(df  ['Day of the week'], prefix = 'day')], axis=1)
df  = pd.concat([df  ,pd.get_dummies(df  ['Seasons'], prefix = 'season')], axis=1)
df = pd.concat([df  ,pd.get_dummies(df  ['Education'], prefix = 'education')], axis=1)



### Meilleur Model RandomForest obtenu

In [30]:
x = df .drop(columns = ['Absenteeism time in hours'])
y = df ['Absenteeism time in hours']
x_train,x_test, y_train,y_test = train_test_split(x,y, train_size = 0.7, random_state = 3)
model = RandomForestClassifier(max_depth=2, random_state=3).fit(x_train,y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_pred,y_test) * 100
print(accuracy)

### GridSearch pour avoir le best modèle.

In [31]:
"""param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
grid = GridSearchCV(RandomForestClassifier(),param_grid,verbose = 3)

grid.fit(x_train,y_train)

print(grid.best_params_) """

In [32]:
best_model = RandomForestClassifier(criterion = 'gini', max_depth = 7, max_features = 'sqrt', n_estimators = 500 )
best_model.fit(x_train,y_train)
y_pred = best_model.predict(x_test)
accuracy = accuracy_score(y_test,y_pred)
print("Prediction best_model: \n",y_pred,"\n","Précision du modèle grid: ", accuracy) 

In [33]:
# Retard, Normal, Absenteiste
def niveau(absence_time):
  if(absence_time< 2):
    classe = 'retard'
  elif((absence_time >= 2) and (absence_time < 24)):
    classe = 'normal'
  elif(absence_time >= 24):
    classe = 'absenteiste'    
  return classe

df['classe'] = df['Absenteeism time in hours'].apply(lambda x: niveau(x)).astype('category')
df_pca = df.copy()
df_pca.head()

In [34]:
df_pca = df_pca .drop(columns = ["Reason for absence","Seasons",'Day of the week',"Height","Education","Absenteeism time in hours"])
df_pca .info()

In [35]:
y = df_pca["classe"]
df_pca = df_pca .drop(columns = ["classe"])


In [36]:
X_pca = df_pca.copy()

X_pca_norm = (X_pca-X_pca.mean())/X_pca.std()
     

In [37]:
X_pca = PCA(random_state = 3).fit(X_pca_norm)


var_cumul = np.cumsum(X_pca.explained_variance_ratio_)
plt.plot(var_cumul)
plt.title('Variance cumulée')
plt.xlabel('Nombre de composantes')
plt.ylabel('Variance expliquée')
     


In [38]:
var_cumul[31]

In [39]:
X_pca = np.dot(X_pca_norm.values, X_pca.components_[:31,:].T)
X_pca = pd.DataFrame(X_pca, columns=["PC%d" % (x + 1) for x in range(31)])
X_pca.head()
     


<a name="models"></a>
# Models 

#### Division du dataset en Train et Test

In [40]:
x_train, x_test, y_train, y_test = train_test_split(X_pca, y, test_size = 0.3, random_state = 3)


### Suréchantillonnage aléatoire 

In [41]:
from imblearn.over_sampling import RandomOverSampler
x_train_s, y_train_s = RandomOverSampler(random_state = 3).fit_resample(x_train, y_train)

 ###  Naive Bayes
 
On commence d'abord par un simple modèle Naive Bayes 

In [42]:
df.info()

In [43]:
model_NB = GaussianNB().fit(x_train_s,y_train_s)


In [44]:
y_pred = model_NB.predict(x_test)
accuracy = accuracy_score(y_test, y_pred) * 100
print(accuracy)

In [45]:
cm = confusion_matrix(y_test,y_pred)


# Plot confusion matrix using seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Class 0", "Class 1", "Class 2"], yticklabels=["Class 0", "Class 1", "Class 2"])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

In [46]:

#rapport de classification
print("Rapport de classification:\n ",classification_report(y_test,y_pred))


### Modèle de Decison Tree

In [47]:
model = DecisionTreeClassifier(criterion = "gini" ).fit(x_train_s,y_train_s)

In [48]:
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_pred,y_test) * 100
print(accuracy)


In [49]:
plt.figure(figsize=(100,70)) 
plot_tree(model, fontsize=8)
plt.show()

In [50]:
cm = confusion_matrix(y_test,y_pred)
print(cm)


plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

In [51]:
print("Rapport de classification:\n ",classification_report(y_test,y_pred))