# Absenteeism at work 

Problem definition: predict the time of absence of an employee knowing some information on the reason of absence or the type of person. 

## Data Cleaning 


**More info regarding data description:**
- If reason of absence is 0, it could means the row doesn't relate to absence but used to add additionnal information (such as a disciplinary failure).
- If absenteeism_time_in_hours is 0, depending on the context it could means the absence is less than 1 hour or not absent at all.
- If month_of_absence, disciplinary_failure and absenteeism_time_in_hours are all 0, we can assume they were probably not absent.
- Service Time is the numbers of year in the company 
- Hit Target is the % of achievement regarding the objectives of the employee


**Cleaning tasks:**
- [x] Rename some of the columns (Son and Work load Average/day)
- [x] Cleaning of columns name
- [x] Gather outliers into bin for Absenteeism time in hours
- [x] Correct dtypes of categorical data
- [x] Check missing values
- [x] Create a new colunm to count the total disciplinary failure for an employee
- [x] Drop the rows where reason of absence == 0 because these rows do not relate to absenteism but rather the comportment of an employee (disciplinary failure). 


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set()

pd.set_option('max_columns',25)

In [None]:
df = pd.read_csv('../data/Absenteeism_at_work.csv', sep=';')
print(df.shape)
df.head()

In [None]:
df.info()

In [None]:
# Cleaning of columns names

df.rename(columns={'Son':'children'}, inplace=True)

df.columns = [col.lower().replace('/','_').replace(' ','_').rstrip('_') for col in df.columns]
df.head()

In [None]:
# Changing the types of categorical data to object

categorical_col = ['id','reason_for_absence','month_of_absence','day_of_the_week','seasons'
                   ,'disciplinary_failure', 'education','social_drinker','social_smoker']

df[categorical_col] = df[categorical_col].astype(object)
df.dtypes

In [None]:
df.describe().round(2)

In [None]:
df['absenteeism_time_in_hours'].plot.box();

# More than 20 hours is outliers - it means people were absent more than a day

In [None]:
# Create bins to gather outliers somewhere

label_absenteeism_time = ['0-2 hours','1/2 day','1 day','1-2 days','more than 2 days']

df['absenteeism_bins']=pd.cut(df.absenteeism_time_in_hours,[-0.001,2.0,4.0,8.0,20.0,120.0])
df.absenteeism_bins.value_counts()

In [None]:
# Create a new column to gather the count of disciplinary failure 
print("Value frequency of disciplinary failure:\n", df.disciplinary_failure.value_counts())

total_disciplinary_failures = df.groupby('id').disciplinary_failure.sum().reset_index()\
                                .rename(columns={'disciplinary_failure':'total_disciplinary_failures'})
total_disciplinary_failures
df1 = pd.merge(df,total_disciplinary_failures, how='outer',on='id')
df1.head()

In [None]:
# Drop rows with reason for absence == 0 and column of disciplinary failure (type = boolean)

row_drop = df1[df1.reason_for_absence==0].index
df1.drop(row_drop, axis=0, inplace=True)
df1.drop('disciplinary_failure',axis=1,inplace=True)
df1.head()

In [None]:
df1.shape

# 2 columns were created, 1 column dropped and 43 rows dropped

In [None]:
# Saving the clean dataset

df1.drop('absenteeism_time_in_hours',axis=1).to_csv('../data/absenteeism_clean.csv',index=False)

## Exploratory data analysis 

In [None]:
# Frequency of reason for absence
df1['reason_for_absence'].hist(bins=df['reason_for_absence'].nunique());

In [None]:
sns.distplot(df1.total_disciplinary_failures);

In [None]:
# total time of absenteeism and count of absence per people

count_absence = df1.groupby('id').reason_for_absence.count().sort_values(ascending=False)
avg_count_absence = count_absence.mean()

count_absence_time = df1.groupby('id').absenteeism_time_in_hours.sum().sort_values(ascending=False)
avg_time_absence = count_absence_time.mean()

fig, ax1=plt.subplots(figsize=(14,6))

count_absence_time.plot(kind='bar', color='orange', label='Hours of absences')
count_absence.plot(kind='bar', color='red', label='Count of absences')

xmin, xmax = ax.get_xlim()
plt.hlines(y=avg_count_absence, xmin=xmin, xmax=xmax, colors='r', linestyles='dashed',\
           label='Average number of absences')

plt.hlines(y=avg_time_absence, xmin=xmin, xmax=xmax, colors='orange', linestyles='dashed',\
           label='Average time of absences')

plt.legend()
plt.title('Total of absences for absent employees')
plt.show()

In [None]:
sns.distplot(count_absence)
plt.xlabel('Total count of absence per employee')

In [None]:
# Total Absenteeism by people
df1.groupby('id').absenteeism_time_in_hours.agg('sum').describe()

In [None]:
# trend of absenteeism for month and weekday

In [None]:
# trend of absence regarding seasons and distance 
# (Are people living far away from the work lazy to come at work because of the weather?)

In [None]:
# average time and count of absences per reason 

In [None]:
# average time and count of absences per disciplinary (are bad people more often absent)

In [None]:
# pairplot to detect any linearity between variables 
sns.pairplot(df1);