Dataset: https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work here you can download dataset and description to it

Always we starting with data preprocessing: group of operations that will convert raw data into a format that is easier to understand and useful for further processing and analysis. Also helps organize information in suitable and practical way. It takes the most of the time and it is crucial part of every analytical ask. While preprocessing we make raw dataset usable for machine learning algorithm.

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
rawdata = pd.read_csv('/kaggle/input/absenteeism-at-works/absenteeism.csv')

After loading a data always I’m exploring it manually. It helps to have some first predictions. Sometimes it helps find some errors - even like importing wrong file 😅 and let us dive in into problem. Jupyter Notebook or JupyterLab dont let us see whole table so I can use:

In [None]:
pd.options.display.max_columns = None

In [None]:
rawdata

In [None]:
df = rawdata.copy()

In [None]:
df.info()

ID - individual identification - indicates precisely who has been away during working hours. It is a label variable to distinguish the individuals from one another, not to carry any numeric information.

We have to drop variable “ID” because it harm the estimation.

In [None]:
df = df.drop(['ID'], axis = 1) 

Next column, Reason from Absence - we have to keep in mind that they are represent categories that are equally meaningful so they are categorical nominal variables. We use numbers and provide to them descriptions because using less characters will think the volume of our dataset, it's easier to digest, btw it is called “database theory”.

Extracting distinct values only:

In [None]:
len(df['Reason for absence'].unique())

In [None]:
sorted(df['Reason for absence'].unique())

There is no number ’20’ in the list. That means that nobody left the work because of “External causes of morbidity and mortality” (we know from the additional info UCI_ABS_TEXT) phew! We have to change this variables into dummy variables. Dummy variable is an explanatory binary variable that equals 1 - if a certain categorical effect is present 0 - if the same effect is absent

We our data we will do like this: 1 - if person was absent because of reason 1 0 - if person was absent because any other reason

next: 1 - if person was absent because of reason 2 0 - if person was absent because any other reason

Fortunately I don’t have to do it manually it is possible thanks to panda by simply .get_dummies()

In [None]:
rcol = pd.get_dummies(df['Reason for absence'])

rcolumn is new dataframe with 28 columns which contains information about which I wrote above.

To this data frame we can add another column where it will be sum:

In [None]:
rcol['check'] = rcol.sum(axis=1)

In [None]:
rcol['check'].unique() #checking if for sure every person have only one reason 

In next stage I have to drop column ‘0’ from rcolumn dataframe. I'm doing this to avoid multicollinearity. For n categories we using n-1 dummies so I am dealing with 28 categories so I need only 27 dummies. (https://www.quora.com/How-and-why-having-the-same-number-of-dummy-variables-as-categories-is-problematic-in-linear-regression-Dummy-variable-trap-Im-looking-for-a-purely-mathematical-not-intuitive-explanation-Also-please-avoid-using-the)

In original ‘dataset’ I still have column called ‘Reason for absence’, if we will leave it we will have duplication of information which lead to multicollinearity. So lets drop this column from ‘dataset’. If we will add our ‘rcolumn’ into ‘dataset’ that means that we will have additional 27 columns in dataframe. A bit too much. Lets group these variables, this action we call classification. We will group basing on features descriptions: Reson1 1-14 diseases Reason2 15-17 - pregnancy related Reason 3 18 - 21 - poisonings Reason 4 22-28 - light reasons

We will create new data frame for each group. Thats why we needed to drop column with ID - because we need every individual have only one reason being out of work. So now we want to create a tables with only type of reason


In [None]:
rcol = rcol.drop(['check'], axis=1)

In [None]:
rcol = pd.get_dummies(df['Reason for absence'], drop_first = True) #drop reason '0'
rcol

In [None]:
df = df.drop(['Reason for absence'], axis=1) #drop 'Reason for absence', replace with dummies. 
df

In [None]:
# merging dummies into 4 categories based on reason for abscence
reasontype1 = rcol.loc[:, 1:14].max(axis=1)
reasontype2 = rcol.loc[:, 15:17].max(axis=1)
reasontype3 = rcol.loc[:, 18:21].max(axis=1)
reasontype4 = rcol.loc[:, 22:28].max(axis=1)

In [None]:
print(reasontype1.sum(), reasontype2.sum(), reasontype3.sum(), reasontype4.sum())

In [None]:
df = pd.concat([df, reasontype1, reasontype2, reasontype3, reasontype4], axis = 1)
df

Adding to data frame and rename it and then reordering columns because we want to see the reason first:

In [None]:
column_names = ['Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours', 'Reason1', 'Reason2', 'Reason3', 'Reason4']
df.columns = column_names
df

In [None]:
reordered = ['Reason1', 'Reason2', 'Reason3', 'Reason4','Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours']

In [None]:
df = df[reordered]
df

Creating a checkpoints - an interim save of your work

I called it df_mod1 which stands for modified dataframe version 1. It is very good practice to creating checkpoints - it is help to organize, storing the current version of code so we reducing risk of losing our data at a later stages. We don’t have to do anything with date - month and the day of the week.

Let’s move to next columns: Transportation Expense, Distance, Age, Daily Work Load, BMI - we are not going to manipulate them too.

What we have next is:  ‘Education’ (high school (1), graduate (2), postgraduate (3), master and doctor (4)) 'Son’ - Number of children ‘Pet’ - Number of pets Columns ‘Son’ and ‘Pet’ we will leave untouched.

We have to change education into dummy variable. To not scroll down everything lets check what we have in ‘education’ variable

In [None]:
df_mod1 = df.copy()
df_mod1

In [None]:
#correlation matrix
cormatrix = df_mod1.corr()
plt.subplots(figsize=(8, 8))
sns.heatmap(cormatrix, vmin=-1, vmax=1, center= 0, cmap= 'coolwarm')

In [None]:
cols = cormatrix.nlargest(10, 'Absenteeism time in hours')['Absenteeism time in hours'].index
corrcoef = np.corrcoef(df_mod1[cols].values.T)
plt.subplots(figsize=(8, 8))
sns.heatmap(corrcoef, annot=True,  yticklabels=cols.values, xticklabels=cols.values, vmin=-1, vmax=1, center= 0,  cmap= 'coolwarm')

In [None]:
df_mod2 = df_mod1.copy() 

In [None]:
df_mod2 = df_mod2.drop(['Month of absence','Distance from Residence to Work','Body mass index'], axis = 1)

In [None]:
df_mod2['Education'].unique()

In [None]:
df_mod2['Education'].value_counts()

Now we can see that 611 is undergraduate and only 129 people holds higher degree (graduate, postgraduate, a master or a doctor) so it is not so relevant anymore. We can combine them in single category. We can assign undergraduate as 0 and at least graduate to 1: 1 -> 0 2 -> 1 3 -> 1 4 -> 1

In [None]:
df_mod2['Education'] = df_mod2['Education'].map({1:0, 2:1, 3:1, 4:1})

In [None]:
df_mod2['Education'].value_counts()

Saving file as csv: 

In [None]:
d_pre = df_mod2.copy()
d_pre

In [None]:
d_pre.to_csv('Absenteeism_preprocessed.csv', index=False)