# Machine Learning Case Study: Absenteeism 
#### by Sooyeon Won 

### Part 1: Data Preprocessing 

### Keywords 
- Dimentionality Reduction 
- Data Manipulation

### Contents 

<ul>    
<li><a href="#Preprocessing">1.  Data Preprocessing</a></li>
<li><a href="#Analysis">2.  Machine Learning</a></li>
<li><a href="#Deployment">3.  Model Deployment</a></li>
</ul>




<a id='Preprocessing'></a>
## 1. Data Preprocessing

In [1]:
# Import the relevant libraries
import pandas as pd
import numpy as np

# Load the data
abs_df = pd.read_csv("Absenteeism_data.csv")

abs_df
# pd.options.display.max_columns = None
# pd.options.display.max_rows = None

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,3,23,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,7,7,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,11,23,23/07/2015,289,36,33,239.554,30,1,2,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...
695,17,10,23/05/2018,179,22,40,237.656,22,2,2,0,8
696,28,6,23/05/2018,225,26,28,237.656,24,1,1,2,3
697,18,10,24/05/2018,330,16,28,237.656,25,2,0,0,8
698,25,23,24/05/2018,235,16,32,237.656,25,3,0,0,2


In [2]:
abs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


#### ID

In [3]:
# Drop 'ID'
abs_df.drop(['ID'], axis = 1, inplace=True)

#### Reason for Absence

In [4]:
# Check the unique values for reason types
abs_df['Reason for Absence'].unique()
# pd.unique(abs_df['Reason for Absence'])
# sorted(abs_df['Reason for Absence'].unique())

array([26,  0, 23,  7, 22, 19,  1, 11, 14, 21, 10, 13, 28, 18, 25, 24,  6,
       27, 17,  8, 12,  5,  9, 15,  4,  3,  2, 16], dtype=int64)

In [5]:
# Check the max. min. values for the reason type.
abs_df['Reason for Absence'].min(), abs_df['Reason for Absence'].max()

(0, 28)

In [6]:
# One-hot Encoding for 'Reason for Absence'
reason_columns = pd.get_dummies(abs_df['Reason for Absence'], drop_first = True)
abs_df.drop(['Reason for Absence'], axis = 1,inplace= True)


# Group the Reasons for Absence
reason_type_1 = reason_columns.loc[:, 1:14].max(axis=1)
# Alternatively; reason_columns.loc[:, 1:14].max(axis=1)
reason_type_2 = reason_columns.loc[:, 15:17].max(axis=1)
reason_type_3 = reason_columns.loc[:, 18:21].max(axis=1)
reason_type_4 = reason_columns.loc[:, 22:].max(axis=1)

In [7]:
# Concatenate Column Values 
combined_df = pd.concat([abs_df, reason_type_1, reason_type_2, reason_type_3, reason_type_4], axis = 1)

# df.columns.values
column_names = ['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Reason_1', 'Reason_2', 'Reason_3', 'Reason_4']

# Change column names 
combined_df.columns = column_names

# Reorder Columns
column_names_reordered = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 
                          'Date', 'Transportation Expense', 'Distance to Work', 'Age',
                          'Daily Work Load Average', 'Body Mass Index', 'Education',
                          'Children', 'Pets', 'Absenteeism Time in Hours']

combined_df = combined_df[column_names_reordered]
combined_df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,0,0,0,1,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,0,0,0,0,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,0,0,0,1,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,1,0,0,0,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,0,0,0,1,23/07/2015,289,36,33,239.554,30,1,2,1,2


In [8]:
# Create a Checkpoint 
df_reason_mod = combined_df.copy()

#### Date

In [9]:
df_reason_mod['Date'] = pd.to_datetime(df_reason_mod['Date'], format = '%d/%m/%Y')

# Extract the Month Value
list_months = []

for i in range(df_reason_mod.shape[0]):
    list_months.append(df_reason_mod['Date'][i].month)
    
df_reason_mod['Month Value'] = list_months

In [10]:
# Extract the Weekday

# df_reason_mod['Date'][699].weekday()
# df_reason_mod['Date'][699]

def date_to_weekday(date_value):
    return date_value.weekday()

df_reason_mod['Day of the Week'] = df_reason_mod['Date'].apply(date_to_weekday)
df_reason_mod = df_reason_mod.drop(['Date'], axis = 1)
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of the Week
0,0,0,0,1,289,36,33,239.554,30,1,2,1,4,7,1
1,0,0,0,0,118,13,50,239.554,31,1,1,0,0,7,1
2,0,0,0,1,179,51,38,239.554,31,1,0,0,2,7,2
3,1,0,0,0,279,5,39,239.554,24,1,2,0,4,7,3
4,0,0,0,1,289,36,33,239.554,30,1,2,1,2,7,3


In [11]:
# Create a Checkpoint
df_reason_date_mod = df_reason_mod.copy()

#### Education
> Education is a categorical variable. Its values have no numeric meaning. So, I transform the feature into a dummy variable. First I extracted the distinct values and found out nearly 600 people have highschool education only. It would make sense to combine the other categories into a single class. Thus I manipulate the feature with the values; 0: high school education, 1: more than high school education.

In [12]:
# Education 
df_reason_date_mod['Education'].unique()

array([1, 3, 2, 4], dtype=int64)

In [13]:
df_reason_date_mod['Education'].value_counts()

1    583
3     73
2     40
4      4
Name: Education, dtype: int64

In [14]:
df_reason_date_mod['Education'] = df_reason_date_mod['Education'].map({1:0, 2:1, 3:1, 4:1})
df_reason_date_mod['Education'].value_counts()

0    583
1    117
Name: Education, dtype: int64

In [15]:
# Final Checkpoint 
df_preprocessed = df_reason_date_mod.copy()
df_preprocessed.head(10)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of the Week
0,0,0,0,1,289,36,33,239.554,30,0,2,1,4,7,1
1,0,0,0,0,118,13,50,239.554,31,0,1,0,0,7,1
2,0,0,0,1,179,51,38,239.554,31,0,0,0,2,7,2
3,1,0,0,0,279,5,39,239.554,24,0,2,0,4,7,3
4,0,0,0,1,289,36,33,239.554,30,0,2,1,2,7,3
5,0,0,0,1,179,51,38,239.554,31,0,0,0,2,7,4
6,0,0,0,1,361,52,28,239.554,27,0,1,4,8,7,4
7,0,0,0,1,260,50,36,239.554,23,0,4,0,4,7,4
8,0,0,1,0,155,12,34,239.554,25,0,2,0,40,7,0
9,0,0,0,1,235,11,37,239.554,29,1,1,1,8,7,0


In [16]:
df_preprocessed.shape

(700, 15)