In [1]:
import pandas as pd
raw_csv_data = pd.read_csv("Absenteeism_data.csv")

In [2]:
df = raw_csv_data.copy()

In [3]:
pd.options.display.max_columns = None
pd.options.display.max_rows = None

## Drop 'ID':

In [4]:
df = df.drop(['ID'], axis = 1)

## 'Reason for Absence':

In [5]:
df['Reason for Absence'].min()

0

In [6]:
df['Reason for Absence'].max()

28

In [8]:
len(df['Reason for Absence'].unique())

28

### .get_dummies()

The original 'Reason for Absence' column includes a category coded as 0, which stands for 'Unspecified reason'. When generating dummy variables with drop_first=True, this category is used as the baseline and is thus omitted. All other reason categories (1–28) are grouped into four binary features (Reason_1 to Reason_4) based on medical and administrative classifications. Observations with reason code 0 are represented with all zeros across Reason_1 to Reason_4.

In [14]:
reason_columns = pd.get_dummies(df['Reason for Absence'], drop_first = True)

## Group the Reasons for Absence:

In [15]:
df.columns.values

array(['Reason for Absence', 'Date', 'Transportation Expense',
       'Distance to Work', 'Age', 'Daily Work Load Average',
       'Body Mass Index', 'Education', 'Children', 'Pets',
       'Absenteeism Time in Hours'], dtype=object)

In [16]:
reason_columns.columns.values

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 21, 22, 23, 24, 25, 26, 27, 28], dtype=int64)

In [17]:
df = df.drop(['Reason for Absence'], axis = 1)

## Concatenate Column Values

###  Categorization of Absence Reasons
The original dataset includes 28 distinct reasons for employee absence, many of which correspond to ICD medical categories. For modeling purposes, these were grouped into four broader categories based on the nature and domain of the conditions:

- Reason Type 1: Serious and Chronic Medical Conditions
    Reasons 1–14, including infections, neoplasms, circulatory and respiratory diseases, digestive issues, etc. These typically reflect more severe or chronic health conditions that may result in longer or more justified absenteeism.

- Reason Type 2: Pregnancy and Perinatal-Related Conditions
    Reasons 15–17, including pregnancy, childbirth, and congenital abnormalities. These causes are more specific to reproductive and early-life conditions.

- Reason Type 3: Temporary and Less Defined Medical Issues
    Reasons 18–21, such as injuries, external causes, and symptoms not classified elsewhere. These are often acute or non-specific events affecting attendance.

- Reason Type 4: Non-Medical or Administrative Absences
    Reasons 22–28, including patient follow-ups, consultations, and unjustified absences. These may involve logistical or administrative reasons not necessarily linked to illness.

In [19]:
reason_type_1 = reason_columns.loc[:, 1:14].max(axis=1)
reason_type_2 = reason_columns.loc[:, 15:17].max(axis=1)
reason_type_3 = reason_columns.loc[:, 18:21].max(axis=1)
reason_type_4 = reason_columns.loc[:, 22:].max(axis=1)

df = pd.concat([df, reason_type_1, reason_type_2, reason_type_3, reason_type_4], axis = 1)

In [20]:
df.columns.values

array(['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 0, 1, 2, 3],
      dtype=object)

In [21]:
column_names = ['Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Reason_1', 'Reason_2', 'Reason_3', 'Reason_4']

In [22]:
df.columns = column_names

In [23]:
df.head()

Unnamed: 0,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Reason_1,Reason_2,Reason_3,Reason_4
0,07/07/2015,289,36,33,239.554,30,1,2,1,4,False,False,False,True
1,14/07/2015,118,13,50,239.554,31,1,1,0,0,False,False,False,False
2,15/07/2015,179,51,38,239.554,31,1,0,0,2,False,False,False,True
3,16/07/2015,279,5,39,239.554,24,1,2,0,4,True,False,False,False
4,23/07/2015,289,36,33,239.554,30,1,2,1,2,False,False,False,True


## Reorder Columns

In [24]:
column_names_reordered = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 
                          'Date', 'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours']

In [25]:
df = df[column_names_reordered]

In [26]:
df.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,False,False,False,True,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,False,False,False,False,14/07/2015,118,13,50,239.554,31,1,1,0,0
2,False,False,False,True,15/07/2015,179,51,38,239.554,31,1,0,0,2
3,True,False,False,False,16/07/2015,279,5,39,239.554,24,1,2,0,4
4,False,False,False,True,23/07/2015,289,36,33,239.554,30,1,2,1,2


## Create a Checkpoint

In [27]:
df_reason_mod = df.copy()

## 'Date':

In [28]:
type(df_reason_mod['Date'][0])

str

In [29]:
df_reason_mod['Date'] = pd.to_datetime(df_reason_mod['Date'], format = '%d/%m/%Y')

In [30]:
type(df_reason_mod['Date'])

pandas.core.series.Series

In [31]:
df_reason_mod.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   Reason_1                   700 non-null    bool          
 1   Reason_2                   700 non-null    bool          
 2   Reason_3                   700 non-null    bool          
 3   Reason_4                   700 non-null    bool          
 4   Date                       700 non-null    datetime64[ns]
 5   Transportation Expense     700 non-null    int64         
 6   Distance to Work           700 non-null    int64         
 7   Age                        700 non-null    int64         
 8   Daily Work Load Average    700 non-null    float64       
 9   Body Mass Index            700 non-null    int64         
 10  Education                  700 non-null    int64         
 11  Children                   700 non-null    int64         
 12  Pets    

## Extract the Month Value:

In [32]:
df_reason_mod['Date'][0].month

7

In [33]:
list_months = []
list_months

[]

In [35]:
for i in range(df_reason_mod.shape[0]):
    list_months.append(df_reason_mod['Date'][i].month)

In [36]:
len(list_months)

700

In [37]:
df_reason_mod['Month Value'] = list_months

In [38]:
df_reason_mod.head(10)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value
0,False,False,False,True,2015-07-07,289,36,33,239.554,30,1,2,1,4,7
1,False,False,False,False,2015-07-14,118,13,50,239.554,31,1,1,0,0,7
2,False,False,False,True,2015-07-15,179,51,38,239.554,31,1,0,0,2,7
3,True,False,False,False,2015-07-16,279,5,39,239.554,24,1,2,0,4,7
4,False,False,False,True,2015-07-23,289,36,33,239.554,30,1,2,1,2,7
5,False,False,False,True,2015-07-10,179,51,38,239.554,31,1,0,0,2,7
6,False,False,False,True,2015-07-17,361,52,28,239.554,27,1,1,4,8,7
7,False,False,False,True,2015-07-24,260,50,36,239.554,23,1,4,0,4,7
8,False,False,True,False,2015-07-06,155,12,34,239.554,25,1,2,0,40,7
9,False,False,False,True,2015-07-13,235,11,37,239.554,29,3,1,1,8,7


## Extract the Day of the Week:

In [41]:
def date_to_weekday(date_value):
    return date_value.weekday()

In [42]:
df_reason_mod['Day of the Week'] = df_reason_mod['Date'].apply(date_to_weekday)

In [43]:
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of the Week
0,False,False,False,True,2015-07-07,289,36,33,239.554,30,1,2,1,4,7,1
1,False,False,False,False,2015-07-14,118,13,50,239.554,31,1,1,0,0,7,1
2,False,False,False,True,2015-07-15,179,51,38,239.554,31,1,0,0,2,7,2
3,True,False,False,False,2015-07-16,279,5,39,239.554,24,1,2,0,4,7,3
4,False,False,False,True,2015-07-23,289,36,33,239.554,30,1,2,1,2,7,3


In [44]:
df_reason_mod = df_reason_mod.drop(['Date'], axis = 1)
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours,Month Value,Day of the Week
0,False,False,False,True,289,36,33,239.554,30,1,2,1,4,7,1
1,False,False,False,False,118,13,50,239.554,31,1,1,0,0,7,1
2,False,False,False,True,179,51,38,239.554,31,1,0,0,2,7,2
3,True,False,False,False,279,5,39,239.554,24,1,2,0,4,7,3
4,False,False,False,True,289,36,33,239.554,30,1,2,1,2,7,3


In [45]:
df_reason_mod.columns.values

array(['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education',
       'Children', 'Pets', 'Absenteeism Time in Hours', 'Month Value',
       'Day of the Week'], dtype=object)

In [46]:
column_names_upd = ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Month Value', 'Day of the Week',
       'Transportation Expense', 'Distance to Work', 'Age',
       'Daily Work Load Average', 'Body Mass Index', 'Education', 'Children',
       'Pets', 'Absenteeism Time in Hours']

In [47]:
df_reason_mod = df_reason_mod[column_names_upd]
df_reason_mod.head()

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,False,False,False,True,7,1,289,36,33,239.554,30,1,2,1,4
1,False,False,False,False,7,1,118,13,50,239.554,31,1,1,0,0
2,False,False,False,True,7,2,179,51,38,239.554,31,1,0,0,2
3,True,False,False,False,7,3,279,5,39,239.554,24,1,2,0,4
4,False,False,False,True,7,3,289,36,33,239.554,30,1,2,1,2


In [48]:
df_reason_date_mod = df_reason_mod.copy()

## Column: 'Education'

The original Education variable included four levels:
1 = High School, 2 = Graduate, 3 = Postgraduate, 4 = Doctor.

To reduce dimensionality and improve interpretability in the logistic regression model, these were consolidated into a binary feature:

0: High School (basic education)

1: Higher Education (any level above high school)

This transformation emphasizes the distinction between employees with and without advanced education, which is a more meaningful predictor of absenteeism behavior in this context.

In [55]:
df_reason_date_mod['Education'].value_counts()

Education
1    583
3     73
2     40
4      4
Name: count, dtype: int64

In [56]:
df_reason_date_mod['Education'] = df_reason_date_mod['Education'].map({1:0, 2:1, 3:1, 4:1})

In [58]:
df_reason_date_mod['Education'].value_counts()

Education
0    583
1    117
Name: count, dtype: int64

## Final Checkpoint

In [59]:
df_preprocessed = df_reason_date_mod.copy()
df_preprocessed.head(10)

Unnamed: 0,Reason_1,Reason_2,Reason_3,Reason_4,Month Value,Day of the Week,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,False,False,False,True,7,1,289,36,33,239.554,30,0,2,1,4
1,False,False,False,False,7,1,118,13,50,239.554,31,0,1,0,0
2,False,False,False,True,7,2,179,51,38,239.554,31,0,0,0,2
3,True,False,False,False,7,3,279,5,39,239.554,24,0,2,0,4
4,False,False,False,True,7,3,289,36,33,239.554,30,0,2,1,2
5,False,False,False,True,7,4,179,51,38,239.554,31,0,0,0,2
6,False,False,False,True,7,4,361,52,28,239.554,27,0,1,4,8
7,False,False,False,True,7,4,260,50,36,239.554,23,0,4,0,4
8,False,False,True,False,7,0,155,12,34,239.554,25,0,2,0,40
9,False,False,False,True,7,0,235,11,37,239.554,29,1,1,1,8


In [None]:
df_preprocessed.to_csv('Absenteeism_preprocessed_01.csv', index = False)