Problem Statement: Analyzing and Predicting Employee Absenteeism
You are provided with a dataset containing records of Employee absenteeism. The dataset includes various attributes such as the reason for absence, transportation
expenses, distance to work, age, daily workload, body mass index (BMI), education level, number of children, number of pets, and the number of hours absent. Your
task is to analyze this data, derive insights, and build a predictive model to forecast absenteeism hours based on the given features.

Objectives:
1. Data Analysis and Visualization:
Perform exploratory data analysis (EDA) to understand the distribution of the data.
Identify any correlations between the features and absenteeism hours.
Visualize the data using Tableau to create insightful dashboards.

2. Data Preprocessing:
Handle missing values, if any.
Encode categorical variables appropriately.
Normalize or standardize the data if necessary.

3. Statistical Analysis:
Conduct hypothesis testing to determine if certain factors significantly affect absenteeism.
Perform regression analysis to understand the relationship between independent variables and absenteeism hours.

4. Machine Learning Model Building:
Split the data into training and testing sets.
Build and evaluate multiple regression models (e.g., Linear Regression, Decision Tree, Random Forest) to predict absenteeism hours.
Use cross-validation to ensure the robustness of your models.
Compare the performance of different models using appropriate metrics (e.g., RMSE, MAE, R²).

In [1]:
import warnings 
warnings.filterwarnings(action="ignore")

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns 

### load the data

In [3]:
df = pd.read_csv("Absenteeism.csv")
df.head(2)

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
0,11,26,07/07/2015,289,36,33,239.554,30,1,2,1,4
1,36,0,14/07/2015,118,13,50,239.554,31,1,1,0,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   ID                         700 non-null    int64  
 1   Reason for Absence         700 non-null    int64  
 2   Date                       700 non-null    object 
 3   Transportation Expense     700 non-null    int64  
 4   Distance to Work           700 non-null    int64  
 5   Age                        700 non-null    int64  
 6   Daily Work Load Average    700 non-null    float64
 7   Body Mass Index            700 non-null    int64  
 8   Education                  700 non-null    int64  
 9   Children                   700 non-null    int64  
 10  Pets                       700 non-null    int64  
 11  Absenteeism Time in Hours  700 non-null    int64  
dtypes: float64(1), int64(10), object(1)
memory usage: 65.8+ KB


In [5]:
df.describe()

Unnamed: 0,ID,Reason for Absence,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,17.951429,19.411429,222.347143,29.892857,36.417143,271.801774,26.737143,1.282857,1.021429,0.687143,6.761429
std,11.028144,8.356292,66.31296,14.804446,6.379083,40.021804,4.254701,0.66809,1.112215,1.166095,12.670082
min,1.0,0.0,118.0,5.0,27.0,205.917,19.0,1.0,0.0,0.0,0.0
25%,9.0,13.0,179.0,16.0,31.0,241.476,24.0,1.0,0.0,0.0,2.0
50%,18.0,23.0,225.0,26.0,37.0,264.249,25.0,1.0,1.0,0.0,3.0
75%,28.0,27.0,260.0,50.0,40.0,294.217,31.0,1.0,2.0,1.0,8.0
max,36.0,28.0,388.0,52.0,58.0,378.884,38.0,4.0,4.0,8.0,120.0


In [6]:
df.isna().sum()

ID                           0
Reason for Absence           0
Date                         0
Transportation Expense       0
Distance to Work             0
Age                          0
Daily Work Load Average      0
Body Mass Index              0
Education                    0
Children                     0
Pets                         0
Absenteeism Time in Hours    0
dtype: int64

In [7]:
df.isnull().sum()

ID                           0
Reason for Absence           0
Date                         0
Transportation Expense       0
Distance to Work             0
Age                          0
Daily Work Load Average      0
Body Mass Index              0
Education                    0
Children                     0
Pets                         0
Absenteeism Time in Hours    0
dtype: int64

In [8]:
df.dtypes

ID                             int64
Reason for Absence             int64
Date                          object
Transportation Expense         int64
Distance to Work               int64
Age                            int64
Daily Work Load Average      float64
Body Mass Index                int64
Education                      int64
Children                       int64
Pets                           int64
Absenteeism Time in Hours      int64
dtype: object

In [12]:
df['Date'] = df['Date'].astype('datetime64[ns]')

In [13]:
df.corr()

Unnamed: 0,ID,Reason for Absence,Date,Transportation Expense,Distance to Work,Age,Daily Work Load Average,Body Mass Index,Education,Children,Pets,Absenteeism Time in Hours
ID,1.0,-0.079111,-0.001759,-0.203788,-0.493562,0.041367,0.092873,-0.320718,-0.032889,0.026095,0.00538,-0.005469
Reason for Absence,-0.079111,1.0,0.078792,-0.13179,0.160059,-0.05521,-0.130406,0.052741,-0.060083,-0.050053,-0.032872,-0.175826
Date,-0.001759,0.078792,1.0,-0.040609,-0.051622,0.017997,-0.241125,-0.04861,0.2224,-0.121598,-0.01388,-0.011202
Transportation Expense,-0.203788,-0.13179,-0.040609,1.0,0.23494,-0.223828,0.006123,-0.140531,-0.054597,0.381749,0.446887,0.008342
Distance to Work,-0.493562,0.160059,-0.051622,0.23494,1.0,-0.131076,-0.073683,0.13619,-0.2826,0.048534,0.171585,-0.080593
Age,0.041367,-0.05521,0.017997,-0.223828,-0.131076,1.0,-0.045452,0.483762,-0.20933,0.04693,-0.252067,0.035784
Daily Work Load Average,0.092873,-0.130406,-0.241125,0.006123,-0.073683,-0.045452,1.0,-0.09843,-0.077012,0.032194,0.01049,0.029609
Body Mass Index,-0.320718,0.052741,-0.04861,-0.140531,0.13619,0.483762,-0.09843,1.0,-0.348758,-0.155711,-0.066484,-0.040203
Education,-0.032889,-0.060083,0.2224,-0.054597,-0.2826,-0.20933,-0.077012,-0.348758,1.0,-0.179521,-0.080899,-0.035621
Children,0.026095,-0.050053,-0.121598,0.381749,0.048534,0.04693,0.032194,-0.155711,-0.179521,1.0,0.116586,0.093661
