### Features Transformation & Selection

**Project:** Data Mining I (2025/26)

**Group:** 15

**Members:**
- Beatriz Boura
- Dinis Gaspar
- Leonor Cardoso
- Margarida Cruz

#### Config & Load

In [3]:
# imports
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder # scalers and encoders
from sklearn.compose import ColumnTransformer # for combining different preprocessing steps
from sklearn.pipeline import Pipeline # for creating machine learning pipelines
from sklearn.feature_selection import VarianceThreshold # feature selection
from sklearn.decomposition import PCA # dimensionality reduction

# Costants
DATA_PATH = '../Datasets/absenteeism_data.csv' # substituir pelo data set tratado dos passos anteriores)
EXPORT_PATH = '../Datasets'
TRANSFORMED_DATA_FILE = os.path.join(EXPORT_PATH, '4_features_transformed_absenteeism_data.csv') # file path for transformed data
READY_DATA_FILE = os.path.join(EXPORT_PATH, '4_features_ready_absenteeism_data.csv') # file path for ready data

ID_COL = 'ID' # identifier column
TARGET_COL = 'Absenteeism time in hours' # target column

# Display settings
pd.set_option('display.max_columns', 120)
pd.set_option('display.width', 160)

# Load data
data = pd.read_csv(DATA_PATH, sep=';')
print('Dimensions of the dataset:', data.shape)
display(data)

Dimensions of the dataset: (800, 22)


Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Days since previous absence,Transportation expense,Distance from Residence to Work,Estimated commute time,Service time,Years until retirement,Date of Birth,Disciplinary failure,Education,Number of children,Social drinker,Social smoker,Number of pets,Weight,Height,Body mass index,Absenteeism time in hours
0,11,Unjustified absence,July,Tuesday,Summer,0.0,289,36,69,13,32,1992-08-15,No,1,2,Y,No,1,90,172,30,4
1,36,Unspecified,July,Tuesday,,0.0,118,13,26,18,15,1975-09-02,Yes,1,1,Y,No,0,98,178,31,0
2,3,Medical consultation,July,Wednesday,Summer,0.0,179,51,108,18,27,1987-04-08,No,1,0,Yes,No,0,89,170,31,2
3,7,Diseases of the eye and adnexa,July,Thursday,,0.0,279,5,5,14,26,1986-07-25,No,1,2,Yes,Yes,0,68,168,24,4
4,11,Medical consultation,July,Thursday,Summer,0.0,289,36,69,13,32,1992-08-15,No,1,2,Yes,No,1,90,172,30,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,11,Diseases of the genitourinary system,July,Tuesday,Summer,0.0,289,36,69,13,32,1992-08-15,No,1,2,Yes,No,1,90,172,30,8
796,1,Diseases of the digestive system,July,Tuesday,Summer,0.0,235,11,20,14,28,1988-06-01,No,3,1,No,No,1,88,172,29,4
797,4,Unspecified,,Tuesday,Summer,0.0,118,14,34,13,25,1985-10-20,No,1,1,Yes,No,8,98,170,34,0
798,8,Unspecified,,Wednesday,,0.0,231,35,63,14,26,1986-09-13,No,1,2,Yes,No,2,100,170,35,0


### Initial data Exploration

In [None]:
# Global view of the data
display(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ID                               800 non-null    int64  
 1   Reason for absence               740 non-null    object 
 2   Month of absence                 737 non-null    object 
 3   Day of the week                  740 non-null    object 
 4   Seasons                          573 non-null    object 
 5   Days since previous absence      781 non-null    float64
 6   Transportation expense           800 non-null    int64  
 7   Distance from Residence to Work  800 non-null    int64  
 8   Estimated commute time           800 non-null    int64  
 9   Service time                     800 non-null    object 
 10  Years until retirement           800 non-null    int64  
 11  Date of Birth                    800 non-null    object 
 12  Disciplinary failure  

None

In [None]:
# Statistical summary
display(data.describe(include='all').T.head(15))

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
ID,800.0,,,,17.985,10.952156,1.0,10.0,18.0,28.0,36.0
Reason for absence,740.0,28.0,Medical consultation,149.0,,,,,,,
Month of absence,737.0,12.0,March,87.0,,,,,,,
Day of the week,740.0,6.0,Monday,159.0,,,,,,,
Seasons,573.0,4.0,Spring,155.0,,,,,,,
Days since previous absence,781.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Transportation expense,800.0,,,,221.9275,66.778732,118.0,179.0,225.0,260.0,388.0
Distance from Residence to Work,800.0,,,,29.79875,14.875057,5.0,16.0,26.0,50.0,52.0
Estimated commute time,800.0,,,,59.34875,31.301067,5.0,31.0,52.0,94.0,114.0
Service time,800.0,19.0,18,150.0,,,,,,,


In [None]:
# Unique values per feature
display(data.nunique().sort_values(ascending=False).head(10))

ID                                 36
Date of Birth                      36
Estimated commute time             29
Reason for absence                 28
Weight                             27
Distance from Residence to Work    25
Transportation expense             24
Years until retirement             22
Absenteeism time in hours          19
Service time                       19
dtype: int64

### Feature Transformation