#**Environment**

In [None]:
# !pip install ydata_profiling

In [1]:
# Google Colab or Locally
try:
    from google.colab import drive
    drive.mount('/content/drive')
    DATA_PATH = "/content/drive/MyDrive/MLOps/MLOps_Project"
    print("Running in Google Colab... Drive mounted.")
except ModuleNotFoundError:
    DATA_PATH = "./"
    print("Running Locally... using local data path.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Running in Google Colab... Drive mounted.


In [2]:
# Libraries
import pandas as pd
import numpy as np
import os
import warnings
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

# Optional Import Ydata Profiling
try:
    from ydata_profiling import ProfileReport
    HAS_PROFILING = True
except Exception:
    HAS_PROFILING = False
    print("ydata_profiling not available (optional).\nTo install: pip install ydata-profiling")

#**Absenteeism at Work**

##**Dataset**

In [3]:
df = pd.read_csv(os.path.join(os.path.abspath(DATA_PATH), "work_absenteeism.csv"))

##**Analysis**

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 754 entries, 0 to 753
Data columns (total 22 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   ID                               746 non-null    object 
 1   Reason for absence               748 non-null    object 
 2   Month of absence                 743 non-null    object 
 3   Day of the week                  746 non-null    object 
 4   Seasons                          750 non-null    object 
 5   Transportation expense           746 non-null    object 
 6   Distance from Residence to Work  743 non-null    object 
 7   Service time                     747 non-null    object 
 8   Age                              749 non-null    object 
 9   Work load Average/day            741 non-null    object 
 10  Hit target                       746 non-null    object 
 11  Disciplinary failure             747 non-null    object 
 12  Education             

In [6]:
df.head(3)

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours,mixed_type_col
0,11.0,26.0,7.0,3.0,1.0,289.0,36.0,13.0,33.0,239.554,...,1.0,2.0,1.0,0.0,1.0,90.0,172.0,30.0,4.0,535
1,36.0,0.0,7.0,3.0,1.0,118.0,13.0,18.0,50.0,239.554,...,1.0,1.0,1.0,0.0,0.0,98.0,178.0,31.0,0.0,584
2,3.0,23.0,7.0,4.0,1.0,179.0,51.0,18.0,38.0,239.554,...,1.0,0.0,1.0,0.0,0.0,89.0,170.0,31.0,2.0,249


In [7]:
baseline_info = {
    "shape": df.shape,
    "dtypes": df.dtypes.value_counts().to_dict(),
    "nulls": df.isna().sum().sum()
}
baseline_info

{'shape': (754, 22),
 'dtypes': {dtype('O'): 20, dtype('float64'): 2},
 'nulls': np.int64(275)}

In [8]:
profile = ProfileReport(df, title='Summary Report')
profile.to_notebook_iframe()

Output hidden; open in https://colab.research.google.com to view.

##**Data Cleaning**

The overall goal is to **correct invalid values, handle outliers, and fill in missing data**.

##### **🗑️ Removed Variables**

| Variable          | Reason for Removal                                               |
|-------------------|-----------------------------------------------------------------|
| `ID`              | It's a unique identifier with no predictive value.              |
| `mixed_type_col`  | Contained mixed or inconsistent values that didn't provide useful information to the model. |

These columns were removed at the beginning of the pipeline to reduce noise and prevent errors during later processing.

---

##### **🔠 Categorical Variables**

| Variable              | Applied Transformation                                 | Justification                                                                                  |
|-----------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------------------|
| `Reason for absence`   | Values outside the `0–28` range were replaced with `0` (`Unknown`). | Ensures consistency with the defined code mapping and avoids errors from corrupt values.      |
| `Month of absence`     | Values outside the `0–12` range were replaced with `0` (`Unknown`). | Ensures consistency with the defined code mapping and avoids errors from corrupt values.      |
| `Day of the week`      | Values outside the `2–6` range were imputed with the mode.          | Only workdays exist in this dataset; imputing with the most frequent value prevents distortion. |
| `Seasons`              | Values outside the `1–4` range were imputed with the mode.          | Limited to 4 seasons; imputing the mode avoids data loss.                                     |
| `Education`            | Values outside the `1–4` range were imputed with the mode.          | Only valid education levels are retained; the mode preserves general distribution.            |
| `Disciplinary failure` | Values other than `0` or `1` were imputed with the mode.            | This binary variable must be 0 or 1; errors were corrected without introducing bias.          |
| `Social drinker`       | Values other than `0` or `1` were imputed with the mode.            | Ensures consistency in this binary variable.                                                  |
| `Social smoker`        | Values other than `0` or `1` were imputed with the mode.            | Maintains integrity of binary data.                                                           |

---

##### **🔢 Numerical Variables**

| Variable                  | Applied Transformations                           | Justification                                                         |
|---------------------------|-------------------------------------------------|----------------------------------------------------------------------|
| `Transportation expense`   | IQR Winsorization + Median Imputation + Final Rounding | Controls outliers, fills in missing values, and formats as integers for modeling. |
| `Distance from Residence to Work` | IQR Winsorization + Median Imputation + Final Rounding | Normalizes distribution and ensures numerical integrity.             |
| `Service time`            | IQR Winsorization + Median Imputation + Final Rounding | Improves consistency in employment-related data.                      |
| `Age`                     | IQR Winsorization + Median Imputation + Final Rounding | Ensures a logical age range and complete data.                        |
| `Work load Average/day`   | IQR Winsorization + Median Imputation + Final Rounding | Reduces impact of extreme values and fills in missing data.          |
| `Hit target`              | IQR Winsorization + Median Imputation + Final Rounding | Adjusts the variable to a realistic and coherent range.              |
| `Son`                     | IQR Winsorization + Median Imputation + Final Rounding | Though discrete, extreme and missing values are treated.             |
| `Pet`                     | IQR Winsorization + Median Imputation + Final Rounding | Kept as an integer while controlling for outliers.                   |
| `Weight`                  | IQR Winsorization + Median Imputation + Final Rounding | Ensures values fall within a physiologically plausible range.         |
| `Height`                  | IQR Winsorization + Median Imputation + Final Rounding | Controls for logical range and completes data.                        |
| `Body mass index`         | IQR Winsorization + Median Imputation + Final Rounding | Improves consistency of this index derived from weight and height.   |
| `Absenteeism time in hours` | IQR Winsorization + Median Imputation + Final Rounding | Target variable; cleaned to avoid bias and prediction errors.        |

---


####**Auxiliary Functions**

In [9]:
def drop_columns(df):
    return df.drop(['ID', 'mixed_type_col'], axis=1)


def strip_object_columns(df):
    df_obj = df.select_dtypes(include='object').copy()
    for col in df_obj.columns:
        df[col] = df[col].astype(str).str.strip()
    return df


def safe_round_to_int_df(df):
    def safe_convert(val):
        try:
            if pd.isnull(val):
                return np.nan
            return round(float(val))
        except:
            return np.nan

    for col in df.columns:
        df[col] = df[col].apply(safe_convert)
    return df


def fix_invalid_values(df):
    df['Reason for absence'] = df['Reason for absence'].apply(lambda x: x if x in range(0, 29) else 0)
    df['Month of absence'] = df['Month of absence'].apply(lambda x: x if x in range(0, 13) else 0)
    df['Day of the week'] = df['Day of the week'].apply(lambda x: x if x in range(2, 7) else df['Day of the week'].mode()[0])
    df['Seasons'] = df['Seasons'].apply(lambda x: x if x in range(1, 5) else df['Seasons'].mode()[0])
    df['Education'] = df['Education'].apply(lambda x: x if x in range(1, 5) else df['Education'].mode()[0])

    binary_cols = ['Disciplinary failure', 'Social drinker', 'Social smoker']
    for col in binary_cols:
        df[col] = df[col].apply(lambda x: x if x in [0, 1] else df[col].mode()[0])

    return df


def winsorize_iqr(df):
    num_cols = [
        'Transportation expense', 'Distance from Residence to Work', 'Service time',
        'Age', 'Work load Average/day', 'Hit target', 'Son', 'Pet', 'Weight',
        'Height', 'Body mass index', 'Absenteeism time in hours'
    ]

    for col in num_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)
    return df


def fillna_with_median(df):
    return df.fillna(df.median(numeric_only=True))


def final_int_conversion(df):
    return df.round(0).astype(int)

####**Pipeline**

In [10]:
preprocessing_pipeline = Pipeline([
    ('drop_columns', FunctionTransformer(drop_columns)),
    ('strip_objects', FunctionTransformer(strip_object_columns)),
    ('safe_round', FunctionTransformer(safe_round_to_int_df)),
    ('fix_invalids', FunctionTransformer(fix_invalid_values)),
    ('winsorize', FunctionTransformer(winsorize_iqr)),
    ('fillna', FunctionTransformer(fillna_with_median)),
    ('final_int', FunctionTransformer(final_int_conversion))
])

####**Clean Dataset**

In [11]:
df_clean = preprocessing_pipeline.fit_transform(df.copy())

In [12]:
df_clean.to_csv(os.path.join(os.path.abspath(DATA_PATH), "work_absenteeism.csv"), index=False)

##**Analysis**

In [13]:
comparison_info = {
    "Before (shape)": df.shape,
    "After (shape)": df_clean.shape,
    "Before (nulls)": df.isna().sum().sum(),
    "After (nulls)": df_clean.isna().sum().sum(),
    "Before (dtypes)": df.dtypes.value_counts().to_dict(),
    "After (dtypes)": df_clean.dtypes.value_counts().to_dict()
}
comparison_info

{'Before (shape)': (754, 22),
 'After (shape)': (754, 20),
 'Before (nulls)': np.int64(275),
 'After (nulls)': np.int64(0),
 'Before (dtypes)': {dtype('O'): 20, dtype('float64'): 2},
 'After (dtypes)': {dtype('int64'): 20}}

In [14]:
profile = ProfileReport(df_clean, title='Summary Report')
profile.to_notebook_iframe()

Output hidden; open in https://colab.research.google.com to view.

## 📘 Data Versioning

| Version | File | Description of Changes | Date | Responsible |
|---------|------|------------------------|------|-------------|
| v1 | `work_absenteeism_original.csv` | Original dataset without cleaning, contains null values and outliers. | 2025-10-09 | Team 59 |
| v2 | `work_absenteeism_clean.csv` | Cleaned dataset: removal of irrelevant columns, correction of invalid values, null value imputation, and outlier winsorization. | 2025-10-10 | Jorge / Miguel |

Both versions are managed under DVC, allowing full traceability of the data used during the analysis and modeling phases.