# Phase 2 - Data Preprocessing

In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats.mstats import winsorize
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PowerTransformer
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report
import statsmodels.stats as sm_stats
import statsmodels.api as sm
plt.rcParams["font.family"] = "DejaVu Sans"

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

In [25]:
def read_any(path):
    try:
        return pd.read_csv(path, sep=None, engine="python", encoding="utf-8-sig")
    except Exception:
        return pd.read_csv(path, sep=";", engine="python", encoding="utf-8-sig")

patient = read_any("data/patient.csv")
station = read_any("data/station.csv")
observation = read_any("data/observation.csv")


## 2.1 Implementing Data Preprocessing

# A


In [26]:
def remove_outliers_iqr(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return series[(series >= lower) & (series <= upper)]


low_outlier_columns = ["SpO₂","PI","RR","PRV","Skin Temperature","Motion/Activity index","PVI","Hb level","SV","Blood Flow Index", "PPG waveform features","Signal Quality Index", "Respiratory effort"]
for column in low_outlier_columns:
    data = observation[column]
    data_without_outliers = remove_outliers_iqr(data)


def removing_outliers_low_percentatge(df, columns):
    cleaned_df = df.copy()
    for col in columns:
        if col in cleaned_df.columns:
            Q1 = cleaned_df[col].quantile(0.25)
            Q3 = cleaned_df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower = Q1 - 1.5 * IQR
            upper = Q3 + 1.5 * IQR

            before = len(cleaned_df)
            cleaned_df = cleaned_df[(cleaned_df[col] >= lower) & (cleaned_df[col] <= upper)]
            after = len(cleaned_df)

    return cleaned_df

In [27]:
observation_clean = removing_outliers_low_percentatge(observation, low_outlier_columns)

In [28]:
observation_clean.shape

(11118, 23)

The cleaned dataset was divided into a training and testing set using an 80/20 ratio.
The target variable is oximetry, representing a binary classification problem (critical vs. normal state).
Stratified sampling was applied to preserve the class balance between both sets.

In [29]:
X = observation_clean.drop(columns=['oximetry'])
y = observation_clean['oximetry']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=11, stratify=y
)

print("Training set:", X_train.shape)
print("Test set:", X_test.shape)

Training set: (8894, 22)
Test set: (2224, 22)


# B

- The dataset observation_clean contains 23 numerical attributes and 11,118 records.
- An initial inspection using .info() and .isna() confirms that there are no missing values in any of the columns.
- All features are stored as float64, making the dataset already suitable for machine learning algorithms
- Three columns: SpO₂ (is directly related to the target variable and could cause data leakage), latitude, and longitude (contain location metadata irrelevant for the classification task) were dropped from the predictors

In [30]:
observation_clean.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11118 entries, 0 to 12045
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   SpO₂                   11118 non-null  float64
 1   HR                     11118 non-null  float64
 2   PI                     11118 non-null  float64
 3   RR                     11118 non-null  float64
 4   EtCO₂                  11118 non-null  float64
 5   FiO₂                   11118 non-null  float64
 6   PRV                    11118 non-null  float64
 7   BP                     11118 non-null  float64
 8   Skin Temperature       11118 non-null  float64
 9   Motion/Activity index  11118 non-null  float64
 10  PVI                    11118 non-null  float64
 11  Hb level               11118 non-null  float64
 12  SV                     11118 non-null  float64
 13  CO                     11118 non-null  float64
 14  Blood Flow Index       11118 non-null  float64
 15  PPG wav

In [31]:
display(observation_clean.isna().sum().sort_values(ascending=False))

SpO₂                     0
HR                       0
PI                       0
RR                       0
EtCO₂                    0
FiO₂                     0
PRV                      0
BP                       0
Skin Temperature         0
Motion/Activity index    0
PVI                      0
Hb level                 0
SV                       0
CO                       0
Blood Flow Index         0
PPG waveform features    0
Signal Quality Index     0
Respiratory effort       0
O₂ extraction ratio      0
SNR                      0
oximetry                 0
latitude                 0
longitude                0
dtype: int64

In [32]:
target = "oximetry"

drop_cols = ["SpO₂", "latitude", "longitude"]

X = observation_clean.drop(columns=[c for c in drop_cols if c in observation_clean.columns])
y = observation_clean[target].astype(int)

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)


Feature matrix shape: (11118, 20)
Target vector shape: (11118,)


In [33]:
print(X.isna().sum().sum()) 

0


In [34]:
print("Final numeric dataset shape:", X.shape)

Final numeric dataset shape: (11118, 20)


# C

To prepare the data for machine learning, several feature transformation techniques were applied and evaluated.

**Two main categories of preprocessing were tested together with a Logistic Regression classifier:**
- Scaling methods: StandardScaler and MinMaxScaler
- Transformers: PowerTransformer (Yeo–Johnson) and PCA (Principal Component Analysis)

**Scaling Techniques**
- Both scalers achieved nearly identical results, with accuracy around 0.85, precision around 0.84–0.85, and f1-score around 0.85.
This indicates that the model performs similarly under both scaling methods and that the features are relatively well-behaved numerically.

**Transformation Techniques**
- The PowerTransformer achieved slightly better overall performance (accuracy = 0.85) compared to PCA (accuracy = 0.84).
This suggests that normalizing skewed distributions was more effective than dimensionality reduction in this dataset.

In [35]:
pipe_std = Pipeline([
    ('scale', StandardScaler()),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

pipe_minmax = Pipeline([
    ('scale', MinMaxScaler()),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

pipe_std.fit(X_train, y_train)
pipe_minmax.fit(X_train, y_train)

y_pred_std = pipe_std.predict(X_test)
y_pred_minmax = pipe_minmax.predict(X_test)

print("=== StandardScaler ===")
print(classification_report(y_test, y_pred_std))

print("=== MinMaxScaler ===")
print(classification_report(y_test, y_pred_minmax))

=== StandardScaler ===
              precision    recall  f1-score   support

         0.0       0.78      0.86      0.82       902
         1.0       0.90      0.84      0.87      1322

    accuracy                           0.85      2224
   macro avg       0.84      0.85      0.84      2224
weighted avg       0.85      0.85      0.85      2224

=== MinMaxScaler ===
              precision    recall  f1-score   support

         0.0       0.78      0.86      0.82       902
         1.0       0.90      0.84      0.87      1322

    accuracy                           0.85      2224
   macro avg       0.84      0.85      0.84      2224
weighted avg       0.85      0.85      0.85      2224



In [39]:
pipe_power = Pipeline([
    ('transform', PowerTransformer(method='yeo-johnson')),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])


pipe_pca = Pipeline([
    ('scale', StandardScaler()),   
    ('pca', PCA(n_components=5)),  
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced'))
])

pipe_power.fit(X_train, y_train)
y_pred_power = pipe_power.predict(X_test)

pipe_pca.fit(X_train, y_train)
y_pred_pca = pipe_pca.predict(X_test)

print("=== PowerTransformer ===")
print(classification_report(y_test, y_pred_power))

print("=== PCA Transformer ===")
print(classification_report(y_test, y_pred_pca))



=== PowerTransformer ===
              precision    recall  f1-score   support

         0.0       0.78      0.86      0.82       902
         1.0       0.90      0.83      0.87      1322

    accuracy                           0.85      2224
   macro avg       0.84      0.85      0.84      2224
weighted avg       0.85      0.85      0.85      2224

=== PCA Transformer ===
              precision    recall  f1-score   support

         0.0       0.77      0.86      0.81       902
         1.0       0.90      0.83      0.86      1322

    accuracy                           0.84      2224
   macro avg       0.83      0.84      0.84      2224
weighted avg       0.85      0.84      0.84      2224



# D

**Data Selection and Structure**

The analysis focused on the observation_clean dataset, which contained all relevant physiological attributes in numeric form.
Files (station, patient) were excluded from the preprocessing pipeline because they contained data not essential for modeling oxygen saturation states.

**Data Cleaning**

An inspection with .info() and .isna() confirmed that the dataset had no missing values and all columns were of type float64.
This eliminated the need for imputation or categorical encoding.

**Feature Preparation**

The target variable was defined as oximetry, representing a binary classification problem. SpO₂ (directly related to the target and would cause data leakage), latitude and longitude (provide only metadata without physiological meaning)
were removed.

**Scaling and Transformation**

Four transformation approaches were tested:
- StandardScaler – normalization to zero mean and unit variance.
- MinMaxScaler – scaling to a [0, 1] range.
- PowerTransformer (Yeo–Johnson) - was chosen to adjust for skewed distributions of numerical attributes, which allows linear models to better capture relationships in the data and reduce the impact of extreme values.
- Principal Component Analysis - was used as a dimensionality reduction. It simplifies data by removing redundancy between correlated attributes. It creates a smaller dataset that still contains the essential information.