### **Data Cleaning Workflow** 

---

#### **1. Import Already Filtered Data**  
- Drop records where:  
   - **`'sii'`** is `NaN`.  
- Save it as target variables:  
   - 🎯 **`sii`**  

---

#### **2. Feature Cleaning**  
-  Drop all **features** that are present **only in the train set**.  
-  Remove features where **`NaN` values ≥ 50%** (Threshold: **`0.5`**).  

---

#### **3. Dataset Splitting**  
- **Split the dataset** into:  
   -  **Numerical Features**  
   -  **Categorical Features**

---

### **4. Feature Engineering**

-  Train a model and exploit the similarity score to **fill out `numerical missing values`** through a process of **`missing value imputation`** 
-  Fill out categorical **`missing value`** with the mode of its values 
-  Merge all **FGC-Zone features**:  
   - Compute: **`feature + (zone_feature * 0.25(feature))`**.
-  Train a **Random Forest** to apply a subselection of the features. We will have only the `most informative features`

---

#### **5. Categorical Feature Encoding**  
- Use **1-Hot Encoding (OHE)**:  
   -  `pd.get_dummies()`

---

#### **6. Handle Outliers (Optional)**  
- 🚨 **[EVENTUALLY]**: Remove records with extremely **high values**.

---

#### **7. Correlation Check (Optional)**  
-  🚨 **[EVENTUALLY]**: Compute the **correlation matrix** and if extremely high `(positive or negative)` correlation is found:
   - Create a `new dataset` without the least informative features with strongest correlation

---

#### **8. Export Dataset**  
- **Export the two different dataset**.  
- Proceed to:  
   - Find out what dataset between the two exported has **better performances**
   - Build **four classification models**.  
   - Perform **hyperparameter tuning**.  
   - Document appropriate **considerations**.

---



#### **1. Import Already Filtered Data**  

In [244]:
# The dataset is kind of strange and after lot of tries I can firmly say it have a lot of outliers. 
# Studying the dataset I found three specific incredibly high and strange values and I decided to remove them by hand. 
# This is why the name of this step is called "Already Filtered".

import os
from dotenv import load_dotenv
import pandas as pd
import numpy as np

load_dotenv()
TRAIN_SET = os.getenv("FILTERED_TRAIN_PATH")
TEST_SET  = os.getenv("TEST_PATH")

train = pd.read_csv(TRAIN_SET)
test = pd.read_csv(TEST_SET)


In [245]:
train.head()

Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,...,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday,sii
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,...,4.0,2.0,4.0,55.0,,,,Fall,3.0,2.0
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,...,0.0,0.0,0.0,0.0,Fall,46.0,64.0,Summer,0.0,0.0
2,00105258,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,...,2.0,1.0,1.0,28.0,Fall,38.0,54.0,Summer,2.0,0.0
3,00115b9f,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,...,3.0,4.0,1.0,44.0,Summer,31.0,45.0,Winter,0.0,1.0
4,0016bb22,Spring,18,1,Summer,,,,,,...,,,,,,,,,,


In [246]:
test.head()

Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,...,BIA-BIA_TBW,PAQ_A-Season,PAQ_A-PAQ_A_Total,PAQ_C-Season,PAQ_C-PAQ_C_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,...,32.6909,,,,,,,,Fall,3.0
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,...,27.0552,,,Fall,2.34,Fall,46.0,64.0,Summer,0.0
2,00105258,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,...,,,,Summer,2.17,Fall,38.0,54.0,Summer,2.0
3,00115b9f,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,...,45.9966,,,Winter,2.451,Summer,31.0,45.0,Winter,0.0
4,0016bb22,Spring,18,1,Summer,,,,,,...,,Summer,1.04,,,,,,,


In [247]:
# Remove records with 'sii' value NaN
x_train = train.dropna(subset=['sii'])

# Extract 'sii' (target variable) 
y_train = x_train['sii']

In [248]:
# Drop 'id' from both train and test:
x_train = x_train.drop(columns=['id'])
x_train = x_train.drop(columns=['sii'])

x_test = test.drop(columns=['id'])

In [249]:
x_train.shape, y_train.shape

((2733, 80), (2733,))

In [250]:
x_test.shape

(20, 58)

---
#### **2. Feature Cleaning**  

In [251]:
# FUNCTION DEFINITION

# given the train dataset, drop the features only present in the train dataset
def intersect_features(train, test):
    features = train[train.columns.intersection(test.columns)]
    return features

In [252]:
x_train = intersect_features(x_train, x_test)
x_train.shape, y_train.shape

((2733, 58), (2733,))

In [253]:
# FUNCTION DEFINITION

# drop all the features that have an arbitrary % threshold of missing values
def drop_columns(df, threshold):
    min_count = len(df) * threshold
    dropped_cols = df.columns[df.isnull().sum() > (len(df) - min_count)]
    df = df.drop(columns=dropped_cols)
    return df, dropped_cols

In [254]:
x_train, dropped_cols = drop_columns(x_train, 0.5)
dropped_cols

Index(['Physical-Waist_Circumference', 'Fitness_Endurance-Season',
       'Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Mins',
       'Fitness_Endurance-Time_Sec', 'FGC-FGC_GSND', 'FGC-FGC_GSND_Zone',
       'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'PAQ_A-Season', 'PAQ_A-PAQ_A_Total'],
      dtype='object')

In [255]:
x_train.shape, y_train.shape

((2733, 47), (2733,))

---
#### **3. Data Splitting**  

In [256]:
# split numerical and categorical features
num_features = x_train.select_dtypes(include=[np.number]).columns
cat_features = x_train.select_dtypes(include=[object]).columns

print(f'Numerical features: {num_features}')
print(f'\nCategorical features: {cat_features}')
print(f'\n Numerical dataset shape: {x_train[num_features].shape}; Categorical dataset shape: {x_train[cat_features].shape}')

Numerical features: Index(['Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight', 'Physical-Diastolic_BP',
       'Physical-HeartRate', 'Physical-Systolic_BP', 'FGC-FGC_CU',
       'FGC-FGC_CU_Zone', 'FGC-FGC_PU', 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL',
       'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR', 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL',
       'FGC-FGC_TL_Zone', 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC',
       'BIA-BIA_BMI', 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW',
       'BIA-BIA_FFM', 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat',
       'BIA-BIA_Frame_num', 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST',
       'BIA-BIA_SMM', 'BIA-BIA_TBW', 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw',
       'SDS-SDS_Total_T', 'PreInt_EduHx-computerinternet_hoursday'],
      dtype='object')

Categorical features: Index(['Basic_Demos-Enroll_Season', 'CGAS-Season', 'Physical-Season',
       'FGC-Season', 'BIA-Season', 'PAQ_C-Season', 'SDS-Season',
      

---
### **4. Feature Engineering**

In [257]:
"""
# train a Random Forest to fill the missing values based on similarity scores
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer # scikit-learn tool to fill in (impute) missing values in the dataset

imp = IterativeImputer(estimator=RandomForestRegressor(
    n_estimators=45,
    random_state=0,
    n_jobs=-1
    ), max_iter=20, verbose=2, random_state=0)
# estimator -> we specify the model to use and predict the missing values
# max_iter -> the number of iterations for the imputer to refine its estimates

# fit the imputer on the train dataset, I only fill the float64 features
imp.fit(x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')])
x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')] = imp.transform(x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')])
# transform -> uses the learned relationships to fill in the missing values
"""

"\n# train a Random Forest to fill the missing values based on similarity scores\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.experimental import enable_iterative_imputer\nfrom sklearn.impute import IterativeImputer # scikit-learn tool to fill in (impute) missing values in the dataset\n\nimp = IterativeImputer(estimator=RandomForestRegressor(\n    n_estimators=45,\n    random_state=0,\n    n_jobs=-1\n    ), max_iter=20, verbose=2, random_state=0)\n# estimator -> we specify the model to use and predict the missing values\n# max_iter -> the number of iterations for the imputer to refine its estimates\n\n# fit the imputer on the train dataset, I only fill the float64 features\nimp.fit(x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')])\nx_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')] = imp.transform(x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')])\n# transform -> uses the learned relationships to fill in t

Bad results. Let's try with simpler models (knn or bayesian)

In [258]:
# Ima try with bayesan ridge
from sklearn.linear_model import BayesianRidge
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer(estimator=BayesianRidge(), max_iter=30, random_state=0, verbose=2)

imp.fit(x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')])
x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')] = imp.transform(x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')])

[IterativeImputer] Completing matrix with shape (2733, 37)
[IterativeImputer] Ending imputation round 1/30, elapsed time 0.70
[IterativeImputer] Change: 257.21035993668295, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 2/30, elapsed time 1.24
[IterativeImputer] Change: 184.6467667103924, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 3/30, elapsed time 2.12
[IterativeImputer] Change: 150.52341683686578, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 4/30, elapsed time 2.56
[IterativeImputer] Change: 122.67949131223233, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 5/30, elapsed time 3.10
[IterativeImputer] Change: 99.98653871172833, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 6/30, elapsed time 3.53
[IterativeImputer] Change: 81.62512619667072, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 7/30, elapsed time 3.98
[IterativeImputer] Change: 66.6

In less than 30 steps I can achieve convergence. I take this as a good result

In [259]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2733 entries, 0 to 3955
Data columns (total 47 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Enroll_Season               2733 non-null   object 
 1   Basic_Demos-Age                         2733 non-null   int64  
 2   Basic_Demos-Sex                         2733 non-null   int64  
 3   CGAS-Season                             2339 non-null   object 
 4   CGAS-CGAS_Score                         2733 non-null   float64
 5   Physical-Season                         2592 non-null   object 
 6   Physical-BMI                            2733 non-null   float64
 7   Physical-Height                         2733 non-null   float64
 8   Physical-Weight                         2733 non-null   float64
 9   Physical-Diastolic_BP                   2733 non-null   float64
 10  Physical-HeartRate                      2733 non-null   float64
 

In [260]:
y_train.info()

<class 'pandas.core.series.Series'>
Index: 2733 entries, 0 to 3955
Series name: sii
Non-Null Count  Dtype  
--------------  -----  
2733 non-null   float64
dtypes: float64(1)
memory usage: 42.7 KB


In [261]:
# I want to fill the missing values in the categorical features with the most frequent value
x_train[cat_features] = x_train[cat_features].fillna(x_train[cat_features].mode().iloc[0])

In [262]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2733 entries, 0 to 3955
Data columns (total 47 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Enroll_Season               2733 non-null   object 
 1   Basic_Demos-Age                         2733 non-null   int64  
 2   Basic_Demos-Sex                         2733 non-null   int64  
 3   CGAS-Season                             2733 non-null   object 
 4   CGAS-CGAS_Score                         2733 non-null   float64
 5   Physical-Season                         2733 non-null   object 
 6   Physical-BMI                            2733 non-null   float64
 7   Physical-Height                         2733 non-null   float64
 8   Physical-Weight                         2733 non-null   float64
 9   Physical-Diastolic_BP                   2733 non-null   float64
 10  Physical-HeartRate                      2733 non-null   float64
 

In [263]:
# split the dataset into cat_features and num_features
x_train_cat = x_train[cat_features]
x_train_num = x_train[num_features]

# one-hot encode the categorical features
x_train_cat = pd.get_dummies(x_train_cat)
x_train_cat *= 1

In [264]:
# concatenate the numerical and categorical features
x_train = pd.concat([x_train_num, x_train_cat], axis=1)
x_train.head()

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,FGC-FGC_CU,...,PAQ_C-Season_Summer,PAQ_C-Season_Winter,SDS-Season_Fall,SDS-Season_Spring,SDS-Season_Summer,SDS-Season_Winter,PreInt_EduHx-Season_Fall,PreInt_EduHx-Season_Spring,PreInt_EduHx-Season_Summer,PreInt_EduHx-Season_Winter
0,5,0,51.0,16.877316,46.0,50.8,67.821521,85.250561,110.753022,0.0,...,0,0,0,1,0,0,1,0,0,0
1,9,0,62.798869,14.03559,48.0,46.0,75.0,70.0,122.0,3.0,...,0,0,1,0,0,0,0,0,1,0
2,10,1,71.0,16.648696,56.5,75.6,65.0,94.0,117.0,20.0,...,1,0,1,0,0,0,0,0,1,0
3,9,0,71.0,18.292347,56.0,81.6,60.0,97.0,117.0,18.0,...,0,1,0,0,1,0,0,0,0,1
5,13,1,50.0,22.279952,59.5,112.2,60.0,73.0,102.0,12.0,...,0,0,0,0,1,0,0,1,0,0


In [265]:
# convert all numerical features to float64
x_train = x_train.astype('float64')
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2733 entries, 0 to 3955
Data columns (total 71 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Age                         2733 non-null   float64
 1   Basic_Demos-Sex                         2733 non-null   float64
 2   CGAS-CGAS_Score                         2733 non-null   float64
 3   Physical-BMI                            2733 non-null   float64
 4   Physical-Height                         2733 non-null   float64
 5   Physical-Weight                         2733 non-null   float64
 6   Physical-Diastolic_BP                   2733 non-null   float64
 7   Physical-HeartRate                      2733 non-null   float64
 8   Physical-Systolic_BP                    2733 non-null   float64
 9   FGC-FGC_CU                              2733 non-null   float64
 10  FGC-FGC_CU_Zone                         2733 non-null   float64
 

In [268]:
# train a Random Forest to subselect the most informative features and drop the rest
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

rf_small = RandomForestClassifier(n_estimators=50)

selector = RFECV(rf_small, step=1, cv=4,
                 scoring='accuracy',        # https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
                 min_features_to_select=1)

selector.fit(x_train, y_train);

In [277]:
print (f"Number of selected features: {selector.n_features_}")

Number of selected features: 66


In [276]:
X_features = x_train[x_train.columns[selector.support_]]
X_features.head()

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,FGC-FGC_CU,...,PAQ_C-Season_Spring,PAQ_C-Season_Winter,SDS-Season_Fall,SDS-Season_Spring,SDS-Season_Summer,SDS-Season_Winter,PreInt_EduHx-Season_Fall,PreInt_EduHx-Season_Spring,PreInt_EduHx-Season_Summer,PreInt_EduHx-Season_Winter
0,5.0,0.0,51.0,16.877316,46.0,50.8,67.821521,85.250561,110.753022,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,9.0,0.0,62.798869,14.03559,48.0,46.0,75.0,70.0,122.0,3.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,10.0,1.0,71.0,16.648696,56.5,75.6,65.0,94.0,117.0,20.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,9.0,0.0,71.0,18.292347,56.0,81.6,60.0,97.0,117.0,18.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
5,13.0,1.0,50.0,22.279952,59.5,112.2,60.0,73.0,102.0,12.0,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
