### **Data Cleaning Workflow** 

---

#### **1. Import Already Filtered Data**  
- Drop records where:  
   - **`'sii'`** is `NaN`.  
- Save it as target variables:  
   - 🎯 **`sii`**  

---

#### **2. Feature Cleaning**  
-  Drop all **features** that are present **only in the train set**.  
-  Remove features where **`NaN` values ≥ 50%** (Threshold: **`0.5`**).  

---

#### **3. Dataset Splitting**  
- **Split the dataset** into:  
   -  **Numerical Features**  
   -  **Categorical Features**

---

### **4. Feature Engineering**
-  Train a **Random Forest** to apply a subselection of the features. We will have only the `most informative features`
-  Train a **New Random Forest** and exploit the similarity score to fill out `missing values` through a process of **`missing value imputation`** 
    and ensure the result matches the feature type:  Truncate **float** to **integer** where needed.
-  Merge all **FGC-Zone features**:  
   - Compute: **`feature + (zone_feature * 0.25(feature))`**.

---

#### **5. Categorical Feature Encoding**  
- Use **1-Hot Encoding (OHE)**:  
   -  `pd.get_dummies()`

---

#### **6. Handle Outliers (Optional)**  
- 🚨 **[EVENTUALLY]**: Remove records with extremely **high values**.

---

#### **7. Correlation Check (Optional)**  
-  🚨 **[EVENTUALLY]**: Compute the **correlation matrix** and if extremely high `(positive or negative)` correlation is found:
   - Create a `new dataset` without the least informative features with strongest correlation

---

#### **8. Export Dataset**  
- **Export the two different dataset**.  
- Proceed to:  
   - Find out what dataset between the two exported has **better performances**
   - Build **four classification models**.  
   - Perform **hyperparameter tuning**.  
   - Document appropriate **considerations**.

---



#### **1. Import Already Filtered Data**  

In [221]:
# The dataset is kind of strange and after lot of tries I can firmly say it have a lot of outliers. 
# Studying the dataset I found three specific incredibly high and strange values and I decided to remove them by hand. 
# This is why the name of this step is called "Already Filtered".

import os
from dotenv import load_dotenv
import pandas as pd
import numpy as np

load_dotenv()
TRAIN_SET = os.getenv("FILTERED_TRAIN_PATH")
TEST_SET  = os.getenv("TEST_PATH")

train = pd.read_csv(TRAIN_SET)
test = pd.read_csv(TEST_SET)


In [222]:
train.head()

Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,...,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday,sii
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,...,4.0,2.0,4.0,55.0,,,,Fall,3.0,2.0
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,...,0.0,0.0,0.0,0.0,Fall,46.0,64.0,Summer,0.0,0.0
2,00105258,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,...,2.0,1.0,1.0,28.0,Fall,38.0,54.0,Summer,2.0,0.0
3,00115b9f,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,...,3.0,4.0,1.0,44.0,Summer,31.0,45.0,Winter,0.0,1.0
4,0016bb22,Spring,18,1,Summer,,,,,,...,,,,,,,,,,


In [223]:
test.head()

Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,...,BIA-BIA_TBW,PAQ_A-Season,PAQ_A-PAQ_A_Total,PAQ_C-Season,PAQ_C-PAQ_C_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,...,32.6909,,,,,,,,Fall,3.0
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,...,27.0552,,,Fall,2.34,Fall,46.0,64.0,Summer,0.0
2,00105258,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,...,,,,Summer,2.17,Fall,38.0,54.0,Summer,2.0
3,00115b9f,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,...,45.9966,,,Winter,2.451,Summer,31.0,45.0,Winter,0.0
4,0016bb22,Spring,18,1,Summer,,,,,,...,,Summer,1.04,,,,,,,


In [224]:
# Remove records with 'sii' value NaN
x_train = train.dropna(subset=['sii'])

# Extract 'sii' (target variable) 
y_train = x_train['sii']

In [225]:
# Drop 'id' from both train and test:
x_train = x_train.drop(columns=['id'])
x_train = x_train.drop(columns=['sii'])

x_test = test.drop(columns=['id'])

In [226]:
x_train.shape, y_train.shape

((2733, 80), (2733,))

In [227]:
x_test.shape

(20, 58)

---
#### **2. Feature Cleaning**  

In [228]:
# FUNCTION DEFINITION

# given the train dataset, drop the features only present in the train dataset
def intersect_features(train, test):
    features = train[train.columns.intersection(test.columns)]
    return features

In [229]:
x_train = intersect_features(x_train, x_test)
x_train.shape, y_train.shape

((2733, 58), (2733,))

In [230]:
# FUNCTION DEFINITION

# drop all the features that have an arbitrary % threshold of missing values
def drop_columns(df, threshold):
    min_count = len(df) * threshold
    dropped_cols = df.columns[df.isnull().sum() > (len(df) - min_count)]
    df = df.drop(columns=dropped_cols)
    return df, dropped_cols

In [231]:
x_train, dropped_cols = drop_columns(x_train, 0.5)
dropped_cols

Index(['Physical-Waist_Circumference', 'Fitness_Endurance-Season',
       'Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Mins',
       'Fitness_Endurance-Time_Sec', 'FGC-FGC_GSND', 'FGC-FGC_GSND_Zone',
       'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'PAQ_A-Season', 'PAQ_A-PAQ_A_Total'],
      dtype='object')

In [232]:
x_train.shape, y_train.shape

((2733, 47), (2733,))

---
#### **3. Data Splitting**  

In [233]:
# split numerical and categorical features
num_features = x_train.select_dtypes(include=[np.number]).columns
cat_features = x_train.select_dtypes(include=[object]).columns

print(f'Numerical features: {num_features}')
print(f'\nCategorical features: {cat_features}')
print(f'\n Numerical dataset shape: {x_train[num_features].shape}; Categorical dataset shape: {x_train[cat_features].shape}')

Numerical features: Index(['Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-BMI',
       'Physical-Height', 'Physical-Weight', 'Physical-Diastolic_BP',
       'Physical-HeartRate', 'Physical-Systolic_BP', 'FGC-FGC_CU',
       'FGC-FGC_CU_Zone', 'FGC-FGC_PU', 'FGC-FGC_PU_Zone', 'FGC-FGC_SRL',
       'FGC-FGC_SRL_Zone', 'FGC-FGC_SRR', 'FGC-FGC_SRR_Zone', 'FGC-FGC_TL',
       'FGC-FGC_TL_Zone', 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC',
       'BIA-BIA_BMI', 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW',
       'BIA-BIA_FFM', 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat',
       'BIA-BIA_Frame_num', 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST',
       'BIA-BIA_SMM', 'BIA-BIA_TBW', 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw',
       'SDS-SDS_Total_T', 'PreInt_EduHx-computerinternet_hoursday'],
      dtype='object')

Categorical features: Index(['Basic_Demos-Enroll_Season', 'CGAS-Season', 'Physical-Season',
       'FGC-Season', 'BIA-Season', 'PAQ_C-Season', 'SDS-Season',
      

---
### **4. Feature Engineering**

First step -> look at confusion matrix and drop columns with extremely high correlation if present

Second step -> missing value imputation through a Regressor for numerical (float64) features. First try is fully grown trees with 20 iterations for the imputer

In [234]:
# train a Random Forest to fill the missing values based on similarity scores
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer # scikit-learn tool to fill in (impute) missing values in the dataset

imp = IterativeImputer(estimator=RandomForestRegressor(
    n_estimators=45,
    random_state=0,
    n_jobs=-1
    ), max_iter=20, verbose=2, random_state=0)
# estimator -> we specify the model to use and predict the missing values
# max_iter -> the number of iterations for the imputer to refine its estimates

# fit the imputer on the train dataset, I only fill the float64 features
imp.fit(x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')])
x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')] = imp.transform(x_train[num_features.drop('Basic_Demos-Age').drop('Basic_Demos-Sex')])
# transform -> uses the learned relationships to fill in the missing values

[IterativeImputer] Completing matrix with shape (2733, 37)
[IterativeImputer] Ending imputation round 1/20, elapsed time 17.22
[IterativeImputer] Change: 171.12939451799653, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 2/20, elapsed time 36.21
[IterativeImputer] Change: 192.40956948888913, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 3/20, elapsed time 55.18
[IterativeImputer] Change: 155.53965038066636, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 4/20, elapsed time 73.65
[IterativeImputer] Change: 141.3296961333334, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 5/20, elapsed time 92.32
[IterativeImputer] Change: 120.33330133333214, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 6/20, elapsed time 111.66
[IterativeImputer] Change: 127.12785133333315, scaled tolerance: 7.99408 
[IterativeImputer] Ending imputation round 7/20, elapsed time 131.59
[IterativeImputer] C



[IterativeImputer] Ending imputation round 1/20, elapsed time 2.52
[IterativeImputer] Ending imputation round 2/20, elapsed time 5.21
[IterativeImputer] Ending imputation round 3/20, elapsed time 8.20
[IterativeImputer] Ending imputation round 4/20, elapsed time 10.64
[IterativeImputer] Ending imputation round 5/20, elapsed time 13.07
[IterativeImputer] Ending imputation round 6/20, elapsed time 16.30
[IterativeImputer] Ending imputation round 7/20, elapsed time 18.83
[IterativeImputer] Ending imputation round 8/20, elapsed time 21.74
[IterativeImputer] Ending imputation round 9/20, elapsed time 24.61
[IterativeImputer] Ending imputation round 10/20, elapsed time 27.54
[IterativeImputer] Ending imputation round 11/20, elapsed time 30.55
[IterativeImputer] Ending imputation round 12/20, elapsed time 32.94
[IterativeImputer] Ending imputation round 13/20, elapsed time 36.12
[IterativeImputer] Ending imputation round 14/20, elapsed time 39.18
[IterativeImputer] Ending imputation round 15/

Bad results. Let's try with simpler models (knn or bayesian)

If anything works, try deleting extreme outliers (99percentile of some features)

In [235]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2733 entries, 0 to 3955
Data columns (total 47 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Enroll_Season               2733 non-null   object 
 1   Basic_Demos-Age                         2733 non-null   int64  
 2   Basic_Demos-Sex                         2733 non-null   int64  
 3   CGAS-Season                             2339 non-null   object 
 4   CGAS-CGAS_Score                         2733 non-null   float64
 5   Physical-Season                         2592 non-null   object 
 6   Physical-BMI                            2733 non-null   float64
 7   Physical-Height                         2733 non-null   float64
 8   Physical-Weight                         2733 non-null   float64
 9   Physical-Diastolic_BP                   2733 non-null   float64
 10  Physical-HeartRate                      2733 non-null   float64
 

In [236]:
y_train.info()

<class 'pandas.core.series.Series'>
Index: 2733 entries, 0 to 3955
Series name: sii
Non-Null Count  Dtype  
--------------  -----  
2733 non-null   float64
dtypes: float64(1)
memory usage: 42.7 KB


In [237]:
# train a Random Forest to subselect the most important features
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(x_train.fillna(0), y_train)

importances = rf.feature_importances_
indices = np.argsort(importances)[::-1]

ValueError: could not convert string to float: 'Fall'