### **Data Cleaning Workflow** 

---

#### **1. Import Already Filtered Data**  
- Drop records where:  
   - **`'sii'`** is `NaN`.  
- Save it as target variables:  
   - 🎯 **`sii`**  

---

#### **2. Feature Cleaning**  
-  Drop all **features** that are present **only in the train set**.  
-  Remove features where **`NaN` values ≥ 60%** (Threshold: **`0.4`**).  
-  Train a **Random Forest** to apply a subselection of the features. We will have only the `most informative features`
-  Train a **New Random Forest** and exploit the similarity score to fill out `missing values` through a process of **`missing value imputation`** 
    and ensure the result matches the feature type:  Truncate **float** to **integer** where needed.
-  Merge all **FGC-Zone features**:  
   - Compute: **`feature + (zone_feature * 0.25(feature))`**.

---

#### **3. Dataset Splitting**  
- **Split the dataset** into:  
   -  **Numerical Features**  
   -  **Categorical Features**

---

#### **4. Categorical Feature Encoding**  
- Use **1-Hot Encoding (OHE)**:  
   -  `pd.get_dummies()`

---

#### **5. Handle Outliers (Optional)**  
- 🚨 **[EVENTUALLY]**: Remove records with extremely **high values**.

---

#### **6. Correlation Check (Optional)**  
-  🚨 **[EVENTUALLY]**: Compute the **correlation matrix** and if extremely high `(positive or negative)` correlation is found:
   - Create a `new dataset` without the least informative features with strongest correlation

---

#### **7. Export Dataset**  
- **Export the two different dataset**.  
- Proceed to:  
   - Find out what dataset between the two exported has **better performances**
   - Build **four classification models**.  
   - Perform **hyperparameter tuning**.  
   - Document appropriate **considerations**.

---



#### **1. Import Already Filtered Data**  

In [152]:
# The dataset is kind of strange and after lot of tries I can firmly say it have a lot of outliers. 
# Studying the dataset I found three specific incredibly high and strange values and I decided to remove them by hand. 
# This is why the name of this step is called "Already Filtered".

import os
from dotenv import load_dotenv
import pandas as pd
import numpy as np

load_dotenv()
TRAIN_SET = os.getenv("FILTERED_TRAIN_PATH")
TEST_SET  = os.getenv("TEST_PATH")

train = pd.read_csv(TRAIN_SET)
test = pd.read_csv(TEST_SET)


In [153]:
train.head()

Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,...,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday,sii
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,...,4.0,2.0,4.0,55.0,,,,Fall,3.0,2.0
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,...,0.0,0.0,0.0,0.0,Fall,46.0,64.0,Summer,0.0,0.0
2,00105258,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,...,2.0,1.0,1.0,28.0,Fall,38.0,54.0,Summer,2.0,0.0
3,00115b9f,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,...,3.0,4.0,1.0,44.0,Summer,31.0,45.0,Winter,0.0,1.0
4,0016bb22,Spring,18,1,Summer,,,,,,...,,,,,,,,,,


In [154]:
test.head()

Unnamed: 0,id,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,...,BIA-BIA_TBW,PAQ_A-Season,PAQ_A-PAQ_A_Total,PAQ_C-Season,PAQ_C-PAQ_C_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday
0,00008ff9,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,...,32.6909,,,,,,,,Fall,3.0
1,000fd460,Summer,9,0,,,Fall,14.03559,48.0,46.0,...,27.0552,,,Fall,2.34,Fall,46.0,64.0,Summer,0.0
2,00105258,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,...,,,,Summer,2.17,Fall,38.0,54.0,Summer,2.0
3,00115b9f,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,...,45.9966,,,Winter,2.451,Summer,31.0,45.0,Winter,0.0
4,0016bb22,Spring,18,1,Summer,,,,,,...,,Summer,1.04,,,,,,,


In [155]:
# Remove records with 'sii' value NaN
x_train = train.dropna(subset=['sii'])

# Extract 'sii' (target variable) 
y_train = x_train['sii']

In [156]:
# Drop 'id' from both train and test:
x_train = x_train.drop(columns=['id'])
x_train = x_train.drop(columns=['sii'])
x_test = test.drop(columns=['id'])

In [157]:
x_train.shape, y_train.shape

((2733, 80), (2733,))

In [158]:
x_test.shape

(20, 58)

---
#### **2. Feature Cleaning**  

In [159]:
# FUNCTION DEFINITION

# given the train dataset, drop the features only present in the train dataset
def intersect_features(train, test):
    features = train[train.columns.intersection(test.columns)]
    return features

In [160]:
x_train = intersect_features(x_train, x_test)
x_train.shape, y_train.shape

((2733, 58), (2733,))

In [161]:
# FUNCTION DEFINITION

# drop all the features that have an arbitrary % threshold of missing values
def drop_columns(df, threshold):
    min_count = len(df) * threshold
    dropped_cols = df.columns[df.isnull().sum() > (len(df) - min_count)]
    df = df.drop(columns=dropped_cols)
    return df, dropped_cols

In [162]:
x_train, dropped_cols = drop_columns(x_train, 0.4)
dropped_cols

Index(['Physical-Waist_Circumference', 'Fitness_Endurance-Max_Stage',
       'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec',
       'FGC-FGC_GSND', 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone',
       'PAQ_A-Season', 'PAQ_A-PAQ_A_Total'],
      dtype='object')

In [163]:
x_train.shape, y_train.shape

((2733, 48), (2733,))