### **Data Cleaning Workflow** 

---

#### **1. Import Data**  
- Drop records where:  
   - **`'sii'`** and **`'PCIAT_Total_Score'`** are `NaN`.  
- Save these as two different target variables:  
   - 🎯 **`sii`**  
   - 🎯 **`PCIAT`**

---

#### **2. Feature Cleaning**  
-  Drop all **features** that are present **only in the train set**.  
-  Remove features where **`NaN` values ≥ 60%** (Threshold: **`0.4`**).  
-  Merge all **FGC-Zone features**:  
   - Compute: **`feature + (zone_feature * 0.1)`**.

---

#### **3. Dataset Splitting**  
- **Split the dataset** into:  
   -  **Numerical Features**  
   -  **Categorical Features**

---

#### **4. Categorical Feature Encoding**  
- Use **1-Hot Encoding (OHE)**:  
   -  `pd.get_dummies()`

---

#### **5. Handle Outliers (Optional)**  
- 🚨 **[EVENTUALLY]**: Remove records with extremely **high values**.

---

#### **6. Missing Value Imputation**  
-  **Numerical Features**:  
   - Apply **Random Imputation**.  
   - Ensure the result matches the feature type:  
      - Truncate **float** to **integer** where needed.

---

#### **7. Model-Based Feature Engineering**  
-  Train a **classifier** (e.g., **kNN** or **SVC**) to compute **PCIAT_Total_Score**.  
   - If accuracy is **very high**:  
     - ➕ Add a **new feature** to the dataset filled by the model's predictions.

---

#### **8. Export Dataset**  
- **Export the cleaned dataset**.  
- Proceed to:  
   - Build **four classification models**.  
   - Perform **hyperparameter tuning**.  
   - Document appropriate **considerations**.
 


#### **1. Import Data**  

In [585]:
import os
from dotenv import load_dotenv
import pandas as pd
import numpy as np

load_dotenv()
TRAIN_SET = os.getenv("TRAIN_PATH")
TEST_SET  = os.getenv("TEST_PATH")

train = pd.read_csv(TRAIN_SET)
test = pd.read_csv(TEST_SET)

train = train.drop(['id'], axis=1)
test = test.drop(['id'], axis=1)

In [586]:
train.head()

Unnamed: 0,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,...,PCIAT-PCIAT_18,PCIAT-PCIAT_19,PCIAT-PCIAT_20,PCIAT-PCIAT_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday,sii
0,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,,...,4.0,2.0,4.0,55.0,,,,Fall,3.0,2.0
1,Summer,9,0,,,Fall,14.03559,48.0,46.0,22.0,...,0.0,0.0,0.0,0.0,Fall,46.0,64.0,Summer,0.0,0.0
2,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,,...,2.0,1.0,1.0,28.0,Fall,38.0,54.0,Summer,2.0,0.0
3,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,,...,3.0,4.0,1.0,44.0,Summer,31.0,45.0,Winter,0.0,1.0
4,Spring,18,1,Summer,,,,,,,...,,,,,,,,,,


In [587]:
test.head()

Unnamed: 0,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,Physical-Waist_Circumference,...,BIA-BIA_TBW,PAQ_A-Season,PAQ_A-PAQ_A_Total,PAQ_C-Season,PAQ_C-PAQ_C_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday
0,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,,...,32.6909,,,,,,,,Fall,3.0
1,Summer,9,0,,,Fall,14.03559,48.0,46.0,22.0,...,27.0552,,,Fall,2.34,Fall,46.0,64.0,Summer,0.0
2,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,,...,,,,Summer,2.17,Fall,38.0,54.0,Summer,2.0
3,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,,...,45.9966,,,Winter,2.451,Summer,31.0,45.0,Winter,0.0
4,Spring,18,1,Summer,,,,,,,...,,Summer,1.04,,,,,,,


---
I do divide into two different dataset because I want to operate differently and even if it seems to cleaning the same rows and values I do have incongruences if I do not do like this

---

In [588]:
SII_TRAIN   = train.copy()
PCIAT_TRAIN = train.copy()

In [589]:
print("SII Train Shape -> ", train.shape, "\nSII Test Shape  -> ", test.shape)

SII Train Shape ->  (3960, 81) 
SII Test Shape  ->  (20, 58)


In [590]:
print("PCIAT Train Shape -> ", train.shape, "\nPCIAT Test Shape  -> ", test.shape)

PCIAT Train Shape ->  (3960, 81) 
PCIAT Test Shape  ->  (20, 58)


In [591]:
print("SII info: ")
SII_TRAIN.info()

SII info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3960 entries, 0 to 3959
Data columns (total 81 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Enroll_Season               3960 non-null   object 
 1   Basic_Demos-Age                         3960 non-null   int64  
 2   Basic_Demos-Sex                         3960 non-null   int64  
 3   CGAS-Season                             2555 non-null   object 
 4   CGAS-CGAS_Score                         2421 non-null   float64
 5   Physical-Season                         3310 non-null   object 
 6   Physical-BMI                            3022 non-null   float64
 7   Physical-Height                         3027 non-null   float64
 8   Physical-Weight                         3076 non-null   float64
 9   Physical-Waist_Circumference            898 non-null    float64
 10  Physical-Diastolic_BP                   2954 non-

In [592]:
print("PCIAT info: ")
PCIAT_TRAIN.info()

PCIAT info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3960 entries, 0 to 3959
Data columns (total 81 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Enroll_Season               3960 non-null   object 
 1   Basic_Demos-Age                         3960 non-null   int64  
 2   Basic_Demos-Sex                         3960 non-null   int64  
 3   CGAS-Season                             2555 non-null   object 
 4   CGAS-CGAS_Score                         2421 non-null   float64
 5   Physical-Season                         3310 non-null   object 
 6   Physical-BMI                            3022 non-null   float64
 7   Physical-Height                         3027 non-null   float64
 8   Physical-Weight                         3076 non-null   float64
 9   Physical-Waist_Circumference            898 non-null    float64
 10  Physical-Diastolic_BP                   2954 no

#### **2. Feature Cleaning**  

In [593]:
# drop records with missing target values
SII_TRAIN = SII_TRAIN.dropna(subset=['sii'])
PCIAT_TRAIN = PCIAT_TRAIN.dropna(subset=['PCIAT-PCIAT_Total'])

SII_target = SII_TRAIN['sii']
PCIAT_target = PCIAT_TRAIN['PCIAT-PCIAT_Total']

print("sii target   -> ", SII_target.shape, "\nPCIAT target -> ", PCIAT_target.shape)

sii target   ->  (2736,) 
PCIAT target ->  (2736,)


In [594]:
#drop train-only features
def intersect_features(train, test):
    sm_train = train[train.columns.intersection(test.columns)]
    return sm_train

In [595]:
X_SII_train   = intersect_features(SII_TRAIN, test)
X_PCIAT_train = intersect_features(PCIAT_TRAIN, test)

print("SII Train Shape   -> ", X_SII_train.shape, "\nPCIAT Train Shape -> ", X_PCIAT_train.shape)

SII Train Shape   ->  (2736, 58) 
PCIAT Train Shape ->  (2736, 58)


In [596]:
#drop features with high percentage of missing values

def drop_columns(df, threshold):
    minimum_non_NaN = len(df) * threshold   
    dropped_columns = df.columns[df.isnull().sum() > (len(df) - minimum_non_NaN)].tolist()
    new_df = df.drop(columns=dropped_columns)
    
    return new_df, dropped_columns

In [597]:
X_SII_train, sii_dropped_features = drop_columns(X_SII_train, 0.4)
X_PCIAT_train, pciat_dropped_features = drop_columns(X_PCIAT_train, 0.4)
print("Dropped Features for SII TRAIN are:", len(sii_dropped_features), "   -> ", sii_dropped_features)
print("\nDropped Features for PCIAT TRAIN are:", len(pciat_dropped_features), " -> ", pciat_dropped_features)

Dropped Features for SII TRAIN are: 10    ->  ['Physical-Waist_Circumference', 'Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec', 'FGC-FGC_GSND', 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'PAQ_A-Season', 'PAQ_A-PAQ_A_Total']

Dropped Features for PCIAT TRAIN are: 10  ->  ['Physical-Waist_Circumference', 'Fitness_Endurance-Max_Stage', 'Fitness_Endurance-Time_Mins', 'Fitness_Endurance-Time_Sec', 'FGC-FGC_GSND', 'FGC-FGC_GSND_Zone', 'FGC-FGC_GSD', 'FGC-FGC_GSD_Zone', 'PAQ_A-Season', 'PAQ_A-PAQ_A_Total']


In [598]:
print("SII Train:")
X_SII_train.info()

SII Train:
<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 48 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Enroll_Season               2736 non-null   object 
 1   Basic_Demos-Age                         2736 non-null   int64  
 2   Basic_Demos-Sex                         2736 non-null   int64  
 3   CGAS-Season                             2342 non-null   object 
 4   CGAS-CGAS_Score                         2342 non-null   float64
 5   Physical-Season                         2595 non-null   object 
 6   Physical-BMI                            2527 non-null   float64
 7   Physical-Height                         2530 non-null   float64
 8   Physical-Weight                         2572 non-null   float64
 9   Physical-Diastolic_BP                   2478 non-null   float64
 10  Physical-HeartRate                      2486 non-null 

In [599]:
print("\nPCIAT Train:")
X_PCIAT_train.info()


PCIAT Train:
<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 48 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Enroll_Season               2736 non-null   object 
 1   Basic_Demos-Age                         2736 non-null   int64  
 2   Basic_Demos-Sex                         2736 non-null   int64  
 3   CGAS-Season                             2342 non-null   object 
 4   CGAS-CGAS_Score                         2342 non-null   float64
 5   Physical-Season                         2595 non-null   object 
 6   Physical-BMI                            2527 non-null   float64
 7   Physical-Height                         2530 non-null   float64
 8   Physical-Weight                         2572 non-null   float64
 9   Physical-Diastolic_BP                   2478 non-null   float64
 10  Physical-HeartRate                      2486 non-nu

In [600]:
"""
# merge fitness relevation features

def merge_fitness(df):
    df['Fitness_Endurance-Time'] = df['Fitness_Endurance-Time_Sec'] + (df['Fitness_Endurance-Time_Mins']*60) + df['Fitness_Endurance-Max_Stage']
    df = df.drop(['Fitness_Endurance-Time_Mins'], axis=1)
    df = df.drop(['Fitness_Endurance-Time_Sec'], axis=1)
    df = df.drop(['Fitness_Endurance-Max_Stage'], axis=1)
    return df
"""

"\n# merge fitness relevation features\n\ndef merge_fitness(df):\n    df['Fitness_Endurance-Time'] = df['Fitness_Endurance-Time_Sec'] + (df['Fitness_Endurance-Time_Mins']*60) + df['Fitness_Endurance-Max_Stage']\n    df = df.drop(['Fitness_Endurance-Time_Mins'], axis=1)\n    df = df.drop(['Fitness_Endurance-Time_Sec'], axis=1)\n    df = df.drop(['Fitness_Endurance-Max_Stage'], axis=1)\n    return df\n"

In [601]:
# Non sense because with the 0.4 threshold we already dropped the features. Uncomment if you want to merge the features once changed the threshold 

"""
X_merged_fitness = merge_fitness(X_train)
X_train = X_merged_fitness
"""

'\nX_merged_fitness = merge_fitness(X_train)\nX_train = X_merged_fitness\n'

In [602]:
# merge the FGC-Attr and the FGC-Attr_Zone features
def merge_fgc(train):
    FGC_features = [col for col in train.columns if 'FGC' in col]
    if 'FGC-Season' in FGC_features: 
        FGC_features.remove('FGC-Season')
    removed_features = 0
    zone_features_to_drop = []

    for feature in FGC_features:
        zone_feature = feature + '_Zone'

        if zone_feature in train.columns:
            print(f'Feature: {feature} - Zone: {zone_feature}')
            train[feature] = train[feature] + (train[zone_feature] * 0.2)
            zone_features_to_drop.append(zone_feature)
            removed_features += 1
    train = train.drop(zone_features_to_drop, axis=1)
    return train, removed_features

In [603]:
X_SII_train, sii_removed_features = merge_fgc(X_SII_train)
print(f"Removed {sii_removed_features} features from SII Train")
print("\n")
X_PCIAT_train, pciat_removed_features = merge_fgc(X_PCIAT_train)
print(f"Removed {pciat_removed_features} features from PCIAT Train")

Feature: FGC-FGC_CU - Zone: FGC-FGC_CU_Zone
Feature: FGC-FGC_PU - Zone: FGC-FGC_PU_Zone
Feature: FGC-FGC_SRL - Zone: FGC-FGC_SRL_Zone
Feature: FGC-FGC_SRR - Zone: FGC-FGC_SRR_Zone
Feature: FGC-FGC_TL - Zone: FGC-FGC_TL_Zone
Removed 5 features from SII Train


Feature: FGC-FGC_CU - Zone: FGC-FGC_CU_Zone
Feature: FGC-FGC_PU - Zone: FGC-FGC_PU_Zone
Feature: FGC-FGC_SRL - Zone: FGC-FGC_SRL_Zone
Feature: FGC-FGC_SRR - Zone: FGC-FGC_SRR_Zone
Feature: FGC-FGC_TL - Zone: FGC-FGC_TL_Zone
Removed 5 features from PCIAT Train


In [604]:
X_SII_train.head()

Unnamed: 0,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,Physical-Diastolic_BP,...,BIA-BIA_LST,BIA-BIA_SMM,BIA-BIA_TBW,PAQ_C-Season,PAQ_C-PAQ_C_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday
0,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,,...,38.9177,19.5413,32.6909,,,,,,Fall,3.0
1,Summer,9,0,,,Fall,14.03559,48.0,46.0,75.0,...,39.4497,15.4107,27.0552,Fall,2.34,Fall,46.0,64.0,Summer,0.0
2,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,65.0,...,,,,Summer,2.17,Fall,38.0,54.0,Summer,2.0
3,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,60.0,...,58.9338,26.4798,45.9966,Winter,2.451,Summer,31.0,45.0,Winter,0.0
5,Spring,13,1,Winter,50.0,Summer,22.279952,59.5,112.2,60.0,...,79.6982,35.3804,63.1265,Spring,4.11,Summer,40.0,56.0,Spring,0.0


In [605]:
X_PCIAT_train.head()

Unnamed: 0,Basic_Demos-Enroll_Season,Basic_Demos-Age,Basic_Demos-Sex,CGAS-Season,CGAS-CGAS_Score,Physical-Season,Physical-BMI,Physical-Height,Physical-Weight,Physical-Diastolic_BP,...,BIA-BIA_LST,BIA-BIA_SMM,BIA-BIA_TBW,PAQ_C-Season,PAQ_C-PAQ_C_Total,SDS-Season,SDS-SDS_Total_Raw,SDS-SDS_Total_T,PreInt_EduHx-Season,PreInt_EduHx-computerinternet_hoursday
0,Fall,5,0,Winter,51.0,Fall,16.877316,46.0,50.8,,...,38.9177,19.5413,32.6909,,,,,,Fall,3.0
1,Summer,9,0,,,Fall,14.03559,48.0,46.0,75.0,...,39.4497,15.4107,27.0552,Fall,2.34,Fall,46.0,64.0,Summer,0.0
2,Summer,10,1,Fall,71.0,Fall,16.648696,56.5,75.6,65.0,...,,,,Summer,2.17,Fall,38.0,54.0,Summer,2.0
3,Winter,9,0,Fall,71.0,Summer,18.292347,56.0,81.6,60.0,...,58.9338,26.4798,45.9966,Winter,2.451,Summer,31.0,45.0,Winter,0.0
5,Spring,13,1,Winter,50.0,Summer,22.279952,59.5,112.2,60.0,...,79.6982,35.3804,63.1265,Spring,4.11,Summer,40.0,56.0,Spring,0.0


#### **3. Dataset Splitting**

In [606]:
X_SII_train_numerical = X_SII_train.select_dtypes(include=[np.number])
X_SII_train_categorical = X_SII_train.select_dtypes(exclude=[np.number])

sii_list_of_numerical = X_SII_train_numerical.columns.tolist()
sii_list_of_categorical = X_SII_train_categorical.columns.tolist()

print("Numerical Features Shape   -> ", X_SII_train_numerical.shape, "\nCategorical Features Shape -> ", X_SII_train_categorical.shape)
print("\nNumerical Features List    -> ", sii_list_of_numerical, "\nCategorical Features List  -> ", sii_list_of_categorical)

Numerical Features Shape   ->  (2736, 34) 
Categorical Features Shape ->  (2736, 9)

Numerical Features List    ->  ['Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-BMI', 'Physical-Height', 'Physical-Weight', 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP', 'FGC-FGC_CU', 'FGC-FGC_PU', 'FGC-FGC_SRL', 'FGC-FGC_SRR', 'FGC-FGC_TL', 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI', 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM', 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num', 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST', 'BIA-BIA_SMM', 'BIA-BIA_TBW', 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw', 'SDS-SDS_Total_T', 'PreInt_EduHx-computerinternet_hoursday'] 
Categorical Features List  ->  ['Basic_Demos-Enroll_Season', 'CGAS-Season', 'Physical-Season', 'Fitness_Endurance-Season', 'FGC-Season', 'BIA-Season', 'PAQ_C-Season', 'SDS-Season', 'PreInt_EduHx-Season']


In [607]:
# And now the same for PCIAT

X_PCIAT_train_numerical = X_PCIAT_train.select_dtypes(include=[np.number])
X_PCIAT_train_categorical = X_PCIAT_train.select_dtypes(exclude=[np.number])

pciat_list_of_numerical = X_PCIAT_train_numerical.columns.tolist()
pciat_dropped_features = X_PCIAT_train_categorical.columns.tolist()

print("Numerical Features Shape   -> ", X_PCIAT_train_numerical.shape, "\nCategorical Features Shape -> ", X_PCIAT_train_categorical.shape)
print("\nNumerical Features List    -> ", pciat_list_of_numerical, "\nCategorical Features List  -> ", pciat_dropped_features)

Numerical Features Shape   ->  (2736, 34) 
Categorical Features Shape ->  (2736, 9)

Numerical Features List    ->  ['Basic_Demos-Age', 'Basic_Demos-Sex', 'CGAS-CGAS_Score', 'Physical-BMI', 'Physical-Height', 'Physical-Weight', 'Physical-Diastolic_BP', 'Physical-HeartRate', 'Physical-Systolic_BP', 'FGC-FGC_CU', 'FGC-FGC_PU', 'FGC-FGC_SRL', 'FGC-FGC_SRR', 'FGC-FGC_TL', 'BIA-BIA_Activity_Level_num', 'BIA-BIA_BMC', 'BIA-BIA_BMI', 'BIA-BIA_BMR', 'BIA-BIA_DEE', 'BIA-BIA_ECW', 'BIA-BIA_FFM', 'BIA-BIA_FFMI', 'BIA-BIA_FMI', 'BIA-BIA_Fat', 'BIA-BIA_Frame_num', 'BIA-BIA_ICW', 'BIA-BIA_LDM', 'BIA-BIA_LST', 'BIA-BIA_SMM', 'BIA-BIA_TBW', 'PAQ_C-PAQ_C_Total', 'SDS-SDS_Total_Raw', 'SDS-SDS_Total_T', 'PreInt_EduHx-computerinternet_hoursday'] 
Categorical Features List  ->  ['Basic_Demos-Enroll_Season', 'CGAS-Season', 'Physical-Season', 'Fitness_Endurance-Season', 'FGC-Season', 'BIA-Season', 'PAQ_C-Season', 'SDS-Season', 'PreInt_EduHx-Season']


In [608]:
""" 
X_train_numerical = X_train.select_dtypes(include=[np.number])
X_train_categorical = X_train.select_dtypes(exclude=[np.number])

list_of_numerical = X_train_numerical.columns.tolist()
list_of_categorical = X_train_categorical.columns.tolist()

print("Numerical Features Shape   -> ", X_train_numerical.shape, "\nCategorical Features Shape -> ", X_train_categorical.shape)
print("\nNumerical Features List    -> ", list_of_numerical, "\nCategorical Features List  -> ", list_of_categorical)
""" 

' \nX_train_numerical = X_train.select_dtypes(include=[np.number])\nX_train_categorical = X_train.select_dtypes(exclude=[np.number])\n\nlist_of_numerical = X_train_numerical.columns.tolist()\nlist_of_categorical = X_train_categorical.columns.tolist()\n\nprint("Numerical Features Shape   -> ", X_train_numerical.shape, "\nCategorical Features Shape -> ", X_train_categorical.shape)\nprint("\nNumerical Features List    -> ", list_of_numerical, "\nCategorical Features List  -> ", list_of_categorical)\n'

#### **4. Categorical Feature Encoding**  

In [609]:
# For the sii dataset

X_SII_train_categorical_decoded = pd.get_dummies(X_SII_train_categorical)
X_SII_train_categorical_decoded *= 1
X_SII_train_categorical_decoded.shape

(2736, 36)

In [610]:
# And for the PCIAT dataset

X_PCIAT_train_categorical_decoded = pd.get_dummies(X_PCIAT_train_categorical)
X_PCIAT_train_categorical_decoded *= 1
X_PCIAT_train_categorical_decoded.shape

(2736, 36)

In [611]:
print("SII Train:")
X_SII_train_categorical_decoded.info()

SII Train:
<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 36 columns):
 #   Column                            Non-Null Count  Dtype
---  ------                            --------------  -----
 0   Basic_Demos-Enroll_Season_Fall    2736 non-null   int64
 1   Basic_Demos-Enroll_Season_Spring  2736 non-null   int64
 2   Basic_Demos-Enroll_Season_Summer  2736 non-null   int64
 3   Basic_Demos-Enroll_Season_Winter  2736 non-null   int64
 4   CGAS-Season_Fall                  2736 non-null   int64
 5   CGAS-Season_Spring                2736 non-null   int64
 6   CGAS-Season_Summer                2736 non-null   int64
 7   CGAS-Season_Winter                2736 non-null   int64
 8   Physical-Season_Fall              2736 non-null   int64
 9   Physical-Season_Spring            2736 non-null   int64
 10  Physical-Season_Summer            2736 non-null   int64
 11  Physical-Season_Winter            2736 non-null   int64
 12  Fitness_Endurance-Season_Fal

In [612]:
print("\nPCIAT Train:")
X_PCIAT_train_categorical_decoded.info()


PCIAT Train:
<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 36 columns):
 #   Column                            Non-Null Count  Dtype
---  ------                            --------------  -----
 0   Basic_Demos-Enroll_Season_Fall    2736 non-null   int64
 1   Basic_Demos-Enroll_Season_Spring  2736 non-null   int64
 2   Basic_Demos-Enroll_Season_Summer  2736 non-null   int64
 3   Basic_Demos-Enroll_Season_Winter  2736 non-null   int64
 4   CGAS-Season_Fall                  2736 non-null   int64
 5   CGAS-Season_Spring                2736 non-null   int64
 6   CGAS-Season_Summer                2736 non-null   int64
 7   CGAS-Season_Winter                2736 non-null   int64
 8   Physical-Season_Fall              2736 non-null   int64
 9   Physical-Season_Spring            2736 non-null   int64
 10  Physical-Season_Summer            2736 non-null   int64
 11  Physical-Season_Winter            2736 non-null   int64
 12  Fitness_Endurance-Season_

#### **5. Handle Outliers (Let's Try!)** 
I tried to filter outliers with the classic IQR approach. I used a relaxed version (with a multiplicative of 3 instead of 1.5) but from 2736 rows I obtain 550 records. I then try in this way:
I choose the most important features in my dataset (the ones starting with 'Physical', 'BIA' and 'FGC'). I choose a very large exclusive percentile (99.99) and I compute if a row does have at least a feature that contain a value higher than this 'threshold'. Given this high percentile I still remove a lot of records in my df: I do print histograms for every features and after applying this filter I do obtain a 1200 records dataframe.
I so decided to comment out every cell, to keep going with my data cleaning and to try maybe with other approaches in the future

In [613]:
"""
# Variables
filtered_data = X_SII_train_numerical.copy()
feature_clusters = ['Physical', 'FGC', 'BIA']
threshold = 99.99

filtered_data.shape 
"""

"\n# Variables\nfiltered_data = X_SII_train_numerical.copy()\nfeature_clusters = ['Physical', 'FGC', 'BIA']\nthreshold = 99.99\n\nfiltered_data.shape \n"

In [614]:
"""
# Filter the columns based on the feature clusters prefix
selected_columns = [col for col in X_SII_train_numerical.columns if any(col.startswith(prefix) for prefix in feature_clusters)]
print("Selected Columns:", selected_columns)
"""

'\n# Filter the columns based on the feature clusters prefix\nselected_columns = [col for col in X_SII_train_numerical.columns if any(col.startswith(prefix) for prefix in feature_clusters)]\nprint("Selected Columns:", selected_columns)\n'

In [615]:
"""
# As I have seen, filtering outliers with IQR is not the best approach for this dataset. I get to remove A LOT of rows (from 2736 records to 550 rows)
# Calculate the ?th percentile for the selected columns
percentiles = { col: np.nanpercentile(X_SII_train_numerical[col], threshold) for col in selected_columns }
print("Percentiles:", percentiles)
""" 

'\n# As I have seen, filtering outliers with IQR is not the best approach for this dataset. I get to remove A LOT of rows (from 2736 records to 550 rows)\n# Calculate the ?th percentile for the selected columns\npercentiles = { col: np.nanpercentile(X_SII_train_numerical[col], threshold) for col in selected_columns }\nprint("Percentiles:", percentiles)\n'

In [616]:
"""
import matplotlib.pyplot as plt

columns_per_row = 4
# Calculate the number of rows required to display all the selected columns
num_rows = (len(selected_columns) + columns_per_row - 1) // columns_per_row  # Ceiling division

# Create the figure and subplots grid
fig, axes = plt.subplots(num_rows, columns_per_row, figsize=(16, 5 * num_rows))

# Flatten the axes array for easier iteration (in case it's multi-dimensional)
axes = axes.flatten()

for i, col in enumerate(selected_columns):
    ax = axes[i]  # Get the subplot axis for the current column
    
    # Plot the histogram on the current axis
    X_SII_train_numerical[col].plot(kind='hist', bins=50, ax=ax, alpha=0.7, title=col)
    
    # Calculate the 99th percentile
    percentile_99 = np.nanpercentile(X_SII_train_numerical[col], 99)
    
    # Plot the vertical line at the 99th percentile
    ax.axvline(percentile_99, color='r', linestyle='dashed', linewidth=2, label=f'99th Percentile: {percentile_99:.2f}')
    
    # Add a legend to each subplot
    ax.legend()

# Hide any unused subplots (if the total number of columns is not a multiple of `columns_per_row`)
for j in range(i + 1, len(axes)):
    axes[j].axis('off')  # Hide the axes for any unused subplots

# Adjust layout to prevent overlap
plt.tight_layout()

# Show the plot
plt.show()
"""

"\nimport matplotlib.pyplot as plt\n\ncolumns_per_row = 4\n# Calculate the number of rows required to display all the selected columns\nnum_rows = (len(selected_columns) + columns_per_row - 1) // columns_per_row  # Ceiling division\n\n# Create the figure and subplots grid\nfig, axes = plt.subplots(num_rows, columns_per_row, figsize=(16, 5 * num_rows))\n\n# Flatten the axes array for easier iteration (in case it's multi-dimensional)\naxes = axes.flatten()\n\nfor i, col in enumerate(selected_columns):\n    ax = axes[i]  # Get the subplot axis for the current column\n    \n    # Plot the histogram on the current axis\n    X_SII_train_numerical[col].plot(kind='hist', bins=50, ax=ax, alpha=0.7, title=col)\n    \n    # Calculate the 99th percentile\n    percentile_99 = np.nanpercentile(X_SII_train_numerical[col], 99)\n    \n    # Plot the vertical line at the 99th percentile\n    ax.axvline(percentile_99, color='r', linestyle='dashed', linewidth=2, label=f'99th Percentile: {percentile_99:.2f

In [617]:
"""
# Filter rows independently for each column
filtered_data = X_SII_train_numerical.copy()
for col in selected_columns:
    filtered_data = filtered_data[filtered_data[col] < percentiles[col]]

print("\nFiltered Data:\n", filtered_data)
"""

'\n# Filter rows independently for each column\nfiltered_data = X_SII_train_numerical.copy()\nfor col in selected_columns:\n    filtered_data = filtered_data[filtered_data[col] < percentiles[col]]\n\nprint("\nFiltered Data:\n", filtered_data)\n'

In [618]:
"""
filtered_data.shape
"""

'\nfiltered_data.shape\n'

#### **6. Missing Value Imputation** 

In [619]:
#""" """
print("SII Train:")
X_SII_train_categorical_decoded.info()
#""" """

SII Train:
<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 36 columns):
 #   Column                            Non-Null Count  Dtype
---  ------                            --------------  -----
 0   Basic_Demos-Enroll_Season_Fall    2736 non-null   int64
 1   Basic_Demos-Enroll_Season_Spring  2736 non-null   int64
 2   Basic_Demos-Enroll_Season_Summer  2736 non-null   int64
 3   Basic_Demos-Enroll_Season_Winter  2736 non-null   int64
 4   CGAS-Season_Fall                  2736 non-null   int64
 5   CGAS-Season_Spring                2736 non-null   int64
 6   CGAS-Season_Summer                2736 non-null   int64
 7   CGAS-Season_Winter                2736 non-null   int64
 8   Physical-Season_Fall              2736 non-null   int64
 9   Physical-Season_Spring            2736 non-null   int64
 10  Physical-Season_Summer            2736 non-null   int64
 11  Physical-Season_Winter            2736 non-null   int64
 12  Fitness_Endurance-Season_Fal

In [620]:
#""" """
print("\nPCIAT Train:")
X_PCIAT_train_numerical.info()

#X_train.info()
#"""


PCIAT Train:
<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 34 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Age                         2736 non-null   int64  
 1   Basic_Demos-Sex                         2736 non-null   int64  
 2   CGAS-CGAS_Score                         2342 non-null   float64
 3   Physical-BMI                            2527 non-null   float64
 4   Physical-Height                         2530 non-null   float64
 5   Physical-Weight                         2572 non-null   float64
 6   Physical-Diastolic_BP                   2478 non-null   float64
 7   Physical-HeartRate                      2486 non-null   float64
 8   Physical-Systolic_BP                    2478 non-null   float64
 9   FGC-FGC_CU                              1884 non-null   float64
 10  FGC-FGC_PU                              1874 non-nu

I do not use the mean (or the median to fill out) the rows in my dataset because it will be a single information for features that might have in the worst case 59% of NaN values. Too low informative. I use random sampling:

In [621]:
#""" """
print("SII Train:")
X_SII_train_numerical.shape
#""" """

SII Train:


(2736, 34)

In [622]:
print("\nPCIAT Train:")
X_PCIAT_train_numerical.shape


PCIAT Train:


(2736, 34)

In [623]:
# Function for random sampling imputation

def random_imputation(series):
    #Fill missing values in a pandas series with random sampling from non-missing values.
    missing_indices = series[series.isnull()].index  # Indices of missing values
    sampled_values = np.random.choice(series.dropna(), size=len(missing_indices), replace=True)
    series.loc[missing_indices] = sampled_values.astype(int) if series.dtype in ['int64', 'int32'] else sampled_values
    return series

In [624]:
# Impute all features in the dataset with random sampling

def impute_all_features_with_random_sampling(df):
    # Apply random sampling imputation to all columns with missing values in a DataFrame.
    for column in df.columns:
        if df[column].isnull().sum() > 0:  # Check if column has missing values
            df[column] = random_imputation(df[column])
    return df

In [625]:
"""
pd.options.mode.chained_assignment = None  # default='warn'

# X_filled = impute_all_features_with_random_sampling(X_train_numerical)
X_SII_filled = impute_all_features_with_random_sampling(X_SII_train)

# Print to verify no missing values remain
print("SII Train:")
X_SII_filled.info()
"""

'\npd.options.mode.chained_assignment = None  # default=\'warn\'\n\n# X_filled = impute_all_features_with_random_sampling(X_train_numerical)\nX_SII_filled = impute_all_features_with_random_sampling(X_SII_train)\n\n# Print to verify no missing values remain\nprint("SII Train:")\nX_SII_filled.info()\n'

In [626]:
"""
pd.options.mode.chained_assignment = None  # default='warn'

# X_filled = impute_all_features_with_random_sampling(X_train_numerical)
X_PCIAT_filled = impute_all_features_with_random_sampling(X_PCIAT_train)

# Print to verify no missing values remain
print("PCIAT Train:")
X_PCIAT_filled.info()
"""

'\npd.options.mode.chained_assignment = None  # default=\'warn\'\n\n# X_filled = impute_all_features_with_random_sampling(X_train_numerical)\nX_PCIAT_filled = impute_all_features_with_random_sampling(X_PCIAT_train)\n\n# Print to verify no missing values remain\nprint("PCIAT Train:")\nX_PCIAT_filled.info()\n'

In [627]:
"""
X_PCIAT_filled.head()
"""

'\nX_PCIAT_filled.head()\n'

#### **7. Data Splitting**

In [628]:
"""
X_train_numerical = X_train.select_dtypes(include=[np.number])
X_train_categorical = X_train.select_dtypes(exclude=[np.number])

list_of_numerical = X_train_numerical.columns.tolist()
list_of_categorical = X_train_categorical.columns.tolist()

print("Numerical Features Shape   -> ", X_train_numerical.shape, "\nCategorical Features Shape -> ", X_train_categorical.shape)
print("\nNumerical Features List    -> ", list_of_numerical, "\nCategorical Features List  -> ", list_of_categorical)
"""

'\nX_train_numerical = X_train.select_dtypes(include=[np.number])\nX_train_categorical = X_train.select_dtypes(exclude=[np.number])\n\nlist_of_numerical = X_train_numerical.columns.tolist()\nlist_of_categorical = X_train_categorical.columns.tolist()\n\nprint("Numerical Features Shape   -> ", X_train_numerical.shape, "\nCategorical Features Shape -> ", X_train_categorical.shape)\nprint("\nNumerical Features List    -> ", list_of_numerical, "\nCategorical Features List  -> ", list_of_categorical)\n'

#### **4. Categorical Feature Encoding**

In [629]:
"""
X_train_categorical_decoded = pd.get_dummies(X_train_categorical)
X_train_categorical_decoded *= 1
X_train_categorical_decoded.shape
"""

'\nX_train_categorical_decoded = pd.get_dummies(X_train_categorical)\nX_train_categorical_decoded *= 1\nX_train_categorical_decoded.shape\n'

In [630]:
"""
X_train_categorical_decoded.info()
"""

'\nX_train_categorical_decoded.info()\n'

In [631]:
"""
X_train_numerical.info()
"""

'\nX_train_numerical.info()\n'

#### **Merge together the decoded categorical with the numerical dataset** 

In [632]:
#""" """
# i want to merge the numerical and categorical features
X_SII_train = pd.concat([X_SII_train_numerical, X_SII_train_categorical_decoded], axis=1)
X_SII_train.shape
#""" """

(2736, 70)

In [633]:
X_PCIAT_train = pd.concat([X_PCIAT_train_numerical, X_PCIAT_train_categorical_decoded], axis=1)
X_PCIAT_train.shape

(2736, 70)

___
After merging numerical and categorical features, I call the functions for filling out the NaN values

In [634]:
pd.options.mode.chained_assignment = None  # default='warn'

# X_filled = impute_all_features_with_random_sampling(X_train_numerical)
X_SII_filled = impute_all_features_with_random_sampling(X_SII_train)

# Print to verify no missing values remain
print("SII Train:")
X_SII_filled.info()

SII Train:
<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 70 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Age                         2736 non-null   int64  
 1   Basic_Demos-Sex                         2736 non-null   int64  
 2   CGAS-CGAS_Score                         2736 non-null   float64
 3   Physical-BMI                            2736 non-null   float64
 4   Physical-Height                         2736 non-null   float64
 5   Physical-Weight                         2736 non-null   float64
 6   Physical-Diastolic_BP                   2736 non-null   float64
 7   Physical-HeartRate                      2736 non-null   float64
 8   Physical-Systolic_BP                    2736 non-null   float64
 9   FGC-FGC_CU                              2736 non-null   float64
 10  FGC-FGC_PU                              2736 non-null 

In [635]:
pd.options.mode.chained_assignment = None  # default='warn'

# X_filled = impute_all_features_with_random_sampling(X_train_numerical)
X_PCIAT_filled = impute_all_features_with_random_sampling(X_PCIAT_train)

# Print to verify no missing values remain
print("PCIAT Train:")
X_PCIAT_filled.info()

PCIAT Train:
<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 70 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Age                         2736 non-null   int64  
 1   Basic_Demos-Sex                         2736 non-null   int64  
 2   CGAS-CGAS_Score                         2736 non-null   float64
 3   Physical-BMI                            2736 non-null   float64
 4   Physical-Height                         2736 non-null   float64
 5   Physical-Weight                         2736 non-null   float64
 6   Physical-Diastolic_BP                   2736 non-null   float64
 7   Physical-HeartRate                      2736 non-null   float64
 8   Physical-Systolic_BP                    2736 non-null   float64
 9   FGC-FGC_CU                              2736 non-null   float64
 10  FGC-FGC_PU                              2736 non-nul

In [636]:
X_SII_filled.head()

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,FGC-FGC_CU,...,PAQ_C-Season_Summer,PAQ_C-Season_Winter,SDS-Season_Fall,SDS-Season_Spring,SDS-Season_Summer,SDS-Season_Winter,PreInt_EduHx-Season_Fall,PreInt_EduHx-Season_Spring,PreInt_EduHx-Season_Summer,PreInt_EduHx-Season_Winter
0,5,0,51.0,16.877316,46.0,50.8,66.0,71.0,110.0,0.0,...,0,0,0,0,0,0,1,0,0,0
1,9,0,80.0,14.03559,48.0,46.0,75.0,70.0,122.0,3.0,...,0,0,1,0,0,0,0,0,1,0
2,10,1,71.0,16.648696,56.5,75.6,65.0,94.0,117.0,20.2,...,1,0,1,0,0,0,0,0,1,0
3,9,0,71.0,18.292347,56.0,81.6,60.0,97.0,117.0,18.2,...,0,1,0,0,1,0,0,0,0,1
5,13,1,50.0,22.279952,59.5,112.2,60.0,73.0,102.0,12.0,...,0,0,0,0,1,0,0,1,0,0


In [637]:
X_PCIAT_filled.head()

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,FGC-FGC_CU,...,PAQ_C-Season_Summer,PAQ_C-Season_Winter,SDS-Season_Fall,SDS-Season_Spring,SDS-Season_Summer,SDS-Season_Winter,PreInt_EduHx-Season_Fall,PreInt_EduHx-Season_Spring,PreInt_EduHx-Season_Summer,PreInt_EduHx-Season_Winter
0,5,0,51.0,16.877316,46.0,50.8,81.0,73.0,96.0,0.0,...,0,0,0,0,0,0,1,0,0,0
1,9,0,60.0,14.03559,48.0,46.0,75.0,70.0,122.0,3.0,...,0,0,1,0,0,0,0,0,1,0
2,10,1,71.0,16.648696,56.5,75.6,65.0,94.0,117.0,20.2,...,1,0,1,0,0,0,0,0,1,0
3,9,0,71.0,18.292347,56.0,81.6,60.0,97.0,117.0,18.2,...,0,1,0,0,1,0,0,0,0,1
5,13,1,50.0,22.279952,59.5,112.2,60.0,73.0,102.0,12.0,...,0,0,0,0,1,0,0,1,0,0


And now I convert all my features to the same type (_float64_)

In [638]:
# convert all the features of the two datasets in float64
X_SII_filled = X_SII_filled.astype('float64')
X_PCIAT_filled = X_PCIAT_filled.astype('float64')

In [639]:
print("SII Train:")
X_SII_train = X_SII_filled
X_SII_train.info()

SII Train:
<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 70 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Age                         2736 non-null   float64
 1   Basic_Demos-Sex                         2736 non-null   float64
 2   CGAS-CGAS_Score                         2736 non-null   float64
 3   Physical-BMI                            2736 non-null   float64
 4   Physical-Height                         2736 non-null   float64
 5   Physical-Weight                         2736 non-null   float64
 6   Physical-Diastolic_BP                   2736 non-null   float64
 7   Physical-HeartRate                      2736 non-null   float64
 8   Physical-Systolic_BP                    2736 non-null   float64
 9   FGC-FGC_CU                              2736 non-null   float64
 10  FGC-FGC_PU                              2736 non-null 

In [640]:
print("\nPCIAT Train:")
X_PCIAT_train = X_PCIAT_filled
X_PCIAT_train.info()


PCIAT Train:
<class 'pandas.core.frame.DataFrame'>
Index: 2736 entries, 0 to 3958
Data columns (total 70 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   Basic_Demos-Age                         2736 non-null   float64
 1   Basic_Demos-Sex                         2736 non-null   float64
 2   CGAS-CGAS_Score                         2736 non-null   float64
 3   Physical-BMI                            2736 non-null   float64
 4   Physical-Height                         2736 non-null   float64
 5   Physical-Weight                         2736 non-null   float64
 6   Physical-Diastolic_BP                   2736 non-null   float64
 7   Physical-HeartRate                      2736 non-null   float64
 8   Physical-Systolic_BP                    2736 non-null   float64
 9   FGC-FGC_CU                              2736 non-null   float64
 10  FGC-FGC_PU                              2736 non-nu

---
#### **7. Model-Based Feature Engineering**  

In [641]:
# I want to use X_train to predict y_PCIAT with a kNN model

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, mean_squared_error
#from sklearn.model_selection import cross_val_score

#scale the data 
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X_PCIAT_train, PCIAT_target, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

for k in range(1, 100):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print (f"k: {k:2d} | Accuracy {acc:.3f}" )

# Very low accuracy, I will try in future

k:  1 | Accuracy 0.027
k:  2 | Accuracy 0.053
k:  3 | Accuracy 0.068
k:  4 | Accuracy 0.080
k:  5 | Accuracy 0.086
k:  6 | Accuracy 0.078
k:  7 | Accuracy 0.082
k:  8 | Accuracy 0.091


k:  9 | Accuracy 0.086
k: 10 | Accuracy 0.084
k: 11 | Accuracy 0.089
k: 12 | Accuracy 0.093
k: 13 | Accuracy 0.097
k: 14 | Accuracy 0.097
k: 15 | Accuracy 0.097
k: 16 | Accuracy 0.099
k: 17 | Accuracy 0.099
k: 18 | Accuracy 0.099
k: 19 | Accuracy 0.100
k: 20 | Accuracy 0.104
k: 21 | Accuracy 0.104
k: 22 | Accuracy 0.104
k: 23 | Accuracy 0.102
k: 24 | Accuracy 0.104
k: 25 | Accuracy 0.102
k: 26 | Accuracy 0.100
k: 27 | Accuracy 0.102
k: 28 | Accuracy 0.106
k: 29 | Accuracy 0.106
k: 30 | Accuracy 0.106
k: 31 | Accuracy 0.108
k: 32 | Accuracy 0.108
k: 33 | Accuracy 0.111
k: 34 | Accuracy 0.111
k: 35 | Accuracy 0.113
k: 36 | Accuracy 0.115
k: 37 | Accuracy 0.115
k: 38 | Accuracy 0.115
k: 39 | Accuracy 0.113
k: 40 | Accuracy 0.113
k: 41 | Accuracy 0.115
k: 42 | Accuracy 0.115
k: 43 | Accuracy 0.113
k: 44 | Accuracy 0.111
k: 45 | Accuracy 0.113
k: 46 | Accuracy 0.111
k: 47 | Accuracy 0.109
k: 48 | Accuracy 0.111
k: 49 | Accuracy 0.113
k: 50 | Accuracy 0.115
k: 51 | Accuracy 0.115
k: 52 | Acc

#### **8. Data export**  

In [642]:
export_data = pd.concat([X_SII_train, SII_target], axis=1)

export_data.head()

Unnamed: 0,Basic_Demos-Age,Basic_Demos-Sex,CGAS-CGAS_Score,Physical-BMI,Physical-Height,Physical-Weight,Physical-Diastolic_BP,Physical-HeartRate,Physical-Systolic_BP,FGC-FGC_CU,...,PAQ_C-Season_Winter,SDS-Season_Fall,SDS-Season_Spring,SDS-Season_Summer,SDS-Season_Winter,PreInt_EduHx-Season_Fall,PreInt_EduHx-Season_Spring,PreInt_EduHx-Season_Summer,PreInt_EduHx-Season_Winter,sii
0,5.0,0.0,51.0,16.877316,46.0,50.8,66.0,71.0,110.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0
1,9.0,0.0,80.0,14.03559,48.0,46.0,75.0,70.0,122.0,3.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,10.0,1.0,71.0,16.648696,56.5,75.6,65.0,94.0,117.0,20.2,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,9.0,0.0,71.0,18.292347,56.0,81.6,60.0,97.0,117.0,18.2,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
5,13.0,1.0,50.0,22.279952,59.5,112.2,60.0,73.0,102.0,12.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


In [643]:
output_path = os.path.join('dataset/', 'v5_cleaned_dataset.csv')
export_data.to_csv(output_path, index=False)

print(f"Cleaned dataset saved to {output_path}")

Cleaned dataset saved to dataset/v5_cleaned_dataset.csv
