<b><font size="6">Predictive Modelling Pipeline Template</font></b><br><br>

In this notebook we present to you the main steps you should follow throughout your project.


<b> Important: The numbered sections and subsections are merely indicative of some of the steps you should pay attention to in your project. <br>You are not required to strictly follow this order or to execute everything in separate cells.</b>
    
<img src="image/process_ML.png" style="height:70px">

In [300]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, RobustScaler
import warnings
warnings.filterwarnings('ignore')

<a class="anchor" id="">

# 1. Import data (Data Integration)

</a>


<img src="image/step1.png" style="height:60px">

In [301]:
# Load the data in a simple way
obesity_train_raw = pd.read_csv('../data/obesity_train.csv')
obesity_test_raw = pd.read_csv('../data/obesity_test.csv') 

In [302]:
obesity_train_raw.head(10)

Unnamed: 0,id,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,marrital_status,meals_perday,...,parent_overweight,physical_activity_perweek,region,siblings,smoke,transportation,veggies_freq,water_daily,weight,obese_level
0,1,21.0,Never,no,up to 5,Sometimes,Female,1.62,,3.0,...,yes,,LatAm,3.0,no,Public,Sometimes,1 to 2,64.0,Normal_Weight
1,2,23.0,Frequently,no,up to 5,Sometimes,Male,1.8,,3.0,...,yes,3 to 4,LatAm,0.0,no,Public,Sometimes,1 to 2,77.0,Normal_Weight
2,3,,Frequently,no,up to 2,Sometimes,Male,1.8,,3.0,...,no,3 to 4,LatAm,2.0,no,Walk,Always,1 to 2,87.0,Overweight_Level_I
3,4,22.0,Sometimes,no,up to 2,Sometimes,Male,1.78,,1.0,...,no,,LatAm,3.0,no,Public,Sometimes,1 to 2,90.0,Overweight_Level_II
4,5,22.0,Sometimes,no,up to 2,Sometimes,Male,1.64,,3.0,...,no,5 or more,LatAm,3.0,no,Public,Sometimes,1 to 2,53.0,Normal_Weight
5,6,24.0,Frequently,yes,up to 5,Sometimes,Male,1.78,,3.0,...,yes,1 to 2,LatAm,2.0,no,Public,Always,1 to 2,64.0,Normal_Weight
6,7,21.0,Sometimes,yes,up to 5,Frequently,Female,1.72,,3.0,...,yes,3 to 4,,2.0,no,Public,Sometimes,1 to 2,80.0,Overweight_Level_II
7,8,22.0,Sometimes,no,up to 2,Sometimes,Male,1.65,,3.0,...,no,3 to 4,LatAm,1.0,no,Public,Always,more than 2,56.0,Normal_Weight
8,9,41.0,Frequently,yes,up to 5,Sometimes,Male,1.8,,3.0,...,no,3 to 4,LatAm,0.0,no,Car,Sometimes,1 to 2,99.0,Obesity_Type_I
9,10,27.0,Sometimes,yes,up to 2,Sometimes,Male,1.93,,1.0,...,yes,1 to 2,LatAm,2.0,no,Public,Sometimes,less than 1,102.0,Overweight_Level_II


In [303]:
obesity_test_raw.head(10)

Unnamed: 0,id,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,marrital_status,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,region,siblings,smoke,transportation,veggies_freq,water_daily,weight
0,1612,21.0,Sometimes,no,up to 2,Sometimes,Female,1.52,,3.0,yes,yes,5 or more,LatAm,3.0,yes,Public,Always,more than 2,56.0
1,1613,29.0,Sometimes,yes,up to 2,Sometimes,Male,1.62,,3.0,no,no,,LatAm,3.0,no,Car,Sometimes,1 to 2,53.0
2,1614,23.0,Sometimes,,up to 2,Sometimes,Female,1.5,,3.0,no,yes,1 to 2,LatAm,2.0,no,Motorbike,Always,1 to 2,
3,1615,22.0,Never,yes,up to 5,Sometimes,Male,1.72,,3.0,no,yes,1 to 2,LatAm,1.0,no,Public,Sometimes,1 to 2,68.0
4,1616,26.0,Sometimes,yes,more than 5,Frequently,Male,1.85,,3.0,no,yes,3 to 4,LatAm,1.0,no,Public,Always,more than 2,105.0
5,1617,23.0,Sometimes,yes,up to 5,Sometimes,Male,1.77,,1.0,no,yes,1 to 2,LatAm,2.0,no,Public,Always,less than 1,60.0
6,1618,22.0,Sometimes,no,up to 5,Always,Female,1.7,,3.0,yes,yes,3 to 4,LatAm,1.0,no,Public,Always,1 to 2,
7,1619,29.0,Never,yes,up to 2,Sometimes,Female,1.53,,1.0,no,no,,LatAm,0.0,no,Car,Sometimes,1 to 2,78.0
8,1620,30.0,Never,yes,up to 2,Frequently,Female,1.71,,4.0,no,yes,,LatAm,0.0,yes,Car,Always,less than 1,82.0
9,1621,23.0,Sometimes,yes,up to 5,Frequently,Female,1.6,,4.0,no,no,3 to 4,LatAm,3.0,no,Car,Sometimes,1 to 2,52.0


<a class="anchor" id="">

# 2. Explore data (Data access, exploration and understanding)

</a>

<img src="image/step2.png" style="height:60px">

Remember, this step is very important as it is at this stage that you will really look into the data that you have. Generally speaking, if you do well at this stage, the following stages will be very smooth.

Moreover, you should also take the time to find meaningful patterns on the data: what interesting relationships can be found between the variables and how can that knowledge be inform your future decisions.

In [304]:
# Display information about the training dataset
obesity_train_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1611 entries, 0 to 1610
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         1611 non-null   int64  
 1   age                        1545 non-null   float64
 2   alcohol_freq               1575 non-null   object 
 3   caloric_freq               1591 non-null   object 
 4   devices_perday             1589 non-null   object 
 5   eat_between_meals          1552 non-null   object 
 6   gender                     1591 non-null   object 
 7   height                     1597 non-null   float64
 8   marrital_status            0 non-null      float64
 9   meals_perday               1602 non-null   float64
 10  monitor_calories           1572 non-null   object 
 11  parent_overweight          1591 non-null   object 
 12  physical_activity_perweek  1046 non-null   object 
 13  region                     1544 non-null   objec

<a class="anchor" id="">

# 3. Modify data (Data preparation)

</a>

<img src="image/step3.png" style="height:60px">

Use this section to apply transformations to your dataset.

Remember that your decisions at this step should be exclusively informed by your **training data**. While you will need to split your data between training and validation, how that split will be made and how to apply the approppriate transformations will depend on the type of model assessment solution you select for your project (each has its own set of advantages and disadvantages that you need to consider). **Please find a list of possible methods for model assessment below**:

1. **Holdout method**
2. **Repeated Holdout method**
3. **Cross-Validation**

__Note:__ Instead of creating different sections for the treatment of training and validation data, you can make the transformations in the same cell. There is no need to create a specific section for that.

### 3.1. Data Preparation

In [305]:
# Drop the 'marrital_status' and 'region' columns from the dataset
obesity_train = obesity_train_raw.drop(columns=['marrital_status', 'region'])
obesity_test = obesity_test_raw.drop(columns=['marrital_status', 'region'])

In [306]:
obesity_train.set_index('id', inplace=True)
obesity_test.set_index('id', inplace=True)
obesity_train

Unnamed: 0_level_0,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,siblings,smoke,transportation,veggies_freq,water_daily,weight,obese_level
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,21.0,Never,no,up to 5,Sometimes,Female,1.62,3.0,no,yes,,3.0,no,Public,Sometimes,1 to 2,64.0,Normal_Weight
2,23.0,Frequently,no,up to 5,Sometimes,Male,1.80,3.0,no,yes,3 to 4,0.0,no,Public,Sometimes,1 to 2,77.0,Normal_Weight
3,,Frequently,no,up to 2,Sometimes,Male,1.80,3.0,no,no,3 to 4,2.0,no,Walk,Always,1 to 2,87.0,Overweight_Level_I
4,22.0,Sometimes,no,up to 2,Sometimes,Male,1.78,1.0,no,no,,3.0,no,Public,Sometimes,1 to 2,90.0,Overweight_Level_II
5,22.0,Sometimes,no,up to 2,Sometimes,Male,1.64,3.0,no,no,5 or more,3.0,no,Public,Sometimes,1 to 2,53.0,Normal_Weight
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1607,21.0,Sometimes,,up to 5,Sometimes,Female,1.73,3.0,no,yes,3 to 4,1.0,no,Public,Always,1 to 2,131.0,Obesity_Type_III
1608,22.0,Sometimes,yes,up to 5,Sometimes,Female,1.75,3.0,no,yes,1 to 2,0.0,no,,Always,1 to 2,134.0,Obesity_Type_III
1609,23.0,Sometimes,yes,up to 5,Sometimes,Female,1.75,3.0,no,yes,1 to 2,0.0,no,Public,Always,1 to 2,134.0,Obesity_Type_III
1610,24.0,Sometimes,yes,up to 5,Sometimes,Female,1.74,3.0,no,yes,1 to 2,0.0,no,Public,Always,more than 2,133.0,Obesity_Type_III


In [307]:
# Selecting outliers for which the age is out of scope. Or the weight classification is suspiciously low for the value given
outliers = obesity_train[
    ((obesity_train['age'] < 16) & ~(obesity_train['age'].isna())) |
    ((obesity_train['age'] > 56) & ~(obesity_train['age'].isna())) |
    ((obesity_train['weight'] > 167) & ~(obesity_train['weight'].isna()))
]
obesity_train.drop(outliers.index, inplace=True)
obesity_train.reset_index(drop=True, inplace=True)

In [308]:
obesity_train.shape # Shape adds up to our expectation (6 rows deleted) 1611 -> 1605 rows

(1605, 18)

# Encoding categorical data

In [309]:
categorical_columns = obesity_train.select_dtypes(include='object').columns
numerical_columns = obesity_train.select_dtypes(exclude='object').columns

In [310]:
obesity_train.columns

Index(['age', 'alcohol_freq', 'caloric_freq', 'devices_perday',
       'eat_between_meals', 'gender', 'height', 'meals_perday',
       'monitor_calories', 'parent_overweight', 'physical_activity_perweek',
       'siblings', 'smoke', 'transportation', 'veggies_freq', 'water_daily',
       'weight', 'obese_level'],
      dtype='object')

In [311]:
# Impute with mode and median as the first null-handling resolution. Will be re-approached with further iterations on the model itself

obesity_train['physical_activity_perweek'].fillna(0, inplace=True) # ASSUMPTION: There is no 0 value in the scope. We assume nulls are the people who dont work out

In [312]:
hashmap = {
"Never": 0,
"Sometimes": 1,
"Frequently": 2,
"Always": 3,

"No Activity": 0,
"up to 2": 1,
"up to 5": 2,
"more than 5": 3,

"less than 1": 1,
"1 to 2": 2,
"more than 2": 3,
"3 to 4": 4,
"5 or more": 5,

"Bicycle": 1,
"Car": 3,
"Motorbike": 3,
"Public": 2,
"Walk": 0,

"no": 0,
"yes": 1,

"Male": 0,
"Female": 1
}

In [313]:
# Manually encode data

columns = ['alcohol_freq',
 'caloric_freq',
 'devices_perday',
 'eat_between_meals',
 'gender',
 'monitor_calories',
 'parent_overweight',
 'physical_activity_perweek',
 'smoke',
 'transportation',
 'veggies_freq',
 'water_daily',
 'meals_perday',
 "siblings"]

for target in columns:
    obesity_train[target] = obesity_train[target].replace(hashmap)

In [314]:
obesity_train_encoded = obesity_train.copy()
obesity_train_encoded.isna().sum()

age                          65
alcohol_freq                 36
caloric_freq                 20
devices_perday               21
eat_between_meals            59
gender                       20
height                       13
meals_perday                  9
monitor_calories             39
parent_overweight            20
physical_activity_perweek     0
siblings                     12
smoke                        12
transportation               40
veggies_freq                 26
water_daily                  34
weight                       53
obese_level                   0
dtype: int64

# Handling Missing Values

In [315]:
obesity_train_encoded.head(3)

Unnamed: 0,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,siblings,smoke,transportation,veggies_freq,water_daily,weight,obese_level
0,21.0,0.0,0.0,2.0,1.0,1.0,1.62,3.0,0.0,1.0,0,3.0,0.0,2.0,1.0,2.0,64.0,Normal_Weight
1,23.0,2.0,0.0,2.0,1.0,0.0,1.8,3.0,0.0,1.0,4,0.0,0.0,2.0,1.0,2.0,77.0,Normal_Weight
2,,2.0,0.0,1.0,1.0,0.0,1.8,3.0,0.0,0.0,4,2.0,0.0,0.0,3.0,2.0,87.0,Overweight_Level_I


In [316]:
from sklearn.model_selection import train_test_split

X = obesity_train_encoded.drop(columns='obese_level')
y = obesity_train_encoded[['obese_level']]

In [317]:
X.isna().sum()

age                          65
alcohol_freq                 36
caloric_freq                 20
devices_perday               21
eat_between_meals            59
gender                       20
height                       13
meals_perday                  9
monitor_calories             39
parent_overweight            20
physical_activity_perweek     0
siblings                     12
smoke                        12
transportation               40
veggies_freq                 26
water_daily                  34
weight                       53
dtype: int64

In [318]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

# Create copies for KNN imputation
data_knnimputer_train = X.copy()

# Specify the columns to scale
numerical_columns_features = numerical_columns

# Scale the data for KNN imputation
scaler = StandardScaler()
data_knnimputer_train[numerical_columns_features] = scaler.fit_transform(data_knnimputer_train[numerical_columns_features])

# Perform KNN imputation
knn_imputer = KNNImputer(n_neighbors=5, weights='uniform')
data_knnimputer_train[numerical_columns_features] = knn_imputer.fit_transform(data_knnimputer_train[numerical_columns_features])

# Inverse transform to original scale
data_knnimputer_train[numerical_columns_features] = scaler.inverse_transform(data_knnimputer_train[numerical_columns_features])

# Check for any remaining missing values
print(data_knnimputer_train.isna().sum())

age                           0
alcohol_freq                 36
caloric_freq                 20
devices_perday               21
eat_between_meals            59
gender                       20
height                        0
meals_perday                  0
monitor_calories             39
parent_overweight            20
physical_activity_perweek     0
siblings                      0
smoke                        12
transportation               40
veggies_freq                 26
water_daily                  34
weight                        0
dtype: int64


Add BMI

In [319]:
def classify_bmi_comprehensive(row):
    """
    Classify BMI based on age and BMI value.

    Input:
    row: A Pandas row with 'weight', 'height', and 'age' columns.

    Output:
    Returns a string that classifies the individual into BMI categories.
    """
    # Check if weight and height are valid
    if row['height'] <= 0 or row['weight'] <= 0:
        return 'Invalid data'

    # Calculate BMI
    bmi = row['weight'] / (row['height'] ** 2)

    # Age group: Children (2-19 years)
    if 2 <= row['age'] < 20:
        if bmi < 14:
            return 0 # Underweight
        elif 14 <= bmi < 18:
            return 1 # Normal weight
        elif 18 <= bmi < 21:
            return 2 # Overweight
        else:
            return 3 # Obesity 1

    # Age group: Adults (20-64 years)
    elif 20 <= row['age'] < 65:
        if bmi < 18.5:
            return 0 # "Underweight"
        elif 18.5 <= bmi < 25:
            return 1 # "Healthy Weight"
        elif 25 <= bmi < 30:
            return 2 #"Overweight"
        elif 30<= bmi < 35:
            return 3 #"Obese Class 1"
        elif 35 <= bmi < 40:
            return 4 #"Obese Class 2"
        else:
            return 5 #"Obese Class 3"

In [320]:
data_knnimputer_train['bmi_class'] = data_knnimputer_train.apply(lambda row: classify_bmi_comprehensive(row), axis=1)

# Manually encode the bmi

In [321]:
hash_obesity = {
 'Normal_Weight': 1,
 'Overweight_Level_I': 2,
 'Overweight_Level_II': 3,
 'Obesity_Type_I': 4,
 'Insufficient_Weight': 5,
 'Obesity_Type_II': 6,
 'Obesity_Type_III': 7
 }

y = y['obese_level'].replace(hash_obesity)

Impute categorical

In [322]:
data_knnimputer_train.head(2)

Unnamed: 0,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,siblings,smoke,transportation,veggies_freq,water_daily,weight,bmi_class
0,21.0,0.0,0.0,2.0,1.0,1.0,1.62,3.0,0.0,1.0,0,3.0,0.0,2.0,1.0,2.0,64.0,1
1,23.0,2.0,0.0,2.0,1.0,0.0,1.8,3.0,0.0,1.0,4,0.0,0.0,2.0,1.0,2.0,77.0,1


In [323]:
categorical_columns = categorical_columns.drop('obese_level')

In [324]:
from sklearn.experimental import enable_iterative_imputer  # Enable IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd

# Initialize IterativeImputer with KNeighborsClassifier for categorical data imputation
iterative_imputer = IterativeImputer(estimator=KNeighborsClassifier(n_neighbors=5), max_iter=10, random_state=42, 
                                     skip_complete=True)

# Perform imputation on the encoded categorical data
data_knnimputer_train_imputed = iterative_imputer.fit_transform(data_knnimputer_train)

# Convert back to DataFrame and assign original column names
data_knnimputer_train = pd.DataFrame(data_knnimputer_train_imputed, columns=data_knnimputer_train.columns)

# Check for any remaining missing values
print(data_knnimputer_train[categorical_columns].isna().sum())

alcohol_freq                 0
caloric_freq                 0
devices_perday               0
eat_between_meals            0
gender                       0
monitor_calories             0
parent_overweight            0
physical_activity_perweek    0
smoke                        0
transportation               0
veggies_freq                 0
water_daily                  0
dtype: int64


In [325]:
encoded_train = data_knnimputer_train.copy()

In [326]:
# Transform to life score
life_columns = [
 'alcohol_freq',
 'caloric_freq',
 'devices_perday',
 'eat_between_meals',
 'monitor_calories',
 'physical_activity_perweek',
 'smoke',
 'transportation',
 'veggies_freq',
 'water_daily',
 ]

encoded_train["life"] = 0

for column in life_columns:
    encoded_train["life"] += encoded_train[column]


### 3.3. Scaling

# Value scaling - finally done at the level of each fold so no prior scaling needed

from sklearn.preprocessing import StandardScaler

scale_age = StandardScaler().fit(encoded_train[["age"]])
scale_height = StandardScaler().fit(encoded_train[["height"]])
scale_weight = StandardScaler().fit(encoded_train[["weight"]]) # Statistical analysis justifies the need to use RobustScaler on this one

dfs = [encoded_train] # Transform both dataframes
for df in dfs:
    new_age = scale_age.transform(df[["age"]])
    new_height = scale_height.transform(df[["height"]])
    new_weight = scale_weight.transform(df[["weight"]])

    # Replace columns
    df["age"] = new_age
    df["height"] = new_height
    df["weight"] = new_weight

from sklearn.preprocessing import StandardScaler

columns_to_scale = encoded_train.columns
scaler = StandardScaler()

encoded_train[columns_to_scale] = scaler.fit_transform(encoded_train[columns_to_scale])

### 3.4. Feature Selection

In [327]:
obesity_train_preproc = encoded_train.copy()

In [328]:
obesity_train_preproc.head(3)

Unnamed: 0,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,siblings,smoke,transportation,veggies_freq,water_daily,weight,bmi_class,life
0,21.0,0.0,0.0,2.0,1.0,1.0,1.62,3.0,0.0,1.0,0.0,3.0,0.0,2.0,1.0,2.0,64.0,1.0,8.0
1,23.0,2.0,0.0,2.0,1.0,0.0,1.8,3.0,0.0,1.0,4.0,0.0,0.0,2.0,1.0,2.0,77.0,1.0,14.0
2,22.2,2.0,0.0,1.0,1.0,0.0,1.8,3.0,0.0,0.0,4.0,2.0,0.0,0.0,3.0,2.0,87.0,2.0,13.0


In [329]:
X_train = obesity_train_preproc.copy()
y_train = y.copy()

In [330]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, f1_score
from sklearn.preprocessing import StandardScaler

# Initialize the Decision Tree model
tree = RandomForestClassifier(random_state=42)

# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Perform Stratified K-Fold Cross Validation
f1_scores = []

for train_index, val_index in skf.split(X_train, y_train):
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]
    
    # Initialize the scaler
    scaler = StandardScaler()
    
    # Fit the scaler on the training fold and transform both training and validation folds
    X_train_fold = scaler.fit_transform(X_train_fold)
    X_val_fold = scaler.transform(X_val_fold)
    
    # Fit the model on the training fold
    tree.fit(X_train_fold, y_train_fold)
    
    # Predict on the validation fold
    y_pred_fold = tree.predict(X_val_fold)
    
    # Calculate F1 score
    f1 = f1_score(y_val_fold, y_pred_fold, average='macro')
    f1_scores.append(f1)

# Print the average F1 score
print(f"Average F1 Score: {np.mean(f1_scores)}")

Average F1 Score: 0.9402411494955224


In [331]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import StratifiedKFold

# Initialize the Gradient Boosting model
tree_gb = GradientBoostingClassifier(random_state=42)

# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Perform Stratified K-Fold Cross Validation
f1_scores = []

for train_index, val_index in skf.split(X_train, y_train):
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]
    
    # Initialize the scaler
    scaler = StandardScaler()
    
    # Fit the scaler on the training fold and transform both training and validation folds
    X_train_fold = scaler.fit_transform(X_train_fold)
    X_val_fold = scaler.transform(X_val_fold)

    # Fit the model on the training fold
    tree_gb.fit(X_train_fold, y_train_fold)
    
    # Predict on the validation fold
    y_pred_fold = tree_gb.predict(X_val_fold)
    
    # Calculate F1 score
    f1 = f1_score(y_val_fold, y_pred_fold, average='macro')
    f1_scores.append(f1)

# Print the average F1 score
print(f"Average F1 Score: {np.mean(f1_scores)}")

Average F1 Score: 0.9446908138758398


# RFE feature selection

In [None]:
importances = tree_gb.feature_importances_

importances_df = pd.DataFrame(importances, index=X_train.columns, columns=['importance'])
importances_df = importances_df.sort_values('importance', ascending=False)
importances_df

In [None]:
import numpy as np
from sklearn.feature_selection import RFE
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

nof_list = np.arange(1, len(X_train.columns) + 1)
high_score = float('-inf')  # Higher is better for F1 score
nof = 0  # Variable to store the optimum number of features
score_list = []

# Store the best selected features for display
best_features = None

# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

for n in tqdm(nof_list):
    fold_scores = []
    for train_index, val_index in skf.split(X_train, y_train):
        X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
        y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]
        
        # Initialize the scaler
        scaler = StandardScaler()
        
        # Fit the scaler on the training fold and transform both training and validation folds
        X_train_fold = scaler.fit_transform(X_train_fold)
        X_val_fold = scaler.transform(X_val_fold)
        
        model = RandomForestClassifier(random_state=42)
        rfe = RFE(model, n_features_to_select=n)
        X_train_rfe = rfe.fit_transform(X_train_fold, y_train_fold)
        X_val_rfe = rfe.transform(X_val_fold)

        model.fit(X_train_rfe, y_train_fold)
        y_pred = model.predict(X_val_rfe)
        
        # Use F1 score with 'macro' average for multiclass classification
        score = f1_score(y_val_fold, y_pred, average='macro')
        fold_scores.append(score)
    
    avg_score = np.mean(fold_scores)
    score_list.append(avg_score)
    
    # Check if this number of features gives a better score
    if avg_score > high_score:  # Higher is better for F1 score
        high_score = avg_score
        nof = n
        best_features = rfe.get_support()  # Get the mask of selected features

# Display the best results
print("Optimum number of features: %d" % nof)
print("Highest F1 score with %d features: %f" % (nof, high_score))

# Convert columns to a numpy array before using the boolean mask
selected_feature_names = X_train.columns.to_numpy()[best_features]
print("Selected features with %d features:" % nof)
print(selected_feature_names)

100%|██████████| 19/19 [07:26<00:00, 23.48s/it]

Optimum number of features: 17
Highest F1 score with 17 features: 0.942982
Selected features with 17 features:
['age' 'alcohol_freq' 'caloric_freq' 'devices_perday' 'eat_between_meals'
 'gender' 'height' 'meals_perday' 'parent_overweight'
 'physical_activity_perweek' 'siblings' 'transportation' 'veggies_freq'
 'water_daily' 'weight' 'bmi_class' 'life']





In [292]:
X_train.columns

Index(['age', 'alcohol_freq', 'caloric_freq', 'devices_perday',
       'eat_between_meals', 'gender', 'height', 'meals_perday',
       'monitor_calories', 'parent_overweight', 'physical_activity_perweek',
       'siblings', 'smoke', 'transportation', 'veggies_freq', 'water_daily',
       'weight', 'bmi_class', 'life'],
      dtype='object')

In [336]:
len(X_train.columns)

19

# Exhaustive feature selection
it is the one that tries every possibility so it runs soooo long but is the most accurate:)

In [None]:
#!pip install mlxtend

Collecting mlxtend
  Downloading mlxtend-0.23.2-py3-none-any.whl.metadata (7.3 kB)
Downloading mlxtend-0.23.2-py3-none-any.whl (1.4 MB)
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   ---------------------------------------- 1.4/1.4 MB 14.0 MB/s eta 0:00:00
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.2


In [None]:
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

# List of must-have features
must_have_features = ['age', 'gender', 'height', 'weight', 'bmi_class']

# Separate the features for selection from the must-have features
other_features = [col for col in X_train.columns if col not in must_have_features]

# Initialize the model
model = GradientBoostingClassifier(random_state=42)

# Set up the exhaustive feature selector, where 'other_features' will vary while 'must_have_features' remain
efs = ExhaustiveFeatureSelector(
    estimator=model,
    min_features=1,  # Start with single feature combinations of 'other_features'
    max_features=len(other_features),
    scoring='f1_macro',
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    n_jobs=-1  # Use all available CPU cores
)

# Combine the must-have features with other features and scale them in each fold to avoid leakage
# Track progress with tqdm
with tqdm(total=len(other_features)) as progress_bar:
    # Fit the feature selector on the dataset that includes both must-have and other features
    efs = efs.fit(X_train[other_features + must_have_features], y_train)
    progress_bar.update(len(other_features))

# Get the best feature subset found by the selector, which includes combinations of other features and must-have features
selected_features = list(efs.best_feature_names_)

# Ensure that must-have features are included in the final feature set
final_features = list(set(selected_features + must_have_features))

  0%|          | 0/14 [00:00<?, ?it/s]

In [None]:
# Perform cross-validation on the final selected features with scaling in each fold
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
f1_scores = []

for train_index, val_index in skf.split(X_train[final_features], y_train):
    X_train_fold, X_val_fold = X_train.iloc[train_index][final_features], X_train.iloc[val_index][final_features]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]
    
    # Initialize and fit the scaler only on the training data within each fold to avoid leakage
    scaler = StandardScaler()
    X_train_fold = scaler.fit_transform(X_train_fold)
    X_val_fold = scaler.transform(X_val_fold)
    
    # Train the model
    model.fit(X_train_fold, y_train_fold)
    
    # Make predictions and calculate the F1 score
    y_pred_fold = model.predict(X_val_fold)
    f1 = f1_score(y_val_fold, y_pred_fold, average='macro')
    f1_scores.append(f1)

# Print the results
print(f"Average F1 Score after final evaluation: {np.mean(f1_scores)}")
print("Selected features including must-haves:", final_features)
print("Best F1 macro score from feature selection:", efs.best_score_)


  0%|          | 0/14 [01:50<?, ?it/s]


KeyboardInterrupt: 

# VIF

In [289]:
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each feature
def calculate_vif(df):
    vif_data = pd.DataFrame()
    vif_data["feature"] = df.columns
    vif_data["VIF"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
    return vif_data

# Calculate VIF for X_train
vif_scores = calculate_vif(X_train)
print(vif_scores)

                      feature       VIF
0                         age  1.629907
1                alcohol_freq       inf
2                caloric_freq       inf
3              devices_perday       inf
4           eat_between_meals       inf
5                      gender  1.979336
6                      height  3.566593
7                meals_perday  1.107993
8            monitor_calories       inf
9           parent_overweight  1.449647
10  physical_activity_perweek       inf
11                   siblings  1.010405
12                      smoke       inf
13             transportation       inf
14               veggies_freq       inf
15                water_daily       inf
16                     weight  8.128276
17                  bmi_class  5.886715
18                       life       inf


In [290]:
best_selected_features_df = X_train[selected_feature_names] # no smoke column

# Hyperparameter tuning Random Forest

In [None]:
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score

# Initialize the label encoder
label_encoder = LabelEncoder()

# Fit the encoder on y_train and transform y_train
y_train_encoded = label_encoder.fit_transform(y_train)

# Define the objective function for Optuna
def objective(trial):
    # Suggest values for hyperparameters
    n_estimators = trial.suggest_int('n_estimators', 100, 500)
    max_depth = trial.suggest_int('max_depth', 5, 32)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 5)
    bootstrap = trial.suggest_categorical('bootstrap', [True, False])
    
    # Initialize RandomForestClassifier with suggested hyperparameters
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        bootstrap=bootstrap,
        random_state=42
    )
    
    # Initialize StratifiedKFold
    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    
    # Use cross-validation with F1 score for classification
    f1 = cross_val_score(model, best_selected_features_df, y_train_encoded, cv=skf, scoring='f1_macro', n_jobs=-1)
    return f1.mean()

# Set up Optuna study and optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

# Retrieve the best parameters
best_params = study.best_params
print("Best Parameters:", best_params)

# Train model with best parameters
best_rf = RandomForestClassifier(**best_params, random_state=42)
best_rf.fit(best_selected_features_df, y_train_encoded)

# Evaluate on training set
print('Train F1 score:', f1_score(y_train_encoded, best_rf.predict(best_selected_features_df), average='macro'))


[I 2024-11-05 16:46:53,247] A new study created in memory with name: no-name-16aa8d07-dc48-45fa-8d6e-e7f4a29e14b7
[I 2024-11-05 16:47:00,184] Trial 0 finished with value: 0.935580779812474 and parameters: {'n_estimators': 498, 'max_depth': 29, 'min_samples_split': 2, 'min_samples_leaf': 3, 'bootstrap': True}. Best is trial 0 with value: 0.935580779812474.
[I 2024-11-05 16:47:03,634] Trial 1 finished with value: 0.9437339858520464 and parameters: {'n_estimators': 409, 'max_depth': 32, 'min_samples_split': 5, 'min_samples_leaf': 2, 'bootstrap': False}. Best is trial 1 with value: 0.9437339858520464.
[I 2024-11-05 16:47:05,296] Trial 2 finished with value: 0.9383097297746525 and parameters: {'n_estimators': 437, 'max_depth': 10, 'min_samples_split': 10, 'min_samples_leaf': 1, 'bootstrap': False}. Best is trial 1 with value: 0.9437339858520464.
[I 2024-11-05 16:47:06,317] Trial 3 finished with value: 0.937402914578295 and parameters: {'n_estimators': 233, 'max_depth': 21, 'min_samples_spli

Best Parameters: {'n_estimators': 470, 'max_depth': 14, 'min_samples_split': 4, 'min_samples_leaf': 1, 'bootstrap': False}
Train F1 score: 0.9993578286816801


# Hyperparameter tuning GBC

In [291]:
import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score

# Initialize the label encoder
label_encoder = LabelEncoder()

# Fit the encoder on y_train and transform y_train
y_train_encoded = label_encoder.fit_transform(y_train)

# Define the objective function for Optuna
def objective(trial):
    # Suggest values for hyperparameters
    n_estimators = trial.suggest_int('n_estimators', 100, 500)
    max_depth = trial.suggest_int('max_depth', 3, 10)
    learning_rate = trial.suggest_float('learning_rate', 0.001, 0.3)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 5)
    subsample = trial.suggest_float('subsample', 0.5, 1.0)
    
    # Initialize GradientBoostingClassifier with suggested hyperparameters
    model = GradientBoostingClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        learning_rate=learning_rate,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        subsample=subsample,
        random_state=42
    )
    
    # Initialize StratifiedKFold
    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    
    # Use cross-validation with F1 score for classification
    f1 = cross_val_score(model, best_selected_features_df, y_train_encoded, cv=skf, scoring='f1_macro', n_jobs=-1)
    return f1.mean()

# Set up Optuna study and optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

# Retrieve the best parameters
best_params = study.best_params
print("Best Parameters:", best_params)

# Train model with best parameters
best_gb = GradientBoostingClassifier(**best_params, random_state=42)
best_gb.fit(best_selected_features_df, y_train_encoded)

# Evaluate on training set
print('Train F1 score:', f1_score(y_train_encoded, best_gb.predict(best_selected_features_df), average='macro'))

[I 2024-11-05 20:33:01,551] A new study created in memory with name: no-name-31cefac8-9da3-4c0d-832a-e2e1b1afa7be
[I 2024-11-05 20:33:13,653] Trial 0 finished with value: 0.9404468929725944 and parameters: {'n_estimators': 202, 'max_depth': 4, 'learning_rate': 0.05683514780729142, 'min_samples_split': 5, 'min_samples_leaf': 3, 'subsample': 0.9953730884345451}. Best is trial 0 with value: 0.9404468929725944.
[I 2024-11-05 20:33:31,696] Trial 1 finished with value: 0.9503314554079815 and parameters: {'n_estimators': 385, 'max_depth': 6, 'learning_rate': 0.08965329806897641, 'min_samples_split': 6, 'min_samples_leaf': 3, 'subsample': 0.9930683042967535}. Best is trial 1 with value: 0.9503314554079815.
[I 2024-11-05 20:33:53,122] Trial 2 finished with value: 0.9494761194976329 and parameters: {'n_estimators': 358, 'max_depth': 9, 'learning_rate': 0.0990952271899798, 'min_samples_split': 4, 'min_samples_leaf': 2, 'subsample': 0.6025110263081686}. Best is trial 1 with value: 0.95033145540798

KeyboardInterrupt: 

<a class="anchor" id="">

# 4 & 5. Model & Assess (Modelling and Assessment)

</a>

<img src="image/step4.png" style="height:60px">

### 4.1. Model Selection

In this section you should take the time to train different predictive algorithms with the data that got to this stage and **use the approppriate model assessment metrics to decide which model you think is the best to address your problem**.

**You are expected to present on your report the model performances of the different algorithms that you tested and discuss what informed your choice for a specific algorithm**

### 4.2. Model Optimization

After selecting the best algorithm (set of algorithms), you can try to optimize the performance of your model by fiddling with the algorithms' hyper-parameters and select the options that result on the best overall performance.

Possible ways of doing this can be through:
1. [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
2. [RandomSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

**While you are not required to show the results of all combinations of hyperparameters that you tried, you should at least discuss the what were the possible combinations used and which of them resulted in your best performance**

<a class="anchor" id="">

# 5. Deploy

</a>

<img src="image/step5.png" style="height:60px">

### 5.0 Training a final model

You used the previous steps of modelling and assessment to determine what would be best strategies when it comes to preprocessing, scaling, feature selection, algorithm and hyper-parameters you could find.

**By this stage, all of those choices were already made**. For that reason, a split between training and validation is no longer necessary. **A good practice** would be to take the initial data and train a final model with all of the labeled data that you have available.

**Everything is figured by this stage**, so, on a first level all you need to do is replicate the exact preprocessing, scaling and feature selection decisions you made before.<br>
When it comes to the final model, all you have to do is creeate a new instance of your best algorithm with the best parameters that you uncovered (no need to try all algorithms and hyper-parameters again).

### 5.1. Import and Transform your test data

Remember, the test data does not have the `outcome` variable.

### 5.2. Obtain Predictions on the test data from your final model

### 5.3. Create a Dataframe containing the index of each row and its intended prediction and export it to a csv file

Submit the csv file to Kaggle to obtain the model performance of your model on the test data.