<b><font size="6">Predictive Modelling Pipeline Template</font></b><br><br>

In this notebook we present to you the main steps you should follow throughout your project.


<b> Important: The numbered sections and subsections are merely indicative of some of the steps you should pay attention to in your project. <br>You are not required to strictly follow this order or to execute everything in separate cells.</b>
    
<img src="image/process_ML.png" style="height:70px">

In [71]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, RobustScaler
import warnings
warnings.filterwarnings('ignore')

<a class="anchor" id="">

# 1. Import data (Data Integration)

</a>


<img src="image/step1.png" style="height:60px">

In [72]:
# Load the data in a simple way
obesity_train_raw = pd.read_csv('../data/obesity_train.csv')
obesity_test_raw = pd.read_csv('../data/obesity_test.csv')

In [73]:
obesity_train_raw.head(10)

Unnamed: 0,id,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,marrital_status,meals_perday,...,parent_overweight,physical_activity_perweek,region,siblings,smoke,transportation,veggies_freq,water_daily,weight,obese_level
0,1,21.0,Never,no,up to 5,Sometimes,Female,1.62,,3.0,...,yes,,LatAm,3.0,no,Public,Sometimes,1 to 2,64.0,Normal_Weight
1,2,23.0,Frequently,no,up to 5,Sometimes,Male,1.8,,3.0,...,yes,3 to 4,LatAm,0.0,no,Public,Sometimes,1 to 2,77.0,Normal_Weight
2,3,,Frequently,no,up to 2,Sometimes,Male,1.8,,3.0,...,no,3 to 4,LatAm,2.0,no,Walk,Always,1 to 2,87.0,Overweight_Level_I
3,4,22.0,Sometimes,no,up to 2,Sometimes,Male,1.78,,1.0,...,no,,LatAm,3.0,no,Public,Sometimes,1 to 2,90.0,Overweight_Level_II
4,5,22.0,Sometimes,no,up to 2,Sometimes,Male,1.64,,3.0,...,no,5 or more,LatAm,3.0,no,Public,Sometimes,1 to 2,53.0,Normal_Weight
5,6,24.0,Frequently,yes,up to 5,Sometimes,Male,1.78,,3.0,...,yes,1 to 2,LatAm,2.0,no,Public,Always,1 to 2,64.0,Normal_Weight
6,7,21.0,Sometimes,yes,up to 5,Frequently,Female,1.72,,3.0,...,yes,3 to 4,,2.0,no,Public,Sometimes,1 to 2,80.0,Overweight_Level_II
7,8,22.0,Sometimes,no,up to 2,Sometimes,Male,1.65,,3.0,...,no,3 to 4,LatAm,1.0,no,Public,Always,more than 2,56.0,Normal_Weight
8,9,41.0,Frequently,yes,up to 5,Sometimes,Male,1.8,,3.0,...,no,3 to 4,LatAm,0.0,no,Car,Sometimes,1 to 2,99.0,Obesity_Type_I
9,10,27.0,Sometimes,yes,up to 2,Sometimes,Male,1.93,,1.0,...,yes,1 to 2,LatAm,2.0,no,Public,Sometimes,less than 1,102.0,Overweight_Level_II


In [74]:
obesity_test_raw.head(10)

Unnamed: 0,id,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,marrital_status,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,region,siblings,smoke,transportation,veggies_freq,water_daily,weight
0,1612,21.0,Sometimes,no,up to 2,Sometimes,Female,1.52,,3.0,yes,yes,5 or more,LatAm,3.0,yes,Public,Always,more than 2,56.0
1,1613,29.0,Sometimes,yes,up to 2,Sometimes,Male,1.62,,3.0,no,no,,LatAm,3.0,no,Car,Sometimes,1 to 2,53.0
2,1614,23.0,Sometimes,,up to 2,Sometimes,Female,1.5,,3.0,no,yes,1 to 2,LatAm,2.0,no,Motorbike,Always,1 to 2,
3,1615,22.0,Never,yes,up to 5,Sometimes,Male,1.72,,3.0,no,yes,1 to 2,LatAm,1.0,no,Public,Sometimes,1 to 2,68.0
4,1616,26.0,Sometimes,yes,more than 5,Frequently,Male,1.85,,3.0,no,yes,3 to 4,LatAm,1.0,no,Public,Always,more than 2,105.0
5,1617,23.0,Sometimes,yes,up to 5,Sometimes,Male,1.77,,1.0,no,yes,1 to 2,LatAm,2.0,no,Public,Always,less than 1,60.0
6,1618,22.0,Sometimes,no,up to 5,Always,Female,1.7,,3.0,yes,yes,3 to 4,LatAm,1.0,no,Public,Always,1 to 2,
7,1619,29.0,Never,yes,up to 2,Sometimes,Female,1.53,,1.0,no,no,,LatAm,0.0,no,Car,Sometimes,1 to 2,78.0
8,1620,30.0,Never,yes,up to 2,Frequently,Female,1.71,,4.0,no,yes,,LatAm,0.0,yes,Car,Always,less than 1,82.0
9,1621,23.0,Sometimes,yes,up to 5,Frequently,Female,1.6,,4.0,no,no,3 to 4,LatAm,3.0,no,Car,Sometimes,1 to 2,52.0


<a class="anchor" id="">

# 2. Explore data (Data access, exploration and understanding)

</a>

<img src="image/step2.png" style="height:60px">

Remember, this step is very important as it is at this stage that you will really look into the data that you have. Generally speaking, if you do well at this stage, the following stages will be very smooth.

Moreover, you should also take the time to find meaningful patterns on the data: what interesting relationships can be found between the variables and how can that knowledge be inform your future decisions.

In [75]:
# Display information about the training dataset
obesity_train_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1611 entries, 0 to 1610
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         1611 non-null   int64  
 1   age                        1545 non-null   float64
 2   alcohol_freq               1575 non-null   object 
 3   caloric_freq               1591 non-null   object 
 4   devices_perday             1589 non-null   object 
 5   eat_between_meals          1552 non-null   object 
 6   gender                     1591 non-null   object 
 7   height                     1597 non-null   float64
 8   marrital_status            0 non-null      float64
 9   meals_perday               1602 non-null   float64
 10  monitor_calories           1572 non-null   object 
 11  parent_overweight          1591 non-null   object 
 12  physical_activity_perweek  1046 non-null   object 
 13  region                     1544 non-null   objec

In [76]:
obesity_train_raw.age.sort_values(ascending=False).head(3)

1013    88.0
100     61.0
882     55.0
Name: age, dtype: float64

<a class="anchor" id="">

# 3. Modify data (Data preparation)

</a>

<img src="image/step3.png" style="height:60px">

Use this section to apply transformations to your dataset.

Remember that your decisions at this step should be exclusively informed by your **training data**. While you will need to split your data between training and validation, how that split will be made and how to apply the approppriate transformations will depend on the type of model assessment solution you select for your project (each has its own set of advantages and disadvantages that you need to consider). **Please find a list of possible methods for model assessment below**:

1. **Holdout method**
2. **Repeated Holdout method**
3. **Cross-Validation**

__Note:__ Instead of creating different sections for the treatment of training and validation data, you can make the transformations in the same cell. There is no need to create a specific section for that.

### 3.1. Data Preparation

In [77]:
# Drop the 'marrital_status' and 'region' columns from the dataset
obesity_train = obesity_train_raw.drop(columns=['marrital_status', 'region'])
obesity_test = obesity_test_raw.drop(columns=['marrital_status', 'region'])

In [78]:
obesity_train.set_index('id', inplace=True)
obesity_test.set_index('id', inplace=True)
obesity_train

Unnamed: 0_level_0,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,siblings,smoke,transportation,veggies_freq,water_daily,weight,obese_level
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,21.0,Never,no,up to 5,Sometimes,Female,1.62,3.0,no,yes,,3.0,no,Public,Sometimes,1 to 2,64.0,Normal_Weight
2,23.0,Frequently,no,up to 5,Sometimes,Male,1.80,3.0,no,yes,3 to 4,0.0,no,Public,Sometimes,1 to 2,77.0,Normal_Weight
3,,Frequently,no,up to 2,Sometimes,Male,1.80,3.0,no,no,3 to 4,2.0,no,Walk,Always,1 to 2,87.0,Overweight_Level_I
4,22.0,Sometimes,no,up to 2,Sometimes,Male,1.78,1.0,no,no,,3.0,no,Public,Sometimes,1 to 2,90.0,Overweight_Level_II
5,22.0,Sometimes,no,up to 2,Sometimes,Male,1.64,3.0,no,no,5 or more,3.0,no,Public,Sometimes,1 to 2,53.0,Normal_Weight
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1607,21.0,Sometimes,,up to 5,Sometimes,Female,1.73,3.0,no,yes,3 to 4,1.0,no,Public,Always,1 to 2,131.0,Obesity_Type_III
1608,22.0,Sometimes,yes,up to 5,Sometimes,Female,1.75,3.0,no,yes,1 to 2,0.0,no,,Always,1 to 2,134.0,Obesity_Type_III
1609,23.0,Sometimes,yes,up to 5,Sometimes,Female,1.75,3.0,no,yes,1 to 2,0.0,no,Public,Always,1 to 2,134.0,Obesity_Type_III
1610,24.0,Sometimes,yes,up to 5,Sometimes,Female,1.74,3.0,no,yes,1 to 2,0.0,no,Public,Always,more than 2,133.0,Obesity_Type_III


In [79]:
# Selecting outliers for which the age is out of scope. Or the weight classification is suspiciously low for the value given
outliers = obesity_train[
    ((obesity_train['age'] < 16) & ~(obesity_train['age'].isna())) |
    ((obesity_train['age'] > 56) & ~(obesity_train['age'].isna())) |
    ((obesity_train['weight'] > 167) & ~(obesity_train['weight'].isna()))
]
obesity_train.drop(outliers.index, inplace=True)
obesity_train.reset_index(drop=True, inplace=True)

In [80]:
obesity_train.shape # Shape adds up to our expectation (6 rows deleted) 1611 -> 1605 rows

(1605, 18)

# Handling Missing Values

In [81]:
obesity_train.isna().sum()

age                           65
alcohol_freq                  36
caloric_freq                  20
devices_perday                21
eat_between_meals             59
gender                        20
height                        13
meals_perday                   9
monitor_calories              39
parent_overweight             20
physical_activity_perweek    564
siblings                      12
smoke                         12
transportation                40
veggies_freq                  26
water_daily                   34
weight                        53
obese_level                    0
dtype: int64

In [82]:
obesity_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 500 entries, 1612 to 2111
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   age                        475 non-null    float64
 1   alcohol_freq               490 non-null    object 
 2   caloric_freq               491 non-null    object 
 3   devices_perday             494 non-null    object 
 4   eat_between_meals          485 non-null    object 
 5   gender                     493 non-null    object 
 6   height                     490 non-null    float64
 7   meals_perday               499 non-null    float64
 8   monitor_calories           488 non-null    object 
 9   parent_overweight          492 non-null    object 
 10  physical_activity_perweek  311 non-null    object 
 11  siblings                   494 non-null    float64
 12  smoke                      491 non-null    object 
 13  transportation             483 non-null    object 


In [83]:
# Impute with mode and median as the first null-handling resolution. Will be re-approached with further iterations on the model itself

obesity_train['physical_activity_perweek'].fillna(0, inplace=True) # ASSUMPTION: There is no 0 value in the scope. We assume nulls are the people who dont work out
obesity_test['physical_activity_perweek'].fillna(0, inplace=True)

In [84]:
categorical_columns = ['alcohol_freq','caloric_freq','devices_perday','eat_between_meals','gender',
                 'monitor_calories','parent_overweight','physical_activity_perweek','smoke','transportation',
                 'veggies_freq','water_daily']

In [85]:
from sklearn.impute import SimpleImputer
# The rest will be imputed with median or mode, depending on wheter the column in a dataframe is numerical or categorical
def fillna_simple(data_train, data_test):
    """
    Fills missing values in the dataframe.
    Categorical columns are filled with mode, numerical with median.
    """
    mode_imputer = SimpleImputer(strategy='most_frequent')
    median_imputer = SimpleImputer(strategy='median')
    
    for column in data_test.columns:
        if data_train[column].dtype == 'object':  # Categorical
            data_train[[column]] = mode_imputer.fit_transform(data_train[[column]])
            data_test[[column]] = mode_imputer.transform(data_test[[column]])
        else:  # Numerical
            data_train[[column]] = median_imputer.fit_transform(data_train[[column]])
            data_test[[column]] = median_imputer.transform(data_test[[column]])

fillna_simple(obesity_train, obesity_test)
obesity_train.isna().sum()

age                          0
alcohol_freq                 0
caloric_freq                 0
devices_perday               0
eat_between_meals            0
gender                       0
height                       0
meals_perday                 0
monitor_calories             0
parent_overweight            0
physical_activity_perweek    0
siblings                     0
smoke                        0
transportation               0
veggies_freq                 0
water_daily                  0
weight                       0
obese_level                  0
dtype: int64

In [86]:
obesity_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1605 entries, 0 to 1604
Data columns (total 18 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   age                        1605 non-null   float64
 1   alcohol_freq               1605 non-null   object 
 2   caloric_freq               1605 non-null   object 
 3   devices_perday             1605 non-null   object 
 4   eat_between_meals          1605 non-null   object 
 5   gender                     1605 non-null   object 
 6   height                     1605 non-null   float64
 7   meals_perday               1605 non-null   float64
 8   monitor_calories           1605 non-null   object 
 9   parent_overweight          1605 non-null   object 
 10  physical_activity_perweek  1605 non-null   object 
 11  siblings                   1605 non-null   float64
 12  smoke                      1605 non-null   object 
 13  transportation             1605 non-null   objec

In [87]:
def classify_bmi_comprehensive(row):
    """
    Classify BMI based on age and BMI value.

    Input:
    row: A Pandas row with 'weight', 'height', and 'age' columns.

    Output:
    Returns a string that classifies the individual into BMI categories.
    """
    # Check if weight and height are valid
    if row['height'] <= 0 or row['weight'] <= 0:
        return 'Invalid data'

    # Calculate BMI
    bmi = row['weight'] / (row['height'] ** 2)

    # Age group: Children (2-19 years)
    if 2 <= row['age'] < 20:
        if bmi < 14:
            return 'Underweight'
        elif 14 <= bmi < 18:
            return 'Normal weight'
        elif 18 <= bmi < 21:
            return 'Overweight'
        else:
            return 'Obese'

    # Age group: Adults (20-64 years)
    elif 20 <= row['age'] < 65:
        if bmi < 18.5:
            return 'Underweight'
        elif 18.5 <= bmi < 24.9:
            return 'Normal weight'
        elif 25 <= bmi < 29.9:
            return 'Overweight'
        else:
            return 'Obese'

In [88]:
obesity_train['bmi_class'] = obesity_train.apply(lambda row: classify_bmi_comprehensive(row), axis=1)
obesity_test['bmi_class'] = obesity_test.apply(lambda row: classify_bmi_comprehensive(row), axis=1)

In [89]:
obesity_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1605 entries, 0 to 1604
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   age                        1605 non-null   float64
 1   alcohol_freq               1605 non-null   object 
 2   caloric_freq               1605 non-null   object 
 3   devices_perday             1605 non-null   object 
 4   eat_between_meals          1605 non-null   object 
 5   gender                     1605 non-null   object 
 6   height                     1605 non-null   float64
 7   meals_perday               1605 non-null   float64
 8   monitor_calories           1605 non-null   object 
 9   parent_overweight          1605 non-null   object 
 10  physical_activity_perweek  1605 non-null   object 
 11  siblings                   1605 non-null   float64
 12  smoke                      1605 non-null   object 
 13  transportation             1605 non-null   objec

In [90]:
targets_one_hot = ["gender", "caloric_freq" ,"transportation", "parent_overweight", "smoke"]
targets_ordinal = ["alcohol_freq", "devices_perday", "eat_between_meals", "monitor_calories", "physical_activity_perweek",
                   "veggies_freq", "water_daily", "bmi_class"]
encode_config = {"onehot": targets_one_hot, "ordinal": targets_ordinal}


In [91]:
obesity_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1605 entries, 0 to 1604
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   age                        1605 non-null   float64
 1   alcohol_freq               1605 non-null   object 
 2   caloric_freq               1605 non-null   object 
 3   devices_perday             1605 non-null   object 
 4   eat_between_meals          1605 non-null   object 
 5   gender                     1605 non-null   object 
 6   height                     1605 non-null   float64
 7   meals_perday               1605 non-null   float64
 8   monitor_calories           1605 non-null   object 
 9   parent_overweight          1605 non-null   object 
 10  physical_activity_perweek  1605 non-null   object 
 11  siblings                   1605 non-null   float64
 12  smoke                      1605 non-null   object 
 13  transportation             1605 non-null   objec

In [92]:
def encode_data(data_train, data_test, targets_one_hot, targets_ordinal):
    """
    Encodes any type of data in onehot and ordinal, using a list of columns to specify which to encode.
    Data: your dataframe, duh
    onehot_list: list of strings representing the columns to encode with 1hot
    ordinal_list: analogous as above, except it's with ordinal encoding.
    
    RETVAL: Two dataframes with the encoded data. Transformed columns' names will be deleted and replaced with the encoded ones, following this naming convention:
    - If onehot, it will follow the format target_transformed_column
    - If ordinal, it will follow the format target_encoded
    """
    
    encoder = OneHotEncoder(sparse_output=False, drop="first")  # drop to avoid multicollinearity
    
    encoded_train = data_train.copy()
    encoded_test = data_test.copy()
    
    for target in targets_one_hot:
        encoded_train[target] = encoded_train[target].astype("str")
        encoded_test[target] = encoded_test[target].astype("str")
        
        # Fit encoder on training data
        target_encoder = encoder.fit(encoded_train[[target]])
        
        # Transform training data
        transformed_train = target_encoder.transform(encoded_train[[target]])
        transformed_train = pd.DataFrame(transformed_train, columns=[f"{target}_{col}" for col in target_encoder.categories_[0][1:]])
        transformed_train.set_index(encoded_train.index, inplace=True)
        
        # Transform test data
        transformed_test = target_encoder.transform(encoded_test[[target]])
        transformed_test = pd.DataFrame(transformed_test, columns=[f"{target}_{col}" for col in target_encoder.categories_[0][1:]])
        transformed_test.set_index(encoded_test.index, inplace=True)
        
        # Merge columns
        encoded_train = encoded_train.drop(columns=[target]).join(transformed_train)
        encoded_test = encoded_test.drop(columns=[target]).join(transformed_test)
    
    for target in targets_ordinal:
        encoded_train[target] = encoded_train[target].astype("str")
        encoded_test[target] = encoded_test[target].astype("str")
        
        # Fit encoder on training data
        target_encoder = OrdinalEncoder()
        target_encoder.fit(encoded_train[[target]])
        
        # Transform training data
        transformed_train = target_encoder.transform(encoded_train[[target]])
        encoded_train[f"{target}_encoded"] = transformed_train
        
        # Transform test data
        transformed_test = target_encoder.transform(encoded_test[[target]])
        encoded_test[f"{target}_encoded"] = transformed_test
        
        # Drop original columns
        encoded_train.drop(columns=[target], inplace=True)
        encoded_test.drop(columns=[target], inplace=True)
    
    return encoded_train, encoded_test

# Encode the data
encoded_train, encoded_test = encode_data(obesity_train, obesity_test, encode_config['onehot'], encode_config['ordinal'])


In [93]:
encoded_train, encoded_test = encode_data(obesity_train, obesity_test, encode_config['onehot'], encode_config['ordinal'])

### 3.3. Scaling

In [94]:
# Value scaling
from sklearn.preprocessing import StandardScaler, RobustScaler

scale_age = StandardScaler().fit(encoded_train[["age"]])
scale_height = StandardScaler().fit(encoded_train[["height"]])
scale_weight = StandardScaler().fit(encoded_train[["weight"]]) # Statistical analysis justifies the need to use RobustScaler on this one

dfs = [encoded_train, encoded_test] # Transform both dataframes
for df in dfs:
    new_age = scale_age.transform(df[["age"]])
    new_height = scale_height.transform(df[["height"]])
    new_weight = scale_weight.transform(df[["weight"]])

    # Replace columns
    df["age"] = new_age
    df["height"] = new_height
    df["weight"] = new_weight


In [95]:
obesity_train_preproc = encoded_train.copy()
obesity_test_preproc = encoded_test.copy()

### 3.4. Feature Selection

In [96]:
obesity_train_preproc.head(3)

Unnamed: 0,age,height,meals_perday,siblings,weight,obese_level,gender_Male,caloric_freq_yes,transportation_Car,transportation_Motorbike,...,parent_overweight_yes,smoke_yes,alcohol_freq_encoded,devices_perday_encoded,eat_between_meals_encoded,monitor_calories_encoded,physical_activity_perweek_encoded,veggies_freq_encoded,water_daily_encoded,bmi_class_encoded
0,-0.535907,-0.887188,3.0,3.0,-0.892716,Normal_Weight,0.0,0.0,0.0,0.0,...,1.0,0.0,2.0,2.0,3.0,0.0,0.0,2.0,0.0,0.0
1,-0.205976,1.019586,3.0,0.0,-0.382023,Normal_Weight,1.0,0.0,0.0,0.0,...,1.0,0.0,1.0,2.0,3.0,0.0,2.0,2.0,0.0,0.0
2,-0.205976,1.019586,3.0,2.0,0.010818,Overweight_Level_I,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,3.0,0.0,2.0,0.0,0.0,2.0


In [111]:
X = obesity_train_preproc.drop(columns=['obese_level'])
y = obesity_train_preproc['obese_level']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [112]:
from sklearn.metrics import classification_report

tree = DecisionTreeClassifier(random_state=42)

tree.fit(X_train, y_train)

print(classification_report(y_val, tree.predict(X_val)))

                     precision    recall  f1-score   support

Insufficient_Weight       0.95      0.90      0.92        40
      Normal_Weight       0.82      0.73      0.77        44
     Obesity_Type_I       0.88      0.87      0.88        53
    Obesity_Type_II       0.86      0.93      0.90        46
   Obesity_Type_III       0.98      0.90      0.93        48
 Overweight_Level_I       0.81      0.87      0.84        45
Overweight_Level_II       0.78      0.87      0.82        45

           accuracy                           0.87       321
          macro avg       0.87      0.87      0.87       321
       weighted avg       0.87      0.87      0.87       321



In [113]:
from sklearn.ensemble import RandomForestClassifier

tree_rf = RandomForestClassifier(random_state=42)

tree_rf.fit(X_train, y_train)

print(classification_report(y_val, tree_rf.predict(X_val)))

tree_rf.score(X_train, y_train)

                     precision    recall  f1-score   support

Insufficient_Weight       1.00      0.88      0.93        40
      Normal_Weight       0.93      0.86      0.89        44
     Obesity_Type_I       0.89      0.96      0.93        53
    Obesity_Type_II       0.93      0.93      0.93        46
   Obesity_Type_III       1.00      0.92      0.96        48
 Overweight_Level_I       0.87      0.89      0.88        45
Overweight_Level_II       0.77      0.89      0.82        45

           accuracy                           0.91       321
          macro avg       0.91      0.90      0.91       321
       weighted avg       0.91      0.91      0.91       321



1.0

In [104]:
importances = tree_rf.feature_importances_

importances_df = pd.DataFrame(importances, index=X.columns, columns=['importance'])
importances_df = importances_df.sort_values('importance', ascending=False)
importances_df

Unnamed: 0,importance
weight,0.278404
bmi_class_encoded,0.129733
age,0.105563
height,0.095777
gender_Male,0.060965
veggies_freq_encoded,0.044369
alcohol_freq_encoded,0.03717
meals_perday,0.034664
eat_between_meals_encoded,0.033563
physical_activity_perweek_encoded,0.03144


<a class="anchor" id="">

# 4 & 5. Model & Assess (Modelling and Assessment)

</a>

<img src="image/step4.png" style="height:60px">

### 4.1. Model Selection

In this section you should take the time to train different predictive algorithms with the data that got to this stage and **use the approppriate model assessment metrics to decide which model you think is the best to address your problem**.

**You are expected to present on your report the model performances of the different algorithms that you tested and discuss what informed your choice for a specific algorithm**

### 4.2. Model Optimization

After selecting the best algorithm (set of algorithms), you can try to optimize the performance of your model by fiddling with the algorithms' hyper-parameters and select the options that result on the best overall performance.

Possible ways of doing this can be through:
1. [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
2. [RandomSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

**While you are not required to show the results of all combinations of hyperparameters that you tried, you should at least discuss the what were the possible combinations used and which of them resulted in your best performance**

<a class="anchor" id="">

# 5. Deploy

</a>

<img src="image/step5.png" style="height:60px">

### 5.0 Training a final model

You used the previous steps of modelling and assessment to determine what would be best strategies when it comes to preprocessing, scaling, feature selection, algorithm and hyper-parameters you could find.

**By this stage, all of those choices were already made**. For that reason, a split between training and validation is no longer necessary. **A good practice** would be to take the initial data and train a final model with all of the labeled data that you have available.

**Everything is figured by this stage**, so, on a first level all you need to do is replicate the exact preprocessing, scaling and feature selection decisions you made before.<br>
When it comes to the final model, all you have to do is creeate a new instance of your best algorithm with the best parameters that you uncovered (no need to try all algorithms and hyper-parameters again).

### 5.1. Import and Transform your test data

Remember, the test data does not have the `outcome` variable.

### 5.2. Obtain Predictions on the test data from your final model

### 5.3. Create a Dataframe containing the index of each row and its intended prediction and export it to a csv file

Submit the csv file to Kaggle to obtain the model performance of your model on the test data.