<b><font size="6">Predictive Modelling Pipeline Template</font></b><br><br>

In this notebook we present to you the main steps you should follow throughout your project.


<b> Important: The numbered sections and subsections are merely indicative of some of the steps you should pay attention to in your project. <br>You are not required to strictly follow this order or to execute everything in separate cells.</b>
    
<img src="image/process_ML.png" style="height:70px">

In [None]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler, RobustScaler
import warnings
warnings.filterwarnings('ignore')

<a class="anchor" id="">

# 1. Import data (Data Integration)

</a>


<img src="image/step1.png" style="height:60px">

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Load the data in a simple way
obesity_train_raw = pd.read_csv('./obesity_train.csv')
obesity_test_raw = pd.read_csv('./obesity_test.csv')

In [None]:
# Load the data
obesity_train_raw = pd.read_csv('/content/drive/MyDrive/NOVA IMS/obesity_train.csv')
obesity_test_raw = pd.read_csv('/content/drive/MyDrive/NOVA IMS/obesity_test.csv')

In [None]:
obesity_train_raw.head(10)

Unnamed: 0,id,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,marrital_status,meals_perday,...,parent_overweight,physical_activity_perweek,region,siblings,smoke,transportation,veggies_freq,water_daily,weight,obese_level
0,1,21.0,Never,no,up to 5,Sometimes,Female,1.62,,3.0,...,yes,,LatAm,3.0,no,Public,Sometimes,1 to 2,64.0,Normal_Weight
1,2,23.0,Frequently,no,up to 5,Sometimes,Male,1.8,,3.0,...,yes,3 to 4,LatAm,0.0,no,Public,Sometimes,1 to 2,77.0,Normal_Weight
2,3,,Frequently,no,up to 2,Sometimes,Male,1.8,,3.0,...,no,3 to 4,LatAm,2.0,no,Walk,Always,1 to 2,87.0,Overweight_Level_I
3,4,22.0,Sometimes,no,up to 2,Sometimes,Male,1.78,,1.0,...,no,,LatAm,3.0,no,Public,Sometimes,1 to 2,90.0,Overweight_Level_II
4,5,22.0,Sometimes,no,up to 2,Sometimes,Male,1.64,,3.0,...,no,5 or more,LatAm,3.0,no,Public,Sometimes,1 to 2,53.0,Normal_Weight
5,6,24.0,Frequently,yes,up to 5,Sometimes,Male,1.78,,3.0,...,yes,1 to 2,LatAm,2.0,no,Public,Always,1 to 2,64.0,Normal_Weight
6,7,21.0,Sometimes,yes,up to 5,Frequently,Female,1.72,,3.0,...,yes,3 to 4,,2.0,no,Public,Sometimes,1 to 2,80.0,Overweight_Level_II
7,8,22.0,Sometimes,no,up to 2,Sometimes,Male,1.65,,3.0,...,no,3 to 4,LatAm,1.0,no,Public,Always,more than 2,56.0,Normal_Weight
8,9,41.0,Frequently,yes,up to 5,Sometimes,Male,1.8,,3.0,...,no,3 to 4,LatAm,0.0,no,Car,Sometimes,1 to 2,99.0,Obesity_Type_I
9,10,27.0,Sometimes,yes,up to 2,Sometimes,Male,1.93,,1.0,...,yes,1 to 2,LatAm,2.0,no,Public,Sometimes,less than 1,102.0,Overweight_Level_II


In [None]:
obesity_test_raw.head(10)

Unnamed: 0,id,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,marrital_status,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,region,siblings,smoke,transportation,veggies_freq,water_daily,weight
0,1612,21.0,Sometimes,no,up to 2,Sometimes,Female,1.52,,3.0,yes,yes,5 or more,LatAm,3.0,yes,Public,Always,more than 2,56.0
1,1613,29.0,Sometimes,yes,up to 2,Sometimes,Male,1.62,,3.0,no,no,,LatAm,3.0,no,Car,Sometimes,1 to 2,53.0
2,1614,23.0,Sometimes,,up to 2,Sometimes,Female,1.5,,3.0,no,yes,1 to 2,LatAm,2.0,no,Motorbike,Always,1 to 2,
3,1615,22.0,Never,yes,up to 5,Sometimes,Male,1.72,,3.0,no,yes,1 to 2,LatAm,1.0,no,Public,Sometimes,1 to 2,68.0
4,1616,26.0,Sometimes,yes,more than 5,Frequently,Male,1.85,,3.0,no,yes,3 to 4,LatAm,1.0,no,Public,Always,more than 2,105.0
5,1617,23.0,Sometimes,yes,up to 5,Sometimes,Male,1.77,,1.0,no,yes,1 to 2,LatAm,2.0,no,Public,Always,less than 1,60.0
6,1618,22.0,Sometimes,no,up to 5,Always,Female,1.7,,3.0,yes,yes,3 to 4,LatAm,1.0,no,Public,Always,1 to 2,
7,1619,29.0,Never,yes,up to 2,Sometimes,Female,1.53,,1.0,no,no,,LatAm,0.0,no,Car,Sometimes,1 to 2,78.0
8,1620,30.0,Never,yes,up to 2,Frequently,Female,1.71,,4.0,no,yes,,LatAm,0.0,yes,Car,Always,less than 1,82.0
9,1621,23.0,Sometimes,yes,up to 5,Frequently,Female,1.6,,4.0,no,no,3 to 4,LatAm,3.0,no,Car,Sometimes,1 to 2,52.0


<a class="anchor" id="">

# 2. Explore data (Data access, exploration and understanding)

</a>

<img src="image/step2.png" style="height:60px">

Remember, this step is very important as it is at this stage that you will really look into the data that you have. Generally speaking, if you do well at this stage, the following stages will be very smooth.

Moreover, you should also take the time to find meaningful patterns on the data: what interesting relationships can be found between the variables and how can that knowledge be inform your future decisions.

In [None]:
# Display information about the training dataset
obesity_train_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1611 entries, 0 to 1610
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         1611 non-null   int64  
 1   age                        1545 non-null   float64
 2   alcohol_freq               1575 non-null   object 
 3   caloric_freq               1591 non-null   object 
 4   devices_perday             1589 non-null   object 
 5   eat_between_meals          1552 non-null   object 
 6   gender                     1591 non-null   object 
 7   height                     1597 non-null   float64
 8   marrital_status            0 non-null      float64
 9   meals_perday               1602 non-null   float64
 10  monitor_calories           1572 non-null   object 
 11  parent_overweight          1591 non-null   object 
 12  physical_activity_perweek  1046 non-null   object 
 13  region                     1544 non-null   objec

<a class="anchor" id="">

# 3. Modify data (Data preparation)

</a>

<img src="image/step3.png" style="height:60px">

Use this section to apply transformations to your dataset.

Remember that your decisions at this step should be exclusively informed by your **training data**. While you will need to split your data between training and validation, how that split will be made and how to apply the approppriate transformations will depend on the type of model assessment solution you select for your project (each has its own set of advantages and disadvantages that you need to consider). **Please find a list of possible methods for model assessment below**:

1. **Holdout method**
2. **Repeated Holdout method**
3. **Cross-Validation**

__Note:__ Instead of creating different sections for the treatment of training and validation data, you can make the transformations in the same cell. There is no need to create a specific section for that.

### 3.1. Data Preparation

In [None]:
# Drop the 'marrital_status' and 'region' columns from the dataset
obesity_train = obesity_train_raw.drop(columns=['marrital_status', 'region'])
obesity_test = obesity_test_raw.drop(columns=['marrital_status', 'region'])

In [None]:
obesity_train.set_index('id', inplace=True)
obesity_test.set_index('id', inplace=True)
obesity_train

Unnamed: 0_level_0,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,siblings,smoke,transportation,veggies_freq,water_daily,weight,obese_level
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,21.0,Never,no,up to 5,Sometimes,Female,1.62,3.0,no,yes,,3.0,no,Public,Sometimes,1 to 2,64.0,Normal_Weight
2,23.0,Frequently,no,up to 5,Sometimes,Male,1.80,3.0,no,yes,3 to 4,0.0,no,Public,Sometimes,1 to 2,77.0,Normal_Weight
3,,Frequently,no,up to 2,Sometimes,Male,1.80,3.0,no,no,3 to 4,2.0,no,Walk,Always,1 to 2,87.0,Overweight_Level_I
4,22.0,Sometimes,no,up to 2,Sometimes,Male,1.78,1.0,no,no,,3.0,no,Public,Sometimes,1 to 2,90.0,Overweight_Level_II
5,22.0,Sometimes,no,up to 2,Sometimes,Male,1.64,3.0,no,no,5 or more,3.0,no,Public,Sometimes,1 to 2,53.0,Normal_Weight
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1607,21.0,Sometimes,,up to 5,Sometimes,Female,1.73,3.0,no,yes,3 to 4,1.0,no,Public,Always,1 to 2,131.0,Obesity_Type_III
1608,22.0,Sometimes,yes,up to 5,Sometimes,Female,1.75,3.0,no,yes,1 to 2,0.0,no,,Always,1 to 2,134.0,Obesity_Type_III
1609,23.0,Sometimes,yes,up to 5,Sometimes,Female,1.75,3.0,no,yes,1 to 2,0.0,no,Public,Always,1 to 2,134.0,Obesity_Type_III
1610,24.0,Sometimes,yes,up to 5,Sometimes,Female,1.74,3.0,no,yes,1 to 2,0.0,no,Public,Always,more than 2,133.0,Obesity_Type_III


In [None]:
obesity_train.age.sort_values(ascending=False).head(3)

Unnamed: 0_level_0,age
id,Unnamed: 1_level_1
1014,88.0
101,61.0
883,55.0


In [None]:
# Selecting outliers for which the age is out of scope. Or the weight classification is suspiciously low for the value given
outliers = obesity_train[
    ((obesity_train['age'] < 16) & ~(obesity_train['age'].isna())) |
    ((obesity_train['age'] > 56) & ~(obesity_train['age'].isna())) |
    ((obesity_train['weight'] > 190) & ~(obesity_train['weight'].isna()))
]
obesity_train.drop(outliers.index, inplace=True)

In [None]:
obesity_train.shape # Shape adds up to our expectation (5 rows deleted) 1611 -> 1606 rows

(1606, 18)

# Handling Missing Values

In [None]:
obesity_train.isna().sum()

Unnamed: 0,0
age,66
alcohol_freq,36
caloric_freq,20
devices_perday,21
eat_between_meals,59
gender,20
height,13
meals_perday,9
monitor_calories,39
parent_overweight,20


In [None]:
# Impute with mode and median as the first null-handling resolution. Will be re-approached with further iterations on the model itself

obesity_train['physical_activity_perweek'].fillna('0', inplace=True) # ASSUMPTION: There is no 0 value in the scope. We assume nulls are the people who dont work out
# Remember to make it a string

# The rest will be imputed with median or mode, depending on wheter the column in a dataframe is numerical or categorical
def fillna_simple(data):
    """
    Fills missing values in the dataframe.
    Categorical columns are filled with mode, numerical with median.
    """
    for column in data.columns:
        if data[column].dtype == 'object':  # Categorical
            data[column].fillna(data[column].mode()[0], inplace=True)
        else:  # Numerical
            data[column].fillna(data[column].median(), inplace=True)

fillna_simple(obesity_train)
obesity_train.isna().sum()

Unnamed: 0,0
age,0
alcohol_freq,0
caloric_freq,0
devices_perday,0
eat_between_meals,0
gender,0
height,0
meals_perday,0
monitor_calories,0
parent_overweight,0


In [40]:
def classify_bmi_comprehensive(row):
    """
    Classify BMI based on age and BMI value.

    Input:
    row: A Pandas row with 'weight', 'height', and 'age' columns.

    Output:
    Returns a string that classifies the individual into BMI categories.
    """
    # Check if weight and height are valid
    if row['height'] <= 0 or row['weight'] <= 0:
        return 'Invalid data'

    # Calculate BMI
    bmi = row['weight'] / (row['height'] ** 2)

    # Classify based on age group
    if row['age'] < 2 or row['age'] > 120:  # Handle unrealistic ages
        return 'Invalid age'

    # Age group: Children (2-19 years)
    if 2 <= row['age'] < 20:
        if bmi < 14:
            return 'Underweight'
        elif 14 <= bmi < 18:
            return 'Normal weight'
        elif 18 <= bmi < 21:
            return 'Overweight'
        else:
            return 'Obese'

    # Age group: Adults (20-64 years)
    elif 20 <= row['age'] < 65:
        if bmi < 18.5:
            return 'Underweight'
        elif 18.5 <= bmi < 24.9:
            return 'Normal weight'
        elif 25 <= bmi < 29.9:
            return 'Overweight'
        else:
            return 'Obese'

In [41]:
obesity_train['bmi_class'] = obesity_train.apply(lambda row: classify_bmi_comprehensive(row), axis=1)
obesity_train.head(3)

Unnamed: 0_level_0,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,siblings,smoke,transportation,veggies_freq,water_daily,weight,obese_level,bmi_class
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,21.0,Never,no,up to 5,Sometimes,Female,1.62,3.0,no,yes,0,3.0,no,Public,Sometimes,1 to 2,64.0,Normal_Weight,Normal weight
2,23.0,Frequently,no,up to 5,Sometimes,Male,1.8,3.0,no,yes,3 to 4,0.0,no,Public,Sometimes,1 to 2,77.0,Normal_Weight,Normal weight
3,23.0,Frequently,no,up to 2,Sometimes,Male,1.8,3.0,no,no,3 to 4,2.0,no,Walk,Always,1 to 2,87.0,Overweight_Level_I,Overweight


In [None]:
#@title Junk (replaced with proper functional data encoding)

junk = """
# Remove multidimensional outliers from training data
outliers = [] # List of list of outliers

#New dataframe with just weight x height
weight_height = obesity_train.loc[: ,["weight", "height"]]
weight_height.head()

# Remove missing values
weight_height = weight_height.dropna()

# Get outliers
query_wh = "(weight>65 & height < 1.4) | (weight >70 & height > 2.1) | (weight > 165)"
outliers.append(weight_height.query(query_wh).index.tolist())

# Repeat
age_height = obesity_train.loc[: ,["age", "height"]]
age_height = age_height.dropna()
outliers.append(age_height.query("(age < 10 & height>2) | (age>35 & height < 1.4) | (age > 80 & height > 1.5)").index.tolist())
# Repeat
age_weight = obesity_train.loc[: ,["age", "weight"]]
age_weight = age_weight.dropna()
outliers.append(age_weight.query("(age<10 & weight>80) | (age>50) | (weight>180)").index.tolist())

# NOTE: Statistical analysis justifies the way we selected outliers

l = []
for x in outliers:
    for out in x:
        if out in l:
            continue
        else:
            l.append(out) # Make a single list of outliers

# Proceed to remove outliers from dataset
obesity_train = obesity_train.drop(l, axis=0)
"""

junk = """
# Transforming categorical to numerical with numbers
def add_1_nn(x): # Aux. function
    if x>=0:
        return x+1
    else:
        return x

def plus_2(x): # Another aux. function
    return x+2

dfs = [obesity_train, obesity_test]

for df, i in zip(dfs, [0,1]): # Transform both datasets
    if i == 0:
        to_transform = df.drop("obese_level", axis=1).select_dtypes("object").columns.tolist()
    elif i == 1:
        to_transform = df.select_dtypes("object").columns.tolist()

    to_transform.append("meals_perday")

    for transformer in to_transform:
        df[f"{transformer}"] = df[f"{transformer}"].astype("category").cat.codes

    targets = df.select_dtypes("int8").columns.tolist()
    targets.remove("physical_activity_perweek")

    for target in targets:
        df[target] = df[target].map(add_1_nn)

    df["physical_activity_perweek"]= df["physical_activity_perweek"].map(plus_2)
    """



In [None]:
# DONE: Make data encoding a functional thing
def encode_data(data, onehot_list, ordinal_list):
    """
    Encodes any type of data in onehot and ordinal, using a list of columns to specify which to encode. \n
    Data: your dataframe, duh \n
    onehot_list: list of strings representing the columns to encode with 1hot \n
    ordinal_list: analogous as above, except it's with ordinal encoding. \n
   \n
    RETVAL: Another dataframe with the encoded data. Transformed columns' names will be deleted and replaced with the encoded ones, following this naming convention:
    \n - If onehot, it will follow the format target_transformed_column
    \n - If ordinal, it will follow the format target_encoded
    """
    encoded_train = obesity_train.copy() # Will become the return value (variable name should be changed to be cleared but idc)
    for target in targets_one_hot:
        # Make encoder
        encoder = OneHotEncoder(sparse_output = False, drop = "first") # drop to avoid multicollinearity
        uniques = encoded_train[f"{target}"].unique()
        # Fit encoder
        target_encoder = encoder.fit(uniques.reshape(-1 ,1))
        transformed = target_encoder.transform(encoded_train[[target]])
        # Transform our target
        transformed = pd.DataFrame(transformed)
        transformed.columns = [str(name) for name in uniques.tolist()[1:]]
        transformed = transformed.set_index(encoded_train.index)

        # Merge columns
        for transformed_column in transformed.columns:
            encoded_train[f"{target}_{transformed_column}"] = transformed[f"{transformed_column}"].astype("bool")

        # Remove old column
        encoded_train.drop(columns=target, axis=1, inplace=True)

    for target in targets_ordinal:
        # Make encoder
        encoder = OrdinalEncoder()
        uniques = encoded_train[f"{target}"].unique()
        # Fit encoder and transform data
        target_encoder = encoder.fit(uniques.reshape(-1 ,1))
        transformed = target_encoder.transform(encoded_train[[target]])
        transformed = pd.DataFrame(transformed)
        transformed.columns = [f"{target}_encoded"]
        transformed = transformed.set_index(encoded_train.index)

        # Replace column
        encoded_train[f"{target}_encoded"] = transformed[f'{target}_encoded']

        # Remove old column
        encoded_train.drop(columns=target, axis=1, inplace=True)

    return encoded_train

# TODO: Possibly generalize on any data input to make it pipeline-efficient

In [None]:
obesity_train

Unnamed: 0_level_0,age,alcohol_freq,caloric_freq,devices_perday,eat_between_meals,gender,height,meals_perday,monitor_calories,parent_overweight,physical_activity_perweek,siblings,smoke,transportation,veggies_freq,water_daily,weight,obese_level
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,21.0,Never,no,up to 5,Sometimes,Female,1.62,3.0,no,yes,0,3.0,no,Public,Sometimes,1 to 2,64.0,Normal_Weight
2,23.0,Frequently,no,up to 5,Sometimes,Male,1.80,3.0,no,yes,3 to 4,0.0,no,Public,Sometimes,1 to 2,77.0,Normal_Weight
3,23.0,Frequently,no,up to 2,Sometimes,Male,1.80,3.0,no,no,3 to 4,2.0,no,Walk,Always,1 to 2,87.0,Overweight_Level_I
4,22.0,Sometimes,no,up to 2,Sometimes,Male,1.78,1.0,no,no,0,3.0,no,Public,Sometimes,1 to 2,90.0,Overweight_Level_II
5,22.0,Sometimes,no,up to 2,Sometimes,Male,1.64,3.0,no,no,5 or more,3.0,no,Public,Sometimes,1 to 2,53.0,Normal_Weight
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1607,21.0,Sometimes,yes,up to 5,Sometimes,Female,1.73,3.0,no,yes,3 to 4,1.0,no,Public,Always,1 to 2,131.0,Obesity_Type_III
1608,22.0,Sometimes,yes,up to 5,Sometimes,Female,1.75,3.0,no,yes,1 to 2,0.0,no,Public,Always,1 to 2,134.0,Obesity_Type_III
1609,23.0,Sometimes,yes,up to 5,Sometimes,Female,1.75,3.0,no,yes,1 to 2,0.0,no,Public,Always,1 to 2,134.0,Obesity_Type_III
1610,24.0,Sometimes,yes,up to 5,Sometimes,Female,1.74,3.0,no,yes,1 to 2,0.0,no,Public,Always,more than 2,133.0,Obesity_Type_III


In [None]:
targets_one_hot = ["gender", "caloric_freq" ,"transportation", "parent_overweight", "smoke"]
targets_ordinal = ["alcohol_freq", "devices_perday", "eat_between_meals", "monitor_calories", "physical_activity_perweek",
                   "veggies_freq", "water_daily"]
encode_config = {"onehot": targets_one_hot, "ordinal": targets_ordinal}

encoded = encode_data(obesity_test, encode_config['onehot'], encode_config['ordinal'])
encoded

Unnamed: 0_level_0,age,height,meals_perday,siblings,weight,obese_level,gender_Male,caloric_freq_yes,transportation_Walk,transportation_Car,...,transportation_Motorbike,parent_overweight_no,smoke_yes,alcohol_freq_encoded,devices_perday_encoded,eat_between_meals_encoded,monitor_calories_encoded,physical_activity_perweek_encoded,veggies_freq_encoded,water_daily_encoded
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,21.0,1.62,3.0,3.0,64.0,Normal_Weight,False,False,False,False,...,False,True,False,2.0,2.0,3.0,0.0,0.0,2.0,0.0
2,23.0,1.80,3.0,0.0,77.0,Normal_Weight,True,False,False,False,...,False,True,False,1.0,2.0,3.0,0.0,2.0,2.0,0.0
3,23.0,1.80,3.0,2.0,87.0,Overweight_Level_I,True,False,False,False,...,True,False,False,1.0,1.0,3.0,0.0,2.0,0.0,0.0
4,22.0,1.78,1.0,3.0,90.0,Overweight_Level_II,True,False,False,False,...,False,False,False,3.0,1.0,3.0,0.0,0.0,2.0,0.0
5,22.0,1.64,3.0,3.0,53.0,Normal_Weight,True,False,False,False,...,False,False,False,3.0,1.0,3.0,0.0,3.0,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1607,21.0,1.73,3.0,1.0,131.0,Obesity_Type_III,False,True,False,False,...,False,True,False,3.0,2.0,3.0,0.0,2.0,0.0,0.0
1608,22.0,1.75,3.0,0.0,134.0,Obesity_Type_III,False,True,False,False,...,False,True,False,3.0,2.0,3.0,0.0,1.0,0.0,0.0
1609,23.0,1.75,3.0,0.0,134.0,Obesity_Type_III,False,True,False,False,...,False,True,False,3.0,2.0,3.0,0.0,1.0,0.0,0.0
1610,24.0,1.74,3.0,0.0,133.0,Obesity_Type_III,False,True,False,False,...,False,True,False,3.0,2.0,3.0,0.0,1.0,0.0,2.0


In [None]:
encode_data

In [None]:
"""
# Fill missing numerical variables with LinearRegression.

# 1. Prepare data for regressors
df = obesity_train.copy()

df_weight = df.drop(df.query("weight.isna() | height.isna() | age.isna() | gender.isna() | meals_perday.astype('int')==-1").index)
weight_variables = df_weight.loc[:, ["height", "age", "gender", "meals_perday"]]
weight_target =  df_weight.loc[:, "weight"]

df_age = df.drop(df.query("weight.isna() | height.isna() | age.isna() | gender.astype('int')==-1").index)
age_variables = df_age.loc[:, ["weight", "height", "gender"]]
age_target = df_age.loc[:, "age"]

df_height = df.drop(df.query("weight.isna() | height.isna() | gender.astype('int')==-1").index)
height_variables = df_height.loc[:, ["weight", "gender"]]
height_target = df_height.loc[:, "height"]

# 2. Prepare regressors for height, weight and age
weight_predictor = LinearRegression().fit(weight_variables, weight_target)
age_predictor = LinearRegression().fit(age_variables, age_target)
height_predictor = LinearRegression().fit(height_variables, height_target)

# 3. Fill missing values
dfs = [obesity_train, obesity_test]
for df in dfs:
    # Fill weight
    to_consider_w = df.query("~(age.isna() | height.isna() | meals_perday.astype('int')==-1 | gender.astype('int')==-1) & weight.isna()")
    df.loc[to_consider_w.index, "weight"] = weight_predictor.predict(to_consider_w.loc[:, ["height", "age", "gender", "meals_perday"]])

    # Fill age
    to_consider_a = df.query("~(height.isna() | weight.isna() | gender.astype('int')==-1) & age.isna()")
    df.loc[to_consider_a.index, "age"] = age_predictor.predict(to_consider_a.loc[:, ["weight", "height", "gender"]])

    # Fill height
    to_consider_h = df.query("~(weight.isna() | gender.astype('int')==-1) & height.isna()")
    df.loc[to_consider_h.index, "height"] = height_predictor.predict(to_consider_h.loc[:, ["weight", "gender"]])

"""

In [None]:
"""
# Fill missing categorical variables with Logistic Regression

# 1. Prepare data for classificators
df = obesity_train.copy()

df_gender = df.drop(df.query("gender.astype('int')==-1 | weight.isna() | height.isna() | age.isna()").index)
gender_variables = df_gender.loc[:, ["weight", "height", "age"]]
gender_target = df_gender.loc[:, "gender"]

df_alcohol = df.loc[df.query("alcohol_freq != -1 & gender != -1 & ~age.isna() & ~weight.isna() & ~height.isna()").index, ["alcohol_freq", "age", "weight", "height", "gender"] ]
df_caloric = df.loc[df.query("caloric_freq != -1 & gender != -1 & ~age.isna() & ~weight.isna() & ~height.isna()").index, ["caloric_freq", "age", "weight", "height", "gender"] ]

alcohol_variables = df_alcohol.loc[:, ["age", "weight", "height", "gender"]]
alcohol_target = df_alcohol.loc[:, "alcohol_freq"]

caloric_variables = df_caloric.loc[:, ["age", "weight", "height", "gender"]]
caloric_target = df_caloric.loc[:, "caloric_freq"]

# 2. Prepare classificators
gender_predictor = KNeighborsClassifier().fit(gender_variables, gender_target)
alcohol_predictor = LogisticRegression().fit(alcohol_variables, alcohol_target)
caloric_predictor = LogisticRegression().fit(caloric_variables, caloric_target)

# 3. Replace data
dfs = [obesity_train, obesity_test]
for df in dfs:
    # Fill gender
    to_consider_g = df.query("~(age.isna() | weight.isna() | height.isna() ) & gender==-1")
    try:
        df.loc[to_consider_g.index, "gender"] = gender_predictor.predict(to_consider_g.loc[:, ["weight", "height", "age"]])
    except:
        pass

    # Fill alcohol
    to_consider_a = df.query("~(age.isna() | gender == -1 | weight.isna() | height.isna() ) & alcohol_freq == -1")
    try:
        df.loc[to_consider_a.index, "alcohol_freq"] = alcohol_predictor.predict(to_consider_a.loc[:, ["age", "weight", "height", "gender"]])
    except:
        pass

    # Fill caloric
    to_consider_c = df.query("~(age.isna() | gender == -1 | weight.isna() | height.isna() ) & caloric_freq == -1")
    try:
        df.loc[to_consider_c.index, "caloric_freq"] = alcohol_predictor.predict(to_consider_c.loc[:, ["age", "weight", "height", "gender"]])
    except:
        pass
"""

In [None]:

# Fill the rest with mode or mean

# DONE: Make functional stuff
def fillna2_simple(data, config, is_train, verbose=False):
    """
    Fill missing values of a given dataset in the "simplest" manner, i.e. with simple values (such as mean, mode, median, ...)
    data: Dataframe to fill. \n
    is_train: Specify whether the DF given is train data or not. If True, then it fills value by calculating the filling value of each column. If False, it fills value by using the dictionary in fill_values. \n
    config: Dictionary where you specify the value (or type of value) to fill in the missing values. Could be a specific value, or the name of an aggregator function. Automatically ignored iff is_train=False\n
        - Note: if is_train==True and config makes use of an aggregator, an exception will be raised to warn of a potential information leak. \n
        - Note: to specify aggregators, the value of the dictionary should follow the format agg;agg_name \n
    \n
    Returns RET_DATA, which is the filled DF.
    """
    RET_DATA = data.copy() # Return Value

    # Validate input
    if not is_train:
        for typ in [config[x] for x in config]:
            if type(typ) is str:
                if(len(typ.split(";"))>1):
                    if typ.split(";")[0]=="agg":
                        raise Exception("InformationLeak Exception")

    # Fill data
    for column in config:
        fill_type = config[column]
        agg_flag = False

        # Parse dictionary value
        if type(fill_type) is str:
            if len(fill_type.split(";"))>1 and fill_type.split(";")[0]=="agg":
                fill_type=fill_type.split(";")[1]
                agg_flag = True

        # Fill data according to agg_flag

        if agg_flag:
            fill_with = data[column].agg([fill_type]).squeeze()
            print(f"\nAggregated value for column {column} is {fill_with}"*verbose, end="")
            RET_DATA[column] = RET_DATA[column].fillna(fill_with)

        elif not agg_flag:
            RET_DATA[column] = RET_DATA[column].fillna(fill_type)

        print(f"\nFilled column {column} with {fill_type}\n{'='*15}X{'='*15}"*verbose, end="")

    return RET_DATA

junk = """
dfs = [obesity_train, obesity_test]
for df in dfs:
    columns = df.columns.tolist()
    columns.remove("id")
    if df is obesity_train:
        columns.remove("obese_level")

    for column in columns:
        # 1. Calculate mode or median for each column
        mode = df[f"{column}"].mode()
        mean = df[f"{column}"].mean()

        # 2. Fill values
        interested_rows_A = df.query(f"{column}.isna()").index
        interested_rows_B = df.query(f"{column} == -1").index
        df.loc[interested_rows_A, column] = df.loc[interested_rows_A, column].map(lambda x: mean)

        if df[f"{column}"].dtype == "int64":
            df.loc[interested_rows_B, column] = df.loc[interested_rows_A, column].map(lambda x: mode)

"""

In [None]:
# Use example: fill some with mean or mode, fill another with arbitrary value
config_fill = {
    "physical_activity_perweek":"None",
    "age":"agg;mean",
    "transportation":"agg;mode"
}


obesity_train_filled = fillna_simple(obesity_train, config_fill, is_train=True) # woah a cool trick with double asterisk :O

try:
    fillna_simple(obesity_train, config_fill, is_train=False)
except Exception as a:
    print("Run failed:",a)

obesity_train_filled

### 3.2. Feature Engineering

In [None]:
# Maybe we can transform height, weight into BMI and removing weight and height
obesity_BMI = obesity_train.copy()

obesity_BMI["BMI"] = obesity_BMI["weight"] / obesity_BMI["height"]**2

obesity_BMI

### 3.3. Scaling

In [None]:
# Value scaling
from sklearn.preprocessing import StandardScaler, RobustScaler

df_t = obesity_train # Code imported from another notebook, change of variables to adapt to this notebook

scale_age = StandardScaler().fit(df_t[["age"]].dropna())
scale_height = StandardScaler().fit(df_t[["height"]].dropna())
scale_weight = RobustScaler().fit(df_t[["weight"]].dropna()) # Statistical analysis justifies the need to use RobustScaler on this one

dfs = [obesity_train, obesity_test] # Transform both dataframes
for df in dfs:
    new_age = scale_age.transform(df[["age"]])
    new_height = scale_height.transform(df[["height"]])
    new_weight = scale_weight.transform(df[["weight"]])

    # Replace columns
    df["age"] = new_age
    df["height"] = new_height
    df["weight"] = new_weight


### 3.4. Feature Selection

In [None]:
# To consider

<a class="anchor" id="">

# 4 & 5. Model & Assess (Modelling and Assessment)

</a>

<img src="image/step4.png" style="height:60px">

### 4.1. Model Selection

In this section you should take the time to train different predictive algorithms with the data that got to this stage and **use the approppriate model assessment metrics to decide which model you think is the best to address your problem**.

**You are expected to present on your report the model performances of the different algorithms that you tested and discuss what informed your choice for a specific algorithm**

### 4.2. Model Optimization

After selecting the best algorithm (set of algorithms), you can try to optimize the performance of your model by fiddling with the algorithms' hyper-parameters and select the options that result on the best overall performance.

Possible ways of doing this can be through:
1. [GridSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
2. [RandomSearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

**While you are not required to show the results of all combinations of hyperparameters that you tried, you should at least discuss the what were the possible combinations used and which of them resulted in your best performance**

<a class="anchor" id="">

# 5. Deploy

</a>

<img src="image/step5.png" style="height:60px">

### 5.0 Training a final model

You used the previous steps of modelling and assessment to determine what would be best strategies when it comes to preprocessing, scaling, feature selection, algorithm and hyper-parameters you could find.

**By this stage, all of those choices were already made**. For that reason, a split between training and validation is no longer necessary. **A good practice** would be to take the initial data and train a final model with all of the labeled data that you have available.

**Everything is figured by this stage**, so, on a first level all you need to do is replicate the exact preprocessing, scaling and feature selection decisions you made before.<br>
When it comes to the final model, all you have to do is creeate a new instance of your best algorithm with the best parameters that you uncovered (no need to try all algorithms and hyper-parameters again).

### 5.1. Import and Transform your test data

Remember, the test data does not have the `outcome` variable.

### 5.2. Obtain Predictions on the test data from your final model

### 5.3. Create a Dataframe containing the index of each row and its intended prediction and export it to a csv file

Submit the csv file to Kaggle to obtain the model performance of your model on the test data.