# Uso de ML para problemas de clasificación

En este notebook revisaremos un ejemplo de proceso de entrenamiento de un modelo de aprendizaje automático para resolver un problema de clasificación.

## 0. Entendimiento previo

* ¿De dónde obtengo el dataset?
* ¿Qué predicciones queremos hacer con este dataset?

The dataset can be found at the following link: [Obesity Levels](https://www.kaggle.com/datasets/fatemehmehrparvar/obesity-levels)
This dataset includes data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III.

**Objective:** Get the correct Obesity Classification
## Target ##
* NObeyesdad : Target, Categorical, "Obesity level"

## Features ##
* Gender: Feature, Categorical, "Gender"
* Age : Feature, Continuous, "Age"
* Height: Feature, Continuous
* Weight: Feature Continuous
* family_history_with_overweight: Feature, Binary, " Has a family member suffered or suffers from overweight? "
* FAVC : Feature, Binary, " Do you eat high caloric food frequently? "
* FCVC : Feature, Integer, " Do you usually eat vegetables in your meals? "
* NCP : Feature, Continuous, " How many main meals do you have daily? "
* CAEC : Feature, Categorical, " Do you eat any food between meals? "
* SMOKE : Feature, Binary, " Do you smoke? "
* CH2O: Feature, Continuous, " How much water do you drink daily? "
* SCC: Feature, Binary, " Do you monitor the calories you eat daily? "
* FAF: Feature, Continuous, " How often do you have physical activity? "
* TUE : Feature, Integer, " How much time do you use technological devices such as cell phone, videogames, television, computer and others? "
* CALC : Feature, Categorical, " How often do you drink alcohol? "
* MTRANS : Feature, Categorical, " Which transportation do you usually use? "

## 1. Cargar datos

Puntos relevantes:
* ¿Dónde están los datos?
* ¿En qué formato?
* ¿Cómo los puede leer?


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
from google.colab import drive

In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
data = pd.read_csv('/content/drive/MyDrive/UVM-ML-Category/ObesityDataSet.csv')

In [None]:
data.columns

Index(['Age', 'Gender', 'Height', 'Weight', 'CALC', 'FAVC', 'FCVC', 'NCP',
       'SCC', 'SMOKE', 'CH2O', 'family_history_with_overweight', 'FAF', 'TUE',
       'CAEC', 'MTRANS', 'NObeyesdad'],
      dtype='object')

Let's change columns' names for clearer ones.

In [None]:
data.columns = ['Age', 'Gender', 'Height', 'Weight', 'Drink_Alcohol', 'High_Caloric_Food', 'Eat_Vegetables', 'Num_Meals',
       'Monitor_Calories', 'SMOKE', 'Water', 'family_history_with_overweight', 'Physical_Activity', 'Use_Devices',
       'Food_between_Meals', 'Mode_Transportation', 'NObeyesdad']

In [None]:
data.head()

Unnamed: 0,Age,Gender,Height,Weight,Drink_Alcohol,High_Caloric_Food,Eat_Vegetables,Num_Meals,Monitor_Calories,SMOKE,Water,family_history_with_overweight,Physical_Activity,Use_Devices,Food_between_Meals,Mode_Transportation,NObeyesdad
0,21.0,Female,1.62,64.0,no,no,2.0,3.0,no,no,2.0,yes,0.0,1.0,Sometimes,Public_Transportation,Normal_Weight
1,21.0,Female,1.52,56.0,Sometimes,no,3.0,3.0,yes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,23.0,Male,1.8,77.0,Frequently,no,2.0,3.0,no,no,2.0,yes,2.0,1.0,Sometimes,Public_Transportation,Normal_Weight
3,27.0,Male,1.8,87.0,Frequently,no,3.0,3.0,no,no,2.0,no,2.0,0.0,Sometimes,Walking,Overweight_Level_I
4,22.0,Male,1.78,89.8,Sometimes,no,2.0,1.0,no,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


## 2. Análisis exploratorio

Puntos relevantes:

* ¿Cuántos datos tenemos?
* ¿Qué y cuántos campos tenemos?
* ¿Qué tipo de información contiene cada campo?
* ¿Qué tan "limpios" están los datos?
* ¿Qué rango de valores tiene cada campo?

NOTA: Justifica o argumenta con visualizaciones básicas

In [None]:
print(f"Obesity dataset has {data.shape[0]} rows and {data.shape[1]} columns")

Obesity dataset has 2111 rows and 17 columns


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Age                             2111 non-null   float64
 1   Gender                          2111 non-null   object 
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   Drink_Alcohol                   2111 non-null   object 
 5   High_Caloric_Food               2111 non-null   object 
 6   Eat_Vegetables                  2111 non-null   float64
 7   Num_Meals                       2111 non-null   float64
 8   Monitor_Calories                2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  Water                           2111 non-null   float64
 11  family_history_with_overweight  2111 non-null   object 
 12  Physical_Activity               21

Based on the info there are no NaN values and we have numerics and categorical columns

In [None]:
data.describe()

Unnamed: 0,Age,Height,Weight,Eat_Vegetables,Num_Meals,Water,Physical_Activity,Use_Devices
count,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0
mean,24.3126,1.701677,86.586058,2.419043,2.685628,2.008011,1.010298,0.657866
std,6.345968,0.093305,26.191172,0.533927,0.778039,0.612953,0.850592,0.608927
min,14.0,1.45,39.0,1.0,1.0,1.0,0.0,0.0
25%,19.947192,1.63,65.473343,2.0,2.658738,1.584812,0.124505,0.0
50%,22.77789,1.700499,83.0,2.385502,3.0,2.0,1.0,0.62535
75%,26.0,1.768464,107.430682,3.0,3.0,2.47742,1.666678,1.0
max,61.0,1.98,173.0,3.0,4.0,3.0,3.0,2.0


* The first thing I noticed it's that the "Age" is between 14 and 61 years old, however the average is 25 years old. Maybe the data has some outliers.
* The min "Weight" is 39 kg. and max is 173 kg. Average is 86 kg so it looks Ok. We can plot later to see.
* The columns "Eat_Vegetables", "Num_Meals", "Water", "Physical_Activity" and "Use_Device" have a range between 0 and 4. Let's explore the data later to see how the distribution of the values behave.

In [None]:
data.dtypes

Age                               float64
Gender                             object
Height                            float64
Weight                            float64
Drink_Alcohol                      object
High_Caloric_Food                  object
Eat_Vegetables                    float64
Num_Meals                         float64
Monitor_Calories                   object
SMOKE                              object
Water                             float64
family_history_with_overweight     object
Physical_Activity                 float64
Use_Devices                       float64
Food_between_Meals                 object
Mode_Transportation                object
NObeyesdad                         object
dtype: object

In [None]:
data_num = data.select_dtypes(include="number")
data_num.head()

Unnamed: 0,Age,Height,Weight,Eat_Vegetables,Num_Meals,Water,Physical_Activity,Use_Devices
0,21.0,1.62,64.0,2.0,3.0,2.0,0.0,1.0
1,21.0,1.52,56.0,3.0,3.0,3.0,3.0,0.0
2,23.0,1.8,77.0,2.0,3.0,2.0,2.0,1.0
3,27.0,1.8,87.0,3.0,3.0,2.0,2.0,0.0
4,22.0,1.78,89.8,2.0,1.0,2.0,0.0,0.0


Let's plot the numeric values.

In [None]:
total_boxplot = px.box(data_num)
total_boxplot.show()

The data will need to be normalized since Height range could affect the model. Also let's check info from Eat_Vegetables, Num_Meals, Water, Physical_Activity and Use_Device to see how the range behaves.

In [None]:
df_FCVC = data['Eat_Vegetables'].value_counts()
df_FCVC.head()

Eat_Vegetables
3.000000    652
2.000000    600
1.000000     33
2.823179      2
2.214980      2
Name: count, dtype: int64

In [None]:
bar_veg = px.scatter(df_FCVC,
                 x=df_FCVC.index,
                 y=df_FCVC.values,
                 color=df_FCVC.values,
                 color_continuous_scale='plasma',
                 title='Bar Count FCVC')
bar_veg.update_layout(xaxis_title='FCVC',
                     yaxis_title='count',
                     yaxis=dict(type='log')
                     )

bar_veg.show()

In [None]:
data['Num_Meals'].value_counts().head()

Num_Meals
3.000000    1203
1.000000     199
4.000000      69
2.776840       2
3.985442       2
Name: count, dtype: int64

In [None]:
data['Water'].value_counts().head()

Water
2.000000    448
1.000000    211
3.000000    162
2.825629      3
1.636326      3
Name: count, dtype: int64

In [None]:
data['Physical_Activity'].value_counts().head()

Physical_Activity
0.000000    411
1.000000    234
2.000000    183
3.000000     75
0.110174      2
Name: count, dtype: int64

In [None]:
data['Use_Devices'].value_counts().head()

Use_Devices
0.000000    557
1.000000    292
2.000000    109
0.630866      4
1.119877      3
Name: count, dtype: int64

After reviewing these columns I noticed that most of the values are concentraded in the integer values. I think the best way to proceed is to round them.

In [None]:
for col in ['Age', 'Weight', 'Eat_Vegetables', 'Num_Meals','Water','Physical_Activity', 'Use_Devices']:
    data[col] = data.loc[:,col].round().astype(int)
data.head()

Unnamed: 0,Age,Gender,Height,Weight,Drink_Alcohol,High_Caloric_Food,Eat_Vegetables,Num_Meals,Monitor_Calories,SMOKE,Water,family_history_with_overweight,Physical_Activity,Use_Devices,Food_between_Meals,Mode_Transportation,NObeyesdad
0,21,Female,1.62,64,no,no,2,3,no,no,2,yes,0,1,Sometimes,Public_Transportation,Normal_Weight
1,21,Female,1.52,56,Sometimes,no,3,3,yes,yes,3,yes,3,0,Sometimes,Public_Transportation,Normal_Weight
2,23,Male,1.8,77,Frequently,no,2,3,no,no,2,yes,2,1,Sometimes,Public_Transportation,Normal_Weight
3,27,Male,1.8,87,Frequently,no,3,3,no,no,2,no,2,0,Sometimes,Walking,Overweight_Level_I
4,22,Male,1.78,90,Sometimes,no,2,1,no,no,2,no,0,0,Sometimes,Public_Transportation,Overweight_Level_II


In [None]:
for i,col in enumerate(data_num.columns[:3]):
    box = px.box(data_num, x=col)
    box.show()

Looking at the "Age" boxplot there are outliers for the "Age" greater than 35, let's get rid of the data.

In [None]:
data = data[data['Age'] < 36]
data.head()

Unnamed: 0,Age,Gender,Height,Weight,Drink_Alcohol,High_Caloric_Food,Eat_Vegetables,Num_Meals,Monitor_Calories,SMOKE,Water,family_history_with_overweight,Physical_Activity,Use_Devices,Food_between_Meals,Mode_Transportation,NObeyesdad
0,21,Female,1.62,64,no,no,2,3,no,no,2,yes,0,1,Sometimes,Public_Transportation,Normal_Weight
1,21,Female,1.52,56,Sometimes,no,3,3,yes,yes,3,yes,3,0,Sometimes,Public_Transportation,Normal_Weight
2,23,Male,1.8,77,Frequently,no,2,3,no,no,2,yes,2,1,Sometimes,Public_Transportation,Normal_Weight
3,27,Male,1.8,87,Frequently,no,3,3,no,no,2,no,2,0,Sometimes,Walking,Overweight_Level_I
4,22,Male,1.78,90,Sometimes,no,2,1,no,no,2,no,0,0,Sometimes,Public_Transportation,Overweight_Level_II


Let's plot the rest of the numeric columns.

In [None]:
datos_num = data.select_dtypes(include="number")

In [None]:
for i,col in enumerate(datos_num.columns[3:]):
    df_col = datos_num[col].value_counts()
    cnt_plot = px.bar(df_col,
                      x=df_col.index,
                      y=df_col.values,
                      color=df_col.values,
                      color_continuous_scale='plasma',
                      title=f'Bar Count {col}')
    cnt_plot.update_layout(yaxis_title='Count'
                     )
    cnt_plot.show()

The numeric values look good.
We will work with the categorical ones next.

In [None]:
data_cat = data.select_dtypes(include="object")
data_cat.head()

Unnamed: 0,Gender,Drink_Alcohol,High_Caloric_Food,Monitor_Calories,SMOKE,family_history_with_overweight,Food_between_Meals,Mode_Transportation,NObeyesdad
0,Female,no,no,no,no,yes,Sometimes,Public_Transportation,Normal_Weight
1,Female,Sometimes,no,yes,yes,yes,Sometimes,Public_Transportation,Normal_Weight
2,Male,Frequently,no,no,no,yes,Sometimes,Public_Transportation,Normal_Weight
3,Male,Frequently,no,no,no,no,Sometimes,Walking,Overweight_Level_I
4,Male,Sometimes,no,no,no,no,Sometimes,Public_Transportation,Overweight_Level_II


In [None]:
for i,col in enumerate(data_cat.columns[:-1]):
    df_col = data_cat[col].value_counts()
    cnt_plot = px.bar(df_col,
                      x=df_col.index,
                      y=df_col.values,
                      color=df_col.values,
                      color_continuous_scale='plasma',
                      title=f'Bar Count {col}')
    cnt_plot.update_layout(yaxis_title='Count'

                     )
    cnt_plot.show()

The categoric data looks good, there are unique values for each category. Finally, we will create a plot for the "Target" column to make sure there are also unique values in the dataset and check if there are enough values for each Obesity Category.

In [None]:
df_obesity = data['NObeyesdad'].value_counts()
df_obesity

NObeyesdad
Obesity_Type_III       324
Obesity_Type_I         283
Normal_Weight          280
Overweight_Level_I     271
Insufficient_Weight    271
Obesity_Type_II        268
Overweight_Level_II    254
Name: count, dtype: int64

In [None]:
bar_obesity = px.bar(df_obesity,
                 x=df_obesity.index,
                 y=df_obesity.values,
                 color=df_obesity.values,
                 color_continuous_scale='plasma',
                 title='Bar Count Obesity')
bar_obesity.update_layout(xaxis_title='Obesity',
                     yaxis_title='count',
                     )

bar_obesity.show()

"Target" column has unique values and also there are enough values for each category in the dataset.

## 3. Preprocesamiento

Puntos relevantes:

* ¿Qué información es relevante conservar?
* ¿Qué campo(s) son las variables de entrada?
* ¿Cuál campo es la variable de salida o etiqueta?

Let's check for NaN values and duplicated

In [None]:
print(f"Are there any NaN values in the dataset? {data.isna().values.any()}")

Are there any NaN values in the dataset? False


In [None]:
print(f"Are there any duplicated values in the dataset? {data.duplicated().values.any()}")
duplicates = data.duplicated()
num_duplicates = duplicates.sum()
print(f"Number of duplicated rows: {num_duplicates}")

Are there any duplicated values in the dataset? True
Number of duplicated rows: 27


In [None]:
data[data.duplicated()].head()

Unnamed: 0,Age,Gender,Height,Weight,Drink_Alcohol,High_Caloric_Food,Eat_Vegetables,Num_Meals,Monitor_Calories,SMOKE,Water,family_history_with_overweight,Physical_Activity,Use_Devices,Food_between_Meals,Mode_Transportation,NObeyesdad
98,21,Female,1.52,42,Sometimes,no,3,1,no,no,1,no,0,0,Frequently,Public_Transportation,Insufficient_Weight
106,25,Female,1.57,55,Sometimes,yes,2,1,no,no,2,no,2,0,Sometimes,Public_Transportation,Normal_Weight
174,21,Male,1.62,70,Sometimes,yes,2,1,no,no,3,no,1,0,no,Public_Transportation,Overweight_Level_I
179,21,Male,1.62,70,Sometimes,yes,2,1,no,no,3,no,1,0,no,Public_Transportation,Overweight_Level_I
184,21,Male,1.62,70,Sometimes,yes,2,1,no,no,3,no,1,0,no,Public_Transportation,Overweight_Level_I


We will remove the duplicated since there are few values and even if we cannot assure 100% these are duplicated since we don't have a identifier column, I think is better to remove them if the entire row is duplicated.

In [None]:
data = data.drop_duplicates(keep='first')
data.shape

(1924, 17)

Let's create a Heatmap before applying the model.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
data_corr =data.copy()
encoder  =LabelEncoder()
for col in data_corr.select_dtypes(include="object").columns:
    data_corr[col] =encoder.fit_transform(data_corr[col])

In [None]:
data_corr.corr()

Unnamed: 0,Age,Gender,Height,Weight,Drink_Alcohol,High_Caloric_Food,Eat_Vegetables,Num_Meals,Monitor_Calories,SMOKE,Water,family_history_with_overweight,Physical_Activity,Use_Devices,Food_between_Meals,Mode_Transportation,NObeyesdad
Age,1.0,0.139571,0.097961,0.321405,-0.088687,0.084002,0.058218,-0.050043,-0.142801,0.088995,0.008297,0.243622,-0.160159,-0.220008,0.123345,-0.444027,0.296267
Gender,0.139571,1.0,0.611967,0.133411,0.019929,0.073878,-0.276312,0.080032,-0.112015,0.051602,0.055404,0.126712,0.180619,-0.012643,0.080729,-0.206333,0.013501
Height,0.097961,0.611967,1.0,0.450842,-0.148652,0.205684,-0.073967,0.241639,-0.154178,0.06865,0.179607,0.269822,0.284991,0.012432,0.063113,-0.165666,0.030825
Weight,0.321405,0.133411,0.450842,1.0,-0.224045,0.281557,0.191577,0.09074,-0.208492,0.03499,0.200625,0.507582,-0.064938,-0.049493,0.304517,0.012564,0.413397
Drink_Alcohol,-0.088687,0.019929,-0.148652,-0.224045,1.0,-0.108518,-0.051254,-0.100401,-0.002157,-0.098513,-0.079623,0.023983,0.116249,0.037768,-0.052047,-0.014125,-0.174914
High_Caloric_Food,0.084002,0.073878,0.205684,0.281557,-0.108518,1.0,-0.015783,0.000706,-0.185459,-0.02714,0.012916,0.222378,-0.100664,0.075059,0.150665,-0.058876,0.039369
Eat_Vegetables,0.058218,-0.276312,-0.073967,0.191577,-0.051254,-0.015783,1.0,0.007734,0.069643,0.006438,0.069764,0.024305,0.002676,-0.099995,-0.034344,0.061022,0.016477
Num_Meals,-0.050043,0.080032,0.241639,0.09074,-0.100401,0.000706,0.007734,1.0,-0.027088,0.014981,0.096297,0.033736,0.157095,-0.014069,-0.070286,-0.089205,-0.08566
Monitor_Calories,-0.142801,-0.112015,-0.154178,-0.208492,-0.002157,-0.185459,0.069643,-0.027088,1.0,0.035837,-0.010052,-0.19885,0.061822,-0.023534,-0.107805,0.011528,-0.049635
SMOKE,0.088995,0.051602,0.06865,0.03499,-0.098513,-0.02714,0.006438,0.014981,0.035837,1.0,-0.052592,0.010647,0.017097,0.018414,-0.048507,-0.0131,-0.025054


In [None]:
heatmap = px.imshow(data_corr.corr(), text_auto=True, aspect="auto")
heatmap.show()

### 3.1 Selección de características

* Vamos a quitar columnas que no parecen útiles.
* Vamos a seleccionar las columnas que podemos usar para hacer predicciones.

After plotting and reviewing the data I will keep all the columns. The data is clean enough at this point, so we are going to work with the categorical columns.

In [None]:
from pandas.api.types import CategoricalDtype

In [None]:
type_Gender = CategoricalDtype(categories=['Female', 'Male'], ordered=False)
type_Drink_Alcohol = CategoricalDtype(categories=['no', 'Sometimes', 'Frequently', 'Always'], ordered=True)
type_High_Caloric_Food = CategoricalDtype(categories=['no', 'yes'], ordered=False)
type_SMOKE = CategoricalDtype(categories=['no', 'yes'], ordered=False)
type_Monitor_Calories = CategoricalDtype(categories=['no', 'yes'], ordered=False)
type_history = CategoricalDtype(categories=['no', 'yes'], ordered=False)
type_Food_between_Meals = CategoricalDtype(categories=['no', 'Sometimes', 'Frequently', 'Always'], ordered=True)
type_Mode_Transportation = CategoricalDtype(categories=['Public_Transportation', 'Walking', 'Automobile', 'Motorbike','Bike'], ordered=False)

In [None]:
data['Gender'] = data['Gender'].astype(type_Gender)
data['Drink_Alcohol'] = data['Drink_Alcohol'].astype(type_Drink_Alcohol)
data['High_Caloric_Food'] = data['High_Caloric_Food'].astype(type_High_Caloric_Food)
data['SMOKE'] = data['SMOKE'].astype(type_SMOKE)
data['Monitor_Calories'] = data['Monitor_Calories'].astype(type_Monitor_Calories)
data['family_history_with_overweight'] = data['family_history_with_overweight'].astype(type_history)
data['Food_between_Meals'] = data['Food_between_Meals'].astype(type_Food_between_Meals)
data['Mode_Transportation'] = data['Mode_Transportation'].astype(type_Mode_Transportation)

In [None]:
data['Gender_codes'] = data['Gender'].cat.codes
data['Drink_Alcohol_codes'] = data['Drink_Alcohol'].cat.codes
data['High_Caloric_Food_codes'] = data['High_Caloric_Food'].cat.codes
data['SMOKE_codes'] = data['SMOKE'].cat.codes
data['Monitor_Calories_codes'] = data['Monitor_Calories'].cat.codes
data['family_history_with_overweight_codes'] = data['family_history_with_overweight'].cat.codes
data['Food_between_Meals_codes'] = data['Food_between_Meals'].cat.codes
data['Mode_Transportation_codes'] = data['Mode_Transportation'].cat.codes

In [None]:
data.head()

Unnamed: 0,Age,Gender,Height,Weight,Drink_Alcohol,High_Caloric_Food,Eat_Vegetables,Num_Meals,Monitor_Calories,SMOKE,...,Mode_Transportation,NObeyesdad,Gender_codes,Drink_Alcohol_codes,High_Caloric_Food_codes,SMOKE_codes,Monitor_Calories_codes,family_history_with_overweight_codes,Food_between_Meals_codes,Mode_Transportation_codes
0,21,Female,1.62,64,no,no,2,3,no,no,...,Public_Transportation,Normal_Weight,0,0,0,0,0,1,1,0
1,21,Female,1.52,56,Sometimes,no,3,3,yes,yes,...,Public_Transportation,Normal_Weight,0,1,0,1,1,1,1,0
2,23,Male,1.8,77,Frequently,no,2,3,no,no,...,Public_Transportation,Normal_Weight,1,2,0,0,0,1,1,0
3,27,Male,1.8,87,Frequently,no,3,3,no,no,...,Walking,Overweight_Level_I,1,2,0,0,0,0,1,1
4,22,Male,1.78,90,Sometimes,no,2,1,no,no,...,Public_Transportation,Overweight_Level_II,1,1,0,0,0,0,1,0


### 3.2 Acomodo de columnas

* Vamos a separar la parte del dataframe que utilizaremos como variables de entrada (X) y las etiquetas (y)

In [None]:
data = data.drop(columns=['Gender','Drink_Alcohol','High_Caloric_Food','Monitor_Calories','SMOKE','family_history_with_overweight','Food_between_Meals','Mode_Transportation'])

In [None]:
df_X = data.drop(columns=['NObeyesdad'])
df_y = data['NObeyesdad']

In [None]:
df_X

Unnamed: 0,Age,Height,Weight,Eat_Vegetables,Num_Meals,Water,Physical_Activity,Use_Devices,Gender_codes,Drink_Alcohol_codes,High_Caloric_Food_codes,SMOKE_codes,Monitor_Calories_codes,family_history_with_overweight_codes,Food_between_Meals_codes,Mode_Transportation_codes
0,21,1.620000,64,2,3,2,0,1,0,0,0,0,0,1,1,0
1,21,1.520000,56,3,3,3,3,0,0,1,0,1,1,1,1,0
2,23,1.800000,77,2,3,2,2,1,1,2,0,0,0,1,1,0
3,27,1.800000,87,3,3,2,2,0,1,2,0,0,0,0,1,1
4,22,1.780000,90,2,1,2,0,0,1,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,21,1.710730,131,3,3,2,2,1,0,1,1,0,0,1,1,0
2107,22,1.748584,134,3,3,2,1,1,0,1,1,0,0,1,1,0
2108,23,1.752206,134,3,3,2,1,1,0,1,1,0,0,1,1,0
2109,24,1.739450,133,3,3,3,1,1,0,1,1,0,0,1,1,0


In [None]:
df_y

0             Normal_Weight
1             Normal_Weight
2             Normal_Weight
3        Overweight_Level_I
4       Overweight_Level_II
               ...         
2106       Obesity_Type_III
2107       Obesity_Type_III
2108       Obesity_Type_III
2109       Obesity_Type_III
2110       Obesity_Type_III
Name: NObeyesdad, Length: 1924, dtype: object

## 4. Entrenamiento de modelos

Puntos relevantes:
* ¿Cómo vamos a separar el dataset?
* ¿Es necesario normalizar? ¿Qué tipo de normalización?
* ¿Qué modelos vamos a probar?

### 4.1 Generación de dataset de entrenamiento y prueba (train/test)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
df_X_train, df_X_test, df_y_train, df_y_test = train_test_split(df_X, df_y, test_size=0.30, random_state=101)

### 4.2 Normalizar (Escalar) datos de entrenamiento

**IMPORTANTE**: Ajustar la normalización en los datos de entrenamiento, NO en los datos de prueba.

#### 4.2.1 Normalización

In [None]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

Let's try **MinMaxScaler** normalization.

In [None]:
minmax_scaler = MinMaxScaler().fit(df_X_train)

In [None]:
minmax_scaler.data_max_

array([ 35.  ,   1.98, 173.  ,   3.  ,   4.  ,   3.  ,   3.  ,   2.  ,
         1.  ,   3.  ,   1.  ,   1.  ,   1.  ,   1.  ,   3.  ,   4.  ])

In [None]:
minmax_scaler.data_min_

array([16.      ,  1.456346, 39.      ,  1.      ,  1.      ,  1.      ,
        0.      ,  0.      ,  0.      ,  0.      ,  0.      ,  0.      ,
        0.      ,  0.      ,  0.      ,  0.      ])

In [None]:
minmax_scaler.data_range_

array([ 19.      ,   0.523654, 134.      ,   2.      ,   3.      ,
         2.      ,   3.      ,   2.      ,   1.      ,   3.      ,
         1.      ,   1.      ,   1.      ,   1.      ,   3.      ,
         4.      ])

In [None]:
df_X_norm_train = minmax_scaler.transform(df_X_train)

In [None]:
df_scaled = pd.DataFrame(df_X_norm_train, columns=df_X_train.columns)
df_scaled

Unnamed: 0,Age,Height,Weight,Eat_Vegetables,Num_Meals,Water,Physical_Activity,Use_Devices,Gender_codes,Drink_Alcohol_codes,High_Caloric_Food_codes,SMOKE_codes,Monitor_Calories_codes,family_history_with_overweight_codes,Food_between_Meals_codes,Mode_Transportation_codes
0,0.368421,0.327625,0.320896,0.5,0.000000,0.5,0.000000,0.5,0.0,0.333333,1.0,0.0,0.0,1.0,0.333333,0.00
1,0.526316,0.818810,0.611940,0.5,0.666667,1.0,0.333333,0.5,1.0,0.333333,1.0,0.0,0.0,1.0,0.333333,0.00
2,0.105263,0.274330,0.156716,1.0,0.000000,0.0,0.000000,1.0,0.0,0.000000,1.0,0.0,1.0,1.0,0.333333,0.25
3,0.368421,0.789938,0.417910,0.0,0.666667,0.5,0.000000,1.0,1.0,0.666667,1.0,0.0,0.0,1.0,0.666667,0.00
4,0.315789,0.274330,0.320896,0.0,0.666667,0.5,0.666667,0.5,0.0,0.000000,1.0,0.0,0.0,1.0,0.333333,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1341,0.473684,0.636554,0.589552,0.0,0.666667,0.5,0.333333,0.0,1.0,0.333333,1.0,0.0,0.0,1.0,0.333333,0.00
1342,0.263158,0.443106,0.619403,1.0,0.666667,0.0,0.333333,0.5,0.0,0.333333,1.0,0.0,0.0,1.0,0.333333,0.00
1343,0.631579,0.643803,0.552239,0.5,0.666667,0.5,0.000000,0.0,1.0,0.333333,1.0,0.0,0.0,1.0,0.333333,0.00
1344,0.789474,0.865635,0.671642,0.5,0.666667,0.5,0.333333,0.5,1.0,0.333333,1.0,1.0,0.0,1.0,0.333333,0.00


In [None]:
total_boxplot = px.box(df_scaled)
total_boxplot.show()

Let's test with **StandardScaler**.

In [None]:
std_scaler = StandardScaler().fit(df_X_train)

In [None]:
std_scaler.mean_

array([2.30817236e+01, 1.70729133e+00, 8.70000000e+01, 2.42942051e+00,
       2.70133730e+00, 2.02451709e+00, 1.00371471e+00, 7.10252600e-01,
       5.14115899e-01, 7.22139673e-01, 8.87815750e-01, 2.15453195e-02,
       4.97771174e-02, 8.15007429e-01, 1.15378900e+00, 3.72956909e-01])

In [None]:
std_scaler.scale_

array([ 4.45641018,  0.09249402, 26.91152749,  0.59704258,  0.80578222,
        0.68424802,  0.88279806,  0.67704811,  0.4998007 ,  0.51144424,
        0.315593  ,  0.14519338,  0.21748415,  0.38829154,  0.47302077,
        0.78184053])

In [None]:
std_scaler.var_

array([1.98595917e+01, 8.55514313e-03, 7.24230312e+02, 3.56459842e-01,
       6.49284987e-01, 4.68195346e-01, 7.79332412e-01, 4.58394141e-01,
       2.49800741e-01, 2.61575214e-01, 9.95989438e-02, 2.10811187e-02,
       4.72993560e-02, 1.50770319e-01, 2.23748645e-01, 6.11274615e-01])

In [None]:
df_X_norm_train = std_scaler.transform(df_X_train)

In [None]:
df_scaled = pd.DataFrame(df_X_norm_train, columns=df_X_train.columns)
df_scaled

Unnamed: 0,Age,Height,Weight,Eat_Vegetables,Num_Meals,Water,Physical_Activity,Use_Devices,Gender_codes,Drink_Alcohol_codes,High_Caloric_Food_codes,SMOKE_codes,Monitor_Calories_codes,family_history_with_overweight_codes,Food_between_Meals_codes,Mode_Transportation_codes
0,-0.018338,-0.858254,-0.185794,-0.719246,-2.111411,-0.035831,-1.136970,0.427957,-1.028642,0.543286,0.355471,-0.148391,-0.228877,0.476427,-0.325121,-0.477024
1,0.654849,1.922586,1.263399,-0.719246,0.370649,1.425628,-0.004208,0.427957,0.972156,0.543286,0.355471,-0.148391,-0.228877,0.476427,-0.325121,-0.477024
2,-1.140318,-1.159981,-1.003288,0.955676,-2.111411,-1.497289,-1.136970,1.904957,-1.028642,-1.411962,0.355471,-0.148391,4.369159,0.476427,-0.325121,0.802009
3,-0.018338,1.759126,0.297270,-2.394168,0.370649,-0.035831,-1.136970,1.904957,0.972156,2.498533,0.355471,-0.148391,-0.228877,0.476427,1.788951,-0.477024
4,-0.242734,-1.159981,-0.185794,-2.394168,0.370649,-0.035831,1.128554,0.427957,-1.028642,-1.411962,0.355471,-0.148391,-0.228877,0.476427,-0.325121,-0.477024
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1341,0.430453,0.890746,1.151923,-2.394168,0.370649,-0.035831,-0.004208,-1.049043,0.972156,0.543286,0.355471,-0.148391,-0.228877,0.476427,-0.325121,-0.477024
1342,-0.467130,-0.204460,1.300558,0.955676,0.370649,-1.497289,-0.004208,0.427957,-1.028642,0.543286,0.355471,-0.148391,-0.228877,0.476427,-0.325121,-0.477024
1343,1.103641,0.931786,0.966129,-0.719246,0.370649,-0.035831,-1.136970,-1.049043,0.972156,0.543286,0.355471,-0.148391,-0.228877,0.476427,-0.325121,-0.477024
1344,1.776828,2.187684,1.560669,-0.719246,0.370649,-0.035831,-0.004208,0.427957,0.972156,0.543286,0.355471,6.738976,-0.228877,0.476427,-0.325121,-0.477024


In [None]:
total_boxplot = px.box(df_scaled)
total_boxplot.show()

Looking at both boxplot the Standard normalization works better for this dataset.

### 4.3 Entrenar modelos base

In [None]:
#from sklearn.linear_model import LogisticRegression
#from sklearn.tree import DecisionTreeClassifier
#from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

In [None]:
#model = LogisticRegression()
#model = DecisionTreeClassifier(max_depth=4)
#model = SVC(kernel='rbf', C=1) # linear, poly, sigmoid, rbf
model = MLPClassifier(activation='relu', hidden_layer_sizes=[10,10,10], max_iter=700)# relu, sigmoid, tanh, linear

In [None]:
model.fit(df_X_norm_train, df_y_train)

## 5. Evaluación del modelo

Puntos importantes:
* ¿Qué métricas nos conviene utilizar?
* ¿Qué desempeño necesitamos alcanzar?
* ¿Cuál métrica es prioridad?

### 5.1 Evaluación con datos de entrenamiento

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

In [None]:
df_y_train_pred = model.predict(df_X_norm_train)

In [None]:
print(confusion_matrix(df_y_train, df_y_train_pred))

[[192   0   0   0   0   0   0]
 [  0 189   0   0   0   0   0]
 [  0   0 199   0   0   0   0]
 [  0   0   0 194   0   0   0]
 [  0   0   0   0 224   0   0]
 [  0   0   0   0   0 172   0]
 [  0   0   0   0   0   0 176]]


In [None]:
print(classification_report(df_y_train, df_y_train_pred))

                     precision    recall  f1-score   support

Insufficient_Weight       1.00      1.00      1.00       192
      Normal_Weight       1.00      1.00      1.00       189
     Obesity_Type_I       1.00      1.00      1.00       199
    Obesity_Type_II       1.00      1.00      1.00       194
   Obesity_Type_III       1.00      1.00      1.00       224
 Overweight_Level_I       1.00      1.00      1.00       172
Overweight_Level_II       1.00      1.00      1.00       176

           accuracy                           1.00      1346
          macro avg       1.00      1.00      1.00      1346
       weighted avg       1.00      1.00      1.00      1346



### 5.2 Evaluación con datos de prueba

In [None]:
df_X_norm_test = std_scaler.transform(df_X_test)

In [None]:
df_y_test_pred = model.predict(df_X_norm_test)

In [None]:
print(confusion_matrix(df_y_test, df_y_test_pred))

[[ 72   1   0   0   0   0   0]
 [  8  73   0   0   0   4   1]
 [  0   0  83   0   0   0   1]
 [  0   0   1  73   0   0   0]
 [  0   0   0   0 100   0   0]
 [  0   4   0   0   0  79   2]
 [  0   0   0   0   0   5  71]]


In [None]:
print(classification_report(df_y_test, df_y_test_pred))

                     precision    recall  f1-score   support

Insufficient_Weight       0.90      0.99      0.94        73
      Normal_Weight       0.94      0.85      0.89        86
     Obesity_Type_I       0.99      0.99      0.99        84
    Obesity_Type_II       1.00      0.99      0.99        74
   Obesity_Type_III       1.00      1.00      1.00       100
 Overweight_Level_I       0.90      0.93      0.91        85
Overweight_Level_II       0.95      0.93      0.94        76

           accuracy                           0.95       578
          macro avg       0.95      0.95      0.95       578
       weighted avg       0.95      0.95      0.95       578



In [None]:
pr = precision_score(df_y_test, df_y_test_pred, average='macro')
re = recall_score(df_y_test, df_y_test_pred, average='macro')
acc = accuracy_score(df_y_test, df_y_test_pred)
f1 = f1_score(df_y_test, df_y_test_pred, average='macro')
print("Test Precision: ", pr)
print("Test Recall: ", re)
print("Test F1: ", f1)
print("Test Accuracy: ", acc)

Test Precision:  0.9526266590552304
Test Recall:  0.9533346563955336
Test F1:  0.9523435769593288
Test Accuracy:  0.9532871972318339


We don't have any specific Obesity Category we want to focus the metrics on, however the model classifies correctly most of the categories and we used macro to get the metrics since "weighted avg" and "macro avg" both have 95% of precision.

### 5.3 Validación cruzada

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

In [None]:
precision_scorer = make_scorer(precision_score, average='macro')

In [None]:
df_X_norm = std_scaler.transform(df_X)

In [None]:
results = cross_val_score(model, df_X_norm, df_y, scoring=precision_scorer, cv=5)
results


Stochastic Optimizer: Maximum iterations (700) reached and the optimization hasn't converged yet.



array([0.75678104, 0.95934773, 0.98229389, 0.9742397 , 0.98100466])