# Content


**1. Introduction**

**2. A brief explanation of the dataset**

**3. Data Dictionary**

**4. Data Analysis**


4.1 Import Libraries

4.2 Import Dataset

4.3 Data Description

4.4 Vizualization

4.5 Classifier Models (LogisticRegression, KNN, Decision Tree, Random Forest, SVM)


## A brief explanation of the dataset


# Dataset Information

This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.


# Dictionary

**family_history_with_overweight:** Feature, Binary, " Has a family member suffered or suffers from overweight? "

**FAVC :** Feature, Binary, " Do you eat high caloric food frequently? "

**FCVC :** Feature, Integer, " Do you usually eat vegetables in your meals? "

**NCP :** Feature, Continuous, " How many main meals do you have daily? "

**CAEC :** Feature, Categorical, " Do you eat any food between meals? "

**SMOKE :** Feature, Binary, " Do you smoke? "

**CH2O:** Feature, Continuous, " How much water do you drink daily? "

**SCC:** Feature, Binary, " Do you monitor the calories you eat daily? "

**FAF:** Feature, Continuous, " How often do you have physical activity? "

**TUE :** Feature, Integer, " How much time do you use technological devices such as cell phone, videogames, television, computer and others? "

**CALC :** Feature, Categorical, " How often do you drink alcohol? "

**MTRANS :** Feature, Categorical, " Which transportation do you usually use? "

**NObeyesdad :** Target, Categorical, "Obesity level"


# Table 1: Questions of the survey

| **Questions**                                                                                                   | **Possible Answers**                                                 | **Type**    |
| --------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- | ----------- | ------ |
| What is your gender?                                                                                            | Female, Male                                                         | Categorical |
| What is your age?                                                                                               | From 14 years old to 61 or more                                      | Continuous  |
| What is your height?                                                                                            | From 1.45 Mts to 1.98 Mts                                            | Continuous  |
| What is your weight?                                                                                            | From 39 Kg to 173 Kg                                                 | Continuous  |
| Has a family member suffered or suffers from overweight?                                                        | yes, no                                                              | Binary      |
| Do you eat high caloric food frequently?                                                                        | yes, no                                                              | Binary      |
| Do you usually eat vegetables in your meals?                                                                    | 1 = Never, 2 = Sometimes, 3 = Always                                 | Integer     |
| How many main meals do you have daily?                                                                          | Between 1 and 4                                                      | Continuous  |
| Do you eat any food between meals?                                                                              | no, Sometimes, Frequently, Always                                    | Categorical |
| Do you smoke?                                                                                                   | yes, no                                                              | Binary      |
| How much water do you drink daily?                                                                              | 1 = Less than a liter, 2 = Between 1 and 2 L, 3 = More than 2 L      | Continuous  |
| Do you monitor the calories you eat daily?                                                                      | yes, no                                                              |             | Binary |
| How often do you have physical activity?                                                                        | 0 = I do not have, 1 = 1 or 2 days, 2 = 2 or 4 days, 3 = 4 or 5 days | Continuous  |
| How much time do you use technological devices such as cell phone, videogames, television, computer and others? | 0 = 0–2 hours, 1 = 3–5 hours, 2 = More than 5 hours                  | Integer     |
| How often do you drink alcohol?                                                                                 | I do not drink, Sometimes, Frequently, Always                        | Categorical |
| Which transportation do you usually use?                                                                        | Automobile, Motorbike, Bike, Public Transportation, Walking          | Categorical |


# Libraries


In [None]:
from ucimlrepo import fetch_ucirepo
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# fetch dataset
dataset = fetch_ucirepo(
    id=544)

# data (as pandas dataframes)
X = dataset.data.features
y = dataset.data.targets

# metadata
print(dataset.metadata)


df = pd.concat([X, y], axis=1)
df

In [None]:
df.shape

So in this project we have 2111 samples with 16 features and one target (NObeyesdad)


In [None]:
categorical_features = ['Gender', 'CALC', 'FAVC', 'SCC',
                        'SMOKE', 'family_history_with_overweight', 'CAEC', 'MTRANS']


continuous_features = ['Age', 'Height', 'Weight',
                       'FCVC', "NCP", 'CH2O', 'FAF', 'TUE']

pie chart of target


In [None]:
target_count = df['NObeyesdad'].value_counts()
target_unique = df['NObeyesdad'].unique()

In [None]:
target_unique

In [None]:
import plotly.express as px
fig = px.pie(values=target_count, names=target_unique, color_discrete_sequence=px.colors.qualitative.Pastel1,
             title="the number of people related to each type of obesity level")


fig.show()

# creating new dataframes targeting the targets in obesity level with the same type


In [None]:
# obesity type I,II,III
df_ot = df[df["NObeyesdad"] == 'Obesity_Type_I']


df_ot2 = df[df["NObeyesdad"] == 'Obesity_Type_II']


df_ot3 = df[df["NObeyesdad"] == 'Obesity_Type_III']

In [None]:
# data frem of Obesity_Type I, II, III
df_ot_final = pd.concat([df_ot, df_ot2, df_ot3])
df_ot_final.reset_index(drop=True, inplace=True)

In [None]:
# over weight type I,II
df_ow = df[df["NObeyesdad"] == 'Overweight_Level_I']

df_ow2 = df[df["NObeyesdad"] == 'Overweight_Level_II']

In [None]:
# data frem of Over_weight_Type I, II
df_ow_final = pd.concat([df_ow, df_ow2])
df_ow_final.reset_index(drop=True, inplace=True)

In [None]:
# Normal Weight
df_n = df[df["NObeyesdad"] == 'Normal_Weight']

In [None]:
# Insufficient Weight
df_In = df[df["NObeyesdad"] == 'Insufficient_Weight']

In [None]:
# 4 different dataframes

# categorical features


In [None]:
categorical_features = ['Gender', 'CALC', 'FAVC', 'SCC',
                        'SMOKE', 'family_history_with_overweight', 'CAEC', 'MTRANS']

# Gender


Distribution of gender across different obesity levels:

- Gender can be a significant factor in obesity patterns
- Understanding gender distribution helps identify potential biases in the dataset
- It provides insights into whether certain obesity levels are more prevalent in specific genders


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

data_list = [df_ot_final, df_ow_final, df_n, df_In]

data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]


fig, axes = plt.subplots(figsize=(10, 8), nrows=2, ncols=2)


for i in range(2):

    sns.histplot(data=data_list[i], x='Gender', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 0], multiple='stack')
    axes[i, 0].set_title(f'{data_name[i]} vs Gender')

    sns.histplot(data=data_list[i+2], x='Gender', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 1], multiple='stack')
    axes[i, 1].set_title(f'{data_name[i+2]} vs Gender')


fig.suptitle('Obesity_levels vs Gender')
plt.tight_layout()
plt.show()

# CALC


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]

data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(figsize=(10, 8), nrows=2, ncols=2)

for i in range(2):

    sns.histplot(data=data_list[i], x='CALC', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 0], multiple='stack')
    axes[i, 0].set_title(f'{data_name[i]} vs CALC')

    sns.histplot(data=data_list[i+2], x='CALC', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 1], multiple='stack')
    axes[i, 1].set_title(f'{data_name[i+2]} vs CALC')

fig.suptitle('Obesity_levels vs CALC')
plt.tight_layout()
plt.show()

# FAVC


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]

data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(figsize=(10, 8), nrows=2, ncols=2)

for i in range(2):

    sns.histplot(data=data_list[i], x='FAVC', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 0], multiple='stack')
    axes[i, 0].set_title(f'{data_name[i]} vs FAVC')

    sns.histplot(data=data_list[i+2], x='FAVC', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 1], multiple='stack')
    axes[i, 1].set_title(f'{data_name[i+2]} vs FAVC')

fig.suptitle('Obesity_levels vs FAVC')
plt.tight_layout()
plt.show()

# SCC


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]

data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(figsize=(10, 8), nrows=2, ncols=2)

for i in range(2):

    sns.histplot(data=data_list[i], x='SCC', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 0], multiple='stack')
    axes[i, 0].set_title(f'{data_name[i]} vs SCC')

    sns.histplot(data=data_list[i+2], x='SCC', hue='NObeyesdad',
                 palette="turbo", ax=axes[i, 1], multiple='stack')
    axes[i, 1].set_title(f'{data_name[i+2]} vs SCC')

fig.suptitle('Obesity_levels vs SCC')
plt.tight_layout()
plt.show()

# SMOKE


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]

data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(figsize=(10, 8), nrows=2, ncols=2)

for i in range(2):

    sns.histplot(data=data_list[i], x='SMOKE', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 0], multiple='stack')
    axes[i, 0].set_title(f'{data_name[i]} vs SMOKE')

    sns.histplot(data=data_list[i+2], x='SMOKE', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 1], multiple='stack')
    axes[i, 1].set_title(f'{data_name[i+2]} vs SMOKE')

fig.suptitle('Obesity_levels vs SMOKE')
plt.tight_layout()
plt.show()

# family_history_with_overweight


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]

data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(figsize=(10, 8), nrows=2, ncols=2)

for i in range(2):

    sns.histplot(data=data_list[i], x='family_history_with_overweight',
                 hue='NObeyesdad', palette='turbo', ax=axes[i, 0], multiple='stack')
    axes[i, 0].set_title(f'{data_name[i]} vs family_history_with_overweight')

    sns.histplot(data=data_list[i+2], x='family_history_with_overweight',
                 hue='NObeyesdad', palette='turbo', ax=axes[i, 1], multiple='stack')
    axes[i, 1].set_title(f'{data_name[i+2]} vs family_history_with_overweight')

fig.suptitle('Obesity_levels vs family_history_with_overweight')
plt.tight_layout()
plt.show()

# CAEC


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]

data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(figsize=(10, 8), nrows=2, ncols=2)

for i in range(2):

    sns.histplot(data=data_list[i], x='CAEC', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 0], multiple='stack')
    axes[i, 0].set_title(f'{data_name[i]} vs CAEC')

    sns.histplot(data=data_list[i+2], x='CAEC', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 1], multiple='stack')
    axes[i, 1].set_title(f'{data_name[i+2]} vs CAEC')

fig.suptitle('Obesity_levels vs CAEC')
plt.tight_layout()
plt.show()

# MTRANS


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]

data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(figsize=(15, 8), nrows=2, ncols=2)

for i in range(2):

    sns.histplot(data=data_list[i], x='MTRANS', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 0], multiple='stack')
    axes[i, 0].set_title(f'{data_name[i]} vs MTRANS')

    sns.histplot(data=data_list[i+2], x='MTRANS', hue='NObeyesdad',
                 palette='turbo', ax=axes[i, 1], multiple='stack')
    axes[i, 1].set_title(f'{data_name[i+2]} vs MTRANS')

fig.suptitle('Obesity_levels vs MTRANS')
plt.tight_layout()
plt.show()

# sunburst chart (categorical features)


# Gender


In [None]:
from plotly.subplots import make_subplots
fig = make_subplots(rows=2, cols=2, specs=[[{"type": "sunburst"}, {"type": "sunburst"}], [{"type": "sunburst"}, {"type": "sunburst"}]],


                    subplot_titles=["Obesity_Types vs family_history", "Over_weight vs family_history",
                                    "Normal_weight vs family_history", "Insufficient_Weight vs family_history"],


                    vertical_spacing=0.1)


figaux = px.sunburst(df_ot_final, path=['NObeyesdad', 'Gender'], color='Gender',


                     color_discrete_sequence=px.colors.qualitative.Pastel1)


fig.add_trace(figaux.data[0], row=1, col=1)

############################################


figaux = px.sunburst(df_ow_final, path=['NObeyesdad', 'Gender'], color='Gender',


                     color_discrete_sequence=px.colors.qualitative.Pastel1)


fig.add_trace(figaux.data[0], row=1, col=2)

#############################################


figaux = px.sunburst(df_In, path=['NObeyesdad', 'Gender'], color='Gender',


                     color_discrete_sequence=px.colors.qualitative.Pastel1)


fig.add_trace(figaux.data[0], row=2, col=1)

#############################################


figaux = px.sunburst(df_n, path=['NObeyesdad', 'Gender'], color='Gender',


                     color_discrete_sequence=px.colors.qualitative.Pastel1)


fig.add_trace(figaux.data[0], row=2, col=2)


fig.update_layout(width=700, height=800,
                  title_text=("Obesity levels vs Gender"))


fig.show()

# CALC


In [None]:
fig = make_subplots(rows=2, cols=2, specs=[[{"type": "sunburst"}, {"type": "sunburst"}], [{"type": "sunburst"}, {"type": "sunburst"}]],
                    subplot_titles=["Obesity_Types vs family_history", "Over_weight vs family_history",
                                    "Normal_weight vs family_history", "Insufficient_Weight vs family_history"],
                    vertical_spacing=0.1)


figaux = px.sunburst(df_ot_final, path=['NObeyesdad', 'CALC'], color='CALC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=1)

############################################

figaux = px.sunburst(df_ow_final, path=['NObeyesdad', 'CALC'], color='CALC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=2)

#############################################

figaux = px.sunburst(df_In, path=['NObeyesdad', 'CALC'], color='CALC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=1)

#############################################

figaux = px.sunburst(df_n, path=['NObeyesdad', 'CALC'], color='CALC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=2)


fig.update_layout(width=700, height=800, title_text=("Obesity levels vs CALC"))
fig.show()

# FAVC


In [None]:
fig = make_subplots(rows=2, cols=2, specs=[[{"type": "sunburst"}, {"type": "sunburst"}], [{"type": "sunburst"}, {"type": "sunburst"}]],
                    subplot_titles=["Obesity_Types vs family_history", "Over_weight vs family_history",
                                    "Normal_weight vs family_history", "Insufficient_Weight vs family_history"],
                    vertical_spacing=0.1)


figaux = px.sunburst(df_ot_final, path=['NObeyesdad', 'FAVC'], color='FAVC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=1)

############################################

figaux = px.sunburst(df_ow_final, path=['NObeyesdad', 'FAVC'], color='FAVC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=2)

#############################################

figaux = px.sunburst(df_In, path=['NObeyesdad', 'FAVC'], color='FAVC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=1)

#############################################

figaux = px.sunburst(df_n, path=['NObeyesdad', 'FAVC'], color='FAVC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=2)


fig.update_layout(width=700, height=800, title_text=("Obesity levels vs FAVC"))
fig.show()

# SCC


In [None]:
fig = make_subplots(rows=2, cols=2, specs=[[{"type": "sunburst"}, {"type": "sunburst"}], [{"type": "sunburst"}, {"type": "sunburst"}]],
                    subplot_titles=["Obesity_Types vs family_history", "Over_weight vs family_history",
                                    "Normal_weight vs family_history", "Insufficient_Weight vs family_history"],
                    vertical_spacing=0.1)


figaux = px.sunburst(df_ot_final, path=['NObeyesdad', 'SCC'], color='SCC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=1)

############################################

figaux = px.sunburst(df_ow_final, path=['NObeyesdad', 'SCC'], color='SCC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=2)

#############################################

figaux = px.sunburst(df_In, path=['NObeyesdad', 'SCC'], color='SCC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=1)

#############################################

figaux = px.sunburst(df_n, path=['NObeyesdad', 'SCC'], color='SCC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=2)


fig.update_layout(width=700, height=800, title_text=("Obesity levels vs SCC"))
fig.show()

# SMOKE


In [None]:
fig = make_subplots(rows=2, cols=2, specs=[[{"type": "sunburst"}, {"type": "sunburst"}], [{"type": "sunburst"}, {"type": "sunburst"}]],
                    subplot_titles=["Obesity_Types vs family_history", "Over_weight vs family_history",
                                    "Normal_weight vs family_history", "Insufficient_Weight vs family_history"],
                    vertical_spacing=0.1)


figaux = px.sunburst(df_ot_final, path=['NObeyesdad', 'SMOKE'], color='SMOKE',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=1)

############################################

figaux = px.sunburst(df_ow_final, path=['NObeyesdad', 'SMOKE'], color='SMOKE',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=2)

#############################################

figaux = px.sunburst(df_In, path=['NObeyesdad', 'SMOKE'], color='SMOKE',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=1)

#############################################

figaux = px.sunburst(df_n, path=['NObeyesdad', 'SMOKE'], color='SMOKE',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=2)


fig.update_layout(width=700, height=800,
                  title_text=("Obesity levels vs SMOKE"))
fig.show()

# CAEC


In [None]:
fig = make_subplots(rows=2, cols=2, specs=[[{"type": "sunburst"}, {"type": "sunburst"}], [{"type": "sunburst"}, {"type": "sunburst"}]],
                    subplot_titles=["Obesity_Types vs family_history", "Over_weight vs family_history",
                                    "Normal_weight vs family_history", "Insufficient_Weight vs family_history"],
                    vertical_spacing=0.1)


figaux = px.sunburst(df_ot_final, path=['NObeyesdad', 'CAEC'], color='CAEC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=1)

############################################

figaux = px.sunburst(df_ow_final, path=['NObeyesdad', 'CAEC'], color='CAEC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=2)

#############################################

figaux = px.sunburst(df_In, path=['NObeyesdad', 'CAEC'], color='CAEC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=1)

#############################################

figaux = px.sunburst(df_n, path=['NObeyesdad', 'CAEC'], color='CAEC',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=2)


fig.update_layout(width=700, height=800, title_text=("Obesity levels vs CAEC"))
fig.show()

# MTRANS


In [None]:
fig = make_subplots(rows=2, cols=2, specs=[[{"type": "sunburst"}, {"type": "sunburst"}], [{"type": "sunburst"}, {"type": "sunburst"}]],
                    subplot_titles=["Obesity_Types vs family_history", "Over_weight vs family_history",
                                    "Normal_weight vs family_history", "Insufficient_Weight vs family_history"],
                    vertical_spacing=0.1)


figaux = px.sunburst(df_ot_final, path=['NObeyesdad', 'MTRANS'], color='MTRANS',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=1)

############################################

figaux = px.sunburst(df_ow_final, path=['NObeyesdad', 'MTRANS'], color='MTRANS',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=1, col=2)

#############################################

figaux = px.sunburst(df_In, path=['NObeyesdad', 'MTRANS'], color='MTRANS',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=1)

#############################################

figaux = px.sunburst(df_n, path=['NObeyesdad', 'MTRANS'], color='MTRANS',
                     color_discrete_sequence=px.colors.qualitative.Pastel1)
fig.add_trace(figaux.data[0], row=2, col=2)


fig.update_layout(width=700, height=800,
                  title_text=("Obesity levels vs MTRANS"))
fig.show()

# Continuous Features


In [None]:
continuous_features = ['Age', 'Height', 'Weight',
                       'FCVC', "NCP", 'CH2O', 'FAF', 'TUE']

# KDE plot


# Age


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]
data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

for i in range(2):

    sns.kdeplot(ax=axes[i, 0], data=data_list[i],
                x="Age", hue="NObeyesdad", fill=True)
    axes[i, 0].set_title(f'{data_name[i]} vs Age')

    sns.kdeplot(ax=axes[i, 1], data=data_list[i+2],
                x="Age", hue="NObeyesdad", fill=True)
    axes[i, 1].set_title(f'{data_name[i+2]} vs Age')


fig.suptitle('Obesity_levels vs Age')
plt.tight_layout()
plt.show()

1. Obesity type III focus between ~18 and 30 years old
2. Most people of normal weight are young, and this group shrinks with age
3. Overweight begins to appear in young adults and can continue into middle age.
4. Underweight is more common in adolescents and young adults.


# Height


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]
data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

for i in range(2):

    sns.kdeplot(ax=axes[i, 0], data=data_list[i],
                x="Height", hue="NObeyesdad", fill=True)
    axes[i, 0].set_title(f'{data_name[i]} vs Height')

    sns.kdeplot(ax=axes[i, 1], data=data_list[i+2],
                x="Height", hue="NObeyesdad", fill=True)
    axes[i, 1].set_title(f'{data_name[i+2]} vs Height')


fig.suptitle('Obesity_levels vs Height')
plt.tight_layout()
plt.show()

3. Being overweight is not directly associated with a specific height, although it tends to be concentrated among people of average height.
4. Underweight is not determined by height.


# Weight


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]
data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

for i in range(2):

    sns.kdeplot(ax=axes[i, 0], data=data_list[i],
                x="Weight", hue="NObeyesdad", fill=True)
    axes[i, 0].set_title(f'{data_name[i]} vs Weight')

    sns.kdeplot(ax=axes[i, 1], data=data_list[i+2],
                x="Weight", hue="NObeyesdad", fill=True)
    axes[i, 1].set_title(f'{data_name[i+2]} vs Weight')


fig.suptitle('Obesity_levels vs Weight')
plt.tight_layout()
plt.show()

# FCVC


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]
data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

for i in range(2):

    sns.kdeplot(ax=axes[i, 0], data=data_list[i],
                x="FCVC", hue="NObeyesdad", fill=True)
    axes[i, 0].set_title(f'{data_name[i]} vs FCVC')

    sns.kdeplot(ax=axes[i, 1], data=data_list[i+2],
                x="FCVC", hue="NObeyesdad", fill=True)
    axes[i, 1].set_title(f'{data_name[i+2]} vs FCVC')


fig.suptitle('Obesity_levels vs FCVC')
plt.tight_layout()
plt.show()

# NCP


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]
data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

for i in range(2):

    sns.kdeplot(ax=axes[i, 0], data=data_list[i],
                x="NCP", hue="NObeyesdad", fill=True)
    axes[i, 0].set_title(f'{data_name[i]} vs NCP')

    sns.kdeplot(ax=axes[i, 1], data=data_list[i+2],
                x="NCP", hue="NObeyesdad", fill=True)
    axes[i, 1].set_title(f'{data_name[i+2]} vs NCP')


fig.suptitle('Obesity_levels vs NCP')
plt.tight_layout()
plt.show()

# CH2O


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]
data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

for i in range(2):

    sns.kdeplot(ax=axes[i, 0], data=data_list[i],
                x="CH2O", hue="NObeyesdad", fill=True)
    axes[i, 0].set_title(f'{data_name[i]} vs CH2O')

    sns.kdeplot(ax=axes[i, 1], data=data_list[i+2],
                x="CH2O", hue="NObeyesdad", fill=True)
    axes[i, 1].set_title(f'{data_name[i+2]} vs CH2O')


fig.suptitle('Obesity_levels vs CH2O')
plt.tight_layout()
plt.show()

# FAF


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]
data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

for i in range(2):

    sns.kdeplot(ax=axes[i, 0], data=data_list[i],
                x="FAF", hue="NObeyesdad", fill=True)
    axes[i, 0].set_title(f'{data_name[i]} vs FAF')

    sns.kdeplot(ax=axes[i, 1], data=data_list[i+2],
                x="FAF", hue="NObeyesdad", fill=True)
    axes[i, 1].set_title(f'{data_name[i+2]} vs FAF')


fig.suptitle('Obesity_levels vs FAF')
plt.tight_layout()
plt.show()

# TUE


In [None]:
data_list = [df_ot_final, df_ow_final, df_n, df_In]
data_name = ["obesity_type", "over_weight_type",
             "normal", "Insufficient_Weight"]

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8))

for i in range(2):

    sns.kdeplot(ax=axes[i, 0], data=data_list[i],
                x="TUE", hue="NObeyesdad", fill=True)
    axes[i, 0].set_title(f'{data_name[i]} vs TUE')

    sns.kdeplot(ax=axes[i, 1], data=data_list[i+2],
                x="TUE", hue="NObeyesdad", fill=True)
    axes[i, 1].set_title(f'{data_name[i+2]} vs TUE')


fig.suptitle('Obesity_levels vs TUE')
plt.tight_layout()
plt.show()

In [None]:
######################################## manual feature encoding ####################################################

In [None]:
df1 = df.copy()

In [None]:
df1.loc[df1['NObeyesdad'] == 'Normal_Weight', 'NObeyesdad'] = 2
df1.loc[df1['NObeyesdad'] == 'Overweight_Level_I', 'NObeyesdad'] = 3
df1.loc[df1['NObeyesdad'] == 'Overweight_Level_II', 'NObeyesdad'] = 4
df1.loc[df1['NObeyesdad'] == 'Obesity_Type_I', 'NObeyesdad'] = 5
df1.loc[df1['NObeyesdad'] == 'Insufficient_Weight', 'NObeyesdad'] = 6
df1.loc[df1['NObeyesdad'] == 'Obesity_Type_II', 'NObeyesdad'] = 7
df1.loc[df1['NObeyesdad'] == 'Obesity_Type_III', 'NObeyesdad'] = 8

###################### data to number #################

# Gender

df1.loc[df1['Gender'] == 'Female', 'Gender'] = 2
df1.loc[df1['Gender'] == 'Male', 'Gender'] = 3

# family_history_with_overweight

df1.loc[df1['family_history_with_overweight'] ==
        'no', 'family_history_with_overweight'] = 2
df1.loc[df1['family_history_with_overweight'] ==
        'yes', 'family_history_with_overweight'] = 3

# FAVC

df1.loc[df1['FAVC'] == 'no', 'FAVC'] = 2
df1.loc[df1['FAVC'] == 'yes', 'FAVC'] = 3

# CAEC

df1.loc[df1['CAEC'] == 'no', 'CAEC'] = 2
df1.loc[df1['CAEC'] == 'Sometimes', 'CAEC'] = 3
df1.loc[df1['CAEC'] == 'Frequently', 'CAEC'] = 4
df1.loc[df1['CAEC'] == 'Always', 'CAEC'] = 5

# SMOKE

df1.loc[df1['SMOKE'] == 'no', 'SMOKE'] = 2
df1.loc[df1['SMOKE'] == 'yes', 'SMOKE'] = 3

# SCC

df1.loc[df1['SCC'] == 'no', 'SCC'] = 2
df1.loc[df1['SCC'] == 'yes', 'SCC'] = 3

# CALC

df1.loc[df1['CALC'] == 'no', 'CALC'] = 2
df1.loc[df1['CALC'] == 'Sometimes', 'CALC'] = 3
df1.loc[df1['CALC'] == 'Frequently', 'CALC'] = 4
df1.loc[df1['CALC'] == 'Always', 'CALC'] = 5

# MTRANS

df1.loc[df1['MTRANS'] == 'Automobile', 'MTRANS'] = 2
df1.loc[df1['MTRANS'] == 'Motorbike', 'MTRANS'] = 3
df1.loc[df1['MTRANS'] == 'Bike', 'MTRANS'] = 4
df1.loc[df1['MTRANS'] == 'Public_Transportation', 'MTRANS'] = 5
df1.loc[df1['MTRANS'] == 'Walking', 'MTRANS'] = 6

#########################################################

df1 = df1.astype('float64')

# Heat Map


In [None]:
plt.figure(figsize=(15, 9))
sns.heatmap(df1.corr(), annot=True, cmap="coolwarm")
plt.title('The correlation among features', y=1.05)
plt.show()

According to heat map it can be seen that there is strong correlation between weight and obesity levels


In [None]:
x = df1.drop(columns=["NObeyesdad"])
y = df1["NObeyesdad"].values.reshape(-1, 1)

# Logistic Regression


In [None]:
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


def LogReg(x, y, test_size, solver_list):

    df_evaluation = pd.DataFrame()

    ########################### normalizing data  ------------->  StandardScaler ###################################

    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=test_size, random_state=0)

    scaler = StandardScaler()

    x_train_scaled = scaler.fit_transform(x_train)

    x_test_scaled = scaler.transform(x_test)

    ########################## logistic model ####################################

    c_list = [0.1, 0.2, 0.4, 0.5, 1, 2, 4, 5, 10, 20, 50, 100, 400]

    for c in c_list:

        for s in solver_list:

            logreg = LogisticRegression(
                solver=s, penalty='l2', C=c, class_weight='balanced')

            logreg.fit(x_train_scaled, y_train.ravel())

            y_pred = logreg.predict(x_test_scaled)

            #####################

            x_norm = scaler.transform(x)

            dict = {'Test_size': test_size, "acc": metrics.accuracy_score(
                y_test, y_pred), "c": c, "solver": s, "score": logreg.score(x_norm, y)}

            df_entry = pd.DataFrame([dict])  # Convert dictionary to DataFrame

            # Concatenate DataFrames
            df_evaluation = pd.concat(
                [df_evaluation, df_entry], ignore_index=True)

    return (x_train, x_test, y_train, y_test, y_pred, df_evaluation)

In [None]:

x_train, x_test, y_train, y_test, y_pred, df_evaluation = LogReg(
    x, y, .25, ['newton-cg', 'sag', 'saga', 'lbfgs'])

In [None]:
def highlight_max(s):

    is_max = s == s.max()
    return ['background-color: gray' if v else '' for v in is_max]

In [None]:
df_evaluation.style.apply(highlight_max)

# Best Logistic Model


In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.25, random_state=0)

In [None]:
scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [None]:
# CÓDIGO CORREGIDO:
best_logreg_model = LogisticRegression(
    solver='newton-cg',
    penalty='l2',
    C=400,
    class_weight='balanced'
)

best_logreg_model.fit(x_train_scaled, y_train.values.ravel())  # ✅

In [None]:
x_norm = scaler.transform(x)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
confusion_matrix(y, best_logreg_model.predict(x_norm))

In [None]:
print(classification_report(y, best_logreg_model.predict(x_norm)))

# knn


In [None]:
metric_list = ['cityblock', 'euclidean',
               'l1', 'l2', 'manhattan', 'nan_euclidean']


p_list = [1, 2]

n_neighbors_list = range(1, 30)

In [None]:
def KNN(x, y, test_size, p):
    training_acc = []
    test_acc = []
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=test_size, random_state=0)

    df_evaluation = pd.DataFrame()

    ########################### normalizing data  ------------->  StandardScaler ###################################

    scaler = StandardScaler()

    x_train_scaled = scaler.fit_transform(x_train)
    x_test_scaled = scaler.transform(x_test)

    ########################### knn model ###################################

    for k in n_neighbors_list:

        for metric in metric_list:

            knn_model = KNeighborsClassifier(k, p=p, metric=metric, n_jobs=-1)
            # ✅ Corregido: DataFrame → array
            knn_model.fit(x_train_scaled, y_train.values.ravel())
            y_pred = knn_model.predict(x_test_scaled)

            training_acc.append(knn_model.score(x_train_scaled, y_train))
            test_acc.append(knn_model.score(x_test_scaled, y_test))

            x_norm = scaler.transform(x)

            dict = {'Test_size': test_size, "acc": metrics.accuracy_score(
                y_test, y_pred), "metric": metric, "p": p, "n_neighbor": k, "score": knn_model.score(x_norm, y)}

            df_entry = pd.DataFrame([dict])  # Convert dictionary to DataFrame
            # Concatenate DataFrames
            df_evaluation = pd.concat(
                [df_evaluation, df_entry], ignore_index=True)

    return (x_train, x_test, y_train, y_test, training_acc, test_acc, y_pred, df_evaluation)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
x_train, x_test, y_train, y_test, training_acc, test_acc, y_pred, df_evaluation = KNN(
    x, y, 0.1, 1)

**~ 1 minute**


In [None]:
df_evaluation.style.apply(highlight_max)

# Best KNN Model


In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.1, random_state=0)

In [None]:
scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [None]:
best_knn_model = KNeighborsClassifier(1, p=1, metric='cityblock', n_jobs=-1)
best_knn_model.fit(x_train_scaled, y_train.values.ravel())

In [None]:
knn_acc = best_knn_model.score(x_norm, y)
knn_acc

In [None]:
x_norm = scaler.transform(x)

In [None]:
confusion_matrix(y, best_knn_model.predict(x_norm))

In [None]:
print(classification_report(y, best_knn_model.predict(x_norm)))

# Decision Tree


In [None]:
from sklearn.tree import DecisionTreeClassifier


def Decision_Tree(x, y, test_size, max_depth_list):
    training_acc, test_acc = [], []
    criterion_list = ['entropy', 'gini']
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=test_size, random_state=0)
    df_evaluation = pd.DataFrame()
    ########################### normalizing data  ------------->  StandardScaler ###################################

    scaler = StandardScaler()
    x_train_scaled = scaler.fit_transform(x_train)
    x_test_scaled = scaler.transform(x_test)
    ########################### Decision_Tree model ###################################

    for criterion in (criterion_list):

        for m in max_depth_list:

            tree_model = DecisionTreeClassifier(
                criterion=criterion, max_depth=m, random_state=0)

            tree_model.fit(x_train_scaled, y_train.values.ravel())
            y_pred = tree_model.predict(x_test_scaled)

            training_acc.append(tree_model.score(x_train_scaled, y_train))
            test_acc.append(tree_model.score(x_test_scaled, y_test))

            x_norm = scaler.transform(x)

            dict = {'Test_size': test_size, "acc": metrics.accuracy_score(
                y_test, y_pred), "criterion": criterion, "max_depth": m, "score": tree_model.score(x_norm, y)}

            df_entry = pd.DataFrame([dict])  # Convert dictionary to DataFrame

            # Concatenate DataFrames
            df_evaluation = pd.concat(
                [df_evaluation, df_entry], ignore_index=True)

    return (x_train, x_test, y_train, y_test, training_acc, test_acc, y_pred, df_evaluation)

In [None]:
x_train, x_test, y_train, y_test, training_acc, test_acc, y_pred, df_evaluation = Decision_Tree(
    x, y, 0.1, range(2, 16))

In [None]:
df_evaluation.style.apply(highlight_max)

# Best Decision Tree Model


In [None]:
best_DT_model = DecisionTreeClassifier(
    criterion='entropy', max_depth=9, random_state=0)


best_DT_model.fit(x_train_scaled, y_train.values.ravel())

In [None]:
x_norm = scaler.transform(x)

In [None]:
DT_acc = best_DT_model.score(x_norm, y)
DT_acc

In [None]:
confusion_matrix(y, best_DT_model.predict(x_norm))

In [None]:
print(classification_report(y, best_DT_model.predict(x_norm)))

# Random Forest


In [None]:
def Random_Forest(x, y, test_size, max_depth_list):
    training_acc, test_acc = [], []
    criterion_list = ['entropy', 'gini']
    estimator_list = range(10, 101)
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=test_size, random_state=0)

    df_evaluation = pd.DataFrame()
    ########################### normalizing data  ------------->  StandardScaler ###################################

    scaler = StandardScaler()
    x_train_scaled = scaler.fit_transform(x_train)
    x_test_scaled = scaler.transform(x_test)
    ########################### Rf model ###################################

    for s in estimator_list:

        for m in max_depth_list:

            for criterion in (criterion_list):

                Rf_model = RandomForestClassifier(
                    n_estimators=s, max_depth=m, criterion=criterion, random_state=0)
                Rf_model.fit(x_train_scaled, y_train.values.ravel())
                y_pred = Rf_model.predict(x_test_scaled)

                training_acc.append(Rf_model.score(x_train_scaled, y_train))
                test_acc.append(Rf_model.score(x_test_scaled, y_test))

                x_norm = scaler.transform(x)

                dict = {'Test_size': test_size, "acc": metrics.accuracy_score(y_test, y_pred),
                        "n_estimator": s, "criterion": criterion, "max_depth": m,
                        "score": Rf_model.score(x_norm, y)}

                # Convert dictionary to DataFrame
                df_entry = pd.DataFrame([dict])
                # Concatenate DataFrames
                df_evaluation = pd.concat(
                    [df_evaluation, df_entry], ignore_index=True)

    return (x_train, x_test, y_train, y_test, training_acc, test_acc, y_pred, df_evaluation)

In [None]:
from sklearn.ensemble import RandomForestClassifier
x_train, x_test, y_train, y_test, training_acc, test_acc, y_pred, df_evaluation = Random_Forest(
    x, y, 0.1, range(2, 16))

**~ 42 minutes**


# Best Random Forest Model


In [None]:
best_RF_model = RandomForestClassifier(
    n_estimators=46, max_depth=15, criterion='entropy', random_state=0)

best_RF_model.fit(x_train_scaled, y_train.values.ravel())

In [None]:
x_norm = scaler.transform(x)

In [None]:
RF_acc = best_RF_model.score(x_norm, y)
RF_acc

In [None]:
confusion_matrix(y, best_RF_model.predict(x_norm))

In [None]:
print(classification_report(y, best_RF_model.predict(x_norm)))

# SVM


In [None]:
from sklearn.model_selection import train_test_split


def svm_model(x, y, test_size):
    training_acc, test_acc = [], []
    kernel_list = ['linear', 'poly', 'rbf', 'sigmoid']
    penalty_list = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=test_size, random_state=0)
    df_evaluation1 = pd.DataFrame()

    ########################### normalizing data  ------------->  StandardScaler ###################################

    scaler = StandardScaler()
    x_train_scaled = scaler.fit_transform(x_train)
    x_test_scaled = scaler.transform(x_test)

    ########################### SVM ###################################

    for kernel in kernel_list:

        for c in penalty_list:

            svm_model = SVC(kernel=kernel, C=c)
            svm_model.fit(x_train_scaled, y_train.values.ravel())
            y_pred = svm_model.predict(x_test_scaled)
            training_acc.append(svm_model.score(x_train_scaled, y_train))
            test_acc.append(svm_model.score(x_test_scaled, y_test))
            x_norm = scaler.transform(x)
            dict = {'Test_size': test_size, "acc": metrics.accuracy_score(y_test, y_pred),
                    "penalty": c, "kernel": kernel, "score": svm_model.score(x_norm, y)}

            df_entry = pd.DataFrame([dict])  # Convert dictionary to DataFrame
            # Concatenate DataFrames
            df_evaluation1 = pd.concat(
                [df_evaluation1, df_entry], ignore_index=True)

    return (x_train, x_test, y_train, y_test, training_acc, test_acc, y_pred, df_evaluation1)

In [None]:
from sklearn.svm import SVC

x_train, x_test, y_train, y_test, training_acc, test_acc, y_pred, df_evaluation1 = svm_model(
    x, y, 0.1)

# Visualizing results


In [None]:
print("🏆 CONFIGS BY KERNEL".center(80))

best_by_kernel = df_evaluation1.loc[df_evaluation1.groupby('kernel')[
    'acc'].idxmax()]
best_by_kernel_sorted = best_by_kernel.sort_values('acc', ascending=False)

print(f"\n{'Kernel':<12} {'C (Penalty)':<15} {'Accuracy':<12} {'Score Total':<12}")
print("-"*80)
for idx, row in best_by_kernel_sorted.iterrows():
    print(
        f"{row['kernel']:<12} {row['penalty']:<15.2f} {row['acc']:<12.4f} {row['score']:<12.4f}")

print("\n" + "-"*80)
best_overall = df_evaluation1.loc[df_evaluation1['acc'].idxmax()]
print(f"BEST GENERAL CONFIG:")
print(f"   Kernel: {best_overall['kernel']}")
print(f"   C (Penalty): {best_overall['penalty']}")
print(f"   Accuracy: {best_overall['acc']:.4f}")
print(f"   Total: {best_overall['score']:.4f}")

In [None]:
# 3. Curves of accuracy vs penalty (C) for each kernel
plt.figure(figsize=(14, 6))

# Subplot 1: All curves
plt.subplot(1, 2, 1)
for kernel in df_evaluation1['kernel'].unique():
    kernel_data = df_evaluation1[df_evaluation1['kernel'] == kernel].sort_values(
        'penalty')
    plt.plot(kernel_data['penalty'], kernel_data['acc'],
             marker='o', label=kernel, linewidth=2, markersize=6)

plt.title('Accuracy vs Penalty (C) by Kernel', fontsize=14, fontweight='bold')
plt.xlabel('Penalty (C)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.legend(title='Kernel', fontsize=10)
plt.grid(True, alpha=0.3, linestyle='--')
plt.xlim([df_evaluation1['penalty'].min() - 0.05,
         df_evaluation1['penalty'].max() + 0.05])

# Subplot 2: Heatmap of accuracy
plt.subplot(1, 2, 2)
pivot_table = df_evaluation1.pivot_table(
    values='acc', index='kernel', columns='penalty')
sns.heatmap(pivot_table, annot=True, fmt='.3f', cmap='RdYlGn', cbar_kws={'label': 'Accuracy'},
            linewidths=0.5, linecolor='gray')
plt.title('Heatmap: Accuracy Kernel & Penalty',
          fontsize=14, fontweight='bold')
plt.xlabel('Penalty (C)', fontsize=12)
plt.ylabel('Kernel', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

# Entrenar el mejor modelo encontrado
best_config = df_evaluation1.loc[df_evaluation1['acc'].idxmax()]
print(
    f"🏆 Training best model: kernel={best_config['kernel']}, C={best_config['penalty']}\n")

# Re-train with the best configuration
scaler_best = StandardScaler()
x_train_scaled_best = scaler_best.fit_transform(x_train)
x_test_scaled_best = scaler_best.transform(x_test)

best_model = SVC(kernel=best_config['kernel'], C=best_config['penalty'])
best_model.fit(x_train_scaled_best, y_train.values.ravel())
y_pred_best = best_model.predict(x_test_scaled_best)
cm = confusion_matrix(y_test, y_pred_best)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar_kws={'label': 'Conteo'},
            linewidths=1, linecolor='gray', square=True)
plt.title(f'Confusion Matrix\nKernel: {best_config["kernel"]}, C: {best_config["penalty"]}, Accuracy: {best_config["acc"]:.4f}',
          fontsize=14, fontweight='bold')
plt.ylabel('Real Class', fontsize=12)
plt.xlabel('Predicted Class', fontsize=12)
plt.tight_layout()
plt.show()

print("\n" + "-"*80)
print("REPORT OF CLASSIFICATION OF THE BEST MODEL".center(80))
print("-"*80)
print(classification_report(y_test, y_pred_best))

# Export results to CSV

In [168]:

df_evaluation1_sorted = df_evaluation1.sort_values('acc', ascending=False)
df_evaluation1_sorted.to_csv(
    'svm_results_full.csv', index=False, encoding='utf-8')
best_by_kernel.to_csv('svm_results_best_by_kernel.csv',
                      index=False, encoding='utf-8')

print("📋 TOP 10".center(90))
print("-"*90)
print(df_evaluation1_sorted[['kernel', 'penalty', 'acc', 'score']].head(
    10).to_string(index=False))
print("-"*90)  

print("📈 STATISTICAL SUMMARY BY KERNEL".center(90))
print("-"*90)
stats = df_evaluation1.groupby('kernel')['acc'].agg(
    ['mean', 'std', 'min', 'max', 'count'])
stats.columns = ['Mean', 'Std', 'Min', 'Max', 'N° Exps']
print(stats.to_string())
print("Results saved as 'svm_results_full.csv'")
print("Best configurations by kernel saved as 'svm_results_best_by_kernel.csv'") 

                                         📋 TOP 10                                         
------------------------------------------------------------------------------------------
kernel  penalty      acc    score
linear      0.8 0.981132 0.963998
linear      0.9 0.981132 0.963998
linear      1.0 0.976415 0.965419
linear      0.7 0.971698 0.963051
linear      0.5 0.966981 0.958314
linear      0.6 0.966981 0.960208
linear      0.4 0.966981 0.951682
linear      0.3 0.962264 0.946471
linear      0.2 0.943396 0.933207
linear      0.1 0.924528 0.909995
------------------------------------------------------------------------------------------
                             📈 STATISTICAL SUMMARY BY KERNEL                              
------------------------------------------------------------------------------------------
             Mean       Std       Min       Max  N° Exps
kernel                                                  
linear   0.964151  0.017677  0.924528  0.981132       10


# SVM with K-Fold Cross-Validation

**Improvement 1:** K-Fold CV instead of simple train/test split to:

- ✅ Get more reliable performance estimates
- ✅ Reduce variance in metrics
- ✅ Calculate confidence intervals
- ✅ Use **weighted averaging** to handle class imbalance

7 classes (Insufficient Weight → Obesity Type III)  
10-Fold Cross-Validation  
Accuracy, Precision, Recall, F1-Score (weighted)


## 🏆 Final Model Trained with All Dataset

**Note:** After evaluating the model with train/test split (above), now we train the final model with **ALL** the data to maximize its predictive capacity. This model is not evaluated because there are no separated test data.


In [169]:
import pickle

# Scale dataset
scaler_final = StandardScaler()
x_scaled_full = scaler_final.fit_transform(x)

# Train with the best configuration found
final_model = SVC(kernel=best_config['kernel'], C=best_config['penalty'])
final_model.fit(x_scaled_full, y.values.ravel())

print(f"\nModel:")
print(
    f"   - Configuration: kernel={best_config['kernel']}, C={best_config['penalty']}")
print(f"   - Samples used: {len(x)} (100% of dataset)")
print(f"   - Classes: {len(final_model.classes_)}")
print(f"   - Support vectors: {len(final_model.support_)}")

# Distribution of support vectors by class
print(f"\nSupport vectors by class:")
for i, clase in enumerate(final_model.classes_):
    n_sv = final_model.n_support_[i]
    print(
        f"   Class {clase}: {n_sv} vectors ({n_sv/len(final_model.support_)*100:.1f}%)")

print("\n" + "-"*90)
print("⚠️ This model DOES NOT have valid evaluation metrics")
print("   because it was trained with all data (without separated test set).")
print("   The metrics reported above (confusion matrix) are correct.")
print("-"*90)

# Guardar el modelo (opcional)
with open('svm_final_model.pkl', 'wb') as f:
    pickle.dump({'model': final_model, 'scaler': scaler_final,
                'config': best_config}, f)
print("\nModel saved as 'svm_final_model.pkl'")


Model:
   - Configuration: kernel=linear, C=0.8
   - Samples used: 2111 (100% of dataset)
   - Classes: 7
   - Support vectors: 705

Support vectors by class:
   Class Insufficient_Weight: 77 vectors (10.9%)
   Class Normal_Weight: 141 vectors (20.0%)
   Class Obesity_Type_I: 111 vectors (15.7%)
   Class Obesity_Type_II: 49 vectors (7.0%)
   Class Obesity_Type_III: 9 vectors (1.3%)
   Class Overweight_Level_I: 155 vectors (22.0%)
   Class Overweight_Level_II: 163 vectors (23.1%)

------------------------------------------------------------------------------------------
⚠️ This model DOES NOT have valid evaluation metrics
   because it was trained with all data (without separated test set).
   The metrics reported above (confusion matrix) are correct.
------------------------------------------------------------------------------------------

Model saved as 'svm_final_model.pkl'


# Evaluation of SVM using K-Fold Cross-Validation

In [174]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    confusion_matrix
)
import time


def kfold_svm_evaluation(X, y, kernel='rbf', C=1.0, gamma='scale', degree=3,
                         n_folds=10, random_state=42):
    """ 
    Parameters:
    -----------
    X : array-like, dataset features
    y : array-like, dataset labels
    kernel : str, type of kernel ('linear', 'rbf', 'poly', 'sigmoid')
    C : float, regularization parameter
    gamma : str or float, kernel coefficient
    degree : int, polynomial degree (only for kernel='poly')
    n_folds : int, number of folds for cross-validation
    random_state : int, seed for reproducibility

    Returns:
    --------
    dict with aggregated metrics and results per fold
    """
    kfold = KFold(n_splits=n_folds, shuffle=True, random_state=random_state)

    # Store results
    fold_results = []
    all_y_true = []
    all_y_pred = []

    print(f"Executing {n_folds}-Fold Cross Validation...")
    print(f"Total samples: {len(X)}")
    print(f"Kernel: {kernel} | C: {C} | Gamma: {gamma}")
    print(f"Classes in dataset: {len(np.unique(y))}\n")

    start_time = time.time()

    for fold, (train_idx, test_idx) in enumerate(kfold.split(X), 1):
        # Divide data - Use .iloc for positional access (works with DataFrame and array)
        if hasattr(X, 'iloc'):  # If DataFrame
            X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        else:  # If numpy array
            X_train, X_test = X[train_idx], X[test_idx]

        if hasattr(y, 'iloc'):  # If Series/DataFrame
            y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
        else:  # If numpy array
            y_train, y_test = y[train_idx], y[test_idx]

        # Normalize (IMPORTANT: fit on train, transform on both)
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        # Train
        model = SVC(
            kernel=kernel,
            C=C,
            gamma=gamma if kernel != 'linear' else 'scale',
            degree=degree if kernel == 'poly' else 3,
            random_state=random_state
        )
        model.fit(X_train_scaled, y_train)

        # Predict
        y_pred = model.predict(X_test_scaled)

        # Save for global confusion matrix
        all_y_true.extend(y_test)
        all_y_pred.extend(y_pred)

        # Calculate metrics with WEIGHTED AVERAGING
        fold_acc = accuracy_score(y_test, y_pred)
        fold_prec = precision_score(
            y_test, y_pred, average='weighted', zero_division=0)
        fold_rec = recall_score(
            y_test, y_pred, average='weighted', zero_division=0)
        fold_f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)

        fold_results.append({
            'fold': fold,
            'accuracy': fold_acc,
            'precision': fold_prec,
            'recall': fold_rec,
            'f1_score': fold_f1,
            'train_size': len(X_train),
            'test_size': len(X_test)
        })

        print(f"  FOLD {fold:2d}: Acc={fold_acc:.4f} | Prec={fold_prec:.4f} | "
              f"Rec={fold_rec:.4f} | F1={fold_f1:.4f}")

    training_time = time.time() - start_time

    # Calculate aggregated statistics
    accuracies = [r['accuracy'] for r in fold_results]
    precisions = [r['precision'] for r in fold_results]
    recalls = [r['recall'] for r in fold_results]
    f1_scores = [r['f1_score'] for r in fold_results]

    # Confidence intervals (percentiles 2.5 and 97.5)
    ci_95_acc = np.percentile(accuracies, [2.5, 97.5])
    ci_95_f1 = np.percentile(f1_scores, [2.5, 97.5])

    results = {
        'fold_results': fold_results,
        'mean_accuracy': np.mean(accuracies),
        'std_accuracy': np.std(accuracies),
        'mean_precision': np.mean(precisions),
        'std_precision': np.std(precisions),
        'mean_recall': np.mean(recalls),
        'std_recall': np.std(recalls),
        'mean_f1': np.mean(f1_scores),
        'std_f1': np.std(f1_scores),
        'ci_95_accuracy': ci_95_acc,
        'ci_95_f1': ci_95_f1,
        'training_time': training_time,
        'confusion_matrix': confusion_matrix(all_y_true, all_y_pred),
        'y_true': np.array(all_y_true),
        'y_pred': np.array(all_y_pred)
    }

    # summary
    print(f"\n{'-'*70}")
    print(f"RESULTS ({n_folds}-Fold CV):")
    print(f"{'-'*70}")
    print(
        f"  Accuracy:  {results['mean_accuracy']:.4f} ± {results['std_accuracy']:.4f}")
    print(
        f"  Precision: {results['mean_precision']:.4f} ± {results['std_precision']:.4f}")
    print(
        f"  Recall:    {results['mean_recall']:.4f} ± {results['std_recall']:.4f}")
    print(f"  F1-Score:  {results['mean_f1']:.4f} ± {results['std_f1']:.4f}")
    print(f"\n  IC 95% Accuracy: [{ci_95_acc[0]:.4f}, {ci_95_acc[1]:.4f}]")
    print(f"  IC 95% F1-Score: [{ci_95_f1[0]:.4f}, {ci_95_f1[1]:.4f}]")
    print(f"\n  Total time: {training_time:.2f}s")
    print(f"{'-'*70}\n")

    return results

print("kfold_svm_evaluation() defined correctly")

kfold_svm_evaluation() defined correctly


## Comparing all kernels

Test the 4 kernels with default parameters (C=1.0, gamma='scale')


In [175]:
X = dataset.data.features
y = dataset.data.targets

experiments_log = [] 
kernels = ['linear', 'rbf', 'poly', 'sigmoid']

for kernel in kernels:
    print(f"\n{'-'*70}")
    print(f"# TESTING KERNEL: {kernel.upper()}")
    print(f"{'-'*70}\n")

    results = kfold_svm_evaluation(
        X=x,  
        y=y.values.ravel(),  # Convert y to 1D array
        kernel=kernel,
        C=1.0,
        gamma='scale',
        degree=3,  
        n_folds=10,
        random_state=42
    )

    # Save
    experiments_log.append({
        'experiment_id': len(experiments_log) + 1,
        'kernel': kernel,
        'C': 1.0,
        'gamma': 'scale',
        'degree': 3 if kernel == 'poly' else '-',
        'mean_accuracy': results['mean_accuracy'],
        'std_accuracy': results['std_accuracy'],
        'mean_f1': results['mean_f1'],
        'std_f1': results['std_f1'],
        'ci_95_acc_lower': results['ci_95_accuracy'][0],
        'ci_95_acc_upper': results['ci_95_accuracy'][1],
        'training_time': results['training_time']
    })

# DataFrame with all results
df_experiments = pd.DataFrame(experiments_log)

print("\n" + "-"*90)
print("SUMMARY")
print("-"*90)
print(df_experiments.to_string(index=False))
print("-"*90) 

best_exp = df_experiments.loc[df_experiments['mean_accuracy'].idxmax()]
print(f"\n BEST KERNEL: {best_exp['kernel'].upper()}")
print(
    f"   Accuracy: {best_exp['mean_accuracy']:.4f} ± {best_exp['std_accuracy']:.4f}")
print(f"   F1-Score: {best_exp['mean_f1']:.4f} ± {best_exp['std_f1']:.4f}")
print(
    f"   IC 95%: [{best_exp['ci_95_acc_lower']:.4f}, {best_exp['ci_95_acc_upper']:.4f}]")


----------------------------------------------------------------------
# TESTING KERNEL: LINEAR
----------------------------------------------------------------------

Executing 10-Fold Cross Validation...
Total samples: 2111
Kernel: linear | C: 1.0 | Gamma: scale
Classes in dataset: 7

  FOLD  1: Acc=0.9528 | Prec=0.9591 | Rec=0.9528 | F1=0.9528
  FOLD  2: Acc=0.9810 | Prec=0.9820 | Rec=0.9810 | F1=0.9811
  FOLD  3: Acc=0.9336 | Prec=0.9383 | Rec=0.9336 | F1=0.9330
  FOLD  4: Acc=0.9431 | Prec=0.9486 | Rec=0.9431 | F1=0.9440
  FOLD  5: Acc=0.9621 | Prec=0.9629 | Rec=0.9621 | F1=0.9619
  FOLD  6: Acc=0.9431 | Prec=0.9458 | Rec=0.9431 | F1=0.9432
  FOLD  7: Acc=0.9289 | Prec=0.9340 | Rec=0.9289 | F1=0.9281
  FOLD  8: Acc=0.9431 | Prec=0.9440 | Rec=0.9431 | F1=0.9432
  FOLD  9: Acc=0.9526 | Prec=0.9534 | Rec=0.9526 | F1=0.9518
  FOLD 10: Acc=0.9810 | Prec=0.9815 | Rec=0.9810 | F1=0.9809

----------------------------------------------------------------------
RESULTS (10-Fold CV):
-------

## Export Results as CVS file


In [177]:
df_experiments.to_csv('svm_kfold_experiments.csv',
                      index=False, encoding='utf-8')
print("Results saved as 'svm_kfold_experiments.csv'")

print("\n" + "-"*90)
print("SUMMARY")
print("-"*90)
print(f"""
Evaluation Strategy: 10-Fold Cross-Validation
- Method: K-Fold with shuffle
- Folds: 10
- Random State: 42
- Metrics: Weighted averaging (handles imbalance)

Kernels:
""")

for idx, row in df_experiments.iterrows():
    print(f"{idx+1}. {row['kernel'].upper():8s} | "
          f"Acc: {row['mean_accuracy']:.4f}±{row['std_accuracy']:.4f} | "
          f"F1: {row['mean_f1']:.4f}±{row['std_f1']:.4f} | "
          f"Time: {row['training_time']:.2f}s")

print(f"\nBEST CONFIGURATION:")
print(f"   Kernel: {best_exp['kernel'].upper()}")
print(
    f"   Accuracy: {best_exp['mean_accuracy']:.4f} ± {best_exp['std_accuracy']:.4f}")
print(
    f"   IC 95%: [{best_exp['ci_95_acc_lower']:.4f}, {best_exp['ci_95_acc_upper']:.4f}]")
print(f"   F1-Score: {best_exp['mean_f1']:.4f} ± {best_exp['std_f1']:.4f}")
print(f"   Tiempo: {best_exp['training_time']:.2f}s")
print("="*90)

Results saved as 'svm_kfold_experiments.csv'

------------------------------------------------------------------------------------------
SUMMARY
------------------------------------------------------------------------------------------

Evaluation Strategy: 10-Fold Cross-Validation
- Method: K-Fold with shuffle
- Folds: 10
- Random State: 42
- Metrics: Weighted averaging (handles imbalance)

Kernels:

1. LINEAR   | Acc: 0.9522±0.0170 | F1: 0.9520±0.0171 | Time: 0.63s
2. RBF      | Acc: 0.8929±0.0272 | F1: 0.8942±0.0268 | Time: 1.07s
3. POLY     | Acc: 0.8394±0.0251 | F1: 0.8361±0.0260 | Time: 0.97s
4. SIGMOID  | Acc: 0.5803±0.0351 | F1: 0.5813±0.0356 | Time: 0.87s

BEST CONFIGURATION:
   Kernel: LINEAR
   Accuracy: 0.9522 ± 0.0170
   IC 95%: [0.9300, 0.9810]
   F1-Score: 0.9520 ± 0.0171
   Tiempo: 0.63s


## K-Fold CV vs Train/Test Split

**1. More reliable estimation**

- K-Fold uses **all data** for training and validation
- Reduces **variance** in metrics
- Train/Test depends on ONE random split

**2. Confidence intervals**

- K-Fold provides **10 independent measurements**
- We calculate **95% CI** to quantify uncertainty
- Train/Test only gives **1 value** without confidence measure

**3. Weighted metrics**

- We use `average='weighted'` in all metrics
- Handles correctly dataset with **7 unbalanced classes**
- Avoids artificially high accuracy by predicting majority class

**4. Correct normalization**

- StandardScaler **fit in train, transform in test** in each fold
- Prevents **data leakage**
- Simulates correctly unseen data

**5. Reproducibility**

- `random_state=42` fixed in all experiments
- Results **reproducible** for comparison
- Important for scientific validation

### 📊 **Interpretation of Results**

**Standard Deviation (±std)**

- Low std → Model is **robust** and **stable**
- High std → Model is **sensitive** to data splits

**95% Confidence Interval**

- Range where we expect 95% of future results
- Narrow → Higher **certainty** in performance
- Wide → Higher **variability**, less confidence

**F1-Score Weighted**

- Most important metric for **imbalanced data**
- Balances precision and recall by class
- Accuracy may be misleading in imbalance
