# **Evidence 1: Supervised Learning**

### Rodrigo Jiménez Ortiz - A01029623
### Ricardo Villarreal Bazán - A01666859
### Bertín Flores Silva - A01660604

## **Describe the dataset:**

General information:

* Dataset Name: Estimation of Obesity Levels Based On Eating Habits and Physical Condition

* Instances: 2111 rows

* Features: Includes 8 categorical features and 8 numerical features.

* Target Variable: NObeyesdad (Obesity level)


Features (X)

There are 17 input attributes (columns) in `X`. They can be grouped as follows:

1. **Demographic / Anthropometric**
   - `Gender` (categorical: “Male” / “Female”)
   - `Age` (numeric: integer, in years)
   - `Height` (numeric: in meters or centimeters—check units)
   - `Weight` (numeric: in kilograms)

2. **Family History**
   - `family_history_with_overweight` (binary: “yes” / “no”)

3. **Eating Habits (self-reported)**
   - `FAVC` (Frequent consumption of high caloric food; categorical: “yes” / “no”)
   - `FCVC` (Frequency of vegetable consumption; integer scale 0–3, where higher means more frequent)
   - `NCP` (Number of main meals per day; integer)
   - `CAEC` (Consumption of food between meals; categorical: “no” / “Sometimes” / “Frequently” / “Always”)
   - `SMOKE` (Binary: “yes” / “no”)
   - `CH2O` (Daily water consumption in liters; numeric)
   - `SCC` (Calories consumption monitoring; categorical: “yes” / “no”)
   - `FAF` or `FAHF` (Frequency of high caloric food consumption; integer 0–3)
   - `TUE` (Frequency of physical activity; integer 0–3)
   - `CALC` (Time using electronic devices; categorical: “no” / “Sometimes” / “Frequently” / “Always”)
   - `MTRANS` (Mode of transportation; categorical: “Automobile” / “Motorbike” / “Bike” / “Public_Transportation” / “Walking”)

4. **Physical Condition / Activity**
   - `Smoke` (often redundant with SMOKE—verify naming)
   - `FCVC` (frequency of vegetable consumption—might double as health habit)
   - `TUE` (frequency of training)
   - Note: some attributes appear in multiple groupings (e.g., `SMOKE` also appears under “lifestyle”). Always cross-check the variable descriptions below.

Target (y)

The column `y` (or `dataset.data.targets`) contains one of these categories:

1. `Insufficient_Weight`  
2. `Normal_Weight`  
3. `Overweight_Level_I`  
4. `Overweight_Level_II`  
5. `Obesity_Type_I`  
6. `Obesity_Type_II`  
7. `Obesity_Type_III`  

Each record in `X` corresponds to exactly one of these obesity categories in `y`.


## **Select and train a classifier**

In this case, Logistic Regression, K Nearest Neighbors and Random Forest will be used, as well as Decision Tree.

In [19]:
# Imports
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, BaggingClassifier, StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

#### **1.- Logistic Regression**

Advantages: 

* Simple and fast to train

* Easy to interpret (coefficients show feature importance)

* Performs well on linearly separable problems

* Works well with high-dimensional data

Disadvantages:

* Assumes a linear relationship between features and log-odds

* Doesn’t perform well with multicollinearity

* Sensitive to outliers

* May underperform on complex, non-linear problems

Justification:
* Accuracy: overall performance

* F1-score: useful for class imbalance

* Precision/Recall: for analyzing specific obesity categories

In [None]:
# Load data
df = pd.read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")

# Features & target
X = df.drop("NObeyesdad", axis=1)
y = df["NObeyesdad"]

# Column types
categorical_cols = X.select_dtypes(include=["object"]).columns.tolist()
numerical_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

# Preprocessor
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Logistic Regression pipeline
lr_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, solver='lbfgs'))
])

# Train and evaluate base model
lr_pipeline.fit(X_train, y_train)
y_pred_base = lr_pipeline.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_base))
print("\nClassification Report:\n", classification_report(y_test, y_pred_base))


Confusion Matrix:
 [[54  0  0  0  0  0  0]
 [ 5 43  0  0  0 10  0]
 [ 0  0 64  3  0  0  3]
 [ 0  0  2 58  0  0  0]
 [ 0  0  0  1 64  0  0]
 [ 0  8  0  0  0 43  7]
 [ 0  2  7  0  0  5 44]]

Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.92      1.00      0.96        54
      Normal_Weight       0.81      0.74      0.77        58
     Obesity_Type_I       0.88      0.91      0.90        70
    Obesity_Type_II       0.94      0.97      0.95        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.74      0.74      0.74        58
Overweight_Level_II       0.81      0.76      0.79        58

           accuracy                           0.87       423
          macro avg       0.87      0.87      0.87       423
       weighted avg       0.87      0.87      0.87       423



#### **Fine-tune parameters:**

In [4]:
# Define hyperparameter grid
lr_params = {
    'classifier__C': [0.1, 1, 10],  
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs', 'saga']
}

# Grid search with cross-validation
lr_grid = GridSearchCV(lr_pipeline, lr_params, cv=5, scoring='f1_macro')
lr_grid.fit(X_train, y_train)

# Evaluate best model
best_lr_model = lr_grid.best_estimator_
y_pred_best = best_lr_model.predict(X_test)

print("Best Parameters:", lr_grid.best_params_)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_best))
print("\nClassification Report:\n", classification_report(y_test, y_pred_best))


Best Parameters: {'classifier__C': 10, 'classifier__penalty': 'l2', 'classifier__solver': 'lbfgs'}
Confusion Matrix:
 [[54  0  0  0  0  0  0]
 [ 3 48  0  0  0  7  0]
 [ 0  0 66  1  0  0  3]
 [ 0  0  0 60  0  0  0]
 [ 0  0  0  1 64  0  0]
 [ 0  8  0  0  0 49  1]
 [ 0  0  2  1  0  2 53]]

Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.95      1.00      0.97        54
      Normal_Weight       0.86      0.83      0.84        58
     Obesity_Type_I       0.97      0.94      0.96        70
    Obesity_Type_II       0.95      1.00      0.98        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.84      0.84      0.84        58
Overweight_Level_II       0.93      0.91      0.92        58

           accuracy                           0.93       423
          macro avg       0.93      0.93      0.93       423
       weighted avg       0.93      0.93      0.93       423



#### **Train ensemble:**

In [5]:
# Define individual classifiers
log_clf = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, solver='lbfgs', random_state=42))
])

rf_clf = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Create ensemble
voting_clf_lr = VotingClassifier(estimators=[
    ('lr', log_clf),
    ('rf', rf_clf)
], voting='soft')

# Fit on training data
voting_clf_lr.fit(X_train, y_train)


#### **Analyze performance on training set:**

In [6]:
# Predict on training set
train_preds_lr = voting_clf_lr.predict(X_train)
print(classification_report(y_train, train_preds_lr))


                     precision    recall  f1-score   support

Insufficient_Weight       1.00      1.00      1.00       218
      Normal_Weight       0.98      1.00      0.99       229
     Obesity_Type_I       1.00      1.00      1.00       281
    Obesity_Type_II       1.00      1.00      1.00       237
   Obesity_Type_III       1.00      1.00      1.00       259
 Overweight_Level_I       0.99      0.97      0.98       232
Overweight_Level_II       0.99      0.99      0.99       232

           accuracy                           0.99      1688
          macro avg       0.99      0.99      0.99      1688
       weighted avg       0.99      0.99      0.99      1688



#### **Report performance on test set:**

In [7]:
# Predict on test set
test_preds_lr = voting_clf_lr.predict(X_test)
print(classification_report(y_test, test_preds_lr))


                     precision    recall  f1-score   support

Insufficient_Weight       0.98      0.98      0.98        54
      Normal_Weight       0.82      0.86      0.84        58
     Obesity_Type_I       0.94      0.97      0.96        70
    Obesity_Type_II       0.98      0.98      0.98        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.85      0.86      0.85        58
Overweight_Level_II       0.96      0.88      0.92        58

           accuracy                           0.93       423
          macro avg       0.93      0.93      0.93       423
       weighted avg       0.94      0.93      0.93       423



#### **2.- Random Forest**

A Random Forest builds an ensemble of decision trees on bootstrapped (randomly sampled) subsets of the training data, and at each split it considers a random subset of features. Each tree votes on the class, and the forest predicts by majority vote (or averaged probabilities).

Key Advantages:

Robustness & Accuracy: By averaging many trees, it reduces overfitting compared to a single decision tree and often achieves high predictive performance out of the box.
Handles Mixed Data Types: Works “out of the box” with numeric and categorical features without needing extensive preprocessing (beyond one-hot encoding if necessary).
Implicit Feature Importance: Provides a straightforward ranking of feature importances, helping to identify which inputs matter most.
Resilient to Outliers & Noise: Since each tree sees a different subset of data and features, individual noisy points have limited influence on the overall model.

Disadvantages:

Less Interpretable: Unlike a single decision tree (where you can read “if-then” rules), a Random Forest’s aggregate of hundreds of trees is hard to visualize or explain in simple terms.
Slower Prediction & Training: Training many trees (and tuning hyperparameters like number of trees, depth, etc.) can be computationally intensive, especially on large datasets.
Memory Consumption: Storing dozens or hundreds of trees requires more RAM than a single model.
Potential Overfitting with Noisy Features: Though more robust than one tree, if there are many irrelevant or highly noisy features, the forest can still learn spurious patterns unless properly tuned (e.g., via limiting tree depth or feature subsampling).

In [8]:
# 2. Preprocessing
# Map binary categorical columns ('yes'/'no') to 0/1
binary_cols = ['family_history_with_overweight', 'FAVC', 'SMOKE', 'SCC']
for col in binary_cols:
    X[col] = X[col].map({'no': 0, 'yes': 1})

# One-hot encode categorical columns with >2 categories
categorical_cols = ['CAEC', 'CALC', 'MTRANS', 'Gender']
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

# 3. Train–test split (stratify on target to preserve class distribution)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 4. Baseline Random Forest Classifier (default params)
rf_baseline = RandomForestClassifier(random_state=42)
rf_baseline.fit(X_train, y_train)

# Evaluate baseline
y_pred_baseline = rf_baseline.predict(X_test)
print("Baseline Random Forest Classification Report:\n")
print(classification_report(y_test, y_pred_baseline))
print("Baseline Confusion Matrix:\n")
print(confusion_matrix(y_test, y_pred_baseline))

# 5. Hyperparameter tuning with GridSearchCV
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

# Stratified 5-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=cv,
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=2
)

grid_search.fit(X_train, y_train)

# Best parameters from GridSearchCV
best_params = grid_search.best_params_
print("\nBest Hyperparameters:\n", best_params)

# 6. Evaluate the best estimator on the test set
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test)

print("\nTuned Random Forest Classification Report:\n")
print(classification_report(y_test, y_pred_best))
print("Tuned Random Forest Confusion Matrix:\n")
print(confusion_matrix(y_test, y_pred_best))

# 7. Feature importance (top 10)
importances = best_rf.feature_importances_
feature_names = X.columns
feature_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)

print("\nTop 10 Feature Importances:\n")
print(feature_importances.head(10))

Baseline Random Forest Classification Report:

                     precision    recall  f1-score   support

Insufficient_Weight       1.00      0.94      0.97        54
      Normal_Weight       0.75      0.95      0.84        58
     Obesity_Type_I       0.96      0.96      0.96        70
    Obesity_Type_II       1.00      0.98      0.99        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.94      0.84      0.89        58
Overweight_Level_II       0.98      0.91      0.95        58

           accuracy                           0.94       423
          macro avg       0.95      0.94      0.94       423
       weighted avg       0.95      0.94      0.94       423

Baseline Confusion Matrix:

[[51  3  0  0  0  0  0]
 [ 0 55  0  0  0  3  0]
 [ 0  2 67  0  0  0  1]
 [ 0  0  1 59  0  0  0]
 [ 0  1  0  0 64  0  0]
 [ 0  8  1  0  0 49  0]
 [ 0  4  1  0  0  0 53]]
Fitting 5 folds for each of 72 candidates, totalling 360 fits

Best Hyperparameter

In [9]:
# After preprocessing X and y:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("X_train shape:", X_train.shape)
print("X_test shape: ", X_test.shape)
print("y_train distribution:\n", y_train.value_counts())
print("y_test distribution:\n", y_test.value_counts())

X_train shape: (1688, 23)
X_test shape:  (423, 23)
y_train distribution:
 NObeyesdad
Obesity_Type_I         281
Obesity_Type_III       259
Obesity_Type_II        237
Overweight_Level_I     232
Overweight_Level_II    232
Normal_Weight          229
Insufficient_Weight    218
Name: count, dtype: int64
y_test distribution:
 NObeyesdad
Obesity_Type_I         70
Obesity_Type_III       65
Obesity_Type_II        60
Normal_Weight          58
Overweight_Level_II    58
Overweight_Level_I     58
Insufficient_Weight    54
Name: count, dtype: int64


In [None]:
# 1. Define the scoring dictionary
scoring = {
    'accuracy'      : 'accuracy',
    'f1_weighted'   : 'f1_weighted',
    'f1_macro'      : 'f1_macro'
}

# 2. Reuse your parameter grid
param_grid = {
    'n_estimators'     : [100, 200, 300],
    'max_depth'        : [None, 10, 20, 30],
    'min_samples_leaf' : [1, 2, 4],
    'criterion'        : ['gini', 'entropy']
}

# 3. Set up stratified CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 4. GridSearchCV 
rf = RandomForestClassifier(random_state=42)
grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring=scoring,
    refit='f1_weighted', 
    cv=cv,
    n_jobs=-1,
    verbose=2
)

# 5. Run the search
grid.fit(X_train, y_train)

# 6. Report results
print("Best params (by weighted F1):", grid.best_params_)
print("\nCV results summary:")
for metric in scoring:
    mean_score = grid.cv_results_[f'mean_test_{metric}'][grid.best_index_]
    print(f"  {metric:<12}: {mean_score:.3f}")

Fitting 5 folds for each of 72 candidates, totalling 360 fits
Best params (by weighted F1): {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 2, 'n_estimators': 300}

CV results summary:
  accuracy    : 0.947
  f1_weighted : 0.947
  f1_macro    : 0.945


In [11]:
# 1. Instantiate base classifiers (you can replace params with your tuned values)
dt_clf = DecisionTreeClassifier(random_state=42, max_depth=20, min_samples_leaf=2)
rf_clf = RandomForestClassifier(**best_params, random_state=42)  
lr_clf = LogisticRegression(max_iter=1000, random_state=42)
knn_clf = KNeighborsClassifier(n_neighbors=5)
gnb_clf = GaussianNB()

# 2. Voting Ensemble (soft voting)
voting_clf = VotingClassifier(
    estimators=[
        ('dt', dt_clf),
        ('rf', rf_clf),
        ('lr', lr_clf),
        ('knn', knn_clf),
        ('gnb', gnb_clf)
    ],
    voting='soft',
    n_jobs=-1
)

voting_clf.fit(X_train, y_train)
y_pred_voting = voting_clf.predict(X_test)

print("Voting Classifier Report:\n")
print(classification_report(y_test, y_pred_voting))
print("Voting Confusion Matrix:\n", confusion_matrix(y_test, y_pred_voting))

# 3. Stacking Ensemble
stacking_clf = StackingClassifier(
    estimators=[
        ('dt', dt_clf),
        ('rf', rf_clf),
        ('lr', lr_clf),
        ('knn', knn_clf)
    ],
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5,
    n_jobs=-1
)

stacking_clf.fit(X_train, y_train)
y_pred_stack = stacking_clf.predict(X_test)

print("\nStacking Classifier Report:\n")
print(classification_report(y_test, y_pred_stack))

Voting Classifier Report:

                     precision    recall  f1-score   support

Insufficient_Weight       0.94      0.91      0.92        54
      Normal_Weight       0.81      0.79      0.80        58
     Obesity_Type_I       0.99      0.99      0.99        70
    Obesity_Type_II       0.95      0.98      0.97        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.84      0.88      0.86        58
Overweight_Level_II       0.98      0.97      0.97        58

           accuracy                           0.93       423
          macro avg       0.93      0.93      0.93       423
       weighted avg       0.93      0.93      0.93       423

Voting Confusion Matrix:
 [[49  5  0  0  0  0  0]
 [ 3 46  0  0  0  9  0]
 [ 0  0 69  0  0  1  0]
 [ 0  0  1 59  0  0  0]
 [ 0  0  0  1 64  0  0]
 [ 0  6  0  0  0 51  1]
 [ 0  0  0  2  0  0 56]]

Stacking Classifier Report:

                     precision    recall  f1-score   support

Insufficient

##  **Performance Metrics & Test Set Results**

###  **Selected Performance Metrics**
- **Accuracy**  
  Overall fraction of correct predictions—an easy‐to‐interpret overall score, but can mask poor minority‐class performance.
- **Macro F1-Score**  
  Unweighted average of per-class F1 scores; treats each class equally, so that mistakes on small classes aren’t hidden by large ones.
- **Weighted F1-Score**  
  Average of per-class F1 scores weighted by class support; balances precision and recall while accounting for the class distribution.

### **Test Set Results**

| Classifier            | Accuracy | Macro F1 | Weighted F1 |
|-----------------------|---------:|---------:|------------:|
| Baseline Random Forest| 0.94     | 0.94     | 0.94        |
| Tuned Random Forest   | 0.95     | 0.95     | 0.96        |
| Voting Ensemble       | 0.93     | 0.92     | 0.93        |
| Stacking Ensemble     | 0.95     | 0.95     | 0.95        |

- **Baseline RF** (accuracy 0.94, macro F1 0.94, weighted F1 0.94) : contentReference[oaicite:0]{index=0}  
- **Tuned RF** (accuracy 0.95, macro F1 0.95, weighted F1 0.96) : contentReference[oaicite:1]{index=1}  
- **Voting Ensemble** (accuracy 0.93, macro F1 0.92, weighted F1 0.93) : contentReference[oaicite:2]{index=2}  
- **Stacking Ensemble** (accuracy 0.95, macro F1 0.95, weighted F1 0.95) : contentReference[oaicite:3]{index=3}  


#### **3.- K-nearest neighbors**

K-Nearest Neighbors is a supervised machine learning algorithm mainly used for classification. The way it works is finding the "k" closest points or "neighbors" to a given input and makes a predictions based on the majority class or the average value in case of regresion. 
In the k-Nearest Neighbours algorithm k is just a number that tells the algorithm how many nearby points or neighbors to look at when it makes a decision.
### Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbor, these neighbors are used for classification and regression task. To identify nearest neighbor we use below distance metrics:

1. Euclidean Distance: the straight-line distance between two points.
2. Manhattan Distance: the total distance you would travel if you could only move along horizontal and vertical lines like a grid or city.
3. Minkowski Distance: is like a family of distances that in some cases includes euclidean and manhattan.

Advantages of KNN
* Simple to use: Easy to understand and implement.
* No training step: No need to train as it just stores the data and uses it during prediction.
* Few parameters: Only needs to set the number of neighbors (k) and a distance method.
* Versatile: Works for both classification and regression problems.

Disadvantages of KNN
* Slow with large data: Needs to compare every point during prediction.
* Struggles with many features: Accuracy drops when data has too many features.
* Can Overfit: It can overfit especially when the data is high-dimensional or not clean.

In [25]:
# Load data
df = pd.read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")

# Features & target
X = df.drop("NObeyesdad", axis=1)
y = df["NObeyesdad"]

# Column types
categorical_cols = X.select_dtypes(include=["object"]).columns.tolist()
numerical_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

# 3. Transformers
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# 4. ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)

# 5. KNN Pipeline
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier(n_neighbors=5))
])

# 6. Train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y.values.ravel(), test_size=0.2, random_state=42)

# 7. Train model
clf.fit(X_train, y_train)

# 8. Evaluate
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))

Accuracy: 0.8203309692671394
Confusion Matrix:
 [[53  2  0  0  0  1  0]
 [15 19  8  2  0 10  8]
 [ 0  0 74  2  0  0  2]
 [ 0  0  2 56  0  0  0]
 [ 0  0  0  0 63  0  0]
 [ 2  5  0  0  0 46  3]
 [ 0  1  6  3  1  3 36]]

Classification report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.76      0.95      0.84        56
      Normal_Weight       0.70      0.31      0.43        62
     Obesity_Type_I       0.82      0.95      0.88        78
    Obesity_Type_II       0.89      0.97      0.93        58
   Obesity_Type_III       0.98      1.00      0.99        63
 Overweight_Level_I       0.77      0.82      0.79        56
Overweight_Level_II       0.73      0.72      0.73        50

           accuracy                           0.82       423
          macro avg       0.81      0.82      0.80       423
       weighted avg       0.81      0.82      0.80       423



In [24]:
# KNN Hyperparameter Tuning with GridSearchCV
param_grid = {
    'classifier__n_neighbors': [3, 5, 7, 9, 11],
    'classifier__weights': ['uniform', 'distance'],
    'classifier__metric': ['euclidean', 'manhattan']
}

# GridSearch with F1_weighted
grid_search = GridSearchCV(
    clf,
    param_grid,
    cv=5,  
    scoring='f1_weighted',  
    n_jobs=-1,  
    verbose=2
)


grid_search.fit(X_train, y_train)


print("Best hyperparameters:", grid_search.best_params_)
print("Best F1 score:", grid_search.best_score_)

y_pred = grid_search.predict(X_test)
print("\nTest F1 Score:", f1_score(y_test, y_pred, average='weighted'))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification report:\n", classification_report(y_test, y_pred))

Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best hyperparameters: {'classifier__metric': 'manhattan', 'classifier__n_neighbors': 3, 'classifier__weights': 'distance'}
Best F1 score: 0.8785808148883518

Test F1 Score: 0.8717558146191585
Confusion Matrix:
 [[54  1  0  0  0  1  0]
 [ 9 34  5  0  0 10  4]
 [ 0  0 73  2  0  1  2]
 [ 0  0  1 57  0  0  0]
 [ 0  0  0  0 63  0  0]
 [ 0  5  0  0  0 47  4]
 [ 0  1  2  0  1  3 43]]

Classification report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.86      0.96      0.91        56
      Normal_Weight       0.83      0.55      0.66        62
     Obesity_Type_I       0.90      0.94      0.92        78
    Obesity_Type_II       0.97      0.98      0.97        58
   Obesity_Type_III       0.98      1.00      0.99        63
 Overweight_Level_I       0.76      0.84      0.80        56
Overweight_Level_II       0.81      0.86      0.83        50

           accuracy                        

 0.78379578 0.80917206 0.76757613 0.79995529        nan 0.87858081
        nan 0.87288615        nan 0.86799686        nan 0.8594353
        nan 0.85224345]


In [21]:
# Define individual classifiers
log_clf = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, solver='lbfgs', random_state=42))
])

rf_clf = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

k_clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier(n_neighbors=5))
])
# Create ensemble
voting_clf_lr = VotingClassifier(estimators=[
    ('lr', log_clf),
    ('rf', rf_clf),
    ('k', k_clf)
], voting='soft')

# Fit on training data
voting_clf_lr.fit(X_train, y_train)

In [22]:
# Predict on training set
train_preds_lr = voting_clf_lr.predict(X_train)
print(classification_report(y_train, train_preds_lr))

                     precision    recall  f1-score   support

Insufficient_Weight       0.99      1.00      1.00       216
      Normal_Weight       0.98      0.97      0.98       225
     Obesity_Type_I       1.00      0.99      1.00       273
    Obesity_Type_II       1.00      1.00      1.00       239
   Obesity_Type_III       1.00      1.00      1.00       261
 Overweight_Level_I       0.98      0.97      0.98       234
Overweight_Level_II       0.98      1.00      0.99       240

           accuracy                           0.99      1688
          macro avg       0.99      0.99      0.99      1688
       weighted avg       0.99      0.99      0.99      1688



In [23]:
# Predict on test set
test_preds_lr = voting_clf_lr.predict(X_test)
print(classification_report(y_test, test_preds_lr))

                     precision    recall  f1-score   support

Insufficient_Weight       0.86      1.00      0.93        56
      Normal_Weight       0.89      0.63      0.74        62
     Obesity_Type_I       0.95      0.95      0.95        78
    Obesity_Type_II       0.95      0.98      0.97        58
   Obesity_Type_III       1.00      1.00      1.00        63
 Overweight_Level_I       0.81      0.84      0.82        56
Overweight_Level_II       0.84      0.92      0.88        50

           accuracy                           0.90       423
          macro avg       0.90      0.90      0.90       423
       weighted avg       0.90      0.90      0.90       423



#### **4.- Decision Tree**

In [None]:
# Load data
df = pd.read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")

# Features & target
X = df.drop("NObeyesdad", axis=1)
y = df["NObeyesdad"]

# Column types
categorical_cols = X.select_dtypes(include=["object"]).columns.tolist()
numerical_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

# Preprocessing pipeline
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Decision Tree pipeline
dt_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

# Train and evaluate base model
dt_pipeline.fit(X_train, y_train)
y_pred_dt = dt_pipeline.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_dt))
print("\nClassification Report:\n", classification_report(y_test, y_pred_dt))


Confusion Matrix:
 [[47  7  0  0  0  0  0]
 [ 2 49  0  0  0  7  0]
 [ 0  0 67  1  0  0  2]
 [ 0  0  3 57  0  0  0]
 [ 0  0  0  1 64  0  0]
 [ 0  5  0  0  0 50  3]
 [ 0  0  4  0  0  1 53]]

Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.96      0.87      0.91        54
      Normal_Weight       0.80      0.84      0.82        58
     Obesity_Type_I       0.91      0.96      0.93        70
    Obesity_Type_II       0.97      0.95      0.96        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.86      0.86      0.86        58
Overweight_Level_II       0.91      0.91      0.91        58

           accuracy                           0.91       423
          macro avg       0.92      0.91      0.91       423
       weighted avg       0.92      0.91      0.92       423



#### **Fine-tune parameters:**


In [53]:
# Define hyperparameter grid
dt_params = {
    'classifier__max_depth': [5, 10, 15, None],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Grid search with cross-validation
dt_grid = GridSearchCV(dt_pipeline, dt_params, cv=5, scoring='f1_macro')
dt_grid.fit(X_train, y_train)

# Evaluate best model
best_dt_model = dt_grid.best_estimator_
y_pred_best_dt = best_dt_model.predict(X_test)

print("Best Parameters:", dt_grid.best_params_)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_best_dt))
print("\nClassification Report:\n", classification_report(y_test, y_pred_best_dt))


Best Parameters: {'classifier__max_depth': 15, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2}
Confusion Matrix:
 [[47  7  0  0  0  0  0]
 [ 2 49  0  0  0  7  0]
 [ 0  0 67  1  0  0  2]
 [ 0  0  3 57  0  0  0]
 [ 0  0  0  1 64  0  0]
 [ 0  5  0  0  0 50  3]
 [ 0  0  4  0  0  1 53]]

Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.96      0.87      0.91        54
      Normal_Weight       0.80      0.84      0.82        58
     Obesity_Type_I       0.91      0.96      0.93        70
    Obesity_Type_II       0.97      0.95      0.96        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.86      0.86      0.86        58
Overweight_Level_II       0.91      0.91      0.91        58

           accuracy                           0.91       423
          macro avg       0.92      0.91      0.91       423
       weighted avg       0.92      0.91      0.92       

#### **Train ensemble:**

In [47]:
dt_base = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

bagged_dt = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', BaggingClassifier(
        estimator=DecisionTreeClassifier(random_state=42),
        n_estimators=50,
        random_state=42))
])

# Fit ensemble
bagged_dt.fit(X_train, y_train)


#### **Analyze performance on training set:**

In [50]:
train_preds_dt = bagged_dt.predict(X_train)
print(classification_report(y_train, train_preds_dt))

                     precision    recall  f1-score   support

Insufficient_Weight       1.00      1.00      1.00       218
      Normal_Weight       1.00      1.00      1.00       229
     Obesity_Type_I       1.00      1.00      1.00       281
    Obesity_Type_II       1.00      1.00      1.00       237
   Obesity_Type_III       1.00      1.00      1.00       259
 Overweight_Level_I       1.00      1.00      1.00       232
Overweight_Level_II       1.00      1.00      1.00       232

           accuracy                           1.00      1688
          macro avg       1.00      1.00      1.00      1688
       weighted avg       1.00      1.00      1.00      1688



#### **Report performance on test set:**

In [49]:
test_preds_dt = bagged_dt.predict(X_test)
print(classification_report(y_test, test_preds_dt))

                     precision    recall  f1-score   support

Insufficient_Weight       0.96      0.89      0.92        54
      Normal_Weight       0.87      0.91      0.89        58
     Obesity_Type_I       0.92      1.00      0.96        70
    Obesity_Type_II       1.00      0.97      0.98        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.95      0.95      0.95        58
Overweight_Level_II       0.98      0.95      0.96        58

           accuracy                           0.95       423
          macro avg       0.95      0.95      0.95       423
       weighted avg       0.95      0.95      0.95       423

