# **Evidence 1: Supervised Learning**

### Rodrigo Jiménez Ortiz A01029623

## **Describe the dataset:**

General information:

* Dataset Name: Estimation of Obesity Levels Based On Eating Habits and Physical Condition

* Instances: 2111 rows

* Features: Includes 8 categorical features and 8 numerical features.

* Target Variable: NObeyesdad (Obesity level)

## **Select and train a classifier**

In this case, besides using a Decision Tree classifier, Logistic Regression will be used as well. It's a supervised classification algorithm used to predict the probability that a given input belongs to a particular class. Some key advantages and disadvantages of this classifier are:

Advantages: 

* Simple and fast to train

* Easy to interpret (coefficients show feature importance)

* Performs well on linearly separable problems

* Works well with high-dimensional data

Disadvantages:

* Assumes a linear relationship between features and log-odds

* Doesn’t perform well with multicollinearity

* Sensitive to outliers

* May underperform on complex, non-linear problems

In [45]:
# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix

#### **1.- Logistic Regression**

* Accuracy: overall performance

* F1-score: useful for class imbalance

* Precision/Recall: for analyzing specific obesity categories

In [35]:
# Load data
df = pd.read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")

# Features & target
X = df.drop("NObeyesdad", axis=1)
y = df["NObeyesdad"]

# Column types
categoricalCols = X.select_dtypes(include=["object"]).columns.tolist()
numericalCols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

# Preprocessor
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numericalCols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categoricalCols)
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Logistic Regression pipeline
lr_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, solver='lbfgs'))
])

# Train and evaluate base model
lr_pipeline.fit(X_train, y_train)
y_pred_base = lr_pipeline.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_base))
print("\nClassification Report:\n", classification_report(y_test, y_pred_base))


Confusion Matrix:
 [[54  0  0  0  0  0  0]
 [ 5 43  0  0  0 10  0]
 [ 0  0 64  3  0  0  3]
 [ 0  0  2 58  0  0  0]
 [ 0  0  0  1 64  0  0]
 [ 0  8  0  0  0 43  7]
 [ 0  2  7  0  0  5 44]]

Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.92      1.00      0.96        54
      Normal_Weight       0.81      0.74      0.77        58
     Obesity_Type_I       0.88      0.91      0.90        70
    Obesity_Type_II       0.94      0.97      0.95        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.74      0.74      0.74        58
Overweight_Level_II       0.81      0.76      0.79        58

           accuracy                           0.87       423
          macro avg       0.87      0.87      0.87       423
       weighted avg       0.87      0.87      0.87       423



#### **Fine-tune parameters:**

In [36]:
# Define hyperparameter grid
lr_params = {
    'classifier__C': [0.1, 1, 10],  
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs', 'saga']
}

# Grid search with cross-validation
lr_grid = GridSearchCV(lr_pipeline, lr_params, cv=5, scoring='f1_macro')
lr_grid.fit(X_train, y_train)

# Evaluate best model
best_lr_model = lr_grid.best_estimator_
y_pred_best = best_lr_model.predict(X_test)

print("Best Parameters:", lr_grid.best_params_)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_best))
print("\nClassification Report:\n", classification_report(y_test, y_pred_best))


Best Parameters: {'classifier__C': 10, 'classifier__penalty': 'l2', 'classifier__solver': 'lbfgs'}
Confusion Matrix:
 [[54  0  0  0  0  0  0]
 [ 3 48  0  0  0  7  0]
 [ 0  0 66  1  0  0  3]
 [ 0  0  0 60  0  0  0]
 [ 0  0  0  1 64  0  0]
 [ 0  8  0  0  0 49  1]
 [ 0  0  2  1  0  2 53]]

Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.95      1.00      0.97        54
      Normal_Weight       0.86      0.83      0.84        58
     Obesity_Type_I       0.97      0.94      0.96        70
    Obesity_Type_II       0.95      1.00      0.98        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.84      0.84      0.84        58
Overweight_Level_II       0.93      0.91      0.92        58

           accuracy                           0.93       423
          macro avg       0.93      0.93      0.93       423
       weighted avg       0.93      0.93      0.93       423



#### **Train ensemble:**

In [37]:
# Define individual classifiers
log_clf = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, solver='lbfgs', random_state=42))
])

rf_clf = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Create ensemble
voting_clf_lr = VotingClassifier(estimators=[
    ('lr', log_clf),
    ('rf', rf_clf)
], voting='soft')

# Fit on training data
voting_clf_lr.fit(X_train, y_train)


#### **Analyze performance on training set:**

In [52]:
# Predict on training set
train_preds_lr = voting_clf_lr.predict(X_train)
print(classification_report(y_train, train_preds_lr))


                     precision    recall  f1-score   support

Insufficient_Weight       1.00      1.00      1.00       218
      Normal_Weight       0.98      1.00      0.99       229
     Obesity_Type_I       1.00      1.00      1.00       281
    Obesity_Type_II       1.00      1.00      1.00       237
   Obesity_Type_III       1.00      1.00      1.00       259
 Overweight_Level_I       0.99      0.97      0.98       232
Overweight_Level_II       0.99      0.99      0.99       232

           accuracy                           0.99      1688
          macro avg       0.99      0.99      0.99      1688
       weighted avg       0.99      0.99      0.99      1688



#### **Report performance on test set:**

In [51]:
# Predict on test set
test_preds_lr = voting_clf_lr.predict(X_test)
print(classification_report(y_test, test_preds_lr))


                     precision    recall  f1-score   support

Insufficient_Weight       0.98      0.98      0.98        54
      Normal_Weight       0.82      0.86      0.84        58
     Obesity_Type_I       0.94      0.97      0.96        70
    Obesity_Type_II       0.98      0.98      0.98        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.85      0.86      0.85        58
Overweight_Level_II       0.96      0.88      0.92        58

           accuracy                           0.93       423
          macro avg       0.93      0.93      0.93       423
       weighted avg       0.94      0.93      0.93       423



#### **2.- Decision Tree Classifier**

In [42]:
# Load data
df = pd.read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")

# Features & target
X = df.drop("NObeyesdad", axis=1)
y = df["NObeyesdad"]

# Column types
categoricalCols = X.select_dtypes(include=["object"]).columns.tolist()
numericalCols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()

# Preprocessing pipeline
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numericalCols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categoricalCols)
])

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Decision Tree pipeline (base model)
dt_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

# Train and evaluate base model
dt_pipeline.fit(X_train, y_train)
y_pred_dt = dt_pipeline.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_dt))
print("\nClassification Report:\n", classification_report(y_test, y_pred_dt))


Confusion Matrix:
 [[47  7  0  0  0  0  0]
 [ 2 49  0  0  0  7  0]
 [ 0  0 67  1  0  0  2]
 [ 0  0  3 57  0  0  0]
 [ 0  0  0  1 64  0  0]
 [ 0  5  0  0  0 50  3]
 [ 0  0  4  0  0  1 53]]

Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.96      0.87      0.91        54
      Normal_Weight       0.80      0.84      0.82        58
     Obesity_Type_I       0.91      0.96      0.93        70
    Obesity_Type_II       0.97      0.95      0.96        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.86      0.86      0.86        58
Overweight_Level_II       0.91      0.91      0.91        58

           accuracy                           0.91       423
          macro avg       0.92      0.91      0.91       423
       weighted avg       0.92      0.91      0.92       423



#### **Fine-tune parameters:**


In [53]:
# Define hyperparameter grid
dt_params = {
    'classifier__max_depth': [5, 10, 15, None],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Grid search with cross-validation
dt_grid = GridSearchCV(dt_pipeline, dt_params, cv=5, scoring='f1_macro')
dt_grid.fit(X_train, y_train)

# Evaluate best model
best_dt_model = dt_grid.best_estimator_
y_pred_best_dt = best_dt_model.predict(X_test)

print("Best Parameters:", dt_grid.best_params_)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_best_dt))
print("\nClassification Report:\n", classification_report(y_test, y_pred_best_dt))


Best Parameters: {'classifier__max_depth': 15, 'classifier__min_samples_leaf': 1, 'classifier__min_samples_split': 2}
Confusion Matrix:
 [[47  7  0  0  0  0  0]
 [ 2 49  0  0  0  7  0]
 [ 0  0 67  1  0  0  2]
 [ 0  0  3 57  0  0  0]
 [ 0  0  0  1 64  0  0]
 [ 0  5  0  0  0 50  3]
 [ 0  0  4  0  0  1 53]]

Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.96      0.87      0.91        54
      Normal_Weight       0.80      0.84      0.82        58
     Obesity_Type_I       0.91      0.96      0.93        70
    Obesity_Type_II       0.97      0.95      0.96        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.86      0.86      0.86        58
Overweight_Level_II       0.91      0.91      0.91        58

           accuracy                           0.91       423
          macro avg       0.92      0.91      0.91       423
       weighted avg       0.92      0.91      0.92       

#### **Train ensemble:**

In [47]:
dt_base = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

bagged_dt = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', BaggingClassifier(
        estimator=DecisionTreeClassifier(random_state=42),
        n_estimators=50,
        random_state=42))
])

# Fit ensemble
bagged_dt.fit(X_train, y_train)


#### **Analyze performance on training set:**

In [50]:
train_preds_dt = bagged_dt.predict(X_train)
print(classification_report(y_train, train_preds_dt))

                     precision    recall  f1-score   support

Insufficient_Weight       1.00      1.00      1.00       218
      Normal_Weight       1.00      1.00      1.00       229
     Obesity_Type_I       1.00      1.00      1.00       281
    Obesity_Type_II       1.00      1.00      1.00       237
   Obesity_Type_III       1.00      1.00      1.00       259
 Overweight_Level_I       1.00      1.00      1.00       232
Overweight_Level_II       1.00      1.00      1.00       232

           accuracy                           1.00      1688
          macro avg       1.00      1.00      1.00      1688
       weighted avg       1.00      1.00      1.00      1688



#### **Report performance on test set:**

In [49]:
test_preds_dt = bagged_dt.predict(X_test)
print(classification_report(y_test, test_preds_dt))

                     precision    recall  f1-score   support

Insufficient_Weight       0.96      0.89      0.92        54
      Normal_Weight       0.87      0.91      0.89        58
     Obesity_Type_I       0.92      1.00      0.96        70
    Obesity_Type_II       1.00      0.97      0.98        60
   Obesity_Type_III       1.00      0.98      0.99        65
 Overweight_Level_I       0.95      0.95      0.95        58
Overweight_Level_II       0.98      0.95      0.96        58

           accuracy                           0.95       423
          macro avg       0.95      0.95      0.95       423
       weighted avg       0.95      0.95      0.95       423

