New plan: 
1. EDA of real data
2. preprocessing
3. train, val, test with only real data
4. recall, precision, micro macro avg etc.
5. hyper param tuning?
6. identify same syntetic and real data 
7. train,val, test with real and syntetic data
8. compare results
9. conclusion

---

# Multi-Class Prediction of Obesity Risk

Intro here

# Setup

In [None]:
!pip install ydata-profiling
!pip install lazypredict


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting lazypredict
  Downloading lazypredict-0.2.16-py2.py3-none-any.whl (14 kB)
Collecting xgboost
  Downloading xgboost-3.0.0-py3-none-manylinux_2_28_x86_64.whl (253.9 MB)
[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m253.9/253.9 MB[0m [31m138.7 MB/s[0m eta [36m0:00:01[0m

In [7]:
import pandas as pd
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin

import lazypredict
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'lazypredict'

In [None]:
train = pd.read_csv("/kaggle/input/playground-series-s4e2/train.csv") 
test = pd.read_csv('/kaggle/input/playground-series-s4e2/test.csv')

# Understanding the Data

## Data Description

There are some signs that one has risk to develop obesity. We want to understand by what this risk is influenced. Further we will predict the obesity risk based on the given properties 

The risk class are:
* Underweight Less than 18.5
* Normal 18.5 to 24.9
* Overweight 25.0 to 29.9
* Obesity I 30.0 to 34.9
* Obesity II 35.0 to 39.9
* Obesity III Higher than 40

Features:

The attributes related with eating habits are: 
* Frequent consumption of high caloric food (FAVC)
* Frequency of consumption of vegetables (FCVC)
* Number of main meals (NCP)
* Consumption of food between meals (CAEC)
* Consumption of water daily (CH20)
* Consumption of alcohol (CALC)

The attributes related with the physical condition are: 
* Calories consumption monitoring (SCC)
* Physical activity frequency (FAF)
* Time using technology devices (TUE)
* Transportation used (MTRANS)

## EDA

In [None]:
profile_train = ProfileReport(train, title="Profiling Report - Train data")
profile_test = ProfileReport(test, title="Profiling Report - Test data")

In [None]:
profile_train

In [None]:
profile_test

In [None]:
# Manually specify the desired order of the NObeyesdad categories
ordered_categories = ['Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III']

# Plotting with manually specified category order
plt.figure(figsize=(12, 8))
sns.boxplot(x='NObeyesdad', y='Weight', data=train, order=ordered_categories, color='lightgrey')
plt.title('Weight vs NObeyesdad Categories')
plt.ylabel('Weight (kg)')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Plotting with manually specified category order and setting the color to light grey
plt.figure(figsize=(12, 8))
sns.boxplot(x='NObeyesdad', y='Age', data=train, order=ordered_categories, color='lightgrey')
plt.title('Age vs NObeyesdad Categories')
plt.ylabel('Age')
plt.xticks(rotation=45)  # Rotating the x-axis labels for better readability
plt.show()

In [None]:
# Plotting with manually specified category order
plt.figure(figsize=(12, 8))
sns.boxplot(x='NObeyesdad', y='Height', data=train, order=ordered_categories, color='lightgrey')
plt.title('Height vs NObeyesdad Categories')
plt.ylabel('Height')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Filter the DataFrame to include only rows where FCVC has values 1, 2, or 3
filtered_data = train[train['FCVC'].isin([1, 2, 3])]

# Create a DataFrame with counts of 'FCVC' for each 'NObeyesdad' category
counts_df_fcvc_stacked = filtered_data.groupby(['NObeyesdad', 'FCVC']).size().unstack(fill_value=0)

# Reorder the DataFrame according to the manually specified order of the NObeyesdad categories
counts_df_fcvc_stacked = counts_df_fcvc_stacked.reindex(ordered_categories)

# Plotting a stacked bar plot for FCVC
plt.figure(figsize=(12, 8))
counts_df_fcvc_stacked.plot(kind='bar', stacked=True, colormap='Blues')
plt.title('Stacked Distribution of FCVC within NObeyesdad Categories')
plt.ylabel('Counts of FCVC Levels')
plt.xticks(rotation=45) 
plt.legend(title='FCVC Level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Filter the DataFrame to include only rows where NCP has values 1, 2, 3 or 4
filtered_data_ncp = train[train['NCP'].isin([1, 2, 3, 4])]

# Create a DataFrame with counts of 'FCVC' for each 'NObeyesdad' category
counts_df_ncp_stacked = filtered_data_ncp.groupby(['NObeyesdad', 'NCP']).size().unstack(fill_value=0)

# Reorder the DataFrame according to the manually specified order of the NObeyesdad categories
counts_df_ncp_stacked = counts_df_ncp_stacked.reindex(ordered_categories)

# Plotting a stacked bar plot for NCP
plt.figure(figsize=(12, 8))
counts_df_ncp_stacked.plot(kind='bar', stacked=True, colormap='Blues')
plt.title('Stacked Distribution of NCP within NObeyesdad Categories')
plt.ylabel('Counts of NCP Levels')
plt.xticks(rotation=45)
plt.legend(title='NCP Level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Filter the DataFrame to include only rows where CH2O has values 1, 2, or 3
filtered_data_ch2O = train[train['CH2O'].isin([1, 2, 3])]

# Create a DataFrame with counts of 'CH2O' for each 'NObeyesdad' category
counts_df_ch2O_stacked = filtered_data_ch2O.groupby(['NObeyesdad', 'CH2O']).size().unstack(fill_value=0)

# Reorder the DataFrame according to the manually specified order of the NObeyesdad categories
counts_df_ch2O_stacked = counts_df_ch2O_stacked.reindex(ordered_categories)

# Plotting a stacked bar plot for CH2O
plt.figure(figsize=(12, 8))
counts_df_ch2O_stacked.plot(kind='bar', stacked=True, colormap='Blues')
plt.title('Stacked Distribution of H2O within NObeyesdad Categories')
plt.ylabel('Counts of H2O Levels')
plt.xticks(rotation=45)
plt.legend(title='H2O Level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

In [None]:
# Filter the DataFrame to include only rows where FAF has values 0, 1, 2, or 3
filtered_data_faf = train[train['FAF'].isin([0, 1, 2, 3])]

# Create a DataFrame with counts of 'FAF' for each 'NObeyesdad' category
counts_df_faf_stacked = filtered_data_faf.groupby(['NObeyesdad', 'FAF']).size().unstack(fill_value=0)

# Reorder the DataFrame according to the manually specified order of the NObeyesdad categories
counts_df_faf_stacked = counts_df_faf_stacked.reindex(ordered_categories)

# Plotting a stacked bar plot for FAF
plt.figure(figsize=(12, 8))
counts_df_faf_stacked.plot(kind='bar', stacked=True, colormap='Blues')
plt.title('Stacked Distribution of FAF within NObeyesdad Categories')
plt.ylabel('Counts of FAF Levels')
plt.xticks(rotation=45)
plt.legend(title='FAF Level', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

**Oberservations:**
* The weight definitely influences the risk for obesity, which is expected 
* It seems that age does not influence the risk for obesity, as the distribution in each target class is almost equal
* It seems that height does not influence the risk for obesity, as the distribution in each target class is almost equal but people with Obesity Type II have a higher mean height than all
* People with Obesity Type III with less than three meals are rare 
* The phyisical activity frequency of people with Obesity Type II and III quite low compared to other

# Preprocessing

* drop id column 
* normalize age, weight and age
* transform boolean to binary
* transform categorical variables
* handle class imbalance
* add BMI column

In [None]:
# Add Body-Mass-Index Column
train['BMI'] = train['Weight'] / (train['Height'] ** 2)
test['BMI'] = test['Weight'] / (test['Height'] ** 2)

In [None]:
"""
# Function to calculate Basal Metabolic Rate
def calculate_bmr(row):
    height_cm = row['Height'] * 100  # Convert height from meters to centimeters
    if row['Gender'] == 'Male':
        bmr = 66.47 + (13.75 * row['Weight']) + (5.003 * height_cm) - (6.755 * row['Age'])
    else:  # Female
        bmr = 655.1 + (9.563 * row['Weight']) + (1.850 * height_cm) - (4.676 * row['Age'])
    return bmr

# Applying the function to create a new column 'BMR'
train['BMR'] = df.apply(calculate_bmr, axis=1)
test['BMR'] = df.apply(calculate_bmr, axis=1)

# Function to calculate IBW using the Devine Formula (height in meters)
def calculate_ibw(row):
    height_cm = row['Height'] * 100  # Convert height from meters to centimeters
    if row['Gender'] == 'Male':
        ibw = 50 + 2.3 * ((height_cm - 152.4) / 2.54)
    else:  # Female
        ibw = 45.5 + 2.3 * ((height_cm - 152.4) / 2.54)
    return ibw

# Apply the function to create the 'IBW' column
train['IBW'] = train.apply(calculate_ibw, axis=1)

# TDEE Calculation Function
def calculate_tdee(row):
    bmr = calculate_bmr(row)
    activity_levels = {
        'Sedentary': 1.2,
        'Lightly_active': 1.375,
        'Moderately_active': 1.55,
        'Very_active': 1.725,
        'Extra_active': 1.9
    }
    return bmr * activity_levels[row['Activity_Level']]
"""

In [None]:
# Save the 'NObeyesdad' column as 'y_train'
y_train = train['NObeyesdad']

# Drop the 'id' and 'NObeyesdad' columns from the DataFrame to create X_train
X_train = train.drop(['id', 'NObeyesdad'], axis=1)

In [None]:
# Split the data into training and validation sets (e.g., 90% train, 10% validation)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=0)

In [None]:
# Custom transformer to convert 'yes' and 'no' to 1 and 0
class BooleanToBinaryTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X.replace({'yes': 1, 'no': 0})

# Define the columns you want to standardize, convert booleans, and one-hot encode
numerical_features = ['Age', 'Height', 'Weight', "BMI"]
boolean_features = ['family_history_with_overweight', 'FAVC', 'SMOKE', 'SCC']
categorical_features = ['CAEC', 'CALC', 'MTRANS', "Gender"]

# Create transformers for numerical, boolean, and categorical features
numerical_transformer = Pipeline([
    ('scaler', StandardScaler())
])

boolean_transformer = Pipeline([
    ('boolean_to_binary', BooleanToBinaryTransformer())
])

categorical_transformer = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine transformers into a single preprocessor object
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('bool', boolean_transformer, boolean_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough'  # Include remaining columns not specified in transformers
)

# Define base models
base_models = [
    ('ada', AdaBoostClassifier(n_estimators=1000, random_state=0)),
    ('xgb', XGBClassifier(n_estimators=100, random_state=0)),
    ('svc', SVC(kernel='linear', probability=True, random_state=0))
]

# Define the meta-model
meta_model = RandomForestClassifier(n_estimators=1000, random_state=0)

# Create the stacking classifier
stacking_classifier = StackingClassifier(
    estimators=base_models,
    final_estimator=meta_model,
    cv=10
)

# Create the full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', stacking_classifier)
])

# Modeling

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
# Make predictions on the training data
train_predictions = pipeline.predict(X_train)

# Generate a classification report for the training data
classification_report_train = classification_report(y_train, train_predictions)

print("Classification Report (Training Data):\n", classification_report_train)

confusion_matrix_train = confusion_matrix(y_train, train_predictions)

# Create a heatmap for the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_train, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix (Training Data)')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

# Evaluation

In [None]:
# Make predictions on the training data
val_predictions = pipeline.predict(X_val)

# Assuming you have ground truth labels for the training data called 'y_train'
classification_report_val = classification_report(y_val, val_predictions)

print("Classification Report (Validation Data):\n", classification_report_val)

confusion_matrix_val = confusion_matrix(y_val, val_predictions)

# Create a heatmap for the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_val, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix (Validation Data)')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

# Hyperparameter-Tuning

In [None]:
"""
# Parameter grid for random search
param_distributions = {
    'classifier__final_estimator__n_estimators': randint(50, 200),
    'classifier__final_estimator__max_depth': [None, 10, 20],
    'classifier__final_estimator__min_samples_split': randint(2, 11)
}

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=3, 
    verbose=1,  
    n_jobs=-1
)

# Fit RandomizedSearchCV to find the best parameters
random_search.fit(X_train, y_train)

print("Best parameters found: ", random_search.best_params_)
"""

In [None]:
"""
# Make predictions on the training data
fine_tuning_predictions = random_search.predict(X_val)

# Assuming you have ground truth labels for the training data called 'y_train'
classification_report_ft = classification_report(y_val, fine_tuning_predictions)

print("Classification Report (Validation Data):\n", classification_report_val)

confusion_matrix_ft = confusion_matrix(y_val, fine_tuning_predictions)

# Create a heatmap for the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_ft, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix (Validation Data)')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()
"""

# Submission to Kaggle competition

In [None]:
test_ids = test['id']

# Drop the 'id' and 'NObeyesdad' columns from the DataFrame to create X_train
X_test = test.drop(['id'], axis=1)

In [None]:
# Make predictions on the training data
test_predictions = pipeline.predict(X_test)

In [None]:
# Create a DataFrame for the submission
submission_df = pd.DataFrame({'id': test_ids, 'Prediction': test_predictions})

# Save the DataFrame to a CSV file
submission_df.to_csv('submission.csv', index=False)

# From here analysis with synthetic data

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=a441f35e-4b4c-4c50-b56a-1aea6b800ed8' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>