Dataset:
- https://www.kaggle.com/datasets/ikjotsingh221/obesity-risk-prediction-cleaned/data
- https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition 

### 1. Preparation and Overview (3 points total)

- **[2 points]** Explain the task and what business-case or use-case it is designed to solve (or designed to investigate). Detail exactly what the classification task is and what parties would be interested in the results. For example, would the model be deployed or used mostly for offline analysis? As in previous labs, also detail how good the classifier needs to perform in order to be useful.

### Business Understanding
- **Obesity** is a growing global health issue, leading to numerous diseases like diabetes, cardiovascular problems, and even certain cancers. According to the World Health Organization (WHO), in 2022, over 1.9 billion adults were overweight, and 650 million of those were classified as obese (source: https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight). This rise in obesity has placed significant strain on healthcare systems, making early identification and management of obesity critical to addressing this epidemic. 

- The **Estimation of Obesity Levels Based on Eating Habits and Physical Condition** dataset contains information on individual's lifestyle habits (such as eating behavior, physical activity, and family health history) along with their corresponding obesity levels. The obesity levels are classified into multiple categories: underweight, normal weight, overweight, obese type I, obese type II, and obese type III. The classification task will involve using these lifestyle attributes to predict an individual's obesity level. 

- The objective is to develop a machine learning model that accurately predicts the obesity level of an individual based on their lifestyle habits, with the goal of aiding public health efforts in early obesity detection and personalized health interventions. 

### Business Use-Case
- **Public Health**: One of the critical goals of public health organizations is to prevent chronic diseases associated with obesity, such as Type 2 diabetes and heart disease. By predicting obesity levels early, individuals can be targeted with personalized interventions, reducing the risk of severe health complications later on.
- **Healthcare Providers**: Hospitals and clinics can use this classification model to create preventive healthcare plans tailored to individuals. For example, individuals classified as overweight or at risk of obesity could be enrolled in nutrition or physical activity programs to mitiage the progression to higher obesity levels.
- **Fitness and Wellness Industry**: Personalized fitness and wellness programs rely heavily on understanding an individual's current health condition. This model can assist personal trainers, dieticians, and fitness apps in recommending targeted lifestyle changes, exercises, or dietary regimens based on the user's predicted obesity level.

### Interested Parties
- **Public Health Organizations**: Government bodies and organizations like WHO and the American Obesity Foundation (AOF) can benefit from the insights provided by this model, using it to design mass intervention programs or tailor their awareness campaigns based on obesity risk.
- **Healthcare Providers and Insurers**: Hospitals and insurance companies might be interested in using these predictions to prioritize care for high-risk individuals and create preventive strategies that reduce long-term healthcare costs.
- **Fitness and Wellness Companies**: Companies offering fitness coaching, wellness programs, and nutrition planning can use this predictive tool to personalize health plans and improve client outcomes.
- **Pharmaceutical Companies**: Lastly, understanding obesity levels in populations can also be valuable for pharmaceutical companies that develop weight management drugs, as it helps in identifying the target populations for such treatments.

### Classification Task
- This task is a **multi-class classification** problem where the `target` variable is the obesity level of individuals. The labels include the obesity categories of: 
    - underweight, normal weight, overweight, obesity type I, obesity type II, obesity type III
- The goal is to build a model that can accurately predict which category a person falls into based on a variety of lifestyle factors, such as caloric intake monitoring, frequency of physical activity, and family health history.

### Deployment vs Offline Analysis
- The model would be used mostly for **offline analysis**, helping public health organizations in analyzing population-wide obesity trends. It can be incorporated into health surveillance systems to monitor obesity levels in different demographics and regions, enabling organizations to tailor their awareness campaigns, interventions, and resource allocation. The model’s results could guide strategic decisions such as where to open more fitness centers or run more frequent public health programs focused on obesity prevention.
- For healthcare providers, the model can be used offline in hospitals and clinics to identify at-risk individuals and target interventions based on their lifestyle habits. Physicians could leverage the insights during patient checkups or preventive care planning to offer personalized advice on nutrition, exercise, or lifestyle changes. Although the model would mostly be used for analysis rather than real-time prediction, it could easily be integrated into electronic health record (EHR) systems to flag patients for obesity risk over time.

### Model Performance 
- To be useful, the model should aim for an **accuracy level of 85-90%** across all obesity levels. However, as obesity is a global health issue, precision and recall for certain categories (such as overweight or obese types) are critical to avoid false positives or false negatives. Misclassifying individuals as obese when they are not can lead to unnecessary anxiety and potentially unwarranted medical interventions. Conversely, failing to identify individuals at risk of obesity may result in delayed interventions, leading to worsening health conditions.
- For this reason, the F1 score - a balance of precision and recall - will be a key metric to evaluate the overall performance of the model. An F1 score in the range of 0.85 to 0.90 would be considered satisfactory for the business use cases mentioned above.
- Current machine learning models are being used for obesity prediction:
    - A study conducted on Indonesian health data achieved 72% accuracy using logistic regression, with a specificity of 71% and precision of 69% (source: https://www.frontiersin.org/journals/nutrition/articles/10.3389/fnut.2021.669155/full). One potential limitation of this study may have been inadequate handling of imbalanced datasets, where certain obesity levels could be underrepresented. To address and improve on this, our model will employ techniques such as oversampling minority classes or using class weights within the logistic regression algorithm, which can help the model better distinguish between different obesity levels and reduce bias toward the majority class. 
    - In another research study, a deep learning approach using Bi-LSTM combined with attention mechanisms achieved a remarkable 96.5% accuracy for obesity prediction. The model was noted for its ability to handle more complex, multifactorial data and showed superiority over traditional methods (source: https://www.mdpi.com/2306-5354/11/6/533). While logistic regression is a simpler model compared to deep learning models such as Bi-LSTM with attention mechanisms, logistic regression can still achieve competitive results with careful optimization. While aiming for the performance of deep learning models, logistic regression has the advantage of interpretability, meaning stakeholders such as healthcare providers can understand the decision process behind predictions. 
- By employing these techniques, our model could achieve an accuracy of 85-90%, with precision and recall optimized in the critical classes (overweight and obese categories). While it may not fully match the 96.5% accuracy of deep learning models like Bi-LSTM, a well-optimized logistic regression model offers the benefits of simplicity, transparency, and lower computational costs, making it more accessible. 

- **[.5 points]** (mostly the same processes as from previous labs) Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis (give reasoning). Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created). Provide a breakdown of the variables after preprocessing (such as the mean, std, etc. for all variables, including numeric and categorical).

### 1.2 Loading the Dataset & Defining Data Types 

In [1]:
# Modules & Libraries
import pandas as pd
import numpy as np

In [2]:
# Loading the dataset
path = '../../Data/estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.csv'

# Read in csv file
df = pd.read_csv(path)
df.head()

Unnamed: 0,Height,Weight,family_history_with_overweight,SCC,MTRANS_Walking,FAVC_z,FCVC_minmax,NCP_z,CAEC_minmax,CH2O_minmax,FAF_minmax,TUE_z,CALC_z,Age_bin_minmax,NObeyesdad
0,1.62,64.0,1,0,0,2.766876,0.5,0.404704,0.333333,0.5,0.0,0.550985,1.439033,0.25,1
1,1.52,56.0,1,1,0,2.766876,1.0,0.404704,0.333333,1.0,1.0,1.092724,0.516552,0.25,1
2,1.8,77.0,1,0,0,2.766876,0.5,0.404704,0.333333,0.5,0.666667,0.550985,2.472136,0.5,1
3,1.8,87.0,0,0,1,2.766876,1.0,0.404704,0.333333,0.5,0.666667,1.092724,2.472136,0.75,2
4,1.78,89.8,0,0,0,2.766876,0.5,2.164116,0.333333,0.5,0.0,1.092724,0.516552,0.5,3


In [3]:
df.describe()

Unnamed: 0,Height,Weight,family_history_with_overweight,SCC,MTRANS_Walking,FAVC_z,FCVC_minmax,NCP_z,CAEC_minmax,CH2O_minmax,FAF_minmax,TUE_z,CALC_z,Age_bin_minmax,NObeyesdad
count,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0,2086.0
mean,1.702045,86.622985,0.817354,0.045062,0.026366,0.639326,0.709818,0.76367,0.379834,0.504361,0.337873,0.843202,0.855417,0.49976,3.110259
std,0.093419,26.256245,0.386469,0.207491,0.16026,0.76912,0.267493,0.645761,0.155956,0.306578,0.283687,0.537726,0.518064,0.353723,1.993832
min,1.45,39.0,0.0,0.0,0.0,0.361418,0.0,0.002375,0.0,0.0,0.0,0.000146,0.516552,0.0,0.0
25%,1.63,65.130595,1.0,0.0,0.0,0.361418,0.5,0.404704,0.333333,0.291005,0.042901,0.466622,0.516552,0.25,1.0
50%,1.701383,83.0,1.0,0.0,0.0,0.361418,0.695087,0.404704,0.333333,0.5,0.333333,0.813973,0.516552,0.5,3.0
75%,1.76877,108.009452,1.0,0.0,0.0,0.361418,1.0,1.031717,0.333333,0.740243,0.557356,1.092724,1.439033,0.75,5.0
max,1.98,165.057269,1.0,1.0,1.0,2.766876,1.0,2.164116,1.0,1.0,1.0,2.194694,4.427721,1.0,6.0


In [4]:
# Returns the dimensions of the dataframe as (number of rows, number of columns)
df.shape

(2086, 15)

In [5]:
# Returns an index object containing the col labels of the dataframe
df.columns

Index(['Height', 'Weight', 'family_history_with_overweight', 'SCC',
       'MTRANS_Walking', 'FAVC_z', 'FCVC_minmax', 'NCP_z', 'CAEC_minmax',
       'CH2O_minmax', 'FAF_minmax', 'TUE_z', 'CALC_z', 'Age_bin_minmax',
       'NObeyesdad'],
      dtype='object')

In [6]:
# Provides a summary of the dataframe including data types, non-null values, and memory usage
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2086 entries, 0 to 2085
Data columns (total 15 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Height                          2086 non-null   float64
 1   Weight                          2086 non-null   float64
 2   family_history_with_overweight  2086 non-null   int64  
 3   SCC                             2086 non-null   int64  
 4   MTRANS_Walking                  2086 non-null   int64  
 5   FAVC_z                          2086 non-null   float64
 6   FCVC_minmax                     2086 non-null   float64
 7   NCP_z                           2086 non-null   float64
 8   CAEC_minmax                     2086 non-null   float64
 9   CH2O_minmax                     2086 non-null   float64
 10  FAF_minmax                      2086 non-null   float64
 11  TUE_z                           2086 non-null   float64
 12  CALC_z                          20

### 1.3 Dividing data into Training & Testing Splits

- **[.5 points]** Divide your data into training and testing splits using an 80% training and 20% testing split. Use the cross validation modules that are part of scikit-learn. Argue "for" or "against" splitting your data using an 80/20 split. That is, why is the 80/20 split appropriate (or not) for your dataset?