# Obesity Level Prediction

## Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix
import pickle
from sklearn.preprocessing import OrdinalEncoder

- Load all the necessary libraries

## Understanding Data

In [None]:
df= pd.read_csv('ObesityDataSet1.csv')
print(df.shape)
df.head()

- Load the dataset and display 5 sample rows to get an overview of the data  
- I also displayed the shape of the dataset to see the number of rows and columns
- The dataset consists of 1,055 rows and 17 columns
- At this point, an anomaly can already be seen in the "Age" column due to inconsistencies in how the ages are written

In [None]:
df.info()

- Due to the inconsistency in formatting, the "Age" column is recognized as an object data type, whereas it should be of integer type  
- At this stage, we can also observe the presence of null values in the "CAEC" and "TUE" columns

In [None]:
df= df.rename(columns= {"family_history_with_overweight": "FamilyHistory" })

- I renamed "family_history_with_overweight" column to make it easier to type, as the original names were too long

In [None]:
df.info()

- Now the column names are much more convenient to work with, as there are no overly long column names anymore

In [None]:
df.NObeyesdad.value_counts()

- The "NObeyesdad" column is the target variable. Its class distribution is relatively balanced across all categories, so there is no need to apply oversampling techniques

## Pre-processing

### Split Data

In [None]:
x= df[df.columns.drop(['NObeyesdad'])]
y= df['NObeyesdad']

- Separate the target column (y) from the feature columns (X)

In [None]:
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state= 42, stratify= y)

- The data is split into training and testing sets with an 80:20 ratio. This split allows the model to learn from the training data and be evaluated on unseen data
- The `stratify=y` parameter is used during the split to ensure that the distribution of the target classes remains consistent in both training and testing sets, even though the original class distribution is already fairly balanced

In [None]:
print("Distribution of Obesity Level in Data Train:")
print(pd.Series(y_train).value_counts())

- The training set remains balanced due to the use of `stratify= y` during the split, and because the original dataset was already fairly well distributed

### Check Distribution in Every Column

In [None]:
print(x_train['Gender'].value_counts(), '\n')

- The "Gender" column in the training set is relatively balanced, with 432 males and 412 females.  
- This balance helps prevent gender related bias during model training and ensures the model learns patterns that are not skewed toward a particular gender group.

In [None]:
print(x_train['Age'].value_counts(), '\n')

- An anomaly is found in the "Age" column where some entries include the string "years" (e.g., "44 years") instead of just numeric values. This inconsistency causes the column to be treated as an object type instead of numeric

In [None]:
print(x_train['Height'].value_counts(), '\n')

- There is a wide range of values in the "Height" column, but the most frequently occurring height is 170 cm.

In [None]:
print(x_train['Weight'].value_counts(), '\n')

- Similar to height, the "Weight" column has diverse values, with 75 kg and 70 kg being the most frequently occurring weight

In [None]:
print(x_train['FamilyHistory'].value_counts(), '\n')

- Most entries in the "FamilyHistory" column indicate a family history of obesity, with "yes" being the dominant response

In [None]:
print(x_train['FAVC'].value_counts(), '\n')

- The "FAVC" column indicates whether a person frequently consumes high calorie food. Most respondents answered "yes", suggesting that high calorie food consumption is common among the individuals in this dataset

In [None]:
print(x_train['FCVC'].value_counts(), '\n')
print(x_train[~x_train['FCVC'].between(1, 3)])

- The "FCVC" column represents the frequency of vegetable consumption on a scale from 1 to 3. Most individuals report a frequency of 2.0 or 3.0, indicating that many people in the dataset regularly consume vegetables

In [None]:
print(x_train['NCP'].value_counts(), '\n')

- The "NCP" column represents the number of main meals per day. The majority of individuals report having 3 meals a day, which aligns with common eating patterns

In [None]:
print(x_train['CAEC'].value_counts(), '\n')

- The "CAEC" column shows how frequently individuals consume food between meals. Most responses fall under "Sometimes", followed by "Frequently", while fewer individuals answered "Always" or "no" 
- This suggests that snacking between meals is a common habit in the dataset, which could potentially contribute to higher calorie intake and weight gain

In [None]:
print(x_train['SMOKE'].value_counts(), '\n')

- The "SMOKE" column indicates whether individuals smoke. The vast majority answered "no", while only a small number reported smoking
- This suggests that smoking is not a common habit among the participants in this dataset

In [None]:
print(x_train['CH2O'].value_counts(), '\n')
print(x_train[~x_train['CH2O'].between(1, 3)])

- The "CH2O" column represents daily water intake on a scale from 1 to 3. Most individuals reported a value of 2.0, followed by 1.0 and 3.0.  
- This indicates that the majority of people in the dataset consume a moderate amount of water daily, though there is some variation.

In [None]:
print(x_train['SCC'].value_counts(), '\n')

- The "SCC" column indicates whether individuals monitor their calorie intake. Most respondents answered "no", with only a small portion answering "yes"  
- This suggests that calorie tracking is not a common practice among the participants, which may affect their awareness of daily nutritional intake

In [None]:
print(x_train['FAF'].value_counts(), '\n')
print(x_train[~x_train['FAF'].between(0, 3)])

- The "FAF" column represents the frequency of physical activity on a scale from 0 to 3. A significant number of individuals reported 0.0, indicating no physical activity, while others showed varying levels of activity
- This suggests that a notable portion of the dataset leads a sedentary lifestyle, which could contribute to higher obesity risk

In [None]:
print(x_train['TUE'].value_counts(), '\n')
print(x_train[~x_train['TUE'].between(0, 3)])

- The "TUE" column represents time spent using technology on a scale from 0 to 3. Most values fall within this expected range, with 0.0 being the most frequent  
- The code `x_train[~x_train['TUE'].between(0, 3)]` is used to check for anomalies specifically values that fall outside the valid range of 0 to 3
- The resulting output includes some rows due to the presence of null values (missing data), not because the values are truly out of range

In [None]:
print(x_train['CALC'].value_counts(), '\n')

- The "CALC" column indicates the frequency of alcohol consumption, with categories such as "no", "Sometimes", "Frequently", and "Always"
- Most individuals selected "Sometimes", followed by "no", suggesting that occasional alcohol consumption is relatively common in the dataset

In [None]:
print(x_train['MTRANS'].value_counts(), '\n')

- The "MTRANS" column represents the main mode of transportation used by individuals  
- The majority rely on public transportation, followed by automobiles. Very few individuals use walking, motorbikes, or bicycles  
- This suggests a high dependency on motorized transport, which may correlate with lower physical activity levels

### Anomaly Handling

#### Age Column

In [None]:
def clean_age(age_val):
  if isinstance(age_val, str):
    return float(''.join(filter(str.isdigit, age_val)))
  return age_val

x_train['Age']= x_train['Age'].apply(clean_age)
x_test['Age']= x_test['Age'].apply(clean_age)

- A custom function `clean_age` is defined to clean the "Age" column. If the value is a string (e.g., "44 years"), it removes all non digit characters and converts the result to a float  
- This is necessary because some age values were stored as strings with additional text, which caused data type issues  
- The function is applied to both the training and testing sets to ensure that the "Age" column contains only numeric values

In [None]:
print(x_train['Age'].value_counts(), '\n')

- Now the "Age" column contains only clean, numeric values, and is correctly formatted for further analysis or modeling

## Handling Duplicate

In [None]:
print(x_train.duplicated().sum())

- There are 8 duplicated rows in the training data. These duplicates are removed to prevent bias and redundancy during model training

In [None]:
original_train_shape= x_train.shape[0]

train_dup_mask= x_train.duplicated(keep= 'first')

x_train_clean= x_train[~train_dup_mask]
y_train_clean= y_train[~train_dup_mask]

print(f"Training set: Removed {original_train_shape - x_train_clean.shape[0]} duplicates")

x_train, y_train= x_train_clean, y_train_clean

assert len(x_train)== len(y_train), "x-y training set misalignment!"

In [None]:
original_test_shape= x_test.shape[0]

test_dup_mask= x_test.duplicated(keep='first')

x_test_clean= x_test[~test_dup_mask]
y_test_clean= y_test[~test_dup_mask]

print(f"Test set: Removed {original_test_shape - x_test_clean.shape[0]} duplicates")

x_test, y_test= x_test_clean, y_test_clean

assert len(x_test) == len(y_test), "x-y test set misalignment!"

print("\nData shapes after cleaning:")
print(f"Train: x {x_train.shape}, y {y_train.shape}")
print(f"Test : x {x_test.shape}, y {y_test.shape}")

- Duplicate data is removed from both the training and test sets  
- In the training set, duplicates are removed to ensure the model learns from diverse and non redundant data, which helps improve generalization. 
- In the test set, duplicates are also removed to avoid biased evaluation and ensure the classification report reflects the model’s performance on truly unique samples

In [None]:
print(x_train.duplicated().sum())

## Null Values Discovery

In [None]:
numerical_cols= ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']
ordinal_cols= ['Gender', 'FamilyHistory', 'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC']
onehot_cols= ['MTRANS']

- The features are divided into three groups to be processed differently in a preprocessing pipeline:
  - `numerical_cols`: Continuous numerical features that will be scaled.
  - `ordinal_cols`: Categorical features with an implied order or limited categories, which will be encoded using ordinal encoding.
  - `onehot_cols`: Nominal categorical features with no intrinsic order, which will be encoded using one hot encoding.
- This separation allows each type of feature to be preprocessed appropriately within a unified pipeline.

In [None]:
print('Null Values in Data Train')
print(x_train.isnull().sum())

- As observed earlier, the "CAEC" and "TUE" columns contain some null values. However, since the number of missing entries is small, I choose to impute them instead of dropping the columns  
- These features are considered important (e.g., "CAEC" relates to snacking habits and "TUE" to technology usage), so retaining them with imputation helps preserve potentially valuable information for the model

#### Check Outliers in Numerical Column

In [None]:
for i in numerical_cols:
  plt.figure(figsize= (8,5))
  sns.boxplot(y= x_train[i])
  plt.xlabel(i)
  plt.show()

- Since there are outliers in several numerical columns, median imputation is used as it is more robust to extreme values than mean imputation  
- Additionally, the data is not normally distributed, so `RobustScaler` is applied for scaling. This scaler is less sensitive to outliers compared to other scaling methods like MinMax or StandardScaler

## Category Distributions for Encoding

In [None]:
for i in ordinal_cols:
  print(x_train[i].value_counts(), '\n')

- This step is used to inspect the distribution of values in each categorical column. It helps in understanding the data and deciding the appropriate order for ordinal encoding 
- By examining the unique values and their frequencies, we can determine how to handle each column during preprocessing and ensure the encoding makes logical sense

In [None]:
for i in onehot_cols:
  print(x_train[i].value_counts(), '\n')

## Pipeline Handling Null Values and Encoding

In [None]:
ordinal_categories= [
  ['Female', 'Male'],
  ['no', 'yes'],
  ['no', 'yes'],
  ['no', 'Sometimes', 'Frequently', 'Always'],
  ['no', 'yes'],
  ['no', 'yes'],
  ['no', 'Sometimes', 'Frequently', 'Always']
]

In [None]:
num_pipeline= Pipeline([
  ('imputer', SimpleImputer(strategy= 'median')),
  ('scaler', RobustScaler())
])

ordinal_pipeline= Pipeline([
  ('imputer', SimpleImputer(strategy= 'most_frequent')),
  ('encoder', OrdinalEncoder(categories= ordinal_categories))
])

onehot_pipeline= Pipeline([
  ('imputer', SimpleImputer(strategy= 'most_frequent')),
  ('encoder', OneHotEncoder(handle_unknown= 'ignore'))
])

- Custom category orders are defined in `ordinal_categories` to guide the `OrdinalEncoder`, ensuring that each ordinal feature is encoded in a meaningful and logical order

- Three separate pipelines are created to handle different types of features:

  - `num_pipeline`: Handles numerical columns using median imputation (to deal with outliers) and `RobustScaler` (to scale values while minimizing the influence of outliers)
  
  - `ordinal_pipeline`: Applies to ordinal categorical features, using most frequent value imputation and ordinal encoding based on the predefined order in `ordinal_categories`
  
  - `onehot_pipeline`: Used for nominal categorical features, with most frequent imputation and one hot encoding. The `handle_unknown='ignore'` parameter ensures the pipeline doesn’t break when encountering unseen categories during inference


In [None]:
preprocessor= ColumnTransformer([
    ('num', num_pipeline, numerical_cols),
    ('ordinal', ordinal_pipeline, ordinal_cols),
    ('onehot', onehot_pipeline, onehot_cols)
])

In [None]:
preprocessor

### Check

In [None]:
x_train_check= preprocessor.fit_transform(x_train)

In [None]:
print(x_train.shape)
x_train.head()

In [None]:
feature_names = preprocessor.get_feature_names_out()
x_train_check_df = pd.DataFrame(x_train_check, columns=feature_names)

print("Shape:", x_train_check_df.shape)
x_train_check_df.head()

In [None]:
print('Null Values in Data Train')
print(x_train_check_df.isnull().sum())

- This step is used to verify that the preprocessing was correctly applied
- After transforming the training data using `preprocessor.fit_transform(x_train)`, the result is a NumPy array 
- To make it more interpretable, the array is converted back into a DataFrame using the feature names from `preprocessor.get_feature_names_out()`  
- The resulting DataFrame shows that scaling, encoding, and imputations were successfully applied to the appropriate columns as expected

## Model

In [None]:
rf_pipeline= Pipeline([
  ('preprocessing', preprocessor),
  ('classifier', RandomForestClassifier(
    n_estimators= 100,
    max_depth= 4,
    min_samples_split= 4,
    min_samples_leaf= 2,
    class_weight= 'balanced',
    random_state= 42,
    n_jobs= -1
  ))
])

In [None]:
gb_pipeline= Pipeline([
  ('preprocessing', preprocessor),
  ('classifier', GradientBoostingClassifier(
    n_estimators= 100,
    learning_rate= 0.05,
    max_depth= 4,
    min_samples_split= 3,
    min_samples_leaf= 2,
    subsample= 0.8,
    loss= 'log_loss',
    random_state= 42
   ))
])

- Two different models are used for comparison: `RandomForestClassifier` and `GradientBoostingClassifier`.  
- Both models are wrapped inside pipelines that include the preprocessing steps, ensuring consistent data transformation during training and evaluation
- This comparison allows us to evaluate which model performs better on the given dataset and should be chosen for final deployment

In [None]:
rf_pipeline.fit(x_train, y_train)

In [None]:
gb_pipeline.fit(x_train, y_train)

In [None]:
y_predict1= rf_pipeline.predict(x_test)

In [None]:
y_predict2= gb_pipeline.predict(x_test)

### Evaluation Model

In [None]:
print('Classification Report Random Forest\n')
print(classification_report(y_test, y_predict1))

labels= ['Insufficient_Weight', 'Normal_Weight', 'Obesity_Type_I','Obesity_Type_II', 'Obesity_Type_III', 'Overweight_Level_I', 'Overweight_Level_II']

cm= confusion_matrix(y_test, y_predict1, labels= labels)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot= True, fmt= 'd', cmap= 'YlGnBu', xticklabels= labels, yticklabels= labels)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - Random Forest')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
print('Classification Report Gradient Boost\n')
print(classification_report(y_test, y_predict2))

labels= ['Insufficient_Weight', 'Normal_Weight', 'Obesity_Type_I','Obesity_Type_II', 'Obesity_Type_III', 'Overweight_Level_I', 'Overweight_Level_II']

cm= confusion_matrix(y_test, y_predict2, labels= labels)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot= True, fmt= 'd', cmap= 'YlGnBu', xticklabels= labels, yticklabels= labels)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - Gradient Boost')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

#### Summary

##### Classification Report Comparison
| Metric         | Random Forest | Gradient Boost |
|----------------|----------------|----------------|
| **Accuracy**   | 0.76           | **0.94**       |
| **Macro Avg Precision** | 0.77   | **0.94**       |
| **Macro Avg Recall**    | 0.76   | **0.94**       |
| **Macro Avg F1-score**  | 0.75   | **0.94**       |

- **Gradient Boosting** significantly outperforms **Random Forest** in all key classification metrics  
- It achieves higher precision, recall, and F1-score across most classes, especially evident in the macro average, which is important in multi-class problems with balanced data

---

##### Per Class Performance Insights

- **Random Forest** shows:
  - High recall on `Insufficient_Weight` and `Obesity_Type_II` (1.00 and 0.97), but
  - Poor performance on `Normal_Weight` (recall = 0.48, F1 = 0.55) and `Overweight_Level_I` (F1 = 0.62).
  - Perfect classification for `Obesity_Type_III`.

- **Gradient Boost** shows:
  - High and consistent scores across all classes (recall and precision mostly >= 0.88),
  - Perfect classification for `Obesity_Type_III`.

---

##### Confusion Matrix Comparison

- **Random Forest Confusion Matrix:**
  - Misclassifies a large number of `Normal_Weight` as `Insufficient_Weight`.
  - Also confuses between `Obesity_Type_I`, ` Overweight_Level_I`, and ` Overweight_Level_II`.

- **Gradient Boost Confusion Matrix:**
  - Much cleaner diagonals indicating better true positive rates.
  - Very few misclassifications, even in closely related categories like overweight and obesity types.

---

#####  Conclusion:
**Gradient Boosting** is clearly the better performing model based on overall accuracy, macro averaged metrics, and confusion matrix clarity. It is the recommended choice for final deployment.


### Save to Pickle

In [None]:
filename= 'Obesity.pkl'
pickle.dump(gb_pipeline, open(filename, 'wb'))

- The Gradient Boosting model is saved using pickle
- This allows the model to be reused for deployment in FastAPI and Streamlit