<a href="https://colab.research.google.com/github/Achiever-caleb/obesity_prediction/blob/main/Obesity_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Introduction

This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.

The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III.

Data Details:

- Gender: Gender
- Age: Age
- Height : in metres
- Weight : in kgs
- family_history : Has a family member suffered or suffers from overweight?
- FAVC : Do you eat high caloric food frequently?
- FCVC : Do you usually eat vegetables in your meals?
- NCP : How many main meals do you have daily?
-CAEC : Do you eat any food between meals?
- SMOKE : Do you smoke?
- CH2O : How much water do you drink daily?
- SCC : Do you monitor the calories you eat daily?
- FAF: How often do you have physical activity?
- TUE : How much time do you use technological devices such as cell phone, videogames, television, computer and others?
- CALC : How often do you drink alcohol?
- MTRANS : Which transportation do you usually use?
- Obesity_level (Target Column) : Obesity level


https://www.semanticscholar.org/paper/Dataset-for-estimation-of-obesity-levels-based-on-Palechor-Manotas/35b40bacd2ffa9370885b7a3004d88995fd1d011

### Data Wrangling and Core Libraries Importation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [2]:
obp_df = pd.read_csv('/content/Obesity prediction.csv')
obp_df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,Obesity
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [3]:
obp_df.describe()

Unnamed: 0,Age,Height,Weight,FCVC,NCP,CH2O,FAF,TUE
count,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0,2111.0
mean,24.3126,1.701677,86.586058,2.419043,2.685628,2.008011,1.010298,0.657866
std,6.345968,0.093305,26.191172,0.533927,0.778039,0.612953,0.850592,0.608927
min,14.0,1.45,39.0,1.0,1.0,1.0,0.0,0.0
25%,19.947192,1.63,65.473343,2.0,2.658738,1.584812,0.124505,0.0
50%,22.77789,1.700499,83.0,2.385502,3.0,2.0,1.0,0.62535
75%,26.0,1.768464,107.430682,3.0,3.0,2.47742,1.666678,1.0
max,61.0,1.98,173.0,3.0,4.0,3.0,3.0,2.0


In [4]:
obp_df.describe(include='object')

Unnamed: 0,Gender,family_history,FAVC,CAEC,SMOKE,SCC,CALC,MTRANS,Obesity
count,2111,2111,2111,2111,2111,2111,2111,2111,2111
unique,2,2,2,4,2,2,4,5,7
top,Male,yes,yes,Sometimes,no,no,Sometimes,Public_Transportation,Obesity_Type_I
freq,1068,1726,1866,1765,2067,2015,1401,1580,351


In [5]:
obp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Gender          2111 non-null   object 
 1   Age             2111 non-null   float64
 2   Height          2111 non-null   float64
 3   Weight          2111 non-null   float64
 4   family_history  2111 non-null   object 
 5   FAVC            2111 non-null   object 
 6   FCVC            2111 non-null   float64
 7   NCP             2111 non-null   float64
 8   CAEC            2111 non-null   object 
 9   SMOKE           2111 non-null   object 
 10  CH2O            2111 non-null   float64
 11  SCC             2111 non-null   object 
 12  FAF             2111 non-null   float64
 13  TUE             2111 non-null   float64
 14  CALC            2111 non-null   object 
 15  MTRANS          2111 non-null   object 
 16  Obesity         2111 non-null   object 
dtypes: float64(8), object(9)
memory u

In [6]:
obp_df.isnull().sum()

Unnamed: 0,0
Gender,0
Age,0
Height,0
Weight,0
family_history,0
FAVC,0
FCVC,0
NCP,0
CAEC,0
SMOKE,0


In [7]:
obp_df.duplicated().sum()

24

#### Key Insights
1. The FCVC column sounds like an open ended question that ought to contain a binary input but we have multiple input in our data.
2. There's a need to change the column names to make it reflect the question more correct.
3. We don't seems to have null values in our dataset
4. There are presence of duplicates, but we can't act on them for lack of sufficient information or evidence.|

### Data Cleaning

- Investigate the FCVC column closely
- Change the name of certain colums


In [8]:
obp_df['FCVC'].value_counts()

Unnamed: 0_level_0,count
FCVC,Unnamed: 1_level_1
3.000000,652
2.000000,600
1.000000,33
2.823179,2
2.214980,2
...,...
2.927409,1
2.706134,1
2.010684,1
2.300408,1


Upon investigation we discoved that the columns contains decimal and so we couldn't figure out what insight the column might we passing. we reached a resolution to drop the column.

In [9]:
obp_new = obp_df.drop('FCVC', axis=1)
obp_new.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history,FAVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,Obesity
0,Female,21.0,1.62,64.0,yes,no,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


For the second Issue, changing the names will affect our introduction and so we will leave it that way and instead change the alphabet case to lower case for uniformity sake.

In [10]:
obp_new.columns = obp_new.columns.str.lower()
obp_new.head()

Unnamed: 0,gender,age,height,weight,family_history,favc,ncp,caec,smoke,ch2o,scc,faf,tue,calc,mtrans,obesity
0,Female,21.0,1.62,64.0,yes,no,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


### DATA PREPROCESSING

In [11]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [12]:
def data_preprocessing(data):
  #Copy data
  data_new  = data.copy()
  # Encode categorical variable using label encoder
  le  = LabelEncoder()
  cat_col = data_new.select_dtypes(include='object').columns
  for col in cat_col:
    data_new[col] = le.fit_transform(data_new[col])

  #Seperate the dependent variable from the independent variable
  y = data_new['obesity']
  X = data_new.drop('obesity', axis=1)

  #Split the data using train test split
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

  #Use standard scaler to scale the data
  scaler = StandardScaler()
  X_train_scaled = scaler.fit_transform(X_train)
  X_test_scaled = scaler.transform(X_test)

  return X_train_scaled, X_test_scaled, y_train, y_test, scaler


In [13]:
X_train_scaled, X_test_scaled, y_train, y_test, scaler = data_preprocessing(obp_new)

In [14]:
print("X_train_scaled;", X_train_scaled.shape)
print("X_test_scaled;", X_test_scaled.shape)
print("y_train;", y_train.shape)
print("y_test;", y_test.shape)

X_train_scaled; (1688, 15)
X_test_scaled; (423, 15)
y_train; (1688,)
y_test; (423,)


### Model training

In [41]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import VotingClassifier

We will proceed to use heterogenous technique in ensemble learning to created our model. Specifically, we will be working with Support Vector Classifier, Logistic regression and Random Forest Classifier

##Logistic Regression

In [30]:
model = LogisticRegression( solver='lbfgs', max_iter=1000)
model.fit(X_train_scaled, y_train)

In [31]:
# Predictions
y_pred = model.predict(X_test_scaled)

# Evaluation
print("Accuracy Score:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy Score: 0.8699763593380615
Classification Report:
               precision    recall  f1-score   support

           0       0.84      1.00      0.91        56
           1       0.90      0.61      0.73        62
           2       0.96      0.87      0.91        78
           3       0.92      0.98      0.95        58
           4       0.97      1.00      0.98        63
           5       0.75      0.79      0.77        56
           6       0.74      0.84      0.79        50

    accuracy                           0.87       423
   macro avg       0.87      0.87      0.86       423
weighted avg       0.88      0.87      0.87       423

Confusion Matrix:
 [[56  0  0  0  0  0  0]
 [11 38  0  0  0  9  4]
 [ 0  0 68  5  2  0  3]
 [ 0  0  1 57  0  0  0]
 [ 0  0  0  0 63  0  0]
 [ 0  4  0  0  0 44  8]
 [ 0  0  2  0  0  6 42]]


### Decision Tree Classifier

In [26]:
dtc_model  = DecisionTreeClassifier(criterion = "gini", splitter = "best", max_depth =None, random_state=42)
dtc_model.fit(X_train_scaled, y_train)

In [27]:
#Prediction
dtc_pred = dtc_model.predict(X_test_scaled)
#Evaluation
print("Accuracy Score:", accuracy_score(y_test, dtc_pred))
print("Classification Report:\n", classification_report(y_test, dtc_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, dtc_pred))

Accuracy Score: 0.9361702127659575
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.96      0.93        56
           1       0.84      0.87      0.86        62
           2       0.97      0.94      0.95        78
           3       0.95      0.97      0.96        58
           4       1.00      1.00      1.00        63
           5       0.91      0.88      0.89        56
           6       0.98      0.94      0.96        50

    accuracy                           0.94       423
   macro avg       0.94      0.94      0.94       423
weighted avg       0.94      0.94      0.94       423

Confusion Matrix:
 [[54  2  0  0  0  0  0]
 [ 6 54  0  0  0  2  0]
 [ 0  1 73  3  0  0  1]
 [ 0  0  2 56  0  0  0]
 [ 0  0  0  0 63  0  0]
 [ 0  7  0  0  0 49  0]
 [ 0  0  0  0  0  3 47]]


### Random Forest

In [32]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth =None)
rf_model.fit(X_train_scaled, y_train)

In [33]:
#Predictions
rf_pred = rf_model.predict(X_test_scaled)

#Evaluation
print("Accuracy Score:", accuracy_score(y_test, rf_pred))
print("Classification Report:\n", classification_report(y_test, rf_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, rf_pred))

Accuracy Score: 0.9550827423167849
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.96      0.97        56
           1       0.89      0.94      0.91        62
           2       0.97      0.96      0.97        78
           3       0.97      0.98      0.97        58
           4       1.00      1.00      1.00        63
           5       0.92      0.88      0.90        56
           6       0.94      0.96      0.95        50

    accuracy                           0.96       423
   macro avg       0.95      0.95      0.95       423
weighted avg       0.96      0.96      0.96       423

Confusion Matrix:
 [[54  2  0  0  0  0  0]
 [ 1 58  0  0  0  3  0]
 [ 0  0 75  2  0  0  1]
 [ 0  0  1 57  0  0  0]
 [ 0  0  0  0 63  0  0]
 [ 0  5  0  0  0 49  2]
 [ 0  0  1  0  0  1 48]]


### Support Vector Classifier

In [37]:
svc_model = SVC(kernel='linear', C=1.0, random_state=42)
svc_model.fit(X_train_scaled, y_train)

In [38]:
#Predictions
svc_pred = svc_model.predict(X_test_scaled)

#Evaluation
print("Accuracy Score:", accuracy_score(y_test, svc_pred))
print("Classification Report:\n", classification_report(y_test, svc_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, svc_pred))

Accuracy Score: 0.9479905437352246
Classification Report:
               precision    recall  f1-score   support

           0       0.88      1.00      0.93        56
           1       0.98      0.81      0.88        62
           2       0.99      0.96      0.97        78
           3       0.97      0.98      0.97        58
           4       1.00      1.00      1.00        63
           5       0.90      0.93      0.91        56
           6       0.92      0.96      0.94        50

    accuracy                           0.95       423
   macro avg       0.95      0.95      0.95       423
weighted avg       0.95      0.95      0.95       423

Confusion Matrix:
 [[56  0  0  0  0  0  0]
 [ 8 50  0  0  0  4  0]
 [ 0  0 75  2  0  0  1]
 [ 0  0  1 57  0  0  0]
 [ 0  0  0  0 63  0  0]
 [ 0  1  0  0  0 52  3]
 [ 0  0  0  0  0  2 48]]


### Voting Classifier

In [42]:
# Create individual models
log_model = LogisticRegression()
svc_model = SVC(probability=True)  # Required for soft voting
dt_model = DecisionTreeClassifier()
rf_model = RandomForestClassifier()

# Create a list of individual models
estimators = [('logistic', log_model), ('svc', svc_model), ('decision_tree', dt_model), ('random_forest', rf_model)]

#Voting technique
voting_model = VotingClassifier(estimators=estimators, voting='soft')
voting_model.fit(X_train_scaled, y_train)

In [44]:
#predictions
voting_pred = voting_model.predict(X_test_scaled)

#Evaluation
print("Accuracy Score:", accuracy_score(y_test, voting_pred))
print("Classification Report:\n", classification_report(y_test, voting_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, voting_pred))

Accuracy Score: 0.9645390070921985
Classification Report:
               precision    recall  f1-score   support

           0       0.95      1.00      0.97        56
           1       0.93      0.90      0.92        62
           2       0.99      0.96      0.97        78
           3       0.97      0.98      0.97        58
           4       1.00      1.00      1.00        63
           5       0.95      0.93      0.94        56
           6       0.96      0.98      0.97        50

    accuracy                           0.96       423
   macro avg       0.96      0.97      0.96       423
weighted avg       0.96      0.96      0.96       423

Confusion Matrix:
 [[56  0  0  0  0  0  0]
 [ 3 56  0  0  0  2  1]
 [ 0  0 75  2  0  0  1]
 [ 0  0  1 57  0  0  0]
 [ 0  0  0  0 63  0  0]
 [ 0  4  0  0  0 52  0]
 [ 0  0  0  0  0  1 49]]


### **Conclusion**

The model demonstrates excellent performance with an overall **accuracy of 96.5%**, indicating that it correctly classified most of the test samples. The detailed classification metrics reveal the following insights:

1. **Strong Performance Across Classes**:
   - Most classes have high **precision**, **recall**, and **F1-scores**, reflecting balanced and effective classification.
   - Notably, class **4** achieved perfect scores, showing the model's flawless performance for this class.

2. **Key Observations**:
   - Class **1** has slightly lower recall (**90%**), indicating some true positives were missed.
   - Misclassifications are minimal, with most errors occurring in **classes 1 and 5**. For instance, 3 instances of class **1** were classified as class **0**, and 4 instances of class **5** were classified as class **1**.

3. **Macro and Weighted Averages**:
   - The **macro average** metrics show balanced performance across all classes, regardless of class size.
   - The **weighted average** metrics align closely with the overall accuracy, confirming the model handles class imbalances effectively.

4. **Confusion Matrix**:
   - The majority of predictions are along the diagonal, showing accurate classifications.
   - Off-diagonal values are minimal and concentrated in specific misclassifications, such as between **classes 1 and 5**, suggesting the model could benefit from further fine-tuning in these areas.

### **Overall**:
The model performs exceptionally well, achieving high accuracy and balanced metrics. With minimal misclassifications and strong precision, recall, and F1-scores, it is well-suited for the task. To further enhance performance, focus could be placed on addressing misclassifications in **classes 1 and 5** by exploring feature engineering or hyperparameter optimization.