# Weight Class Prediction Project

## Introduction

**Project Objective:** 

The Weight Class Prediction Model is a machine learning project designed to predict an individual's weight status based on a set of input features. The primary goal of this project is to provide a user-friendly tool that allows users to understand and assess their weight based on various lifestyle and demographic factors.

The project leverages a machine learning ensemble approach, combining multiple predictive models to enhance the accuracy and robustness of the predictions. Each model has been trained on a dataset of individuals with known weight statuses and corresponding features.

## Project Contents

This project includes the following components:

1. **Data Collection and Preprocessing:** The dataset used in this project is from UCI Repository (https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition), later preprocessed to prepare it for model training. The dataset includes information on an individual's gender, age, height, weight, family history with overweight, dietary habits, physical activity, and other relevant features.

2. **Machine Learning Models:**

    - Support Vector Machine (SVM)
    - Light Gradient Boosting (LGBM)
    - Logistic Regression (LR)
    - Gradient Boosting Classifier (GBC)
    - Decision Tree Classifier (DTC)
    - Random Forest Classifier (RFC)
    - CatBoost Classifier (CatBoost)
    - XGBoost Classifier (XGBoost)

   A diverse set of machine learning models is used to ensure robust predictions. Each model contributes to the ensemble prediction, and hyperparameter tuning is performed to optimize their performance.

3. **Ensemble Learning:** The project employs ensemble learning, a technique that combines the predictions from multiple models to make a final prediction. This approach improves the model's overall performance and generalizability.

4. **User Interface:** A user interface has been developed to allow users to input their personal information and obtain a prediction regarding their weight status and overall health. The user's input is then processed by the ensemble of models to provide a comprehensive assessment.

The combination of various models and the ensemble approach allows the system to provide more accurate and robust predictions. Users can obtain personalized insights and recommendations based on their input, making it a valuable tool for understanding the relationship between lifestyle and health.

This project seeks to promote awareness about the impact of lifestyle choices on an individual's weight and health and is intended for educational and informational purposes.

**Note:** The predictions provided by this tool may not be 100% accurate, as this project was developed for educational purposes. For more information on weight and health, it is advisable to seek the guidance of a medical professional.

In [1]:
# Importing the required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
import timeit
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Loading the primary dataset from the specified file path
df = pd.read_csv("C:/Users/gouth/Downloads/(04) Obesity Detection Project/ObesityDataSet_raw_and_data_sinthetic.csv")

# Displaying the first few rows of the dataset to get an initial overview
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,high_caloric_food,Vegetables,Main_meals,food_between_meals,SMOKE,daily_water,calorie_monitering,physical_activity,technology_use,alcohol_consumption,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [3]:
# Check for null values and data types in the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             2111 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   object 
 5   high_caloric_food               2111 non-null   object 
 6   Vegetables                      2111 non-null   float64
 7   Main_meals                      2111 non-null   float64
 8   food_between_meals              2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  daily_water                     2111 non-null   float64
 11  calorie_monitering              2111 non-null   object 
 12  physical_activity               21

In [4]:
# Renaming the columns for better understanding of the dataset features
df.columns = ['Gender', 'Age', 'Height', 'Weight', 'Family History with Overweight',
              'Frequent consumption of high caloric food', 'Frequency of consumption of vegetables',
              'Number of main meals', 'Consumption of food between meals', 'Smoke',
              'Consumption of water daily', 'Calories consumption monitoring',
              'Physical activity frequency', 'Time using technology devices',
              'Consumption of alcohol', 'Transportation used', 'Obesity']

# Display the dataset with the updated column names
df

Unnamed: 0,Gender,Age,Height,Weight,Family History with Overweight,Frequent consumption of high caloric food,Frequency of consumption of vegetables,Number of main meals,Consumption of food between meals,Smoke,Consumption of water daily,Calories consumption monitoring,Physical activity frequency,Time using technology devices,Consumption of alcohol,Transportation used,Obesity
0,Female,21.000000,1.620000,64.000000,yes,no,2.0,3.0,Sometimes,no,2.000000,no,0.000000,1.000000,no,Public_Transportation,Normal_Weight
1,Female,21.000000,1.520000,56.000000,yes,no,3.0,3.0,Sometimes,yes,3.000000,yes,3.000000,0.000000,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.000000,1.800000,77.000000,yes,no,2.0,3.0,Sometimes,no,2.000000,no,2.000000,1.000000,Frequently,Public_Transportation,Normal_Weight
3,Male,27.000000,1.800000,87.000000,no,no,3.0,3.0,Sometimes,no,2.000000,no,2.000000,0.000000,Frequently,Walking,Overweight_Level_I
4,Male,22.000000,1.780000,89.800000,no,no,2.0,1.0,Sometimes,no,2.000000,no,0.000000,0.000000,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,Female,20.976842,1.710730,131.408528,yes,yes,3.0,3.0,Sometimes,no,1.728139,no,1.676269,0.906247,Sometimes,Public_Transportation,Obesity_Type_III
2107,Female,21.982942,1.748584,133.742943,yes,yes,3.0,3.0,Sometimes,no,2.005130,no,1.341390,0.599270,Sometimes,Public_Transportation,Obesity_Type_III
2108,Female,22.524036,1.752206,133.689352,yes,yes,3.0,3.0,Sometimes,no,2.054193,no,1.414209,0.646288,Sometimes,Public_Transportation,Obesity_Type_III
2109,Female,24.361936,1.739450,133.346641,yes,yes,3.0,3.0,Sometimes,no,2.852339,no,1.139107,0.586035,Sometimes,Public_Transportation,Obesity_Type_III


In [5]:
# Transforming specific features for better analysis
df['Obesity'] = df['Obesity'].apply(lambda x: x.replace('_', ' '))  # Remove underscores from 'Obesity' values
df['Transportation used'] = df['Transportation used'].apply(lambda x: x.replace('_', ' '))  # Remove underscores from 'Transportation used' values
df['Height'] = df['Height'] * 100  # Convert 'Height' from meters to centimeters
df['Height'] = df['Height'].round(1)  # Round 'Height' to one decimal place
df['Weight'] = df['Weight'].round(1)  # Round 'Weight' to one decimal place
df['Age'] = df['Age'].round(1)  # Round 'Age' to one decimal place

# Display the dataset with the transformed features
df

Unnamed: 0,Gender,Age,Height,Weight,Family History with Overweight,Frequent consumption of high caloric food,Frequency of consumption of vegetables,Number of main meals,Consumption of food between meals,Smoke,Consumption of water daily,Calories consumption monitoring,Physical activity frequency,Time using technology devices,Consumption of alcohol,Transportation used,Obesity
0,Female,21.0,162.0,64.0,yes,no,2.0,3.0,Sometimes,no,2.000000,no,0.000000,1.000000,no,Public Transportation,Normal Weight
1,Female,21.0,152.0,56.0,yes,no,3.0,3.0,Sometimes,yes,3.000000,yes,3.000000,0.000000,Sometimes,Public Transportation,Normal Weight
2,Male,23.0,180.0,77.0,yes,no,2.0,3.0,Sometimes,no,2.000000,no,2.000000,1.000000,Frequently,Public Transportation,Normal Weight
3,Male,27.0,180.0,87.0,no,no,3.0,3.0,Sometimes,no,2.000000,no,2.000000,0.000000,Frequently,Walking,Overweight Level I
4,Male,22.0,178.0,89.8,no,no,2.0,1.0,Sometimes,no,2.000000,no,0.000000,0.000000,Sometimes,Public Transportation,Overweight Level II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,Female,21.0,171.1,131.4,yes,yes,3.0,3.0,Sometimes,no,1.728139,no,1.676269,0.906247,Sometimes,Public Transportation,Obesity Type III
2107,Female,22.0,174.9,133.7,yes,yes,3.0,3.0,Sometimes,no,2.005130,no,1.341390,0.599270,Sometimes,Public Transportation,Obesity Type III
2108,Female,22.5,175.2,133.7,yes,yes,3.0,3.0,Sometimes,no,2.054193,no,1.414209,0.646288,Sometimes,Public Transportation,Obesity Type III
2109,Female,24.4,173.9,133.3,yes,yes,3.0,3.0,Sometimes,no,2.852339,no,1.139107,0.586035,Sometimes,Public Transportation,Obesity Type III


## Data Preprocessing

In [6]:
# Create a copy of the original dataset to ensure the original data is not modified
df1 = df.copy()

In [7]:
# Define the input variables (x) as all columns except the last one (target variable)
x = df1[df1.columns[:-1]]

# Define the target variable (y) as the 'Obesity' column
y = df['Obesity']

## Data Preprocessing: Choosing Encoding Methods

In the data preprocessing phase, I was faced with the task of encoding categorical features into a numerical format that can be used for machine learning. The three common methods for this task are Label Encoding, One-Hot Encoding, and Ordinal Encoding.

### Label Encoding

Label encoding is a straightforward technique where each category of a categorical variable is assigned a unique integer value. For example, if we have a categorical variable for "Color" with categories "Red," "Green," and "Blue," label encoding would assign them values like 0, 1, and 2.

One challenge with label encoding is that it may introduce ordinal relationships that do not exist in the original data. In our example, the machine learning model might interpret that "Blue" is greater than "Red" and "Green," which is not the case. This ordinal relationship can lead to incorrect predictions in some cases.

### One-Hot Encoding

In contrast to label encoding, One-Hot Encoding is another method for handling categorical variables. Instead of assigning integer values to categories, One-Hot Encoding creates binary columns for each category. Each binary column represents a category, where a '1' indicates the presence of the category, and '0' indicates its absence.

One-Hot Encoding does not introduce ordinal relationships, making it a suitable choice for categorical variables with no inherent order. However, it comes with the trade-off of increasing the dimensionality of the dataset, as it creates multiple new columns.

### Ordinal Encoding

Ordinal encoding is the method I chose to use in this project. This approach assigns integer values to categories without introducing any ordinal relationship between them. Each category is given a unique number, making it suitable for categorical variables without inherent orders.

One key advantage of ordinal encoding is that it results in a single column, which simplifies the dataset. This can be particularly useful when using machine learning models, as it reduces the complexity of the input data. With label encoding or One-Hot Encoding, we would have created multiple new columns (one for each category), making the dataset more challenging to work with.

Our decision to use ordinal encoding was made to avoid introducing unintended ordinal relationships and to maintain a more straightforward dataset structure. It ensures that our machine learning models can work with the categorical variables more effectively.

By using ordinal encoding, we can ensure that our predictions are based on the actual characteristics of the data and avoid any potential pitfalls introduced by label encoding or One-Hot Encoding.

In [8]:
def preprocess_data():
    # Separate categorical and numeric columns
    categorical_columns = x.select_dtypes(include=['object']).columns
    numeric_columns = x.select_dtypes(include=['int64', 'float64']).columns

    # Encode categorical columns using OrdinalEncoder
    if not categorical_columns.empty:
        ordinal_encoder = OrdinalEncoder()
        x[categorical_columns] = ordinal_encoder.fit_transform(x[categorical_columns])

    # Return the preprocessed data
    return x

In [9]:
# Process the input features by applying the preprocessing function
x = preprocess_data()

In [10]:
x.head()

Unnamed: 0,Gender,Age,Height,Weight,Family History with Overweight,Frequent consumption of high caloric food,Frequency of consumption of vegetables,Number of main meals,Consumption of food between meals,Smoke,Consumption of water daily,Calories consumption monitoring,Physical activity frequency,Time using technology devices,Consumption of alcohol,Transportation used
0,0.0,21.0,162.0,64.0,1.0,0.0,2.0,3.0,2.0,0.0,2.0,0.0,0.0,1.0,3.0,3.0
1,0.0,21.0,152.0,56.0,1.0,0.0,3.0,3.0,2.0,1.0,3.0,1.0,3.0,0.0,2.0,3.0
2,1.0,23.0,180.0,77.0,1.0,0.0,2.0,3.0,2.0,0.0,2.0,0.0,2.0,1.0,1.0,3.0
3,1.0,27.0,180.0,87.0,0.0,0.0,3.0,3.0,2.0,0.0,2.0,0.0,2.0,0.0,1.0,4.0
4,1.0,22.0,178.0,89.8,0.0,0.0,2.0,1.0,2.0,0.0,2.0,0.0,0.0,0.0,2.0,3.0


In [11]:
# Splitting the input features and target variable into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)

### Label Encoding for Target Variable

In the context of encoding the target variable, 'Obesity,' I chose to use Label Encoding. Label Encoding is a suitable choice when dealing with the target variable, particularly when it represents different categories or classes. It simplifies the target variable by assigning unique integer values to each class, making it compatible with various machine learning algorithms.

The decision to use Label Encoding for the 'Obesity' column provides an efficient way to represent the classes numerically without introducing any ordinal relationships between them. This ensures that our machine learning models can effectively predict different obesity categories based on the actual characteristics of the data.

In [12]:
# Label encoding the 'Obesity' column
le = LabelEncoder()

# Fit and transform the training set
y_train_encoded = le.fit_transform(y_train)

# Transform the testing set using the same label encoder
y_test_encoded = le.transform(y_test)

In [13]:
from collections import Counter

# Checking for class imbalance in the encoded target variables

# Calculate the class distribution in the training set
train_class_distribution = Counter(y_train_encoded)

# Calculate the class distribution in the testing set
test_class_distribution = Counter(y_test_encoded)

# Print class distribution in the training set
print("Class distribution in y_train_encoded:")
for class_label, count in train_class_distribution.items():
    print(f"Class {class_label}: {count} samples")

# Print class distribution in the testing set
print("\nClass distribution in y_test_encoded:")
for class_label, count in test_class_distribution.items():
    print(f"Class {class_label}: {count} samples")

Class distribution in y_train_encoded:
Class 4: 303 samples
Class 5: 258 samples
Class 0: 243 samples
Class 3: 271 samples
Class 2: 307 samples
Class 1: 263 samples
Class 6: 254 samples

Class distribution in y_test_encoded:
Class 2: 44 samples
Class 4: 21 samples
Class 5: 32 samples
Class 1: 24 samples
Class 3: 26 samples
Class 0: 29 samples
Class 6: 36 samples


In [14]:
# Import the library for correcting class imbalance using oversampling
from imblearn.over_sampling import RandomOverSampler

# Create a RandomOverSampler instance with a specified random seed (random_state = 42)
sampler = RandomOverSampler(random_state=42)

# Apply random oversampling to the training set
x_train_resampled, y_train_resampled = sampler.fit_resample(x_train, y_train_encoded)

# Apply random oversampling to the testing set
x_test_resampled, y_test_resampled = sampler.fit_resample(x_test, y_test_encoded)

In [15]:
# Check for class distribution in the resampled training and testing sets
train_class_distribution = Counter(y_train_resampled)
test_class_distribution = Counter(y_test_resampled)

print("Class distribution in y_train:")
for class_label, count in train_class_distribution.items():
    print(f"Class {class_label}: {count} samples")

print("\nClass distribution in y_test:")
for class_label, count in test_class_distribution.items():
    print(f"Class {class_label}: {count} samples")

Class distribution in y_train:
Class 4: 307 samples
Class 5: 307 samples
Class 0: 307 samples
Class 3: 307 samples
Class 2: 307 samples
Class 1: 307 samples
Class 6: 307 samples

Class distribution in y_test:
Class 2: 44 samples
Class 4: 44 samples
Class 5: 44 samples
Class 1: 44 samples
Class 3: 44 samples
Class 0: 44 samples
Class 6: 44 samples


## Training a Random Forest Classifier

I've chosen the Random Forest Classifier as one of the key machine learning models. Random Forest is a versatile and powerful ensemble learning method that offers exceptional predictive capabilities. 

#### The Math Behind Random Forest:

The Random Forest algorithm operates based on a few fundamental mathematical concepts:

1. **Decision Trees:** At its core, Random Forest is made up of multiple decision trees. Each tree is constructed using a subset of the training data and random feature selection.

2. **Bootstrap Sampling:** The randomness in Random Forest comes from bootstrapped sampling, where each decision tree is trained on a random sample of the data. This involves drawing multiple random samples (with replacement) from the training dataset.

3. **Voting Mechanism:** In classification tasks like this, Random Forest uses a majority voting mechanism. Each decision tree provides a prediction, and the final prediction is determined by a majority vote among the individual tree predictions.

In [16]:
# Train a Random Forest Classifier, make predictions, calculate accuracy, and generate a classification report for evaluation.
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, 
                             random_state=42)

clf.fit(x_train_resampled, y_train_resampled)
y_pred = clf.predict(x_test_resampled)
accuracy = accuracy_score(y_test_resampled, y_pred)

print(f"Accuracy: {accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred)
print(class_report)

Accuracy: 0.96
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        44
           1       0.85      0.93      0.89        44
           2       1.00      0.98      0.99        44
           3       1.00      1.00      1.00        44
           4       1.00      1.00      1.00        44
           5       1.00      0.93      0.96        44
           6       0.95      0.95      0.95        44

    accuracy                           0.96       308
   macro avg       0.96      0.96      0.96       308
weighted avg       0.96      0.96      0.96       308



### Hyperparameter Tuning for Random Forest

One of the critical aspects of building a highly performant machine learning model is hyperparameter tuning. By adjusting the hyperparameters, we can optimize our model's predictive accuracy and generalization capabilities.

#### Why Hyperparameter Tuning?

Hyperparameters are settings that are not learned from the training data but play a crucial role in the model's performance. These include parameters such as the number of trees (n_estimators), maximum depth of the trees (max_depth), and minimum samples required to split an internal node (min_samples_split). Optimizing these hyperparameters is essential to ensure our model operates at its best.

#### The Hyperparameter Tuning Process

Hyperparameter tuning often involves the following steps:

1. **Selecting Hyperparameters:** We need to choose the hyperparameters to tune. For a Random Forest, this may include n_estimators (the number of trees), max_depth (the maximum depth of the trees), and other parameters that influence the tree's growth and ensemble behavior.

2. **Defining a Search Space:** We specify a range of values or options for each hyperparameter. For example, we might consider values of n_estimators between 50 and 200, max_depth from 5 to 20, and so on.

3. **Evaluation Metric:** We define the metric by which we'll evaluate the model's performance. Common choices include accuracy, F1-score, or area under the ROC curve (AUC). In this case, I went with accuracy since it's a straightforward metric for classification tasks like predicting obesity levels.

4. **Hyperparameter Search:** We employ search techniques like Grid Search or Random Search to explore the combinations of hyperparameters within the defined search space. Each combination is evaluated using the chosen evaluation metric.

5. **Validation Sets:** We split our training data into training and validation sets to estimate how well each hyperparameter combination generalizes to unseen data. Cross-validation is also an effective technique to robustly validate the model.

6. **Optimal Hyperparameters:** The search process yields the optimal hyperparameters that result in the best model performance on the validation sets.

In [17]:
%%time
# Optimize Random Forest Classifier using Grid Search, make predictions, calculate accuracy, and generate a classification report for evaluation.
from sklearn.model_selection import GridSearchCV

clf = RandomForestClassifier()
param_grid = { 
    'n_estimators': [25, 50, 100, 150], 
    'max_features': ['sqrt', 'log2', None], 
    'max_depth': [3, 6, 9], 
    'max_leaf_nodes': [3, 6, 9], 
} 
grid_search = GridSearchCV(clf, 
                           param_grid=param_grid,
                           scoring="accuracy") 

grid_search.fit(x_train_resampled, y_train_resampled) 
clf_grid = grid_search.best_estimator_
y_pred_grid = clf_grid.predict(x_test_resampled)

accuracy = clf_grid.score(x_test_resampled, y_test_resampled)
print("Accuracy on Test Set:", accuracy)
class_report = classification_report(y_test_resampled, y_pred_grid)
print(class_report)

Accuracy on Test Set: 0.775974025974026
              precision    recall  f1-score   support

           0       0.64      0.86      0.74        44
           1       0.73      0.50      0.59        44
           2       1.00      0.50      0.67        44
           3       0.79      1.00      0.88        44
           4       1.00      1.00      1.00        44
           5       0.85      0.75      0.80        44
           6       0.62      0.82      0.71        44

    accuracy                           0.78       308
   macro avg       0.80      0.78      0.77       308
weighted avg       0.80      0.78      0.77       308

CPU times: total: 2min 35s
Wall time: 3min 21s


## Training a Decision Tree Classifier

In this phase of the project, I chose to employ the Decision Tree Classifier as one of the machine learning models to predict obesity levels. This selection was based on the model's unique characteristics and its suitability for classification task.

#### The Math Behind Decision Trees

At the heart of Decision Trees is a set of mathematical principles that guide the process of selecting the best attribute for node splitting. The primary mathematical concepts include:

1. **Gini Impurity:** Gini impurity measures the degree of disorder in a dataset. The Decision Tree aims to minimize the Gini impurity by selecting the attribute that reduces uncertainty and improves classification purity.

2. **Information Gain:** Information Gain, based on entropy, quantifies the reduction in uncertainty achieved by a specific attribute split. It plays a vital role in selecting the most informative attribute for the current node.

3. **Entropy:** Entropy is a measure of disorder or impurity in a dataset. A split that minimizes entropy is preferred, as it leads to a more organized classification in the leaves.

In [18]:
# Train a Decision Tree Classifier, make predictions, calculate accuracy, and generate a classification report for evaluation.
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(random_state=42)

dtc.fit(x_train_resampled, y_train_resampled)
y_pred = dtc.predict(x_test_resampled)
accuracy = accuracy_score(y_test_resampled, y_pred)

print(f"Accuracy: {accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred)
print(class_report)

Accuracy: 0.93
              precision    recall  f1-score   support

           0       0.88      0.84      0.86        44
           1       0.79      0.84      0.81        44
           2       0.93      0.98      0.96        44
           3       1.00      0.98      0.99        44
           4       1.00      1.00      1.00        44
           5       0.93      0.91      0.92        44
           6       0.95      0.93      0.94        44

    accuracy                           0.93       308
   macro avg       0.93      0.93      0.93       308
weighted avg       0.93      0.93      0.93       308



### Hyperparameter Tuning for Decision Tree Classifier

In [19]:
%%time
# Optimize Decision Tree Classifier using Grid Search, make predictions, calculate accuracy, and generate a classification report for evaluation.
dtc = DecisionTreeClassifier()
param_grid = { 
    'max_depth': [3, 6, 9], 
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
} 
grid_search = GridSearchCV(dtc, 
                           param_grid=param_grid,
                           scoring="accuracy") 

grid_search.fit(x_train_resampled, y_train_resampled) 
dtc_grid = grid_search.best_estimator_
y_pred_grid = dtc_grid.predict(x_test_resampled)

accuracy = dtc_grid.score(x_test_resampled, y_test_resampled)
print(f"Accuracy:{accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred_grid)
print(class_report)

Accuracy:0.94
              precision    recall  f1-score   support

           0       0.89      0.93      0.91        44
           1       0.86      0.84      0.85        44
           2       1.00      0.93      0.96        44
           3       0.98      1.00      0.99        44
           4       1.00      1.00      1.00        44
           5       0.93      0.91      0.92        44
           6       0.93      0.98      0.96        44

    accuracy                           0.94       308
   macro avg       0.94      0.94      0.94       308
weighted avg       0.94      0.94      0.94       308

CPU times: total: 1.67 s
Wall time: 1.91 s


## Training a Gradient Boosting Classifier

Gradient Boosting Classifier is a powerful and versatile algorithm that excels in solving classification problems. This classifier is known for its capacity to produce highly accurate predictions by sequentially combining multiple "weak" learners into a strong ensemble model.

#### Key Components:

- **Weak Learners**: Gradient Boosting employs a collection of simple models, often decision trees with a limited depth. These models are referred to as "weak learners" because they perform slightly better than random guessing.

- **Loss Function**: A loss function is chosen, which measures the difference between the actual target values and the predictions made by the ensemble. The goal is to minimize this loss.

- **Gradient Descent**: The algorithm uses gradient descent to minimize the loss function. It computes the gradient (derivative) of the loss with respect to the predictions, indicating the direction in which the predictions need to be adjusted.

- **Sequential Learning**: Weak learners are trained sequentially, with each one focusing on the mistakes of the ensemble up to that point. The output of one learner is used as input to the next.

### The Math Behind Gradient Boosting

The mathematical core of Gradient Boosting is rooted in optimization and regression. The primary objective is to find the optimal coefficients for each weak learner. The following are the fundamental steps:

1. **Initialization**: The algorithm begins with a simple model, often with constant predictions.

2. **Gradient Calculation**: For each data point, the gradient (derivative) of the loss function with respect to the current prediction is computed. This gradient indicates how much the predictions need to be adjusted.

3. **Weak Learner Training**: A weak learner is trained to approximate the negative gradient, aiming to reduce the loss function.

4. **Shrinking**: The predictions of the weak learner are multiplied by a small learning rate and added to the ensemble. This step ensures that each weak learner contributes incrementally to the final prediction.

5. **Iterative Process**: Steps 2-4 are repeated for a specified number of iterations or until convergence is achieved.

In [20]:
# Train a Gradient Boosting Classifier, make predictions, calculate accuracy, and generate a classification report for evaluation.
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(random_state=42)

gbc.fit(x_train_resampled, y_train_resampled)
y_pred = gbc.predict(x_test_resampled)
accuracy = accuracy_score(y_test_resampled, y_pred)

print(f"Accuracy: {accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred)
print(class_report)

Accuracy: 0.96
              precision    recall  f1-score   support

           0       0.98      0.98      0.98        44
           1       0.97      0.89      0.93        44
           2       0.98      0.98      0.98        44
           3       0.98      0.98      0.98        44
           4       1.00      1.00      1.00        44
           5       0.89      0.95      0.92        44
           6       0.96      0.98      0.97        44

    accuracy                           0.96       308
   macro avg       0.97      0.96      0.96       308
weighted avg       0.97      0.96      0.96       308



### Hyperparameter Tuning for Gradient Boosting Classifier

In [21]:
%%time
# Optimize Gradient Boosting Classifier using Grid Search, make predictions, calculate accuracy, and generate a classification report for evaluation.
gbc = GradientBoostingClassifier()
param_grid = { 
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 6, 9],
} 
grid_search = GridSearchCV(gbc, 
                           param_grid=param_grid,
                           scoring="accuracy") 

grid_search.fit(x_train_resampled, y_train_resampled) 
gbc_grid = grid_search.best_estimator_
y_pred_grid = gbc_grid.predict(x_test_resampled)

accuracy = gbc_grid.score(x_test_resampled, y_test_resampled)
print(f"Accuracy: {accuracy :2f}")
class_report = classification_report(y_test_resampled, y_pred_grid)
print(class_report)

Accuracy: 0.983766
              precision    recall  f1-score   support

           0       0.96      0.98      0.97        44
           1       0.98      0.95      0.97        44
           2       1.00      0.98      0.99        44
           3       0.98      1.00      0.99        44
           4       1.00      1.00      1.00        44
           5       0.98      1.00      0.99        44
           6       1.00      0.98      0.99        44

    accuracy                           0.98       308
   macro avg       0.98      0.98      0.98       308
weighted avg       0.98      0.98      0.98       308

CPU times: total: 18min 46s
Wall time: 20min 51s


## Training a Logistic Regression Model

Logistic Regression is a popular statistical model used for binary classification tasks. It is primarily employed when the target variable is categorical. The fundamental concept underlying Logistic Regression is the logistic function (also known as the sigmoid function), which maps input values to a range between 0 and 1. This characteristic is crucial for estimating the probability that a given sample belongs to a particular class.

The logistic function is defined as follows:

f(z) = 1 / (1 + e^[-z])

where \( z \) represents the linear combination of the input features and their associated coefficients.

The logistic function transforms the input data, allowing us to model the probability of a binary outcome. In the case of multiple classes, Logistic Regression can be extended to perform multiclass classification using various techniques such as one-vs-rest or multinomial Logistic Regression.

In [22]:
# Train a Logistic Regression model, make predictions, calculate accuracy, and generate a classification report for evaluation.
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42)

lr.fit(x_train_resampled, y_train_resampled)
y_pred = lr.predict(x_test_resampled)
accuracy = accuracy_score(y_test_resampled, y_pred)

print(f"Accuracy: {accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred)
print(class_report)

Accuracy: 0.80
              precision    recall  f1-score   support

           0       0.81      0.77      0.79        44
           1       0.62      0.64      0.63        44
           2       0.83      0.86      0.84        44
           3       0.91      0.98      0.95        44
           4       0.98      0.98      0.98        44
           5       0.64      0.77      0.70        44
           6       0.81      0.57      0.67        44

    accuracy                           0.80       308
   macro avg       0.80      0.80      0.79       308
weighted avg       0.80      0.80      0.79       308



### Hyperparameter Tuning for Logistic Regression

In [23]:
%%time
# Optimize Logistic Regression using Grid Search, make predictions, calculate accuracy, and generate a classification report for evaluation.
lr = LogisticRegression()
param_grid = { 
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
}
grid_search = GridSearchCV(lr, 
                           param_grid=param_grid,
                           scoring="accuracy") 

grid_search.fit(x_train_resampled, y_train_resampled) 
lr_grid = grid_search.best_estimator_
y_pred_grid = lr_grid.predict(x_test_resampled)

accuracy = lr_grid.score(x_test_resampled, y_test_resampled)
print(f"Accuracy: {accuracy: 2f}")
class_report = classification_report(y_test_resampled, y_pred_grid)
print(class_report)

Accuracy:  0.811688
              precision    recall  f1-score   support

           0       0.81      0.77      0.79        44
           1       0.65      0.73      0.69        44
           2       0.83      0.80      0.81        44
           3       0.90      0.98      0.93        44
           4       0.98      1.00      0.99        44
           5       0.72      0.77      0.75        44
           6       0.80      0.64      0.71        44

    accuracy                           0.81       308
   macro avg       0.81      0.81      0.81       308
weighted avg       0.81      0.81      0.81       308

CPU times: total: 4.91 s
Wall time: 1.96 s


## Training a Support Vector Machine (SVM) Classifier with Linear Kernel

I utilized Support Vector Machine (SVM) Classifier with a linear kernel to train my model. This classifier is a fundamental machine learning model known for its effectiveness in both binary and multiclass classification tasks.

#### The Math Behind SVM

The essence of SVM lies in finding the optimal hyperplane that best separates data points of different classes. In the case of a linear kernel, SVM aims to discover a linear decision boundary. The key mathematical concept is to maximize the margin between the hyperplane and the nearest data points (support vectors). This optimization problem can be formulated as a quadratic programming task, where the goal is to find the weights and bias that define the hyperplane.

The margin is defined as the perpendicular distance from the hyperplane to the nearest data point, and the objective is to find the hyperplane with the largest margin while minimizing classification errors. This leads to a dual problem in which Lagrange multipliers are employed to find the optimal solution.

#### Choice of Kernel: Linear

The SVM algorithm offers various kernel options, including linear, polynomial, radial basis function (RBF), and more. The choice of kernel depends on the nature of the data and the problem at hand.

For this project, we opt for a linear kernel due to specific reasons:
1. **Linear Separability**: The data exhibits reasonably linear separability, meaning that a linear decision boundary can effectively separate different classes. In cases where data clusters are well-differentiated along a straight line, a linear kernel is often a pragmatic choice.

2. **Lower Complexity**: Linear kernels are computationally less intensive than their non-linear counterparts. This results in faster training and prediction times, making them suitable for this project's context.

In [24]:
# Train a Support Vector Machine (SVM) Classifier with a linear kernel, make predictions, calculate accuracy, and generate a classification report for evaluation.
from sklearn.svm import SVC

svm = SVC(kernel='linear', 
          random_state=42)

svm.fit(x_train_resampled, y_train_resampled)
y_pred = svm.predict(x_test_resampled)
accuracy = accuracy_score(y_test_resampled, y_pred)

print(f"Accuracy: {accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred)
print(class_report)

Accuracy: 0.97
              precision    recall  f1-score   support

           0       0.95      0.95      0.95        44
           1       0.93      0.95      0.94        44
           2       0.98      1.00      0.99        44
           3       1.00      1.00      1.00        44
           4       1.00      1.00      1.00        44
           5       0.98      0.95      0.97        44
           6       0.98      0.95      0.97        44

    accuracy                           0.97       308
   macro avg       0.97      0.97      0.97       308
weighted avg       0.97      0.97      0.97       308



### Hyperparameter Tuning for SVM

In [25]:
%%time
# Optimize a Support Vector Machine (SVM) Classifier with a linear kernel using Grid Search, make predictions, calculate accuracy, and generate a classification report for evaluation.
svm = SVC(kernel='linear')
param_grid = { 
    'C': [0.01, 0.1, 1, 10],
}
grid_search = GridSearchCV(svm, 
                           param_grid=param_grid,
                           scoring="accuracy") 

grid_search.fit(x_train_resampled, y_train_resampled) 
svm_grid = grid_search.best_estimator_
y_pred_grid = svm_grid.predict(x_test_resampled)

accuracy = svm_grid.score(x_test_resampled, y_test_resampled)
print(f"Accuracy: {accuracy: 2f}")
class_report = classification_report(y_test_resampled, y_pred_grid)
print(class_report)

Accuracy:  0.974026
              precision    recall  f1-score   support

           0       0.95      0.95      0.95        44
           1       0.93      0.95      0.94        44
           2       0.98      1.00      0.99        44
           3       1.00      1.00      1.00        44
           4       1.00      1.00      1.00        44
           5       0.98      0.95      0.97        44
           6       0.98      0.95      0.97        44

    accuracy                           0.97       308
   macro avg       0.97      0.97      0.97       308
weighted avg       0.97      0.97      0.97       308

CPU times: total: 1.41 s
Wall time: 1.46 s


### Training a LightGBM Classifier

In this step, we are training a LightGBM Classifier to create a predictive model for our dataset. LightGBM is a gradient boosting framework that uses tree-based learning algorithms, known for its efficiency and speed in handling large datasets. The algorithm works by creating decision trees iteratively, optimizing the objective function at each step to minimize errors.

### The Math behind LightGBM

The LightGBM algorithm utilizes gradient boosting, a machine learning technique that combines the predictions from multiple individual models (in this case, decision trees) to produce a single, robust model. It optimizes the objective function by minimizing errors through gradient descent, adjusting the model's parameters to reduce the difference between predicted and actual values.

In [26]:
# Train a LightGBM Classifier, make predictions, calculate accuracy, and generate a classification report for evaluation.
from lightgbm import LGBMClassifier

lgb = LGBMClassifier(random_state=42)

lgb.fit(x_train_resampled, y_train_resampled)
y_pred = lgb.predict(x_test_resampled)
accuracy = accuracy_score(y_test_resampled, y_pred)

print(f"Accuracy: {accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred)
print(class_report)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000623 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1984
[LightGBM] [Info] Number of data points in the train set: 2149, number of used features: 16
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
Accuracy: 0.98
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        44
           1       0.93      0.93      0.93        44
           2       1.00      1.00      1.00        44
           3       1.00      1.00     

### Hyperparameter tuning for Light Grdient Boosting

In [27]:
%%time
# Optimize a LightGBM Classifier using Grid Search, make predictions, calculate accuracy, and generate a classification report for evaluation.
lgb_classifier = LGBMClassifier()
param_grid = { 
    'num_leaves': [31, 50, 100],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 150],
}
grid_search = GridSearchCV(lgb_classifier, 
                           param_grid=param_grid,
                           scoring="accuracy") 

grid_search.fit(x_train_resampled, y_train_resampled) 
lgb_grid = grid_search.best_estimator_
y_pred_grid = lgb_grid.predict(x_test_resampled)

accuracy = lgb_grid.score(x_test_resampled, y_test_resampled)
print(f"Accuracy: {accuracy: 2f}")
class_report = classification_report(y_test_resampled, y_pred_grid)
print(class_report)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000324 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1964
[LightGBM] [Info] Number of data points in the train set: 1719, number of used features: 16
[LightGBM] [Info] Start training from score -1.948240
[LightGBM] [Info] Start training from score -1.948240
[LightGBM] [Info] Start training from score -1.944166
[LightGBM] [Info] Start training from score -1.944166
[LightGBM] [Info] Start training from score -1.948240
[LightGBM] [Info] Start training from score -1.944166
[LightGBM] [Info] Start training from score -1.944166
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000365 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1963
[LightGBM] [Info] Number of data points in the train set: 1719, number of used features: 16
[LightGBM] [Info] Start training from score -1

## Training a CatBoost Classifier


### Understanding the Math Behind CatBoost

CatBoost, short for "Categorical Boosting," is a powerful gradient boosting framework for machine learning. What sets CatBoost apart is its exceptional ability to handle categorical features naturally, eliminating the need for extensive preprocessing. 

At its core, CatBoost employs a gradient boosting approach, which combines multiple decision trees to make accurate predictions. However, it incorporates several mathematical techniques and optimizations to enhance performance:

1. **Ordered Boosting**: CatBoost employs an ordered boosting technique that optimizes the gradient boosting process. By adjusting the order of categorical variables, it effectively reduces overfitting, resulting in a more robust model.

2. **Categorical Feature Support**: Unlike traditional gradient boosting frameworks, CatBoost can directly work with categorical features without the need for one-hot encoding or label encoding. It internally handles the encoding of categorical data using techniques such as "ordered target encoding" and "combinations of categorical features."

3. **Regularization and Shrinkage**: CatBoost integrates L1 and L2 regularization techniques to control the complexity of the model. This prevents overfitting and improves generalization.

4. **Bayesian Hyperparameter Optimization**: The algorithm utilizes a Bayesian approach for hyperparameter optimization, making it efficient and reducing the need for manual tuning.

In [28]:
%%time
# Train a CatBoost Classifier, make predictions, calculate accuracy, and generate a classification report for evaluation.
from catboost import CatBoostClassifier

catboost = CatBoostClassifier(iterations=500, 
                              depth=6, 
                              learning_rate=0.1, 
                              loss_function='MultiClass', 
                              random_state=42)

catboost.fit(x_train_resampled, y_train_resampled)
y_pred = catboost.predict(x_test_resampled)
y_pred_class = [int(np.round(x)) for x in y_pred]
accuracy = accuracy_score(y_test_resampled, y_pred_class)

print(f"Accuracy: {accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred_class)
print(class_report)

0:	learn: 1.6762834	total: 155ms	remaining: 1m 17s
1:	learn: 1.4960383	total: 164ms	remaining: 40.8s
2:	learn: 1.3744470	total: 173ms	remaining: 28.6s
3:	learn: 1.2704536	total: 182ms	remaining: 22.5s
4:	learn: 1.1894442	total: 191ms	remaining: 18.9s
5:	learn: 1.1008391	total: 201ms	remaining: 16.5s
6:	learn: 1.0318486	total: 210ms	remaining: 14.8s
7:	learn: 0.9721807	total: 219ms	remaining: 13.5s
8:	learn: 0.9170315	total: 229ms	remaining: 12.5s
9:	learn: 0.8643642	total: 238ms	remaining: 11.7s
10:	learn: 0.8196044	total: 247ms	remaining: 11s
11:	learn: 0.7828187	total: 256ms	remaining: 10.4s
12:	learn: 0.7470204	total: 265ms	remaining: 9.94s
13:	learn: 0.7180522	total: 276ms	remaining: 9.58s
14:	learn: 0.6839595	total: 286ms	remaining: 9.26s
15:	learn: 0.6573391	total: 298ms	remaining: 9.02s
16:	learn: 0.6325730	total: 310ms	remaining: 8.8s
17:	learn: 0.6109216	total: 321ms	remaining: 8.61s
18:	learn: 0.5903778	total: 333ms	remaining: 8.44s
19:	learn: 0.5702112	total: 345ms	remaining

### Hyperparameter Tuning for CatBoost

In [29]:
%%time
# Optimize a CatBoost Classifier using Grid Search, make predictions, calculate accuracy, and generate a classification report for evaluation.
catboost_classifier = CatBoostClassifier()
param_grid = {
    'iterations': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'depth': [6, 8, 10],
    'l2_leaf_reg': [1, 3, 5],
}
grid_search = GridSearchCV(catboost_classifier, 
                           param_grid=param_grid, 
                           scoring="accuracy", 
                           cv=3)

grid_search.fit(x_train_resampled, y_train_resampled)
catboost_grid = grid_search.best_estimator_
y_pred_grid = catboost_grid.predict(x_test_resampled)

accuracy = catboost_grid.score(x_test_resampled, y_test_resampled)
print(f"Accuracy: {accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred_grid)
print(class_report)

0:	learn: 1.9188280	total: 11.4ms	remaining: 1.13s
1:	learn: 1.8937096	total: 23.4ms	remaining: 1.14s
2:	learn: 1.8684044	total: 35ms	remaining: 1.13s
3:	learn: 1.8417117	total: 46.8ms	remaining: 1.12s
4:	learn: 1.8174592	total: 58.2ms	remaining: 1.11s
5:	learn: 1.7980804	total: 69.2ms	remaining: 1.08s
6:	learn: 1.7785994	total: 80.7ms	remaining: 1.07s
7:	learn: 1.7565894	total: 92ms	remaining: 1.06s
8:	learn: 1.7345841	total: 103ms	remaining: 1.04s
9:	learn: 1.7120423	total: 114ms	remaining: 1.03s
10:	learn: 1.6917289	total: 125ms	remaining: 1.01s
11:	learn: 1.6698671	total: 137ms	remaining: 1s
12:	learn: 1.6521946	total: 148ms	remaining: 993ms
13:	learn: 1.6349417	total: 160ms	remaining: 984ms
14:	learn: 1.6183511	total: 172ms	remaining: 975ms
15:	learn: 1.5992908	total: 184ms	remaining: 964ms
16:	learn: 1.5846609	total: 195ms	remaining: 954ms
17:	learn: 1.5689068	total: 207ms	remaining: 942ms
18:	learn: 1.5530788	total: 219ms	remaining: 932ms
19:	learn: 1.5367786	total: 231ms	remain

## Understanding XGBoost Classifier

The XGBoost (Extreme Gradient Boosting) Classifier is an advanced and powerful implementation of the gradient boosting framework. It is designed to provide high performance and efficiency when dealing with large and complex datasets. XGBoost employs an ensemble learning technique that combines the strengths of multiple weak learners to create an accurate predictive model.

#### Mathematical Background

XGBoost utilizes an optimized gradient boosting algorithm that aims to minimize the loss function by adding new models to the ensemble. This process involves fitting new models to the residuals of the previous models, effectively reducing the errors in each iteration. The key components of XGBoost include:

1. **Gradient Tree Boosting**: XGBoost builds a sequence of decision trees to predict the residuals of the previous models. Each new tree is added to the ensemble, contributing to the overall prediction. The trees are constructed based on the gradients of the loss function, allowing the algorithm to minimize the errors effectively.

2. **Regularization**: XGBoost incorporates regularization techniques to prevent overfitting and improve the generalization of the model. It includes L1 and L2 regularization terms, which control the complexity of the model by penalizing large coefficient values.

3. **Weighted Quantile Sketch**: XGBoost employs an approximation method to handle large datasets efficiently. It utilizes a weighted quantile sketch to reduce the computational complexity, making it suitable for processing extensive datasets in a scalable manner.

In [30]:
# Train an XGBoost Classifier, make predictions, calculate accuracy, and generate a classification report for evaluation.
from xgboost import XGBClassifier

xgb = XGBClassifier(random_state=42)

xgb.fit(x_train_resampled, y_train_resampled)
y_pred = xgb.predict(x_test_resampled)
accuracy = accuracy_score(y_test_resampled, y_pred)

print(f"Accuracy: {accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred)
print(class_report)

Accuracy: 0.97
              precision    recall  f1-score   support

           0       0.96      0.98      0.97        44
           1       0.97      0.82      0.89        44
           2       1.00      1.00      1.00        44
           3       1.00      1.00      1.00        44
           4       1.00      1.00      1.00        44
           5       0.94      1.00      0.97        44
           6       0.91      0.98      0.95        44

    accuracy                           0.97       308
   macro avg       0.97      0.97      0.97       308
weighted avg       0.97      0.97      0.97       308



### Hyperparameter tuning for XG Boost

In [31]:
%%time
# Optimize an XGBoost Classifier using Grid Search, make predictions, calculate accuracy, and generate a classification report for evaluation.
xgb_classifier = XGBClassifier()
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_child_weight': [1, 2, 3],
}
grid_search = GridSearchCV(xgb_classifier, 
                           param_grid=param_grid, 
                           scoring="accuracy", 
                           cv=3)

grid_search.fit(x_train_resampled, y_train_resampled)
xgb_grid = grid_search.best_estimator_
y_pred_grid = xgb_grid.predict(x_test_resampled)

accuracy = xgb_grid.score(x_test_resampled, y_test_resampled)
print(f"Accuracy: {accuracy:.2f}")
class_report = classification_report(y_test_resampled, y_pred_grid)
print(class_report)

Accuracy: 0.98
              precision    recall  f1-score   support

           0       0.96      0.98      0.97        44
           1       0.98      0.91      0.94        44
           2       0.98      1.00      0.99        44
           3       1.00      0.98      0.99        44
           4       1.00      1.00      1.00        44
           5       0.94      1.00      0.97        44
           6       1.00      0.98      0.99        44

    accuracy                           0.98       308
   macro avg       0.98      0.98      0.98       308
weighted avg       0.98      0.98      0.98       308

CPU times: total: 3min 24s
Wall time: 1min 24s


### Creating an Ensemble

In [45]:
# Predictions for individual models
predictions_svm = svm_grid.predict(x_test_resampled)
predictions_lgb = lgb_grid.predict(x_test_resampled)
predictions_lr = lr_grid.predict(x_test_resampled)
predictions_gbc = gbc_grid.predict(x_test_resampled)
predictions_dtc = dtc_grid.predict(x_test_resampled)
predictions_rfc = clf_grid.predict(x_test_resampled)
predictions_catboost = catboost_grid.predict(x_test_resampled)
predictions_xgb = xgb_grid.predict(x_test_resampled)

# Ravel the predictions from Cat Boost to ensure compatibility
predictions_catboost1 = predictions_catboost.ravel()

# Create a DataFrame to store the model predictions
ensemble_df = pd.DataFrame({
    'SVM': predictions_svm,
    'Light Gradient Boosting': predictions_lgb,
    'Logistic Regression': predictions_lr,
    'Gradient Boosting': predictions_gbc,
    'Decision Tree': predictions_dtc,
    'Random Forest Classifier': predictions_rfc,
    'Cat Boost': predictions_catboost1,
    'XG Boost': predictions_xgb
})

# Perform ensemble (majority vote)
ensemble_df['Ensemble'] = ensemble_df.mode(axis=1).iloc[:, 0]

# Evaluate the ensemble's performance
accuracy = accuracy_score(y_test_resampled, ensemble_df['Ensemble'])
print(f"Ensemble Accuracy: {accuracy:.2f}")

Ensemble Accuracy: 0.99


## Creating Prediction System Based on User Input

In [33]:
def manual_testing(user_input):
    # Create a dictionary with user-provided input variables
    input_variables = {
        "Gender": [user_input['Gender']],
        "Age": [user_input['Age']],
        "Height": [user_input['Height']],
        "Weight": [user_input['Weight']],
        "Family History with Overweight": [user_input['Family History with Overweight']],
        "Frequent consumption of high caloric food": [user_input['Frequent consumption of high caloric food']],
        "Frequency of consumption of vegetables": [user_input['Frequency of consumption of vegetables']],
        "Number of main meals": [user_input['Number of main meals']],
        "Consumption of food between meals": [user_input['Consumption of food between meals']],
        "Smoke": [user_input['Smoke']],
        "Consumption of water daily": [user_input['Consumption of water daily']],
        "Calories consumption monitoring": [user_input['Calories consumption monitoring']],
        "Physical activity frequency": [user_input['Physical activity frequency']],
        "Time using technology devices": [user_input['Time using technology devices']],
        "Consumption of alcohol": [user_input['Consumption of alcohol']],
        "Transportation used": [user_input['Transportation used']]
    } 

    # Create a DataFrame from the user input
    new_tb = pd.DataFrame(input_variables)
    new_x_test = new_tb

    # Make predictions using the ensemble of machine learning models
    predictions_svm = svm_grid.predict(new_x_test)
    predictions_lgb = lgb_grid.predict(new_x_test)
    predictions_lr = lr_grid.predict(new_x_test)
    predictions_gbc = gbc_grid.predict(new_x_test)
    predictions_dtc = dtc_grid.predict(new_x_test)
    predictions_rfc = clf_grid.predict(new_x_test)
    predictions_catboost = catboost_grid.predict(new_x_test)
    predictions_xgb = xgb_grid.predict(new_x_test)

    # Ravel the predictions from Cat Boost to ensure compatibility
    predictions_catboost1 = predictions_catboost.ravel()

    # Create a DataFrame to store the model predictions
    ensemble_df = pd.DataFrame({
        'SVM': predictions_svm,
        'Light Gradient Boosting': predictions_lgb,
        'Logistic Regression': predictions_lr,
        'Gradient Boosting': predictions_gbc,
        'Decision Tree': predictions_dtc,
        'Random Forest Classifier': predictions_rfc,
        'Cat Boost': predictions_catboost1,
        'XG Boost': predictions_xgb
    })
    
    # Initialize Ensemble column with numeric prediction before mapping
    ensemble_df['Ensemble'] = predictions_svm
    
    # Mapping of numeric predictions to corresponding words
    class_mapping = {
        0: "Insufficient Weight",
        1: "Normal Weight",
        2: "Obesity Type I",
        3: "Obesity Type II",
        4: "Obesity Type III",
        5: "Overweight Level I",
        6: "Overweight Level II"
    }

    # Perform ensemble (majority vote)
    ensemble_df['Ensemble'] = ensemble_df['Ensemble'].map(class_mapping)

    # Return the ensemble prediction
    return ensemble_df['Ensemble'][0]

In [None]:
# Collect user input for various health and lifestyle factors
user_input = {
    'Gender': float(input("Gender" 
                          "\n1 (Male)"
                          "\n0 (Female)"
                          "\nAns: ")),
    'Age': float(input("Age (numeric value): ")),
    'Height': float(input("Height (Centimeters): ")),
    'Weight': float(input("Weight (Kilograms): ")),
    'Family History with Overweight': float(input("Do you have a family history with overweight?"
                                                  "\n1 (yes)" 
                                                  "\n0 (no)]"
                                                  "\nAns: ")),
    'Frequent consumption of high caloric food': float(input("Do you frequently consume high caloric food?"
                                                             "\n1 (yes)" 
                                                             "\n0 (no)"
                                                             "\nAns: ")),
    'Frequency of consumption of vegetables': float(input("Frequency of consuming vegetables"
                                                          "\n1 (Never)"
                                                          "\n2 (Sometimes)"
                                                          "\n3 (Always)"
                                                          "\nAns: ")),
    'Number of main meals': float(input("Number of main meals you have per day [1, 2, 3, 4]: ")),
    'Consumption of food between meals': float(input("Do you consume food between meals?"
                                                     "\n0 (always)"
                                                     "\n1 (frequently)"
                                                     "\n2 (sometimes)"
                                                     "\n3 (no)"
                                                     "\nAns: ")),
    'Smoke': float(input("Do you smoke?"
                         "\n1 (yes)"
                         "\n0 (no)"
                         "\nAns: ")),
    'Consumption of water daily': float(input("Daily consumption of water in liters [1, 2, 3, 4]: ")),
    'Calories consumption monitoring': float(input("Do you monitor your calories consumption?"
                                                   "\n1 (yes)"
                                                   "\n0 (no)"
                                                   "\nAns: ")),
    'Physical activity frequency': float(input("Frequency of physical activity"
                                               "\n0 (I do not)"
                                               "\n1 (1 - 2 days)"
                                               "\n2 (2 - 4 days)"
                                               "\n3 (4 - 5 days)"
                                               "\nAns: ")),
    'Time using technology devices': float(input("Select the time in hours using technology devices daily"
                                                 "\n0 (0-2 hours)"
                                                 "\n1 (3-5 hours)"
                                                 "\n2 (More than 5 hours)"
                                                 "\nAns: ")),
    'Consumption of alcohol': float(input("Do you consume alcohol?"
                                          "\n0 (always)"
                                          "\n1 (frequently)"
                                          "\n2 (sometimes)"
                                          "\n3 (no)"
                                          "\nAns: ")),
    'Transportation used': float(input("Mode of transportation you use"
                                       "\n0 (Automobile)"
                                       "\n1 (Bike)"
                                       "\n2 (Motorbike)"
                                       "\n3 (Public Transportation)"
                                       "\n4 (Walking)"
                                       "\nAns: "))
}

# Invoke the manual_testing function with the collected user input
manual_testing(user_input)

In [36]:
model_names = ["SVM", "Light Gradient Boosting", "Logistic Regression", "Gradient Boosting",
               "Decision Tree", "Random Forest Classifier", "Cat Boost", "XG Boost"]
accuracies = []

predictions_svm = svm_grid.predict(x_test_resampled)
predictions_lgb = lgb_grid.predict(x_test_resampled)
predictions_lr = lr_grid.predict(x_test_resampled)
predictions_gbc = gbc_grid.predict(x_test_resampled)
predictions_dtc = dtc_grid.predict(x_test_resampled)
predictions_rfc = clf_grid.predict(x_test_resampled)
predictions_catboost = catboost_grid.predict(x_test_resampled)
predictions_xgb = xgb_grid.predict(x_test_resampled)

for predictions in [predictions_svm, predictions_lgb, predictions_lr, predictions_gbc,
                   predictions_dtc, predictions_rfc, predictions_catboost, predictions_xgb]:
    accuracy = accuracy_score(y_test_resampled, predictions)   
    accuracies.append(accuracy)

model_performance = {
    "Model": model_names,
    "Accuracy": accuracies
}
performance_df = pd.DataFrame(model_performance)
print(performance_df)

                      Model  Accuracy
0                       SVM  0.974026
1   Light Gradient Boosting  0.980519
2       Logistic Regression  0.811688
3         Gradient Boosting  0.983766
4             Decision Tree  0.941558
5  Random Forest Classifier  0.775974
6                 Cat Boost  0.983766
7                  XG Boost  0.977273
