## Obesity Level Estimation

Feature Name - Description 
NObesity - Target; 
FAVC - Frequent consumption of high caloric food; 
FCVC - Frequency of consumption of vegetables; 
NCP - Number of main meals ; 
CAEC - Consumption of food between meals ; 
CH20 - Consumption of water daily ; 
CALC - Consumption of alcohol ; 
SCC - Calories consumption monitoring ; 
FAF - Pysical activity frequency ; 
TUE - Time using technology devices;
MTRANS- Transportation used ; 
SMOKE - Smokes Yes or No; 
Family - History with Overweight;
Gender - Gender is Male or Female; 
Age - Age in years ; 
Height - Height in meters;
Weight - Weight in kilograms;

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import numpy as np
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report 
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

In [2]:
# Read file
df = pd.read_csv("ObesityDataSet.csv")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   int64  
 1   Age                             2111 non-null   int64  
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   int64  
 5   FAVC                            2111 non-null   int64  
 6   FCVC                            2111 non-null   int64  
 7   NCP                             2111 non-null   int64  
 8   CAEC                            2111 non-null   int64  
 9   SMOKE                           2111 non-null   int64  
 10  CH2O                            2111 non-null   int64  
 11  SCC                             2111 non-null   int64  
 12  FAF                             21

In [9]:
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,0,21,1.62,64.0,1,0,2,3,2,0,2,0,0,1,3,3,1
1,0,21,1.52,56.0,1,0,3,3,2,1,3,1,3,0,2,3,1
2,1,23,1.8,77.0,1,0,2,3,2,0,2,0,2,1,1,3,1
3,1,27,1.8,87.0,0,0,3,3,2,0,2,0,2,0,1,4,5
4,1,22,1.78,89.8,0,0,2,1,2,0,2,0,0,0,2,3,6


In [4]:
# split dataset in features and target variable
# Features
X = df.drop(columns=["NObeyesdad"])
#X=x.drop(columns=["family_history_with_overweight"])
# Target variable
y = df['NObeyesdad'] 

In [5]:
# import sklearn packages for data treatments
from sklearn.model_selection import train_test_split # Import train_test_split function

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test


In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.preprocessing import StandardScaler # Import for standard scaling of the data
from sklearn.preprocessing import MinMaxScaler # Import for standard scaling of the data
# standard scale data
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

# tested MinMaxScaler as KNN historically does better with MinMax
mm = MinMaxScaler()
X_train_mm_scaled = ss.fit_transform(X_train)
X_test_mm_scaled = ss.transform(X_test)

# program to run multilple models though sklearn 
# Default settings output accuracy and classification report
# compares accuracy for scaled and unscaled data
def run_models(X_train: pd.DataFrame , y_train: pd.DataFrame, X_test: pd.DataFrame, y_test: pd.DataFrame):
    models = [('naive_bayes', GaussianNB()),
        ('Logistic Regression', LogisticRegression()),
        ('KNN', KNeighborsClassifier()),
        ('Random Forest', RandomForestClassifier(random_state=2020)),
        ('SVM', SVC(C=1000, gamma=1, kernel='linear')),
        ('Decision Tree', DecisionTreeClassifier())
        ]  
    
    for name, model in models:        
        # unscaled data
        clf = model.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        
        # scaled data
        clf_scaled = model.fit(X_train_scaled, y_train)
        y_pred_scaled = clf_scaled.predict(X_test_scaled)
        
        # mm scaled data
        clf_mm_scaled = model.fit(X_train_mm_scaled, y_train)
        y_pred_mm_scaled = clf_scaled.predict(X_test_mm_scaled)
        
        # accuracy scores
        accuracy = round(metrics.accuracy_score(y_test, y_pred),5)
        scaled_accuracy = round(metrics.accuracy_score(y_test, y_pred_scaled),5)
        scaled_mm_accuracy = round(metrics.accuracy_score(y_test, y_pred_mm_scaled),5)
        
        # output
        print(name + ':')        
        print("---------------------------------------------------------------")      
        print("Accuracy:", accuracy)
        print("Accuracy w/Scaled Data (ss):", scaled_accuracy)
        print("Accuracy w/Scaled Data (mm):", scaled_mm_accuracy)
        if (accuracy > scaled_accuracy) and (accuracy > scaled_mm_accuracy):
            print("\nClassification Report:\n", metrics.classification_report(y_test, y_pred))      
            print("--------------------------------------------------------------- \n")      
        elif (scaled_accuracy > scaled_mm_accuracy):
            print("\nClassification Report (ss):\n", metrics.classification_report(y_test, y_pred_scaled))      
            print("---------------------------------------------------------------\n")     
        else:            
            print("\nClassification Report (mm):\n", metrics.classification_report(y_test, y_pred_mm_scaled))      
            print("--------------------------------------------------------------- \n")
        

In [7]:
run_models(X_train, y_train, X_test, y_test)

naive_bayes:
---------------------------------------------------------------
Accuracy: 0.61356
Accuracy w/Scaled Data (ss): 0.61356
Accuracy w/Scaled Data (mm): 0.61356

Classification Report (mm):
               precision    recall  f1-score   support

           0       0.72      0.90      0.80        92
           1       0.46      0.29      0.35        77
           2       0.38      0.60      0.46       114
           3       0.70      0.98      0.81        85
           4       0.98      0.98      0.98        92
           5       0.54      0.28      0.37        89
           6       0.53      0.21      0.30        85

    accuracy                           0.61       634
   macro avg       0.61      0.60      0.58       634
weighted avg       0.61      0.61      0.59       634

--------------------------------------------------------------- 



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Logistic Regression:
---------------------------------------------------------------
Accuracy: 0.64196
Accuracy w/Scaled Data (ss): 0.85489
Accuracy w/Scaled Data (mm): 0.85489

Classification Report (mm):
               precision    recall  f1-score   support

           0       0.87      0.97      0.92        92
           1       0.83      0.69      0.75        77
           2       0.87      0.87      0.87       114
           3       0.91      0.94      0.92        85
           4       0.98      0.99      0.98        92
           5       0.79      0.72      0.75        89
           6       0.72      0.78      0.75        85

    accuracy                           0.85       634
   macro avg       0.85      0.85      0.85       634
weighted avg       0.85      0.85      0.85       634

--------------------------------------------------------------- 

KNN:
---------------------------------------------------------------
Accuracy: 0.86751
Accuracy w/Scaled Data (ss): 0.78391
Accura