<h1>Part 1 - Preprocessing</h1></br>(This was accomplished in the last deliverable - skip to pt II)

This Notebook is being used as a quick preparation for Modeling. The purpose will be to replace the categorical variable, standardize features using StandardScaler(), and to split the dataframe into training and testing sets.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import os.path

import warnings
warnings.filterwarnings('ignore')

upon inspection of my dataframe, some index columns had been inserted between all of the notebooks that I have been using. This column is simply removing the unnecessary index columns (including the seq_number id field that I used to merge tables together earlier in the project as it is no longer needed)

In [2]:
df = pd.read_csv('my_data/prep_preprocess.csv')
df.rename(columns={"Unnamed: 0":"delete"},inplace=True)
df = df.drop(columns=["delete","Seq_Number"])

The only Categorical field in the dataframe was the age_grp column that I inserted last notebook to differentiate between rows that represented children (<=18) and adults. This cell is replacing that field with dummy binary fields.

In [3]:
age_grp_dummy = pd.get_dummies(df['age_grp'])
df = pd.concat([df,age_grp_dummy],axis=1).drop(columns=["age_grp"])
df.head()

Unnamed: 0,Gender,Age_yr,#_diff_foods,tot_calories,total_protein,total_carb,total_sugar,total_fiber,total_fat,avg_visc_fat,bmi,waist,weight,ave_BP,adult,child
0,1,69,11.0,1574.0,43.63,239.59,176.47,10.8,52.81,20.6,26.7,100.0,78.3,112.666667,1,0
1,1,54,8.0,5062.0,338.13,423.78,44.99,16.7,124.29,24.4,28.6,107.6,89.5,157.333333,1,0
2,1,72,27.0,1743.0,64.61,224.39,102.9,9.9,65.97,25.6,28.9,109.2,88.9,142.0,1,0
3,1,9,19.0,1490.0,77.75,162.92,80.58,10.6,58.27,14.9,17.1,61.0,32.2,104.666667,0,1
4,2,73,7.0,1421.0,55.24,178.2,87.78,12.3,55.36,20.8,19.7,88.6,52.0,137.333333,1,0


In [4]:
df['obese'] = 0
df.loc[df['bmi'] >= 26, 'obese'] = 1

Then finally we are splitting the data into training and testing sets.

In [5]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["obese"])
y = df.obese == 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

That concludes the scope of this notebook, later we will Model our standardized data and see what insights we can come up with!

<h1>Part 2 - Modeling</h1>

I originally thought that this was a regression problem, but I have added a Binary field entitled obese, where individuals are given a value of 1 if they have a bmi of 26 or more as this would imply that they are overweight. With this in mind, this now becomes a classification problem and we can employ classification models....

In [7]:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score,confusion_matrix

In [8]:
def display_score_results(model):
    cv_scores_test= cross_val_score(model,X_test,y_test,cv=5,scoring='roc_auc')
    cv_scores_train= cross_val_score(model,X_train,y_train,cv=5,scoring='roc_auc')
    print("5-fold CV Scores: ", cv_scores_test)
    print ('Mean testing score: ', cv_scores_test.mean())
    print ('Mean training score: ', cv_scores_train.mean())
    print ('Standard deviation: ', cv_scores_test.std())

<h2>K-Nearest Neighbor</h2>

In [9]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(p=2,weights='distance',n_neighbors=50)
knn.fit(X_train,y_train)

y_predict_knn=knn.predict(X_test)

print(confusion_matrix(y_test, y_predict_knn))
print(knn.score(X_test,y_test))

[[901  54]
 [ 56 696]]
0.9355594610427651


<h2>Support Vector Machine</h2>

In [10]:
from sklearn.svm import SVC

svm = SVC(kernel='linear')
svm.fit(X_train, y_train)

y_predict_svm=svm.predict(X_test)

print(confusion_matrix(y_test, y_predict_svm))
print(svm.score(X_test,y_test))

[[955   0]
 [  8 744]]
0.9953134153485648


<h2>Random Forest</h2>

In [11]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(bootstrap=True,n_estimators=100,criterion='entropy')
rf.fit(X_train, y_train)

y_predict_rf = rf.predict(X_test)

print(confusion_matrix(y_test, y_predict_rf))
print(rf.score(X_test,y_test))

[[955   0]
 [  0 752]]
1.0


<h2>Gradient Boosting</h2>

In [12]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(subsample=0.8, learning_rate=0.05 , n_estimators=160, random_state=5, max_depth=9, max_leaf_nodes=100)
gbc.fit(X_train, y_train)

y_predict_gbc = gbc.predict(X_test)

print(confusion_matrix(y_test, y_predict_gbc))
print(gbc.score(X_test,y_test))

[[955   0]
 [  0 752]]
1.0


<h2>Naive bayes</h2>

In [13]:
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train,y_train)

y_predict_nb=nb.predict(X_test)

print(confusion_matrix(y_test, y_predict_nb))
print(nb.score(X_test,y_test))

[[826 129]
 [ 54 698]]
0.8927943760984183


In [14]:
for model in [knn,svm,rf,gbc,nb]:
    print(model,":")
    display_score_results(model)
    print('------')

KNeighborsClassifier(n_neighbors=50, weights='distance') :
5-fold CV Scores:  [0.9809646  0.97468881 0.9809075  0.97933682 0.9599651 ]
Mean testing score:  0.9751725668319405
Mean training score:  0.9824374834655292
Standard deviation:  0.007940772883526574
------
SVC(kernel='linear') :
5-fold CV Scores:  [1.         0.99930654 0.99982548 0.99958115 0.99902269]
Mean testing score:  0.9995471724281405
Mean training score:  0.9999226646883074
Standard deviation:  0.0003511835513666919
------
RandomForestClassifier(criterion='entropy') :
5-fold CV Scores:  [1. 1. 1. 1. 1.]
Mean testing score:  1.0
Mean training score:  1.0
Standard deviation:  0.0
------
GradientBoostingClassifier(learning_rate=0.05, max_depth=9, max_leaf_nodes=100,
                           n_estimators=160, random_state=5, subsample=0.8) :
5-fold CV Scores:  [1. 1. 1. 1. 1.]
Mean testing score:  1.0
Mean training score:  1.0
Standard deviation:  0.0
------
GaussianNB() :
5-fold CV Scores:  [0.97052807 0.9615478  0.9670

It appears that RandomForest and GradientBoosting performed the best, with perfect scores. Either model would be an appropriate selection, but we will go with RandomForest as it doesn't require quite as much of a computing cost.