# Table of Contents
   1. Business Statement   
      1.1 About Data Set
   2. Introduction 
   3. Installing Libraries 
   4. Reading Data Set
   5. Machine Learning Models 
   6. Summary

# 1. Business Statement 

## 1.1 Diabetes
Dataset Shape: 253680 rows x 22 columns

Columns Types:
1. Continuous: 22
   - Explanatory Variables: 21 - float64 
   - Target Variable: 1 - float64 
   
2. Categorical: 0


# Introduction
Diabetes is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. Diabetes is a serious chronic disease in which individuals lose the ability to effectively regulate levels of glucose in the blood, and can lead to reduced quality of life and life expectancy. After different foods are broken down into sugars during digestion, the sugars are then released into the bloodstream. This signals the pancreas to release insulin. Insulin helps enable cells within the body to use those sugars in the bloodstream for energy. Diabetes is generally characterized by either the body not making enough insulin or being unable to use the insulin that is made as effectively as needed.

Complications like heart disease, vision loss, lower-limb amputation, and kidney disease are associated with chronically high levels of sugar remaining in the bloodstream for those with diabetes. While there is no cure for diabetes, strategies like losing weight, eating healthily, being active, and receiving medical treatments can mitigate the harms of this disease in many patients. Early diagnosis can lead to lifestyle changes and more effective treatment, making predictive models for diabetes risk important tools for public and public health officials.

The scale of this problem is also important to recognize. The Centers for Disease Control and Prevention has indicated that as of 2018, 34.2 million Americans have diabetes and 88 million have prediabetes. Furthermore, the CDC estimates that 1 in 5 diabetics, and roughly 8 in 10 prediabetics are unaware of their risk. While there are different types of diabetes, type II diabetes is the most common form and its prevalence varies by age, education, income, location, race, and other social determinants of health. Much of the burden of the disease falls on those of lower socioeconomic status as well. Diabetes also places a massive burden on the economy, with diagnosed diabetes costs of roughly $327 billion dollars and total costs with undiagnosed diabetes and prediabetes approaching $400 billion dollars annually.

# Importing Important Dependencies 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random as rnd
import warnings
warnings.filterwarnings("ignore")

# Reading Data Set

In [2]:
df = pd.read_csv('diabetes_012_health_indicators_BRFSS2015.csv')

In [3]:
df.head()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [4]:
df.shape

(253680, 22)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_012          253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   HeartDiseaseorAttack  253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14  GenHlth               253680 non-null  float64
 15  

In [6]:
sum(df.duplicated())

23899

In [7]:
df.describe()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
count,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,...,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0
mean,0.296921,0.429001,0.424121,0.96267,28.382364,0.443169,0.040571,0.094186,0.756544,0.634256,...,0.951053,0.084177,2.511392,3.184772,4.242081,0.168224,0.440342,8.032119,5.050434,6.053875
std,0.69816,0.494934,0.49421,0.189571,6.608694,0.496761,0.197294,0.292087,0.429169,0.481639,...,0.215759,0.277654,1.068477,7.412847,8.717951,0.374066,0.496429,3.05422,0.985774,2.071148
min,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,6.0,4.0,5.0
50%,0.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,7.0
75%,0.0,1.0,1.0,1.0,31.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,2.0,3.0,0.0,1.0,10.0,6.0,8.0
max,2.0,1.0,1.0,1.0,98.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,1.0,13.0,6.0,8.0


In [8]:
df.isnull().sum()

Diabetes_012            0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64

In [9]:
df.isnull().sum()/df.shape[0]

Diabetes_012            0.0
HighBP                  0.0
HighChol                0.0
CholCheck               0.0
BMI                     0.0
Smoker                  0.0
Stroke                  0.0
HeartDiseaseorAttack    0.0
PhysActivity            0.0
Fruits                  0.0
Veggies                 0.0
HvyAlcoholConsump       0.0
AnyHealthcare           0.0
NoDocbcCost             0.0
GenHlth                 0.0
MentHlth                0.0
PhysHlth                0.0
DiffWalk                0.0
Sex                     0.0
Age                     0.0
Education               0.0
Income                  0.0
dtype: float64

In [10]:
df.select_dtypes(include=(['int64', 'float64'])).isna().sum()
df_num_col = df.select_dtypes(include=(['int64', 'float64'])).columns
for c in df_num_col:
    df[c].fillna(df[c].median(), inplace=True)

df.select_dtypes(include=(['int64', 'float64'])).isna().sum()

Diabetes_012            0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64

In [11]:
df.select_dtypes(include=('object')).isna().sum()
df_cat_col = df.select_dtypes(include=('object')).columns
for c in df_cat_col:
    df[c].fillna(df[c].mode().values[0], inplace=True)
    
df.select_dtypes(include=('object')).isna().sum()

Series([], dtype: float64)

In [12]:
df.isnull().sum()

Diabetes_012            0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64

In [13]:
df['Diabetes_012'].value_counts() 

0.0    213703
2.0     35346
1.0      4631
Name: Diabetes_012, dtype: int64

In [14]:
X = df.drop(['Diabetes_012'], axis=1)
y = df['Diabetes_012']

In [15]:
from sklearn.model_selection import train_test_split as tts

X_train, X_test, y_train, y_test = tts(X, y, test_size = .3, random_state = 2022)

In [16]:
# Break X_train in to 2 parts: X_train_cat, X_train_cont, same w/ X_test
X_train_cat = X_train.select_dtypes(exclude='number')
X_train_cat

194089
7084
58863
111322
110
...
177841
103024
147629
1244
86909


In [17]:
X_test_cat = X_test.select_dtypes(exclude='number')
X_test_cat

157077
11076
10598
193439
211480
...
64398
195991
231267
136860
162945


In [18]:
X_train_cont = X_train.select_dtypes(include='number')
X_train_cont

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
194089,0.0,0.0,1.0,35.0,1.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,2.0,0.0,0.0,0.0,1.0,10.0,6.0,6.0
7084,0.0,1.0,1.0,28.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,3.0,0.0,4.0,1.0,1.0,9.0,5.0,3.0
58863,0.0,0.0,1.0,43.0,0.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,4.0,3.0,1.0,0.0,0.0,5.0,3.0,5.0
111322,1.0,1.0,1.0,36.0,1.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,4.0,0.0,1.0,0.0,0.0,12.0,4.0,4.0
110,0.0,0.0,1.0,25.0,1.0,0.0,0.0,1.0,1.0,1.0,...,1.0,1.0,2.0,30.0,19.0,0.0,0.0,4.0,5.0,6.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
177841,1.0,0.0,1.0,34.0,1.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,2.0,1.0,0.0,0.0,1.0,7.0,5.0,4.0
103024,0.0,0.0,1.0,31.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,0.0,10.0,4.0,6.0
147629,0.0,0.0,1.0,31.0,1.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,5.0,5.0,5.0
1244,0.0,0.0,1.0,25.0,1.0,0.0,0.0,0.0,1.0,0.0,...,1.0,1.0,5.0,30.0,22.0,1.0,0.0,9.0,5.0,8.0


In [19]:
X_test_cont = X_test.select_dtypes(include='number')
X_test_cont

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
157077,1.0,1.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,4.0,5.0,2.0,1.0,0.0,9.0,4.0,1.0
11076,0.0,0.0,1.0,28.0,0.0,0.0,1.0,0.0,1.0,0.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,11.0,6.0,4.0
10598,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,2.0,5.0,8.0
193439,1.0,1.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,4.0
211480,1.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64398,1.0,1.0,1.0,27.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,3.0,0.0,8.0,0.0,0.0,13.0,4.0,3.0
195991,0.0,0.0,1.0,24.0,1.0,0.0,1.0,1.0,1.0,1.0,...,1.0,0.0,1.0,2.0,0.0,0.0,1.0,5.0,6.0,7.0
231267,0.0,1.0,1.0,31.0,1.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,1.0,10.0,6.0,8.0
136860,0.0,1.0,1.0,37.0,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,1.0,0.0,0.0,9.0,5.0,8.0


In [20]:
from sklearn.preprocessing import OneHotEncoder

# Encoding on X_train_cat and X_test_cat
ohe = OneHotEncoder(drop = 'first')

X_train_cat_encoded = pd.DataFrame(ohe.fit_transform(X_train_cat).toarray(), columns = ohe.get_feature_names(list(X_train_cat.columns)))
X_test_cat_encoded = pd.DataFrame(ohe.transform(X_test_cat).toarray(), columns = ohe.get_feature_names(list(X_test_cat.columns)))

In [21]:
from sklearn.preprocessing import StandardScaler

# Scaling on X_train_cont and X_test_cont
sc = StandardScaler()

X_train_cont_scaled = pd.DataFrame(sc.fit_transform(X_train_cont), columns = X_train_cont.columns)
X_test_cont_scaled = pd.DataFrame(sc.transform(X_test_cont), columns = X_test_cont.columns)

In [22]:
X_train = pd.concat([X_train_cat_encoded, X_train_cont_scaled], axis = 1)
X_test = pd.concat([X_test_cat_encoded, X_test_cont_scaled], axis = 1)

In [23]:
colors= ["YlGnBu", "Blues", "BuPu", "Greens", "PiYG"]
rnd.choice(colors)

'PiYG'

# Machine Learning Models:

1. Logistic Regression
2. Decision Tree
3. Gaussian Naive Bayes
4. Random Forest
5. K-Nearest Neighbors
6. Support Vector Machine
7. Stochastic Gradient Descent
8. AdaBoost
9. Gradient Bossting 
# Metrics used: Accuracy Score, F-Beta Score, Classification Report, Confusion Matrix, ROCAUC Curve, Precision-Recall Curve

In [24]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# Logistic Regression

In [25]:
from sklearn.linear_model import LogisticRegression

#Created an obj for logistic regression 
log_reg = LogisticRegression(random_state = 10)
log_reg.fit(X_train, y_train)

In [26]:
#Training Accuracy
log_reg.score(X_train, y_train)

0.8464713700049556

In [27]:
#Testing Accuracy
y_pred = log_reg.predict(X_test)
accuracy_score(y_test, y_pred) 

0.8463681278250814

In [28]:
y_prad = log_reg.predict(X_train)

In [29]:
precision_score(y_train, y_prad, average = 'weighted')

0.7996629851537518

In [30]:
precision_score(y_test, y_pred, average = 'weighted')

0.7996234709085079

In [31]:
recall_score(y_train, y_prad, average = 'weighted')

0.8464713700049556

In [32]:
recall_score(y_test, y_pred, average = 'weighted')

0.8463681278250814

In [33]:
f1_score(y_train, y_prad, average = 'weighted')

0.8080031362510021

In [34]:
f1_score(y_test, y_pred, average = 'weighted')

0.8085056869271932

# Decision Tree

In [35]:
from sklearn.tree import DecisionTreeClassifier

#Created an obj for Decision Tree 
dec_tree = DecisionTreeClassifier(max_depth = 25, min_samples_split = 5, random_state = 10)
dec_tree.fit(X_train, y_train)

In [36]:
#Training Accuracy
dec_tree.score(X_train, y_train)

0.9404311393431545

In [37]:
#Testing Accuracy
y_pred = dec_tree.predict(X_test)
accuracy_score(y_test, y_pred)

0.792008304425523

In [38]:
y_prad = dec_tree.predict(X_train)

In [39]:
precision_score(y_train, y_prad, average = 'weighted')

0.9377982189218823

In [40]:
precision_score(y_test, y_pred, average = 'weighted')

0.7796114579615032

In [41]:
recall_score(y_train, y_prad, average = 'weighted')

0.9404311393431545

In [42]:
recall_score(y_test, y_pred, average = 'weighted')

0.792008304425523

In [43]:
f1_score(y_train, y_prad, average = 'weighted')

0.9364688487434838

In [44]:
f1_score(y_test, y_pred, average = 'weighted')

0.7855410568170653

# Gaussian Naive Bayes

In [45]:
from sklearn.naive_bayes import GaussianNB

#Created an obj for Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)

In [46]:
#Training Accuracy
gnb.score(X_train, y_train)

0.7572644952020543

In [47]:
#Testing Accuracy
y_pred = gnb.predict(X_test)
accuracy_score(y_test, y_pred)

0.7564648375906654

In [48]:
y_prad = gnb.predict(X_train)

In [49]:
precision_score(y_train, y_prad, average = 'weighted')

0.8110897987470048

In [50]:
precision_score(y_test, y_pred, average = 'weighted')

0.8105697563002686

In [51]:
recall_score(y_train, y_prad, average = 'weighted')

0.7572644952020543

In [52]:
recall_score(y_test, y_pred, average = 'weighted')

0.7564648375906654

In [53]:
f1_score(y_train, y_prad, average = 'weighted')

0.7766697818729744

In [54]:
f1_score(y_test, y_pred, average = 'weighted')

0.7759471490893615

# Random Forest

In [55]:
from sklearn.ensemble import RandomForestClassifier

#Created an obj for Random Forest
rand_for = RandomForestClassifier(n_jobs = None, max_depth = 25, min_samples_split = 5, random_state = 10, n_estimators= 200)
rand_for.fit(X_train, y_train)

In [56]:
#Training Accuracy
rand_for.score(X_train, y_train)

0.9349855836374285

In [57]:
#Testing Accuracy
y_pred = rand_for.predict(X_test)
accuracy_score(y_test, y_pred) 

0.8488121517922842

In [58]:
y_prad = rand_for.predict(X_train)

In [59]:
precision_score(y_train, y_prad, average = 'weighted')

0.93867299670725

In [60]:
precision_score(y_test, y_pred, average = 'weighted')

0.8038912693366711

In [61]:
recall_score(y_train, y_prad, average = 'weighted')

0.9349855836374285

In [62]:
recall_score(y_test, y_pred, average = 'weighted')

0.8488121517922842

In [63]:
f1_score(y_train, y_prad, average = 'weighted')

0.9246085491304998

In [64]:
f1_score(y_test, y_pred, average = 'weighted')

0.8108762788200466

# K-Nearest Neighbors

In [65]:
from sklearn.neighbors import KNeighborsClassifier

#Created an obj for K - Nearest Neighbors
knc = KNeighborsClassifier(n_jobs = -1, n_neighbors = 6)
knc.fit(X_train, y_train)

In [66]:
#Training Accuracy
knc.score(X_train, y_train)

0.8632022345361986

In [67]:
#Testing Accuracy
y_pred = knc.predict(X_test)
accuracy_score(y_test, y_pred)

0.8372096079049721

In [68]:
y_prad = knc.predict(X_train)

In [69]:
precision_score(y_train, y_prad, average = 'weighted')

0.8403420145904319

In [70]:
precision_score(y_test, y_pred, average = 'weighted')

0.7844287657046991

In [71]:
recall_score(y_train, y_prad, average = 'weighted')

0.8632022345361986

In [72]:
recall_score(y_test, y_pred, average = 'weighted')

0.8372096079049721

In [73]:
f1_score(y_train, y_prad, average = 'weighted')

0.8291771551787935

In [74]:
f1_score(y_test, y_pred, average = 'weighted')

0.7983328984755222

# Support Vector Machine

In [75]:
from sklearn.svm import SVC

#Created an obj for SVM
svm = SVC(C = 0.25, kernel = 'linear', random_state = 10)
svm.fit(X_train, y_train)

In [76]:
#Training Accuracy
svm.score(X_train, y_train)

0.8422759832409785

In [77]:
#Testing Accuracy
y_pred = svm.predict(X_test)
accuracy_score(y_test, y_pred)

0.8427283717018816

In [78]:
y_prad = svm.predict(X_train)

In [79]:
precision_score(y_train, y_prad, average = 'weighted')

0.7094288319445572

In [80]:
precision_score(y_test, y_pred, average = 'weighted')

0.7101911084713047

In [81]:
recall_score(y_train, y_prad, average = 'weighted')

0.8422759832409785

In [82]:
recall_score(y_test, y_pred, average = 'weighted')

0.8427283717018816

In [83]:
f1_score(y_train, y_prad, average = 'weighted')

0.7701656411940104

In [84]:
f1_score(y_test, y_pred, average = 'weighted')

0.7708039007565681

# Stochastic Gradient Descent

In [85]:
from sklearn.linear_model import SGDClassifier

#Created an obj for SGD
sgd = SGDClassifier(loss = 'modified_huber', max_iter = 2000, shuffle = False, n_jobs = None, early_stopping = False, random_state = 10)
sgd.fit(X_train, y_train)

In [86]:
#Training Accuracy
sgd.score(X_train, y_train)

0.8439147182051628

In [87]:
#Testing Accuracy
y_pred = sgd.predict(X_test)
accuracy_score(y_test, y_pred)

0.8447124986860086

In [88]:
y_prad = sgd.predict(X_train)

In [89]:
precision_score(y_train, y_prad, average = 'weighted')

0.7894445356202019

In [90]:
precision_score(y_test, y_pred, average = 'weighted')

0.7914340076870741

In [91]:
recall_score(y_train, y_prad, average = 'weighted')

0.8439147182051628

In [92]:
recall_score(y_test, y_pred, average = 'weighted')

0.8447124986860086

In [93]:
f1_score(y_train, y_prad, average = 'weighted')

0.7887278467728388

In [94]:
f1_score(y_test, y_pred, average = 'weighted')

0.7907177162072655

# Ada Boost

In [95]:
from sklearn.ensemble import AdaBoostClassifier

#Created an obj for Ada Boost
adab = AdaBoostClassifier(n_estimators = 200, learning_rate = 0.1 , random_state = 10)
adab.fit(X_train, y_train)

In [96]:
#Training Accuracy
adab.score(X_train, y_train)

0.848661981348831

In [97]:
#Testing Accuracy
y_pred = adab.predict(X_test)
accuracy_score(y_test, y_pred)

0.8494560075685903

In [98]:
y_prad = adab.predict(X_train)

In [99]:
precision_score(y_train, y_prad, average = 'weighted')

0.8030869388004176

In [100]:
precision_score(y_test, y_pred, average = 'weighted')

0.8046543248789475

In [101]:
recall_score(y_train, y_prad, average = 'weighted')

0.848661981348831

In [102]:
recall_score(y_test, y_pred, average = 'weighted')

0.8494560075685903

In [103]:
f1_score(y_train, y_prad, average = 'weighted')

0.8071010278570279

In [104]:
f1_score(y_test, y_pred, average = 'weighted')

0.8088870029826761

# Gradient Boosting

In [105]:
from sklearn.ensemble import GradientBoostingClassifier

#Created an obj for Gradient Boosting
gb = GradientBoostingClassifier( n_estimators = 200, learning_rate = 0.1 , random_state = 10)
gb.fit(X_train, y_train)

In [106]:
#Training Accuracy
gb.score(X_train, y_train)

0.8513425237644727

In [107]:
#Testing Accuracy
y_pred = gb.predict(X_test)
accuracy_score(y_test, y_pred)

0.8497188058446337

In [108]:
y_prad = gb.predict(X_train)

In [109]:
precision_score(y_train, y_prad, average = 'weighted')

0.8273813606947401

In [110]:
precision_score(y_test, y_pred, average = 'weighted')

0.8061646857023137

In [111]:
recall_score(y_train, y_prad, average = 'weighted')

0.8513425237644727

In [112]:
recall_score(y_test, y_pred, average = 'weighted')

0.8497188058446337

In [113]:
f1_score(y_train, y_prad, average = 'weighted')

0.8151641733540685

In [114]:
f1_score(y_test, y_pred, average = 'weighted')

0.8135830910581752