# Gradient Boosting - Lab

## Introduction

In this lab, we'll learn how to use both Adaboost and Gradient Boosting Classifiers from scikit-learn!

## Objectives

You will be able to:

* Compare and contrast Adaboost and Gradient Boosting
* Use adaboost to make predictions on a dataset
* Use Gradient Boosting to make predictions on a dataset

## Getting Started

In this lab, we'll learn how to use Boosting algorithms to make classifications on the [Pima Indians Dataset](http://ftp.ics.uci.edu/pub/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names). You will find the data stored within the file `pima-indians-diabetes.csv`. Our goal is to use boosting algorithms to classify each person as having or not having diabetes. Let's get started!

We'll begin by importing everything we need for this lab. In the cell below:

* Import `numpy`, `pandas`, and `matplotlib.pyplot`, and set the standard alias for each. Also set matplotlib visualizations to display inline. 
* Set a random seed of `0` by using `np.random.seed(0)`
* Import `train_test_split` and `cross_val_score` from `sklearn.model_selection`
* Import `StandardScaler` from `sklearn.preprocessing`
* Import `AdaboostClassifier` and `GradientBoostingClassifier` from `sklearn.ensemble`
* Import `accuracy_score`, `f1_score`, `confusion_matrix`, and `classification_report` from `sklearn.metrics`

In [3]:
!pip install sklearn-pandas

import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
%matplotlib inline
random.seed(0)

Collecting sklearn-pandas
  Downloading https://files.pythonhosted.org/packages/1f/48/4e1461d828baf41d609efaa720d20090ac6ec346b5daad3c88e243e2207e/sklearn_pandas-1.8.0-py2.py3-none-any.whl
Installing collected packages: sklearn-pandas
Successfully installed sklearn-pandas-1.8.0
[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


Now, use pandas to read in the data stored in `pima-indians-diabetes.csv` and store it in a DataFrame. Display the head to inspect the data we've imported and ensure everything loaded correctly. 

In [4]:
df = pd.read_csv('pima-indians-diabetes.csv')

In [5]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Cleaning, Exploration, and Preprocessing

The target we're trying to predict is the `'Outcome'` column. A `1` denotes a patient with diabetes. 

By now, you're quite familiar with exploring and preprocessing a dataset, so we won't hold your hand for this step. 

In the following cells:

* Store our target column in a separate variable and remove it from the dataset
* Check for null values and deal with them as you see fit (if any exist)
* Check the distribution of our target
* Scale the dataset
* Split the dataset into training and testing sets, with a `test_size` of `0.25`

In [6]:
target = df['Outcome']

In [7]:
df = df.drop(['Outcome'], axis=1)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df, target)

In [9]:
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
dtype: int64

In [10]:
#scaler = None
#scaled_df = None
#scaled_df.head()

In [11]:
from sklearn.impute import SimpleImputer
import sklearn_pandas

imp_mean = SimpleImputer(missing_values=0, strategy='mean')

mapper = sklearn_pandas.DataFrameMapper([
    (["Glucose", "BloodPressure", "SkinThickness", "Insulin", 
      "BMI", "DiabetesPedigreeFunction", "Age"], imp_mean),
    (["Pregnancies", "Glucose", "BloodPressure", "SkinThickness", "Insulin", 
      "BMI", "DiabetesPedigreeFunction", "Age"], StandardScaler()),
])

In [48]:
mapper.fit(X_train)

DataFrameMapper(default=False, df_out=False,
        features=[(['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'], SimpleImputer(copy=True, fill_value=None, missing_values=0, strategy='mean',
       verbose=0)), (['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'], StandardScaler(copy=True, with_mean=True, with_std=True))],
        input_df=False, sparse=False)

In [None]:
mapper.transform(X_train)

In [62]:
X_train

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
760,2,88,58,26,16,28.4,0.766,22
723,5,117,86,30,105,39.1,0.251,42
435,0,141,0,0,0,42.4,0.205,29
610,3,106,54,21,158,30.9,0.292,24
137,0,93,60,25,92,28.7,0.532,22
659,3,80,82,31,70,34.2,1.292,27
395,2,127,58,24,275,27.7,1.600,25
498,7,195,70,33,145,25.1,0.163,55
28,13,145,82,19,110,22.2,0.245,57
385,1,119,54,13,50,22.3,0.205,24


In [50]:
mapper.fit(X_test)

DataFrameMapper(default=False, df_out=False,
        features=[(['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'], SimpleImputer(copy=True, fill_value=None, missing_values=0, strategy='mean',
       verbose=0)), (['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'], StandardScaler(copy=True, with_mean=True, with_std=True))],
        input_df=False, sparse=False)

In [51]:
mapper.transform(X_test)

array([[ 1.24000000e+02,  6.00000000e+01,  3.20000000e+01, ...,
         5.39482357e-01,  1.18646514e-01, -9.98159721e-01],
       [ 1.47000000e+02,  7.40000000e+01,  2.50000000e+01, ...,
         4.22147490e-01, -2.41304106e-01, -2.82763621e-01],
       [ 1.32000000e+02,  7.80000000e+01,  2.88496241e+01, ...,
         9.62173064e-02, -2.18981587e-01, -9.98159721e-01],
       ...,
       [ 1.24000000e+02,  7.00000000e+01,  2.00000000e+01, ...,
        -5.55643061e-01, -6.06835356e-01,  1.94167113e-01],
       [ 9.20000000e+01,  6.20000000e+01,  2.50000000e+01, ...,
        -1.58558244e+00,  2.93564379e-02, -6.80205898e-01],
       [ 1.37000000e+02,  4.00000000e+01,  3.50000000e+01, ...,
         1.49119849e+00,  5.06866512e+00, -4.42982539e-02]])

## Training the Models

Now that we've cleaned and preprocessed our dataset, we're ready to fit some models!

In the cell below:

* Create an `AdaBoostClassifier`
* Create a `GradientBoostingClassifer`

In [52]:
adaboost_clf = AdaBoostClassifier()
gbt_clf = GradientBoostingClassifier()

Now, train each of the classifiers using the training data.

In [53]:
adaboost_clf.fit(X_train, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)

In [54]:
gbt_clf.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=100,
              n_iter_no_change=None, presort='auto', random_state=None,
              subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False)

Now, let's create some predictions using each model so that we can calculate the training and testing accuracy for each.

In [55]:
adaboost_train_preds = adaboost_clf.predict(X_train)
adaboost_test_preds = adaboost_clf.predict(X_test)
gbt_clf_train_preds = gbt_clf.predict(X_train)
gbt_clf_test_preds = gbt_clf.predict(X_test)

Now, complete the following function and use it to calculate the training and testing accuracy and f1-score for each model. 

In [57]:
def display_acc_and_f1_score(true, preds, model_name):
    acc = accuracy_score(true, preds)
    f1 = f1_score(true, preds)
    print("Model: {}".format(model_name))
    print("Accuracy: {}".format(acc))
    print("F1-Score: {}".format(f1))
    
print("Training Metrics")
display_acc_and_f1_score(y_train, adaboost_train_preds, model_name='AdaBoost')
print("")
display_acc_and_f1_score(y_train, gbt_clf_train_preds, model_name='Gradient Boosted Trees')
print("")
print("Testing Metrics")
display_acc_and_f1_score(y_test, adaboost_test_preds, model_name='AdaBoost')
print("")
display_acc_and_f1_score(y_test, gbt_clf_test_preds, model_name='Gradient Boosted Trees')

Training Metrics
Model: AdaBoost
Accuracy: 0.8489583333333334
F1-Score: 0.7808564231738034

Model: Gradient Boosted Trees
Accuracy: 0.9270833333333334
F1-Score: 0.8944723618090453

Testing Metrics
Model: AdaBoost
Accuracy: 0.6979166666666666
F1-Score: 0.4727272727272727

Model: Gradient Boosted Trees
Accuracy: 0.7552083333333334
F1-Score: 0.584070796460177


Let's go one step further and create a confusion matrix and classification report for each. Do so in the cell below.

In [58]:
adaboost_confusion_matrix = confusion_matrix(y_test, adaboost_test_preds)
adaboost_confusion_matrix

array([[108,  19],
       [ 39,  26]])

In [59]:
gbt_confusion_matrix = confusion_matrix(y_test, gbt_clf_test_preds)
gbt_confusion_matrix

array([[112,  15],
       [ 32,  33]])

In [60]:
adaboost_classification_report = classification_report(y_test, adaboost_test_preds)
print(adaboost_classification_report)

              precision    recall  f1-score   support

           0       0.73      0.85      0.79       127
           1       0.58      0.40      0.47        65

   micro avg       0.70      0.70      0.70       192
   macro avg       0.66      0.63      0.63       192
weighted avg       0.68      0.70      0.68       192



In [61]:
gbt_classification_report = classification_report(y_test, gbt_clf_test_preds)
print(gbt_classification_report)

              precision    recall  f1-score   support

           0       0.78      0.88      0.83       127
           1       0.69      0.51      0.58        65

   micro avg       0.76      0.76      0.76       192
   macro avg       0.73      0.69      0.71       192
weighted avg       0.75      0.76      0.74       192



**_Question:_** How did the models perform? Interpret the evaluation metrics above to answer this question.

Write your answer below this line:
_______________________________________________________________________________________________________________________________

 
 
As a final performance check, let's calculate the `cross_val_score` for each model! Do so now in the cells below. 

Recall that to compute the cross validation score, we need to pass in:

* a classifier
* All training Data
* All labels
* The number of folds we want in our cross validation score. 

Since we're computing cross validation score, we'll want to pass in the entire (scaled) dataset, as well as all of the labels. We don't need to give it data that has been split into training and testing sets because it will handle this step during the cross validation. 

In the cells below, compute the mean cross validation score for each model. For the data, use our `scaled_df` variable. The corresponding labels are in the variable `target`. Also set `cv=5`.

In [1]:
print('Mean Adaboost Cross-Val Score (k=5):')
print(cross_val_score(adaboost_clf, df, target, cv=5))
# Expected Output: 0.7631270690094218

Mean Adaboost Cross-Val Score (k=5):


NameError: name 'cross_val_score' is not defined

In [2]:
print('Mean GBT Cross-Val Score (k=5):')
print(cross_val_score(gbt_clf, df, target, cv=5))
# Expected Output: 0.7591715474068416

Mean GBT Cross-Val Score (k=5):


NameError: name 'cross_val_score' is not defined

These models didn't do poorly, but we could probably do a bit better by tuning some of the important parameters such as the **_Learning Rate_**. 

## Summary

In this lab, we learned how to use scikit-learn's implementations of popular boosting algorithms such as AdaBoost and Gradient Boosted Trees to make classification predictions on a real-world dataset!