# Gradient Boosting - Lab

## Introduction

In this lab, we'll learn how to use both Adaboost and Gradient Boosting classifiers from scikit-learn!

## Objectives

You will be able to:

- Use AdaBoost to make predictions on a dataset 
- Use Gradient Boosting to make predictions on a dataset 

## Getting Started

In this lab, we'll learn how to use boosting algorithms to make classifications on the [Pima Indians Dataset](http://ftp.ics.uci.edu/pub/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names). You will find the data stored in the file `'pima-indians-diabetes.csv'`. Our goal is to use boosting algorithms to determine whether a person has diabetes. Let's get started!

We'll begin by importing everything we need for this lab. Run cell below:

In [94]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
import warnings

# Suppress all warnings
warnings.filterwarnings('ignore')


Now, use Pandas to import the data stored in `'pima-indians-diabetes.csv'` and store it in a DataFrame. Print the first five rows to inspect the data we've imported and ensure everything loaded correctly. 

In [95]:
# Load the dataset
Diabetes = ("pima-indians-diabetes.csv")
df = pd.read_csv(Diabetes)

# Display the first few rows to inspect the dataset
df.head()


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Cleaning, exploration, and preprocessing

The target we're trying to predict is the `'Outcome'` column. A `1` denotes a patient with diabetes. 

By now, you're quite familiar with exploring and preprocessing a dataset.  

In the following cells:

* Check for missing values and deal with them as you see fit (if any exist) 
* Count the number of patients with and without diabetes in this dataset 
* Store the target column in a separate variable and remove it from the dataset
* Split the dataset into training and test sets, with a `test_size` of 0.25 and a `random_state` of 42

In [96]:
# Check for missing values
# Check for missing values
missing_values = df.isnull().sum()

missing_values



Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [97]:
# Count the number of patients with and without diabetes
diabetes_counts = df['Outcome'].value_counts()

# Display the results
print("Number of patients with diabetes:", diabetes_counts.get(1, 0))
print("Number of patients without diabetes:", diabetes_counts.get(0, 0))



Number of patients with diabetes: 268
Number of patients without diabetes: 500


In [98]:
target = df['Outcome']
df = pd.read_csv(Diabetes)

In [99]:
print(type(df))  # This will show the type of `df`
print(df)        # This will display the content of `df`


<class 'pandas.core.frame.DataFrame'>
     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.6

In [100]:
# Separate features (X) and target (y)
X = df.drop(columns=['Outcome'])  # Drop the Outcome column to keep only features
y = df['Outcome']                # Target variable

# Import train_test_split
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the sizes of the splits for verification
print("Training set size (X_train):", X_train.shape)
print("Test set size (X_test):", X_test.shape)
print("Training target size (y_train):", y_train.shape)
print("Test target size (y_test):", y_test.shape)


Training set size (X_train): (614, 8)
Test set size (X_test): (154, 8)
Training target size (y_train): (614,)
Test target size (y_test): (154,)


## Train the models

Now that we've explored the dataset, we're ready to fit some models!

In the cell below:

* Instantiate an `AdaBoostClassifier` (set the `random_state` for 42)
* Instantiate a `GradientBoostingClassifer` (set the `random_state` for 42) 

In [101]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

# Instantiate an AdaBoostClassifier
adaboost_clf = AdaBoostClassifier(random_state=42)

# Instantiate a GradientBoostingClassifier
gbt_clf = GradientBoostingClassifier(random_state=42)

# Display the instantiated models
print("AdaBoostClassifier instantiated:", adaboost_clf)
print("GradientBoostingClassifier instantiated:", gbt_clf)


AdaBoostClassifier instantiated: AdaBoostClassifier(random_state=42)
GradientBoostingClassifier instantiated: GradientBoostingClassifier(random_state=42)


Now, fit the training data to both the classifiers: 

In [102]:
# Fit AdaBoostClassifier
# Fit the AdaBoostClassifier to the training data
adaboost_clf.fit(X_train, y_train)
print("AdaBoostClassifier trained successfully!")




AdaBoostClassifier trained successfully!


In [103]:
# Fit GradientBoostingClassifier
# Fit the GradientBoostingClassifier to the training data
gbt_clf.fit(X_train, y_train)
print("GradientBoostingClassifier trained successfully!")


GradientBoostingClassifier trained successfully!


Now, let's use these models to predict labels on both the training and test sets: 

In [104]:
# AdaBoost model predictions
adaboost_train_preds = adaboost_clf.predict(X_train)  # Predictions on training set
adaboost_test_preds = adaboost_clf.predict(X_test)   # Predictions on test set

# GradientBoosting model predictions
gbt_clf_train_preds = gbt_clf.predict(X_train)       # Predictions on training set
gbt_clf_test_preds = gbt_clf.predict(X_test)         # Predictions on test set


Now, complete the following function and use it to calculate the accuracy and f1-score for each model: 

In [105]:
from sklearn.metrics import accuracy_score, f1_score

# Define the function to display accuracy and F1-score
def display_acc_and_f1_score(true, preds, model_name):
    acc = accuracy_score(true, preds)  # Calculate accuracy
    f1 = f1_score(true, preds)        # Calculate F1-score
    print("Model: {}".format(model_name))
    print("Accuracy: {:.2f}".format(acc))
    print("F1-Score: {:.2f}".format(f1))

# Display training metrics
print("Training Metrics")
display_acc_and_f1_score(y_train, adaboost_train_preds, model_name='AdaBoost')
print("")
display_acc_and_f1_score(y_train, gbt_clf_train_preds, model_name='Gradient Boosted Trees')
print("")

# Display testing metrics
print("Testing Metrics")
display_acc_and_f1_score(y_test, adaboost_test_preds, model_name='AdaBoost')
print("")
display_acc_and_f1_score(y_test, gbt_clf_test_preds, model_name='Gradient Boosted Trees')


Training Metrics
Model: AdaBoost
Accuracy: 0.84
F1-Score: 0.75

Model: Gradient Boosted Trees
Accuracy: 0.94
F1-Score: 0.91

Testing Metrics
Model: AdaBoost
Accuracy: 0.73
F1-Score: 0.63

Model: Gradient Boosted Trees
Accuracy: 0.75
F1-Score: 0.65


Let's go one step further and create a confusion matrix and classification report for each. Do so in the cell below: 

In [106]:
adaboost_confusion_matrix = confusion_matrix(y_test, adaboost_test_preds)
print("Confusion Matrix: AdaBoost")
print(adaboost_confusion_matrix)
print("\nClassification Report: AdaBoost")
print(classification_report(y_test, adaboost_test_preds))

Confusion Matrix: AdaBoost
[[78 21]
 [20 35]]

Classification Report: AdaBoost
              precision    recall  f1-score   support

           0       0.80      0.79      0.79        99
           1       0.62      0.64      0.63        55

    accuracy                           0.73       154
   macro avg       0.71      0.71      0.71       154
weighted avg       0.73      0.73      0.73       154



In [107]:
gbt_confusion_matrix = confusion_matrix(y_test, gbt_clf_test_preds)
print("Confusion Matrix: Gradient Boosting")
print(gbt_confusion_matrix)
print("\nClassification Report: Gradient Boosting")
print(classification_report(y_test, gbt_clf_test_preds))

Confusion Matrix: Gradient Boosting
[[78 21]
 [18 37]]

Classification Report: Gradient Boosting
              precision    recall  f1-score   support

           0       0.81      0.79      0.80        99
           1       0.64      0.67      0.65        55

    accuracy                           0.75       154
   macro avg       0.73      0.73      0.73       154
weighted avg       0.75      0.75      0.75       154



**_Question:_** How did the models perform? Interpret the evaluation metrics above to answer this question.

Write your answer below this line:
_______________________________________________________________________________________________________________________________
The models performed as follows:

AdaBoost Classifier:

Accuracy: 73%, meaning the model correctly predicted outcomes for 73% of the test samples.
Precision for Class 1 (Diabetes): 62%, indicating that out of all the instances predicted as diabetes, 62% were correct.
Recall for Class 1 (Diabetes): 64%, meaning the model correctly identified 64% of all actual diabetes cases.
F1-Score for Class 1: 63%, showing a balance between precision and recall.
Confusion Matrix:
20 false negatives (diabetes cases incorrectly predicted as non-diabetes).
21 false positives (non-diabetes cases incorrectly predicted as diabetes).
Conclusion: The model is moderately effective but struggles to achieve high precision and recall for identifying diabetes cases.
Gradient Boosting Classifier:

Accuracy: 75%, slightly better than AdaBoost.
Precision for Class 1 (Diabetes): 64%, a slight improvement over AdaBoost.
Recall for Class 1 (Diabetes): 67%, indicating it identified 67% of all diabetes cases.
F1-Score for Class 1: 65%, showing improved balance between precision and recall compared to AdaBoost.
Confusion Matrix:
18 false negatives, fewer than AdaBoost.
21 false positives, the same as AdaBoost.
Conclusion: Gradient Boosting outperforms AdaBoost slightly in terms of accuracy, precision, and recall. It is better at reducing false negatives, which is critical in medical contexts like diabetes prediction.

 
 
As a final performance check, let's calculate the 5-fold cross-validated score for each model! 

Recall that to compute the cross-validation score, we need to pass in:

* A classifier
* All training data
* All labels
* The number of folds we want in our cross-validation score  

Since we're computing cross-validation score, we'll want to pass in the entire dataset, as well as all of the labels. 

In the cells below, compute the mean cross validation score for each model. 

In [108]:
from sklearn.model_selection import cross_val_score

# Calculate cross-validation score for AdaBoost
adaboost_cv_scores = cross_val_score(adaboost_clf, X, y, cv=5)
print("AdaBoost Cross-Validation Mean Score:", adaboost_cv_scores.mean())

# Calculate cross-validation score for Gradient Boosting
gbt_cv_scores = cross_val_score(gbt_clf, X, y, cv=5)
print("Gradient Boosting Cross-Validation Mean Score:", gbt_cv_scores.mean())


AdaBoost Cross-Validation Mean Score: 0.7631270690094218
Gradient Boosting Cross-Validation Mean Score: 0.7591715474068416


These models didn't do poorly, but we could probably do a bit better by tuning some of the important parameters such as the **_Learning Rate_**. 

## Summary

In this lab, we learned how to use scikit-learn's implementations of popular boosting algorithms such as AdaBoost and Gradient Boosted Trees to make classification predictions on a real-world dataset!