<a href="https://colab.research.google.com/github/TanChen168/Week8_Boosting/blob/main/GradientBoostingAssignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Gradient Boosting

![gradient boosting image](https://media.geeksforgeeks.org/wp-content/uploads/20200721214745/gradientboosting.PNG)

Image thanks to [Geeks for Geeks](https://www.geeksforgeeks.org/ml-gradient-boosting/)

In this assignment you will:
1. import and prepare a dataset for modeling
2. test and evaluate 3 different boosting models and compare the fit times of each.
3. tune the hyperparameters of the best model to reduce overfitting and improve performance.

In [91]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, plot_confusion_matrix
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import r2_score

In this assignment you will be working with census data.  Your goal is to predict whether a person will make more or less than $50k per year in income.

The data is available [here](https://drive.google.com/file/d/1drlRzq-lIY7rxQnvv_3fsxfIfLsjQ4A-/view?usp=sharing)

In [59]:
# Get data
df = pd.read_csv('/content/sample_data/census_income.csv')
df = df.drop(columns='Unnamed: 0')


Prepare your dataset for modeling.

Remember to: 
1. Check for missing data, bad data, and duplicates.
2. Check your target class balance.
3. Perform your validation split
4. Create a preprocessing pipeline to use with your models.
5. Fit and evaluate your models using pipelines

In [None]:
# Clean up bad data
df = df[df.workclass != '?']
df.info()

In [61]:
# Create feature and target dataset
X = df.drop(columns = 'income-class')
# Encode our target
y = df['income-class']

In [None]:
# Check target class balance
y.value_counts()

In [67]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)

In [79]:
# Prepare and create pipeline

# Initialize transformers
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')
mean_imputer = SimpleImputer(strategy='mean')
freq_imputer = SimpleImputer(strategy='most_frequent')
scaler = StandardScaler()
ohe_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# This is the pipeline for numeric columns
num_pipe = make_pipeline(mean_imputer, scaler)
# This is the pipeline for categorical columns
cat_pipe = make_pipeline(freq_imputer, ohe_encoder)

#Column matching 
num_tuple = (num_pipe, num_selector)
cat_tuple = (cat_pipe, cat_selector)
column_transformer = make_column_transformer(num_tuple, cat_tuple)



In [80]:
# Instantiate and fit gradient boosting classifier
gbc = GradientBoostingClassifier()

In [81]:
# Instantiate and fit gradient boosting classifier
lgbm = LGBMClassifier()

In [82]:
# Instantiate and fit gradient boosting classifier
xgb = XGBClassifier()

In [83]:
gbc_pipe = make_pipeline(column_transformer, gbc)
lgbm_pipe = make_pipeline(column_transformer, lgbm)
xgb_pipe = make_pipeline(column_transformer, xgb)

In [84]:
gbc_pipe.fit(X_train, y_train)
lgbm_pipe.fit(X_train, y_train)
xgb_pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f6d092f1090>),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('onehotencoder',
                                                                   OneHotEncode

In [86]:
print('Training accuracy:', gbc.score(X_train, y_train))
print('Testing accuracy:', gbc.score(X_test, y_test))

print('Training accuracy:', lgbm.score(X_train, y_train))
print('Testing accuracy:', lgbm.score(X_test, y_test))

print('Training accuracy:', xgb.score(X_train, y_train))
print('Testing accuracy:', xgb.score(X_test, y_test))

ValueError: ignored

In [89]:
X_train

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
22,35,Federal-gov,9th,Married-civ-spouse,Farming-fishing,Husband,Black,Male,0,0,40,United-States
13995,29,Private,Bachelors,Never-married,Exec-managerial,Not-in-family,White,Male,0,1590,50,United-States
3300,31,Private,Assoc-acdm,Divorced,Adm-clerical,Unmarried,White,Female,0,0,41,United-States
8732,22,Private,Some-college,Never-married,Other-service,Not-in-family,White,Male,3325,0,40,United-States
184,37,Federal-gov,Some-college,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,42,United-States
...,...,...,...,...,...,...,...,...,...,...,...,...
16202,19,Private,Some-college,Never-married,Adm-clerical,Other-relative,White,Male,0,0,30,Nicaragua
27802,19,Private,Some-college,Never-married,Sales,Own-child,White,Female,0,0,15,United-States
12229,23,Private,Some-college,Separated,Craft-repair,Not-in-family,White,Male,0,0,40,United-States
1802,52,Private,HS-grad,Divorced,Other-service,Not-in-family,White,Female,0,0,38,United-States


# eXtreme Gradient Boosting
We are going to compare both metrics and fit times for our models.  Notice the 'cell magic' in the top of the cell below.  By putting `%%time` at the top of a notebook cell, we can tell it to output how long that cell took to run.  We can use this to compare the speed of each of our different models.  Fit times can be very important for models in deployment, especially with very large dataset and/or many features.

Instantiate an eXtreme Gradient Boosting Classifier (XGBClassifier) below, fit it, and print out a classification report.  Take note of the accuracy, recall, precision, and f1-score, as well as the run time of the cell to compare to our next models.

In [None]:
%%time


Which target class is your model better at predicting?  Is it significantly overfit?

# More Gradient Boosting

Now fit and evaluate a Light Gradient Boosting Machine and a the Scikit Learn (sklearn) gradient boost model.  Remember to use the `%%time` cell magic command to get the run time.

## LightGBM

In [None]:
%%time


## GradientBoostingClassifier

In [None]:
%%time



# Tuning Gradient Boosting Models

Tree-based gradient boosting models have a LOT of hyperparameters to tune.  Here are the documentation pages for each of the 3 models you used today:

1. [XGBoost Hyperparameter Documentation](https://xgboost.readthedocs.io/en/latest/parameter.html)
2. [LightGBM Hyperparameter Documentation](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html)
3. [Scikit-learn Gradient Boosting Classifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

Choose the model you felt performed the best when comparing multiple metrics and the runtime for fitting, and use GridSearchCV to try at least 2 different values each for 3 different hyper parameters in boosting model you chose.

See if you can create a model with an accuracy between 86 and 90.


In [None]:
%%time


# Evaluation

Evaluate your model using a classifiation report and/or a confusion matrix.  Explain in text how your model performed in terms of precision, recall, and it's ability to predict each of the two classes.  Also talk about the benefits or drawbacks of the computation time of that model.

# Conclusion

In this assignment you practiced:
1. data cleaning
2. instantiating, fitting, and evaluating boosting models using multiple metrics
3. timing how long it takes a model to fit and comparing run times between multiple models
4. and choosing a final model based on multiple metrics.

