<a href="https://colab.research.google.com/github/TanChen168/Week8_Boosting/blob/main/GradientBoostingAssignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Gradient Boosting

![gradient boosting image](https://media.geeksforgeeks.org/wp-content/uploads/20200721214745/gradientboosting.PNG)

Image thanks to [Geeks for Geeks](https://www.geeksforgeeks.org/ml-gradient-boosting/)

In this assignment you will:
1. import and prepare a dataset for modeling
2. test and evaluate 3 different boosting models and compare the fit times of each.
3. tune the hyperparameters of the best model to reduce overfitting and improve performance.

In [67]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, plot_confusion_matrix
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import r2_score

#import accuracy, precision, recall, classification report, and confusion matrix scoring functions
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, confusion_matrix

In this assignment you will be working with census data.  Your goal is to predict whether a person will make more or less than $50k per year in income.

The data is available [here](https://drive.google.com/file/d/1drlRzq-lIY7rxQnvv_3fsxfIfLsjQ4A-/view?usp=sharing)

In [68]:
# Get data
df = pd.read_csv('/content/sample_data/census_income.csv')
df = df.drop(columns='Unnamed: 0')


Prepare your dataset for modeling.

Remember to: 
1. Check for missing data, bad data, and duplicates.
2. Check your target class balance.
3. Perform your validation split
4. Create a preprocessing pipeline to use with your models.
5. Fit and evaluate your models using pipelines

In [69]:
# Clean up bad data
df = df[df.workclass != '?']
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30725 entries, 0 to 32560
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             30725 non-null  int64 
 1   workclass       30725 non-null  object
 2   education       30725 non-null  object
 3   marital-status  30725 non-null  object
 4   occupation      30725 non-null  object
 5   relationship    30725 non-null  object
 6   race            30725 non-null  object
 7   sex             30725 non-null  object
 8   capital-gain    30725 non-null  int64 
 9   capital-loss    30725 non-null  int64 
 10  hours-per-week  30725 non-null  int64 
 11  native-country  30725 non-null  object
 12  income-class    30725 non-null  object
dtypes: int64(4), object(9)
memory usage: 3.3+ MB


In [70]:
# Create feature and target dataset
X = df.drop(columns = 'income-class')
# Encode our target
y = df['income-class']

In [71]:
# Check target class balance
y.value_counts()

<=50K    23075
>50K      7650
Name: income-class, dtype: int64

In [72]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3)

In [73]:
# Prepare and create pipeline

# Initialize transformers
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')
mean_imputer = SimpleImputer(strategy='mean')
freq_imputer = SimpleImputer(strategy='most_frequent')
scaler = StandardScaler()
ohe_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# This is the pipeline for numeric columns
num_pipe = make_pipeline(mean_imputer, scaler)
# This is the pipeline for categorical columns
cat_pipe = make_pipeline(freq_imputer, ohe_encoder)

#Column matching 
num_tuple = (num_pipe, num_selector)
cat_tuple = (cat_pipe, cat_selector)
column_transformer = make_column_transformer(num_tuple, cat_tuple)

column_transformer.fit(X_train)

X_train_processed = column_transformer.transform(X_train)
X_test_processed = column_transformer.transform(X_test)

# eXtreme Gradient Boosting
We are going to compare both metrics and fit times for our models.  Notice the 'cell magic' in the top of the cell below.  By putting `%%time` at the top of a notebook cell, we can tell it to output how long that cell took to run.  We can use this to compare the speed of each of our different models.  Fit times can be very important for models in deployment, especially with very large dataset and/or many features.

Instantiate an eXtreme Gradient Boosting Classifier (XGBClassifier) below, fit it, and print out a classification report.  Take note of the accuracy, recall, precision, and f1-score, as well as the run time of the cell to compare to our next models.

In [74]:
%%time

xgb = XGBClassifier()
xgb.fit(X_train_processed, y_train)
xgbPredictions = xgb.predict(X_test_processed)

accuracy = accuracy_score(y_test, xgbPredictions)
# recall = recall_score(y_test, xgbPredictions)
# precisionscore = precision_score(y_test, xgbPredictions)


# print('Training accuracy:', xgb.score(X_train_processed, y_train))
# print('Testing accuracy:', xgb.score(X_test_processed, y_test))
# print('Accuracy: ',  accuracy)
# print('Recall: ', recall)
# print('Precision Score: ', precisionscore)

Creport = classification_report(y_test, xgbPredictions)
print(Creport)



              precision    recall  f1-score   support

       <=50K       0.88      0.95      0.91      5798
        >50K       0.80      0.60      0.68      1884

    accuracy                           0.86      7682
   macro avg       0.84      0.77      0.80      7682
weighted avg       0.86      0.86      0.86      7682

CPU times: user 5.87 s, sys: 26.7 ms, total: 5.9 s
Wall time: 5.9 s


Which target class is your model better at predicting?  Is it significantly overfit?

# More Gradient Boosting

Now fit and evaluate a Light Gradient Boosting Machine and a the Scikit Learn (sklearn) gradient boost model.  Remember to use the `%%time` cell magic command to get the run time.

## LightGBM

In [85]:
%%time

lgbm = LGBMClassifier(max_depth = 6)
lgbm.fit(X_train_processed, y_train)
lgbmPredictions = lgbm.predict(X_test_processed)

accuracy = accuracy_score(y_test, lgbmPredictions)
Creport = classification_report(y_test, lgbmPredictions)
print(Creport)


              precision    recall  f1-score   support

       <=50K       0.89      0.94      0.92      5798
        >50K       0.79      0.65      0.71      1884

    accuracy                           0.87      7682
   macro avg       0.84      0.80      0.81      7682
weighted avg       0.87      0.87      0.87      7682

CPU times: user 1.12 s, sys: 11.5 ms, total: 1.14 s
Wall time: 1.24 s


## GradientBoostingClassifier

In [76]:
%%time
gbc = GradientBoostingClassifier()
gbc.fit(X_train_processed, y_train)
gbcmPredictions = lgbm.predict(X_test_processed)

accuracy = accuracy_score(y_test, gbcmPredictions)
Creport = classification_report(y_test, gbcmPredictions)
print(Creport)


              precision    recall  f1-score   support

       <=50K       0.90      0.94      0.92      5798
        >50K       0.78      0.67      0.72      1884

    accuracy                           0.87      7682
   macro avg       0.84      0.81      0.82      7682
weighted avg       0.87      0.87      0.87      7682

CPU times: user 8.25 s, sys: 7.39 ms, total: 8.25 s
Wall time: 8.27 s



# Tuning Gradient Boosting Models

Tree-based gradient boosting models have a LOT of hyperparameters to tune.  Here are the documentation pages for each of the 3 models you used today:

1. [XGBoost Hyperparameter Documentation](https://xgboost.readthedocs.io/en/latest/parameter.html)
2. [LightGBM Hyperparameter Documentation](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html)
3. [Scikit-learn Gradient Boosting Classifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

Choose the model you felt performed the best when comparing multiple metrics and the runtime for fitting, and use GridSearchCV to try at least 2 different values each for 3 different hyper parameters in boosting model you chose.

See if you can create a model with an accuracy between 86 and 90.


In [None]:
import time
from sklearn.model_selection import GridSearchCV

start = time.time()

lgbm = LGBMClassifier()

parameters = {'num_leaves':[20, 60,100], 'min_child_samples':[5,10, 15],'max_depth':[5, 10,20]}

#Define the scoring
clf=GridSearchCV(lgbm,parameters,scoring='accuracy')
clf.fit(X=X_train_processed, y=y_train)
print(clf.best_params_)
predicted=clf.predict(X_test_processed)
print('Classification of the result is:')
print(accuracy_score(y_test, predicted))

end = time.time()

print('Execution time is:')
print(end - start)

# lgbm.fit(X_train_processed, y_train)
# lgbmPredictions = lgbm.predict(X_test_processed)


# Evaluation

Evaluate your model using a classifiation report and/or a confusion matrix.  Explain in text how your model performed in terms of precision, recall, and it's ability to predict each of the two classes.  Also talk about the benefits or drawbacks of the computation time of that model.

-- The classification report using the tuned hyperparameters performed slightly better than the manual fitting of LGBM (including precision and recall scoring). The computing time by itself is far better compare to XGB and GB model but except it takes time to run through all defined parameters with GridSearchCV. training time also increase if more parameters and ranges were added.

In [97]:
lgbm = LGBMClassifier(max_depth = 10, min_child_samples=10, num_leaves=20)
lgbm.fit(X_train_processed, y_train)
lgbmPredictions = lgbm.predict(X_test_processed)

accuracy = accuracy_score(y_test, lgbmPredictions)
Creport = classification_report(y_test, lgbmPredictions)
print(Creport)

              precision    recall  f1-score   support

       <=50K       0.90      0.94      0.92      5798
        >50K       0.79      0.67      0.72      1884

    accuracy                           0.87      7682
   macro avg       0.84      0.80      0.82      7682
weighted avg       0.87      0.87      0.87      7682



# Conclusion

In this assignment you practiced:
1. data cleaning
2. instantiating, fitting, and evaluating boosting models using multiple metrics
3. timing how long it takes a model to fit and comparing run times between multiple models
4. and choosing a final model based on multiple metrics.

