<a href="https://colab.research.google.com/github/AgunsBaba/Assignment/blob/master/Gradient_Boosting_Assignment_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Gradient Boosting

![gradient boosting image](https://media.geeksforgeeks.org/wp-content/uploads/20200721214745/gradientboosting.PNG)

Image thanks to [Geeks for Geeks](https://www.geeksforgeeks.org/ml-gradient-boosting/)

In this assignment you will:
1. import and prepare a dataset for modeling
2. test and evaluate 3 different boosting models and compare the fit times of each.
3. tune the hyperparameters of the best model to reduce overfitting and improve performance.

In [3]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, plot_confusion_matrix
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

In this assignment you will be working with census data.  Your goal is to predict whether a person will make more or less than $50k per year in income.

The data is available [here](https://drive.google.com/file/d/1drlRzq-lIY7rxQnvv_3fsxfIfLsjQ4A-/view?usp=sharing)

In [4]:
df = pd.read_csv('/content/census_income - census_income.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income-class
0,0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Prepare your dataset for modeling.

Remember to: 
1. Check for missing data, bad data, and duplicates.
2. Check your target class balance.
3. Perform your validation split
4. Create a preprocessing pipeline to use with your models.
5. Fit and evaluate your models using pipelines

In [5]:
print(df.isna().sum()) #check for null values

Unnamed: 0        0
age               0
workclass         0
education         0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income-class      0
dtype: int64


In [6]:
print(df.duplicated().sum()) #check for duplication

0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      32561 non-null  int64 
 1   age             32561 non-null  int64 
 2   workclass       32561 non-null  object
 3   education       32561 non-null  object
 4   marital-status  32561 non-null  object
 5   occupation      32561 non-null  object
 6   relationship    32561 non-null  object
 7   race            32561 non-null  object
 8   sex             32561 non-null  object
 9   capital-gain    32561 non-null  int64 
 10  capital-loss    32561 non-null  int64 
 11  hours-per-week  32561 non-null  int64 
 12  native-country  32561 non-null  object
 13  income-class    32561 non-null  object
dtypes: int64(5), object(9)
memory usage: 3.5+ MB


In [8]:
#check income class balance
df['income-class'].value_counts(normalize=True)

<=50K    0.75919
>50K     0.24081
Name: income-class, dtype: float64

In [9]:
#define target and feature labels
X = df.drop(columns=['income-class', 'Unnamed: 0'])
y = df['income-class']

In [10]:
#Perform validation split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [11]:
#pre-processing pipeline
cat_selector = make_column_selector(dtype_include=object) #instantiate selecting categorical columns
num_selector = make_column_selector(dtype_include=int) #instantiate selecting numerical columns

oht = OneHotEncoder(sparse=False, handle_unknown='ignore') #instantiate one hot enncoder transfromer
scaler = StandardScaler() #instantiate scaler transformer

#make pipeline for the categorical columns
cat_pipe = make_pipeline(oht)
num_pipe = make_pipeline(scaler)


#create pre-processing tuples
cat_tuple = (cat_pipe, cat_selector)
num_tuple = (num_pipe, num_selector)


#make column transformers
col_transformer = make_column_transformer(num_tuple, cat_tuple, remainder='passthrough')


# eXtreme Gradient Boosting
We are going to compare both metrics and fit times for our models.  Notice the 'cell magic' in the top of the cell below.  By putting `%%time` at the top of a notebook cell, we can tell it to output how long that cell took to run.  We can use this to compare the speed of each of our different models.  Fit times can be very important for models in deployment, especially with very large dataset and/or many features.

Instantiate an eXtreme Gradient Boosting Classifier (XGBClassifier) below, fit it, and print out a classification report.  Take note of the accuracy, recall, precision, and f1-score, as well as the run time of the cell to compare to our next models.

In [12]:
%%time
#instantiate the model
xgb =  XGBClassifier(random_state=42)

#create an XGB pipeline
xgb_pipe = make_pipeline(col_transformer, xgb)

#fit the model
xgb_pipe.fit(X_train, y_train)

CPU times: user 5.08 s, sys: 62 ms, total: 5.14 s
Wall time: 5.21 s


In [13]:
#generate xgb classification report
pred_xgb = xgb_pipe.predict(X_test)
pred_xgb_train = xgb_pipe.predict(X_train)

print(classification_report(y_test, pred_xgb))
print(classification_report(y_train, pred_xgb_train))

              precision    recall  f1-score   support

       <=50K       0.88      0.95      0.91      7455
        >50K       0.79      0.58      0.67      2314

    accuracy                           0.86      9769
   macro avg       0.84      0.76      0.79      9769
weighted avg       0.86      0.86      0.86      9769

              precision    recall  f1-score   support

       <=50K       0.88      0.96      0.91     17265
        >50K       0.81      0.58      0.68      5527

    accuracy                           0.86     22792
   macro avg       0.84      0.77      0.80     22792
weighted avg       0.86      0.86      0.86     22792



Which target class is your model better at predicting?  Is it significantly overfit?

Model is better at predicting the <=50K category with a higher f1-score. However, our model  I do not think our model is overfitting given there's a comparable level of accuracy on the training and test set

# More Gradient Boosting

Now fit and evaluate a Light Gradient Boosting Machine and a the Scikit Learn (sklearn) gradient boost model.  Remember to use the `%%time` cell magic command to get the run time.

## LightGBM

In [14]:
%%time
#instantiate the model
lgbm =  LGBMClassifier(random_state=42)

#create an LGBM pipeline
lgbm_pipe = make_pipeline(col_transformer, lgbm)

#fit the model
lgbm_pipe.fit(X_train, y_train)

CPU times: user 529 ms, sys: 0 ns, total: 529 ms
Wall time: 540 ms


In [15]:
#generate lgbm classification report
pred_lgbm = lgbm_pipe.predict(X_test)
pred_lgbm_train = lgbm_pipe.predict(X_train)

print(classification_report(y_test, pred_lgbm))
print(classification_report(y_train, pred_lgbm_train))

              precision    recall  f1-score   support

       <=50K       0.90      0.94      0.92      7455
        >50K       0.77      0.66      0.71      2314

    accuracy                           0.87      9769
   macro avg       0.83      0.80      0.81      9769
weighted avg       0.87      0.87      0.87      9769

              precision    recall  f1-score   support

       <=50K       0.91      0.95      0.93     17265
        >50K       0.82      0.69      0.75      5527

    accuracy                           0.89     22792
   macro avg       0.86      0.82      0.84     22792
weighted avg       0.88      0.89      0.88     22792



## GradientBoostingClassifier

In [16]:
%%time
#instantiate the model
gbc =  GradientBoostingClassifier(random_state=42)

#create a GBC pipeline
gbc_pipe = make_pipeline(col_transformer, gbc)

#fit the model
gbc_pipe.fit(X_train, y_train)

CPU times: user 5.68 s, sys: 0 ns, total: 5.68 s
Wall time: 5.68 s


In [17]:
#generate gbc classification report
pred_gbc = gbc_pipe.predict(X_test)
pred_gbc_train = gbc_pipe.predict(X_train)

print(classification_report(y_test, pred_gbc))
print(classification_report(y_train, pred_gbc_train))

              precision    recall  f1-score   support

       <=50K       0.88      0.95      0.92      7455
        >50K       0.79      0.59      0.68      2314

    accuracy                           0.87      9769
   macro avg       0.84      0.77      0.80      9769
weighted avg       0.86      0.87      0.86      9769

              precision    recall  f1-score   support

       <=50K       0.88      0.96      0.92     17265
        >50K       0.81      0.60      0.69      5527

    accuracy                           0.87     22792
   macro avg       0.85      0.78      0.80     22792
weighted avg       0.86      0.87      0.86     22792




# Tuning Gradient Boosting Models

Tree-based gradient boosting models have a LOT of hyperparameters to tune.  Here are the documentation pages for each of the 3 models you used today:

1. [XGBoost Hyperparameter Documentation](https://xgboost.readthedocs.io/en/latest/parameter.html)
2. [LightGBM Hyperparameter Documentation](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html)
3. [Scikit-learn Gradient Boosting Classifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

Choose the model you felt performed the best when comparing multiple metrics and the runtime for fitting, and use GridSearchCV to try at least 2 different values each for 3 different hyper parameters in boosting model you chose.

See if you can create a model with an accuracy between 86 and 90.


I really like the performance of the LGBM model. It took about 540ms to fit (significantly faster than the other models) and had the highest accuracy score

In [18]:
#get parameters for the LGBM model
lgbm_pipe.get_params()

{'columntransformer': ColumnTransformer(remainder='passthrough',
                   transformers=[('pipeline-1',
                                  Pipeline(steps=[('standardscaler',
                                                   StandardScaler())]),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f6651894fd0>),
                                 ('pipeline-2',
                                  Pipeline(steps=[('onehotencoder',
                                                   OneHotEncoder(handle_unknown='ignore',
                                                                 sparse=False))]),
                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f6651894ad0>)]),
 'columntransformer__n_jobs': None,
 'columntransformer__pipeline-1': Pipeline(steps=[('standardscaler', StandardScaler())]),
 'columntransformer__pipeline-1__memory': None,
 'columntransformer__pipeline-1__standard

In [40]:
param_grid = {'lgbmclassifier__num_leaves': [30, 60, 100],
              'lgbmclassifier__min_child_samples': [10, 20, 30],
              'lgbmclassifier__n_estimators': [150, 450, 600]}

grid = GridSearchCV(lgbm_pipe, param_grid)

In [41]:
%%time
grid.fit(X_train, y_train)


CPU times: user 3min 59s, sys: 6.46 s, total: 4min 6s
Wall time: 4min 5s


GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(remainder='passthrough',
                                                          transformers=[('pipeline-1',
                                                                         Pipeline(steps=[('standardscaler',
                                                                                          StandardScaler())]),
                                                                         <sklearn.compose._column_transformer.make_column_selector object at 0x7f6651894fd0>),
                                                                        ('pipeline-2',
                                                                         Pipeline(steps=[('onehotencoder',
                                                                                          OneHotEncoder(handle_unknown='ignore',
                                                                    

In [42]:
print(grid.best_params_)

{'lgbmclassifier__min_child_samples': 10, 'lgbmclassifier__n_estimators': 150, 'lgbmclassifier__num_leaves': 30}


In [38]:
#instantiate lgbm with the bestparams
lgbm_best =  LGBMClassifier(min_child_samples=10, n_estimators=150, num_leaves=30, random_state=42)

#create an LGBM pipeline
lgbm_pipe_best = make_pipeline(col_transformer, lgbm_best)

#fit the model
lgbm_pipe_best.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('pipeline-1',
                                                  Pipeline(steps=[('standardscaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f6651894fd0>),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x7f6651894ad0>)])),
                ('lgbmclassifier',
  

In [39]:
best_pred = lgbm_pipe_best.predict(X_test)
best_pred_train = lgbm_pipe_best.predict(X_train)


print(classification_report(y_test, best_pred))
print(classification_report(y_train, best_pred_train))


              precision    recall  f1-score   support

       <=50K       0.90      0.94      0.92      7455
        >50K       0.77      0.67      0.71      2314

    accuracy                           0.87      9769
   macro avg       0.83      0.80      0.82      9769
weighted avg       0.87      0.87      0.87      9769

              precision    recall  f1-score   support

       <=50K       0.91      0.95      0.93     17265
        >50K       0.83      0.71      0.77      5527

    accuracy                           0.89     22792
   macro avg       0.87      0.83      0.85     22792
weighted avg       0.89      0.89      0.89     22792



# Evaluation

Evaluate your model using a classifiation report and/or a confusion matrix.  Explain in text how your model performed in terms of precision, recall, and it's ability to predict each of the two classes.  Also talk about the benefits or drawbacks of the computation time of that model.

Based on the classification report, the lgbm model after Grid Search did not significantly improve the performance of the model with identical precision, recall and f1-score. My hypothesis is that with more compute resources to accomodate extensive Grid Search, it's possible to find best hyperparameter values to improve the model. It also took more than 4 minutes to train the model which might not be suitable for production system. Again, optimizing an LGBM model using Grid Search in a production system would require adequate compute resources.

# Conclusion

In this assignment you practiced:
1. data cleaning
2. instantiating, fitting, and evaluating boosting models using multiple metrics
3. timing how long it takes a model to fit and comparing run times between multiple models
4. and choosing a final model based on multiple metrics.

