<h1><center> Gradient Boosting </center></h1>

*https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/introducing-gradient-boosting?u=78163626*

*https://towardsdatascience.com/https-medium-com-talperetz24-mastering-the-new-generation-of-gradient-boosting-db04062a7ea2*

*https://machinelearningmastery.com/gradient-boosting-with-scikit-learn-xgboost-lightgbm-and-catboost/*

## *Written by Nathanael Hitch*

<hr>

**CatBoost** and **Light GBM (LGBM)** are algorithms for gradient boosting decision trees, similar to **Random Forest**.

# Gradient Boosting

Gradient boosting is an *Ensemble Method*: it creates multiple models and then combines them to produce better results than a single model.

Unlike other ensemble methods (e.g. Random Forest), Gradient Boosting takes an iterative approach to combining weak learners to create a strong learner by focusing on mistakes in prior iterations.

The decision trees within Gradient Boosting are very basic, more basic than in other ensemble methods. For the first iteration/decision tree, it evaulates what it gets right and wrong. Then, with the next iteration, it places a heavier weight on what the first tree got wrong. It does this over and over, focusing on the examples that it doesn't quite understand until it has minimised the error as much as possible.

Gradient Boosting models can accept various types of inputs, can be used for classification or regression and outputs feature importance.

## Differences from other Ensemble Methods, e.g. Random Forest

They are both models that use Ensemble Methods and Decision tress.

However, Random Forest uses *bagging* while Gradient Boosting uses *boosting*; bagging samples randomly while boosting samples with an increased weight of ones it goes wrong previously.

Random Forest can train in parallel as each decision tree doesn't rely on the previous. The training can be done reasonably quickly.<br>
Gradient Boosting trains iteratively as relys on the trees before it. As trees can't be trained in parallel, the training for Gradient Boosting models is done much more slowly. This is a big consideration.

Random Forest's final prediction is an unweighted voting while Gradient Boosting models have a weighted voting final prediciton.

Lastly, Random Forest is **easier to tune**, **quicker to train** and **harder to overfit**; all positives.<br>
Gradient Boosting is **harder to tune**, **slower to train** and **easier to overfit**; all negatives.

It is harder to tune as it has more parameters and it is more likely to overfit as it *obsesses* over the ones it got wrong.

So why go with Gradient Boosting; because, typically, these models are more powerful and perform better *when tuned properly*.

These *boosts* usually rely on data that uses **Encoding**.

<hr>

### Note

Weirdly, while the examples in the ***NLP_Encoding.ipynb*** used the *Oridinal* or *One Hot* encoder on the primary data, the below example needed to use the *LabelEncoder* only, otherwise the metrics for *precision* and *recall* threw an error.

Try running it afterwards without using the *LabelEncoder* on the labels, and/or *Oridinal* and *One Hot* encoders on the body data.

When writing Gradient Boosting models, use all the different variations of encoders to see which one works.

<hr>

# Gradient Boosting Example

Using the Gradient Booster, the Label Encoder for the *labels* column and using ***text_cleaner()*** function found in *NLP_Functions.py*:

In [59]:
from sklearn.ensemble import GradientBoostingClassifier
from NLP_Functions import text_cleaner
from sklearn.preprocessing import LabelEncoder

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [60]:
df = pd.read_csv("Files/SMSSpamCollection.tsv", sep='\t')
df.columns = ['label','body_test']

encoder_lb = LabelEncoder()
y_trans = encoder_lb.fit_transform(df['label'].astype(str))

X_train, X_test, y_train, y_test = train_test_split(df['body_test'].astype(str), y_trans, test_size=0.2)

tfidf_vector = TfidfVectorizer(tokenizer = text_cleaner)

gbc = GradientBoostingClassifier(n_estimators=10)

pipe = Pipeline([("vectorizer", tfidf_vector),
                ("classifier", gbc)])

In [61]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(tokenizer=<function text_cleaner at 0x0000020624C15550>)),
                ('classifier', GradientBoostingClassifier(n_estimators=10))])

In [62]:
predicted = pipe.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Precision:",metrics.precision_score(y_test, predicted))
print("Recall:",metrics.recall_score(y_test, predicted))

Accuracy: 0.9021543985637342
Precision: 1.0
Recall: 0.1925925925925926


## Grid Search for Gradient Boosting

As stated earlier, the parameters need to be tweaked in order for the Gradient Boosting model to work correctly. If not, the model's issues (easy to overfit etc.) can make the model worse than most others.

To see the varying performance of Gradient Boosting for each parameters, we can set up a **Grid Search** to vary the parameters:

In [22]:
from sklearn.ensemble import GradientBoostingClassifier
from NLP_Functions import text_cleaner
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [16]:
df = pd.read_csv("Files/SMSSpamCollection.tsv", sep='\t')
df.columns = ['label','body_test']

encoder_lb = LabelEncoder()
y_trans = encoder_lb.fit_transform(df['label'].astype(str))

X_train, X_test, y_train, y_test = train_test_split(df['body_test'].astype(str), y_trans, test_size=0.2)

tfidf_vector = TfidfVectorizer(tokenizer = text_cleaner)

gbc = GradientBoostingClassifier() #n_estimators=10)

pipe = Pipeline([("vectorizer", tfidf_vector),
                ("classifier", gbc)])

print(pipe)

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(tokenizer=<function text_cleaner at 0x0000020624C15550>)),
                ('classifier', GradientBoostingClassifier())])


In [27]:
tuned_parameters = [{'classifier__n_estimators': [50, 100],
                    'classifier__max_depth': [3, 7, 11, 15],
                    'classifier__learning_rate': [0.01, 0.1, 1]}]

In [28]:
clf = GridSearchCV(
        pipe, param_grid=tuned_parameters, scoring = 'accuracy'
)

clf.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('vectorizer',
                                        TfidfVectorizer(tokenizer=<function text_cleaner at 0x0000020624C15550>)),
                                       ('classifier',
                                        GradientBoostingClassifier())]),
             param_grid=[{'classifier__learning_rate': [0.01, 0.1, 1],
                          'classifier__max_depth': [3, 7, 11, 15],
                          'classifier__n_estimators': [50, 100]}],
             scoring='accuracy')

In [29]:
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r"
            % (mean, std * 2, params))
print()

print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
print("--" * 40)

Best parameters set found on development set:

{'classifier__learning_rate': 0.1, 'classifier__max_depth': 7, 'classifier__n_estimators': 100}

Grid scores on development set:

0.869 (+/-0.000) for {'classifier__learning_rate': 0.01, 'classifier__max_depth': 3, 'classifier__n_estimators': 50}
0.902 (+/-0.008) for {'classifier__learning_rate': 0.01, 'classifier__max_depth': 3, 'classifier__n_estimators': 100}
0.874 (+/-0.005) for {'classifier__learning_rate': 0.01, 'classifier__max_depth': 7, 'classifier__n_estimators': 50}
0.929 (+/-0.009) for {'classifier__learning_rate': 0.01, 'classifier__max_depth': 7, 'classifier__n_estimators': 100}
0.869 (+/-0.001) for {'classifier__learning_rate': 0.01, 'classifier__max_depth': 11, 'classifier__n_estimators': 50}
0.944 (+/-0.016) for {'classifier__learning_rate': 0.01, 'classifier__max_depth': 11, 'classifier__n_estimators': 100}
0.868 (+/-0.002) for {'classifier__learning_rate': 0.01, 'classifier__max_depth': 15, 'classifier__n_estimators': 50

The highest accuracy was 96.7% while the lowest accuracy was 86.8%; a difference of 10%.

That is a massive difference in accuracies.

# Light GBM

**Light Gradient Boosting Machine** (LGBM) is a library developed at Microsoft that provides an efficient implementation of the gradient boosting algorithm.

The primary benefit of the LightGBM is the changes to the training algorithm that make the process dramatically faster, and in many cases, result in a more effective model.

The LightGBM library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via:

- LGBMClassifier
- LGBMRegressor

In [42]:
# Imports for both the classifier and the regressor

from numpy import mean
from numpy import std
from sklearn.model_selection import cross_val_score
from matplotlib import pyplot

## Classifier

The example below evaluates an LGBMClassifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy:

In [50]:
from sklearn.datasets import make_classification
from lightgbm import LGBMClassifier
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import accuracy_score

In [51]:
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# evaluate the model
model = LGBMClassifier()
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.934 (0.021)


This part of the example *fits* a single model on all available data and a single prediction is made:

In [52]:
# fit the model on the whole dataset
model = LGBMClassifier()
model.fit(X, y)

# make a single prediction
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

Prediction: 1


## Regressor

The example below first evaluates an LGBMRegressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error:

In [53]:
from sklearn.datasets import make_regression
from lightgbm import LGBMRegressor
from sklearn.model_selection import RepeatedKFold

In [54]:
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# evaluate the model
model = LGBMRegressor()
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

MAE: -12.739 (1.408)


This part of the example *fits* a single model on all available data and a single prediction is made:

In [57]:
# fit the model on the whole dataset
model = LGBMRegressor()
model.fit(X, y)

# make a single prediction
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])

Prediction: -82.040


# CatBoost

**CatBoost** is a third-party library developed at Yandex that provides an efficient implementation of the gradient boosting algorithm.

The primary benefit of the CatBoost (in addition to computational speed improvements) is support for categorical input variables. This gives the library its name CatBoost for *Category Gradient Boosting*.

Like *LGBM*, the CatBoost library provides wrapper classes so that the efficient algorithm implementation can be used with the scikit-learn library, specifically via:

- CatBoostClassifier
- CatBoostRegressor

In [67]:
# Imports for both the classifier and the regressor

from numpy import mean
from numpy import std
from sklearn.model_selection import cross_val_score
from matplotlib import pyplot

## Classifier

The example below evaluates an CatBoost Classifier on the test problem using repeated k-fold cross-validation and reports the mean accuracy:

In [68]:
from sklearn.datasets import make_classification
from catboost import CatBoostClassifier
from sklearn.model_selection import RepeatedStratifiedKFold

In [69]:
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# evaluate the model
model = CatBoostClassifier(verbose=0, n_estimators=100)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Accuracy: 0.926 (0.026)


This part of the example *fits* a single model on all available data and a single prediction is made:

In [71]:
# fit the model on the whole dataset
model = CatBoostClassifier(verbose=0, n_estimators=100)
model.fit(X, y)

# make a single prediction
row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]]
yhat = model.predict(row)
print('Prediction: %d' % yhat[0])

Prediction: 1


## Regressor

The example below first evaluates an CatBoost Regressor on the test problem using repeated k-fold cross-validation and reports the mean absolute error:

In [72]:
from sklearn.datasets import make_regression
from catboost import CatBoostRegressor
from sklearn.model_selection import RepeatedKFold

In [73]:
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=5, random_state=1)

# evaluate the model
model = CatBoostRegressor(verbose=0, n_estimators=100)
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1, error_score='raise')
print('MAE: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

MAE: -9.623 (0.930)


In [74]:
# fit the model on the whole dataset
model = CatBoostRegressor(verbose=0, n_estimators=100)
model.fit(X, y)

# make a single prediction
row = [[2.02220122, 0.31563495, 0.82797464, -0.30620401, 0.16003707, -1.44411381, 0.87616892, -0.50446586, 0.23009474, 0.76201118]]
yhat = model.predict(row)
print('Prediction: %.3f' % yhat[0])

Prediction: -87.936
