# Gradient Boosting
---

1.   **[Introduction to Gradient Boosting](#1.-Introduction-to-Gradient-Boosting)**
2.   **[Foundations of Gradient Boosting](#2.-Foundations-of-Gradient-Boosting)**
3.   **[Gradient Boost Tuning](#3.-Gradient-Boost-Tuning)**
4.   **[Model Validation](#4.-Model-Validation)**
5.   **[Exploratory Data Analysis](#5.-Exploratory-Data-Analysis)**
6.   **[Model Construction](#6.-Model-Construction)**
7.   **[Model Evaluation](#7.-Model-Evaluation)**

---
<a name="1.-Introduction-Gradient-Boosting"></a>
### 1. Introduction to Gradient Boosting

#### 1.1 Definitions

**Boosting |** Technique that builds an ensemble of weak learners sequentially, with each consecutive learner trying to correct the errors of the one that preceded it

Comparison to bagging:

Similarities:
- Ensemble technique
- Aggregates weak learners

Differences: 
- Learners are built sequentially, not in parallel
- Not limited to tree-based learners

**Adaptive Boosting |** A boosting methodology where each consecutive base learner assigns greater weight to the observations incorrectly predicted by the preceding learner

**Gradient Boosting |** A boosting methodology where each base learner in the sequence is built to predict the residual errors of the model that precede it

**Black-box model |** Any model whose predictions cannot be precisely explained

**Extrapolation |** A model's ability to predict new values that fall outside of the range of values in the training data

---
<a name="2.-Foundations-of-Gradient-Boosting-Machines"></a>
### 2. Foundations of Gradient Boosting Machines

#### 2.1 Concept
GBMs are model ensembles that use gradient boosting. Gradient boosting is one of the most powerful supervised learning techniques. Although GBMs do not have to be tree-based, tree ensembles are the most common implementation of this technique. There are two key features of tree-based gradient boosting that set it apart from other modeling techniques:

1. It works by building an ensemble of decision tree base learners wherein each base learner is trained successively, attempts to predict the error—also known as “residual”—of the previous tree, and therefore compensate for it.
2. Its base learner trees are known as “weak learners” or “decision stumps.” They are generally very shallow.

##### 2.1.2 Advantages
- High accuracy
- Generally scalable
- Work well with missing data
- Don't require parameter scaling

##### 2.1.2 Disadvantages
- Tuning many hyperparameters can be time-consuming
- Difficult to interpret
- Have difficulty with extrapolation
- Prone to overfitting if too many hyperparameters are tuned


---
<a name="3.-Gradient-Model-Tuning"></a>
### 3. Gradient Model Tuning



#### 3.1 Models

For classification tasks:

**from xgboost import XGBClassifier**

For regression tasks:

**from xgboost import XGBRegressor**

#### 3.2 Most Important Hyperparameters for Gradient Model Tuning:

| Hyperparameter | What it does | Input type | Default Value | Considerations |
| --- | --- | --- | --- | --- |
| `n_estimators` | Specifies the number of trees the model will build in its ensemble | int | 100 | A typical range is 50–500. Consider how much data, how deep the trees are allowed to grow and how many samples are bootstrapped to grow each tree (generally need more trees if they’re shallow, and more trees if your bootstrap sample size is smaller). Also consider that gradient models grow trees sequentially so training can take much longer than random forest models. |
| `max_depth` | Specifies how many levels tree can have.   If None, trees grow until leaves are pure or until all leaves contain less than min_child_weight. | int | 3  | Controls complexity of the model. Gradient boosting typically uses weak learners, or “decision stumps” (i.e., shallow trees). Restricting tree depth can reduce training times and serving latency as well as prevent overfitting. Consider values 2–6 |
| `min_child_weight` | Controls threshold below which a node becomes a leaf, based on the combined weight of the samples it contains. For regression models, this value is functionally equivalent to a number of samples. For the binary classification objective, the weight of a sample in a node is dependent on its probability of response as calculated by that tree. The weight of the sample decreases the more certain the model is (i.e., the closer the probability of response is to 0 or 1). | int or float | 1 | Higher values will stop trees splitting further, and lower values will allow trees to continue to split further. If your model is underfitting, then you may want to lower it to allow for more complexity. Conversely, increase this value to stop your trees from getting too finely divided. |
| `learning_rate` | Controls how much importance is given to each consecutive base learner in the ensemble’s final prediction. Also known as eta or shrinkage. | float | 0.1 | Values can range from (0–1]. Typical values range from 0.01 to 0.3. Lower values mean less weight is given to each consecutive base learner. Consider how many trees are in your ensemble. Lower values typically benefit from more trees.
| `colsample_bytree` | Specifies the percentage (0–1.0] of features that each tree randomly selects during training | float | 1.0 | Adds randomness to the model to make it robust to noise. Consider how many features the dataset has and how many trees will be grown. Fewer features sampled means more base learners might be needed. Small colsample_bytree values on datasets with many features mean more unpredictive trees in the ensemble
| `subsample` | Specifies the percentage (0–1.0] of observations sampled from the dataset to train each base model. | float | 1.0 | Adds randomness to the model to make it robust to noise. Consider the size of dataset. When working with large datasets, it can be beneficial to limit the number of samples in each tree, because doing so can greatly reduce training time and still result in a robust model.

---
<a name="4.-Model-Validation"></a>
### 4. Model Validation

Model validation is the set of processes and activities intended to verify that models are performing as expected. Model validation refers to the whole process of evaluating different models, selecting one, and then continuing to analyze the performance of the selected model to better understand its strengths and limitations. 

#### 4.1 Validation Sets 
The simplest way to maintain the objectivity of the test data is to create another partition in the data—a validation set—and save the test data for after you select the final model. The validation set is then used, instead of the test set, to compare different models.

When building a model using a separate validation set, once the final model is selected, best practice is to go back and fit the selected model to all the non-test data (i.e., the training data + validation data) before scoring this final model on the test data.


#### 4.2 Cross Validation 
Cross Validation is a process that uses different portions of the data to test and train a model on different iterations.  

Cross-validation makes more efficient use of the training data by splitting the training data into k number of “folds” (partitions), training a model on k – 1 folds, and using the fold that was held out to get a validation score. The training process occurs k times, each time using a different fold as the validation set. At the end, the final validation score is the average of all k scores. This process is also commonly referred to as k-fold cross validation.

After a model is selected using cross-validation, that selected model is then refit to the entire training set (i.e., it’s retrained on all k folds combined).

Cross-validation reduces the likelihood of significant divergence of the distributions in the validation data from those in the full dataset. For this reason, it’s often the preferred technique when working with smaller datasets, which are more susceptible to randomness. The more folds you use, the more thorough the validation. However, adding folds increases the time needed to train, and may not be useful beyond a certain point.

#### 4.3 Model Selection
Once candidate models have been trained and validated a champion model is selected. Model validation scores factor heavily into this selection but score is seldom the only criterion. Often other factors are also considered. 
- How explainable is the model? 
- How complex is it? 
- How resilient is it against fluctuations in input values? 
- How well does it perform on data not found in the training data? 
- How much computational cost does it have to make predictions? 
- Does it add much latency to any production system? 
It’s not uncommon for a model with a slightly lower validation score to be selected over the highest-scoring model due to it being simpler, less computationally expensive, or more stable.


---
<a name="5.-Exploratory-Data-Analysis"></a>
### 5. Exploratory Data Analysis

#### 5.1 Imports

In [None]:
# Import relevant libraries and modules.

import numpy as np
import pandas as pd
import matplotlib as plt
import pickle

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

from xgboost import XGBClassifier
from xgboost import plot_importance

# Load the dataset into a DataFrame and save in a variable
data = pd.read_csv("example_file.csv")

#### 5.2 Data Exploration
After loading the dataset, the next step is to prepare the data to be suitable for clustering. This includes: 

*   Exploring data
*   Checking for missing values
*   Encoding categorical data 
*   Dropping irrelevant columns
*   Renaming columns
*   Create training and testing data

In [None]:
# Display the first 10 rows of the data
data.head(10)

In [None]:
# Display number of rows, number of columns
data.shape

In [None]:
# Display the data type for each column. NB logistic regression models expect numeric data
data.dtypes

**Question to answer:** Identify the target (or predicted) variable. What is the initial hypothesis about which variables will be valuable in predicting the target variable?

##### 5.3 Prepare model for predictions

**Question to answer:** Before proceeding with modeling, consider which metrics should ultimately be leveraged to evaluate the model. Which metrics are most suited to evaluating this type of model?
- Important to evaluate not just accuracy but the balance of false positives and false negatives that the model predicts. Therefore precision, recall and f1 score will be the best metrics for classification models.
- The ROC AUC score is also suited to classification modelling

##### 5.3.1 Convert Variables to Numeric

In [None]:
# Convert the object predictor variables to numerical dummies.
data_dummies = pd.get_dummies(data, 
                                columns=['categorical_column1','categorical_column2','categorical_column3','categorical_column4'])

##### 5.3.2 Isolate Target and Predictor Variables

In [None]:
# Separate the dataset into labels (y) and features (X).
y = data_subset["target_variable"]

X = data_subset.copy()
X = X.drop("target_variable", axis = 1)

##### 5.2.4 Create Training and Test Data

In [None]:
# Separate into train, validate, test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify= y, random_state = 0)
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size = 0.25, stratify= y_train, random_state = 0)

---
<a name="6.-Model-Construction"></a>
### 6. Model Construction


#### 6.1 Instantiate XGBClassifier

In [None]:
# Define xgb to be your XGBClassifier.
xgb = XGBClassifier(random_state = 0)

#### 6.2 Define Hyperparameters for Tuning

In [None]:
# Define parameters for tuning as `cv_params`
cv_params = {'max_depth': [4, 6],
              'min_child_weight': [3, 5],
              'learning_rate': [0.1, 0.2, 0.3],
              'n_estimators': [5,10,15],
              'subsample': [0.7],
              'colsample_bytree': [0.7]
              }

#### 6.3 Define Model Evaluation

In [None]:
# Define criteria as `scoring`
scoring = {'accuracy', 'precision', 'recall', 'f1'}

#### 6.4 Construct GridSearchCV

In [None]:
# Construct the GridSearch.
xgb_cv = GridSearchCV(xgb, cv_params, scoring = scoring, cv= 4, refit = 'f1', n_jobs= -1, verbose= 1)

In [None]:
%%time
# fit the GridSearch model to training data
xgb_cv = xgb_cv.fit(X_train, y_train)
xgb_cv

**Question to answer:** Which optimal set of parameters did the GridSearch yield?

#### 6.5 Save Model for Reference

In [None]:
# Use `pickle` to save the trained model.
pickle.dump(xgb_cv, open('xgb_cv.sav', 'wb'))

---
<a name="7.-Model-Evaluation"></a>
### 7. Model Evaluation

#### 7.1 Formulate Predictions on Test Set

To evaluate the predictions yielded from model, leverage a series of metrics and evaluation techniques from scikit-learn by examining the actual observed values in the test set relative to model's prediction.

In [None]:
# Apply model to predict on test data
y_pred = xgb_cv.predict(X_test)

In [None]:
# 1. Print your accuracy score.
ac_score = metrics.accuracy_score(y_test, y_pred)
print('accuracy score:', ac_score)

# 2. Print your precision score.
pc_score = metrics.precision_score(y_test, y_pred)
print('precision score:', pc_score)

# 3. Print your recall score.
rc_score = metrics.recall_score(y_test, y_pred)
print('recall score:', rc_score)

# 4. Print your f1 score.
f1_score = metrics.f1_score(y_test, y_pred)
print('f1 score:', f1_score)

**Question:** When observing the precision and recall scores of your model, how do you interpret these values, and is one more accurate than the other?

**Question:** What does your model's F1 score tell you, beyond what the other metrics provide?

#### 7.2 Confusion Matrix for Clarity
A confusion matrix is a graphic that shows a model's true and false positives and true and false negatives. It helps to create a visual representation of the components feeding into the metrics above.

In [None]:
# Construct and display confusion matrix.
cm = metrics.confusion_matrix(y_test, y_pred)

# Create the display for confusion matrix
disp = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=xgb_cv.classes_)

# Plot the visual in-line
disp.plot()

**Question to answer:** When observing the confusion matrix, what do you notice? Does this correlate to any of your other calculations?

#### 7.3 Visualize Most Important Features

In [None]:
# Plot the relative feature importance of the predictor variables in your model
plot_importance(xgb_cv.best_estimator_)

**Question to answer:** Examine the feature importance outputted above. What is your assessment of the result? Did anything surprise you?

#### 7.4 Compare Models

In [None]:
# Create a table of results to compare model performance
table = pd.DataFrame()
table = table.append({'Model': "Tuned Decision Tree",
                        'F1':  0.945422,
                        'Recall': 0.935863,
                        'Precision': 0.955197,
                        'Accuracy': 0.940864
                      },
                        ignore_index=True
                    )

table = table.append({'Model': "Tuned Random Forest",
                        'F1':  0.947306,
                        'Recall': 0.944501,
                        'Precision': 0.950128,
                        'Accuracy': 0.942450
                      },
                        ignore_index=True
                    )

table = table.append({'Model': "Tuned XGBoost",
                        'F1':  f1_score,
                        'Recall': rc_score,
                        'Precision': pc_score,
                        'Accuracy': ac_score
                      },
                        ignore_index=True
                    )

table

**Question to answer:** How does this model compare to the other models that have been built to solve the same problem

**Sharing findings:**

- Showcase the data used to create the prediction and the performance of the model overall.
- Review the sample output of the features and the confusion matrix to reference the model's performance.
- Highlight the metric values, emphasizing the F1 score.
- Visualize the feature importance to showcase what drove the model's predictions.