## Boosting Assignment - 2
**By Shahequa Modabbera**

### Q1. What is Gradient Boosting Regression?

Ans) Gradient Boosting Regression, also known as Gradient Boosted Regression Trees (GBRT), is a machine learning algorithm that combines the concepts of boosting and regression. It is a powerful technique for solving regression problems and is widely used in various domains.

In Gradient Boosting Regression, the algorithm iteratively builds an ensemble of weak regression models (usually decision trees) in a stage-wise fashion. It trains each new model to correct the mistakes made by the previous models in the ensemble. The algorithm learns by minimizing a loss function, which measures the difference between the predicted and actual values of the target variable.

The key idea behind Gradient Boosting Regression is that each weak learner is trained to address the weaknesses of the previous models, focusing on the remaining errors. This iterative process allows the algorithm to learn complex relationships and make accurate predictions.

Gradient Boosting Regression offers several benefits, including its ability to handle a variety of data types, handle missing values, and provide robust predictions. However, it can be sensitive to noise and overfitting if not properly regularized.

The scikit-learn library in Python provides an implementation of Gradient Boosting Regression called GradientBoostingRegressor, which can be readily used for regression tasks.

### Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a simple regression problem as an example and train the model on a small dataset. Evaluate the model's performance using metrics such as mean squared error and R-squared.

In [1]:
import numpy as np

class GradientBoostingRegressor:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.estimators = []
    
    def fit(self, X, y):
        # Initialize the predictions with the mean of y
        y_pred = np.mean(y) * np.ones_like(y)
        
        for _ in range(self.n_estimators):
            # Calculate the negative gradient (residuals)
            residuals = y - y_pred
            
            # Fit a decision tree to the residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            
            # Update the predictions by adding the tree's predictions
            y_pred += self.learning_rate * tree.predict(X)
            
            # Add the tree to the ensemble
            self.estimators.append(tree)
    
    def predict(self, X):
        # Initialize the predictions with zeros
        y_pred = np.zeros(len(X))
        
        for tree in self.estimators:
            # Add the tree's predictions to the ensemble predictions
            y_pred += self.learning_rate * tree.predict(X)
        
        return y_pred

# Usage example
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor

# Generate a simple regression dataset
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the gradient boosting regressor
gb_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
gb_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gb_regressor.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 31.735482349161565
R-squared: 0.9772379183627112


### Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to optimise the performance of the model. Use grid search or random search to find the best hyperparameters

In [2]:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.base import BaseEstimator

class GradientBoostingRegressor(BaseEstimator):
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.estimators = []
    
    def fit(self, X, y):
        y_pred = np.mean(y) * np.ones_like(y)
        
        for _ in range(self.n_estimators):
            residuals = y - y_pred
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            y_pred += self.learning_rate * tree.predict(X)
            self.estimators.append(tree)
    
    def predict(self, X):
        y_pred = np.zeros(len(X))
        
        for tree in self.estimators:
            y_pred += self.learning_rate * tree.predict(X)
        
        return y_pred
    
    def score(self, X, y):
        y_pred = self.predict(X)
        return r2_score(y, y_pred)
    
    def get_params(self, deep=True):
        return {
            'n_estimators': self.n_estimators,
            'learning_rate': self.learning_rate,
            'max_depth': self.max_depth
        }
    
    def set_params(self, **params):
        for param, value in params.items():
            setattr(self, param, value)
        return self

# Generate a simple regression dataset
X, y = make_regression(n_samples=100, n_features=1, noise=0.1, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid for grid search
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.5],
    'max_depth': [2, 3, 4]
}

# Create an instance of the gradient boosting regressor
gb_regressor = GradientBoostingRegressor()

# Perform grid search
grid_search = GridSearchCV(gb_regressor, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Create and fit the gradient boosting regressor with the best hyperparameters
gb_regressor = GradientBoostingRegressor(**best_params)
gb_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gb_regressor.predict(X_test)

# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = gb_regressor.score(X_test, y_test)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 150}
Mean Squared Error: 32.456435181317
R-squared: 0.9767208193301015


### Q4. What is a weak learner in Gradient Boosting?

Ans) In Gradient Boosting, a weak learner refers to a model that is relatively simple and performs slightly better than random guessing. It is typically a decision tree with a small number of nodes or a shallow depth. Unlike traditional decision trees that aim to create the best splits at each node, weak learners focus on capturing patterns or trends in the data that can be exploited to make better predictions.

The key characteristic of a weak learner is that it should be better than random guessing on a particular task, but its individual performance may be limited. However, when combined with other weak learners through the boosting process, they can collectively form a strong learner that achieves high predictive accuracy.

In Gradient Boosting, the weak learner models are trained sequentially, with each subsequent model trying to correct the mistakes made by the previous models. The ensemble of these weak learners, weighted according to their performance, forms the final boosted model.

By iteratively adding weak learners and adjusting their weights, Gradient Boosting can effectively capture complex patterns and dependencies in the data, leading to improved predictive performance compared to using a single strong learner.

Examples of weak learners used in Gradient Boosting include decision stumps (decision trees with only one split), shallow decision trees with limited depth, or linear models with restricted complexity.

The strength of Gradient Boosting lies in its ability to combine multiple weak learners to create a powerful predictive model that can generalize well to unseen data.

### Q5. What is the intuition behind the Gradient Boosting algorithm?

Ans) The intuition behind the Gradient Boosting algorithm is to iteratively build an ensemble of weak learners that work together to improve the overall predictive performance. The algorithm focuses on minimizing the errors made by the previous weak learners and gradually refining the model.

Here's a step-by-step intuition behind the Gradient Boosting algorithm:

1. Start with an initial model: The algorithm begins with an initial weak learner, which can be a simple model like a decision tree with shallow depth.

2. Fit the model to the data: The initial model is trained on the training data to make predictions. These predictions are compared to the actual target values, and the errors (residuals) are calculated.

3. Train the next model to correct the errors: The next weak learner is trained to predict the residuals (errors) made by the previous model. The goal is to find a model that can effectively correct the mistakes of the previous model.

4. Update the ensemble: The predictions of the new model are combined with the predictions of the previous models. Each prediction is weighted according to the performance of its corresponding weak learner.

5. Repeat steps 2-4: The process is repeated iteratively, with each new model attempting to minimize the remaining errors from the previous models. The ensemble is updated at each iteration.

6. Final prediction: The final prediction is obtained by summing the predictions from all the weak learners in the ensemble. This combined prediction aims to provide a more accurate and refined estimate of the target variable.

The intuition behind Gradient Boosting is that by sequentially adding models that focus on correcting the mistakes of the previous models, the algorithm can gradually reduce the overall prediction error. It creates a strong ensemble model that can capture complex patterns and dependencies in the data, leading to improved predictive performance.

Overall, the key idea of Gradient Boosting is to iteratively build a powerful model by leveraging the strengths of multiple weak learners, allowing them to compensate for each other's weaknesses and produce a more accurate and robust prediction.

### Q6. How does Gradient Boosting algorithm build an ensemble of weak learners?

Ans) The Gradient Boosting algorithm builds an ensemble of weak learners in an iterative manner. Each weak learner is trained to correct the mistakes made by the previous learners, gradually improving the overall performance of the ensemble. Here's how the algorithm builds the ensemble:

1. Start with an initial prediction: The algorithm starts with an initial prediction, which can be a simple estimate like the mean of the target variable. This serves as the base prediction for the first weak learner.

2. Calculate the residuals: The residuals are the differences between the actual target values and the current prediction of the ensemble. For the first iteration, the residuals are simply the differences between the actual values and the initial prediction.

3. Train a weak learner on the residuals: The next step is to train a weak learner, such as a decision tree with shallow depth, on the residuals. The weak learner's objective is to capture the patterns and relationships in the data that are not accounted for by the current ensemble.

4. Update the ensemble: The weak learner's predictions are then combined with the predictions of the previous weak learners in the ensemble. The predictions are weighted based on the performance of each weak learner. The weights are determined using an optimization algorithm that minimizes a loss function, typically the mean squared error.

5. Adjust the residuals: The residuals are updated by subtracting the predictions made by the latest weak learner. This adjustment brings the ensemble closer to the true target values.

6. Repeat steps 3-5: The process is repeated for a predetermined number of iterations or until a specified stopping criterion is met. At each iteration, a new weak learner is trained on the updated residuals, the ensemble is updated, and the residuals are adjusted.

7. Final prediction: The final prediction of the Gradient Boosting ensemble is obtained by summing the predictions of all the weak learners, typically with some learning rate or shrinkage factor to control the contribution of each weak learner.

By iteratively adding weak learners that focus on the remaining errors or residuals, the Gradient Boosting algorithm gradually improves the ensemble's predictive performance. Each weak learner contributes to the ensemble by capturing different patterns and making corrections to the previous predictions. The ensemble combines the strengths of these weak learners to produce a more accurate and robust prediction.

Overall, the Gradient Boosting algorithm builds an ensemble of weak learners by sequentially training and combining them to correct the errors made by the previous learners. This iterative process leads to the creation of a powerful model that can capture complex patterns and dependencies in the data.

### Q7. What are the steps involved in constructing the mathematical intuition of Gradient Boosting algorithm?

Ans) To construct the mathematical intuition of the Gradient Boosting algorithm, we can break it down into the following steps:

1. Initialize the model: Start with an initial prediction or estimate, which serves as the base model. This could be a simple value like the mean of the target variable.

2. Calculate the residuals: Compute the residuals by taking the differences between the actual target values and the current prediction of the model.

3. Train a weak learner: Fit a weak learner, such as a decision tree with shallow depth, on the residuals. The weak learner's objective is to capture the patterns and relationships in the data that are not accounted for by the current model.

4. Update the model: Update the current model by combining its predictions with the predictions made by the weak learner. The combination is weighted based on the performance of the weak learner. This weight is determined by an optimization algorithm that minimizes a loss function, typically the mean squared error.

5. Adjust the residuals: Update the residuals by subtracting the predictions made by the weak learner. This adjustment brings the model closer to the true target values.

6. Repeat steps 3-5: Repeat the process of training a weak learner, updating the model, and adjusting the residuals for a predetermined number of iterations or until a stopping criterion is met. At each iteration, a new weak learner is trained on the updated residuals, and the model and residuals are updated accordingly.

7. Final prediction: The final prediction of the Gradient Boosting algorithm is obtained by summing the predictions of all the weak learners, typically with some learning rate or shrinkage factor to control the contribution of each weak learner.

Mathematically, the intuition behind the Gradient Boosting algorithm lies in the optimization of a loss function by iteratively adding weak learners that focus on the remaining errors or residuals. The algorithm aims to find the optimal combination of weak learners that minimizes the overall loss and improves the predictive performance of the model.

By following these steps, the Gradient Boosting algorithm creates an ensemble of weak learners that collectively work to capture complex patterns and dependencies in the data. Each weak learner contributes its own expertise to the ensemble, allowing the model to make more accurate predictions over time.