Q1. What is Gradient Boosting Regression?

Gradient Boosting Regression is a machine learning technique used for regression tasks, where the goal is to predict a continuous value. It is part of the broader class of ensemble methods, which combine the predictions of multiple models to achieve better performance than individual models. In Gradient Boosting Regression, the idea is to build a series of weak learners, typically decision trees, in a sequential manner. Each new tree focuses on correcting the errors (residuals) made by the previous trees.

Working:

1. Initial Model: The process begins with an initial model, often just the mean of the target values.

2. Residual Calculation: For each training sample, the residual (error) is calculated, which is the difference between the actual target value and the prediction from the current model.

3. Fit Weak Learner: A new weak learner (usually a shallow decision tree) is trained to predict these residuals (the errors of the current model).

4. Update the Model: The predictions from the new weak learner are added to the current model's predictions, with a learning rate applied to control the contribution of each new tree.

5. Repeat: This process is repeated iteratively. In each iteration, a new tree is added, and the model gets progressively better at predicting the target values by focusing on the errors of previous iterations.

6. Stop: The process stops after a predefined number of iterations or when adding new trees does not improve the model's performance.

Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a
simple regression problem as an example and train the model on a small dataset. Evaluate the model's
performance using metrics such as mean squared error and R-squared.

In [2]:
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score

# Simple dataset (X: input features, y: target values)
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([1.2, 1.9, 3.0, 4.1, 5.1])

# Gradient Boosting Regressor from scratch
class GradientBoostingRegressorScratch:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=2):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.models = []

    def _fit_base_model(self, residuals):
        # A simple decision stump (1-level decision tree) for demonstration purposes
        mean_residual = np.mean(residuals)
        return mean_residual

    def fit(self, X, y):
        # Initialize with the mean prediction
        self.initial_prediction = np.mean(y)
        predictions = np.full_like(y, self.initial_prediction)
        self.models = []

        for _ in range(self.n_estimators):
            # Compute residuals (errors from the current model)
            residuals = y - predictions
            # Fit a base model (mean of residuals for simplicity)
            model = self._fit_base_model(residuals)
            self.models.append(model)
            # Update the predictions by adding the new model's contribution
            predictions += self.learning_rate * model

    def predict(self, X):
        # Start with the initial prediction
        predictions = np.full(X.shape[0], self.initial_prediction)
        # Add the contribution from each weak learner
        for model in self.models:
            predictions += self.learning_rate * model
        return predictions

# Initialize and train the model
gbr_scratch = GradientBoostingRegressorScratch(n_estimators=10, learning_rate=0.1)
gbr_scratch.fit(X, y)

# Predict on training data
y_pred = gbr_scratch.predict(X)

# Evaluate model performance
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

mse, r2, y_pred


(2.0103999999999993, 0.0, array([3.06, 3.06, 3.06, 3.06, 3.06]))

Mean Squared Error (MSE): 2.01

R-squared (R²): 0.0

The predicted values for all inputs were approximately 3.06, which indicates that the model is underfitting. This happens because the current base learner in this example is too simplistic (a constant mean model), and more advanced methods (like decision trees) are needed for a better fit

Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to
optimise the performance of the model. Use grid search or random search to find the best
hyperparameters


To optimize the performance of the Gradient Boosting model, we can experiment with different hyperparameters such as:

1. Learning Rate: Controls the contribution of each tree.
2. Number of Trees (n_estimators): The number of weak learners.
3. Tree Depth (max_depth): The complexity of the individual trees.

In [3]:
# Modified model to include tree depth simulation and grid search over hyperparameters

class GradientBoostingRegressorScratchV2:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=1):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.models = []

    def _fit_base_model(self, residuals, depth):
        # A simple decision stump (mean residual) that is adjusted to simulate tree depth
        # For now, depth affects the fraction of the residual used in the model
        model_value = np.mean(residuals) / (depth + 1)
        return model_value

    def fit(self, X, y):
        # Initialize with the mean prediction
        self.initial_prediction = np.mean(y)
        predictions = np.full_like(y, self.initial_prediction)
        self.models = []

        for _ in range(self.n_estimators):
            residuals = y - predictions
            model = self._fit_base_model(residuals, self.max_depth)
            self.models.append(model)
            predictions += self.learning_rate * model

    def predict(self, X):
        predictions = np.full(X.shape[0], self.initial_prediction)
        for model in self.models:
            predictions += self.learning_rate * model
        return predictions

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [10, 50, 100],
    'learning_rate': [0.01, 0.1, 0.5],
    'max_depth': [1, 2, 3]
}

best_params = None
best_mse = float('inf')
best_r2 = float('-inf')

# Perform grid search over the parameters
for n_estimators in param_grid['n_estimators']:
    for learning_rate in param_grid['learning_rate']:
        for max_depth in param_grid['max_depth']:
            model = GradientBoostingRegressorScratchV2(
                n_estimators=n_estimators, 
                learning_rate=learning_rate, 
                max_depth=max_depth
            )
            model.fit(X, y)
            y_pred = model.predict(X)
            
            # Evaluate performance
            mse = mean_squared_error(y, y_pred)
            r2 = r2_score(y, y_pred)
            
            # Track the best model based on MSE
            if mse < best_mse:
                best_mse = mse
                best_r2 = r2
                best_params = {
                    'n_estimators': n_estimators,
                    'learning_rate': learning_rate,
                    'max_depth': max_depth
                }

best_params, best_mse, best_r2


({'n_estimators': 10, 'learning_rate': 0.01, 'max_depth': 1},
 2.0103999999999993,
 0.0)

Q4. What is a weak learner in Gradient Boosting?

A weak learner in Gradient Boosting is a machine learning model that performs slightly better than random guessing but is not highly accurate by itself. The idea is that these weak learners can be combined, or "boosted," to create a strong learner that performs well overall.

In the context of Gradient Boosting, the weak learners are typically shallow decision trees (also called decision stumps), which are decision trees with limited depth (often just one or a few levels). These trees capture simple patterns in the data, but on their own, they have limited predictive power.

## Key Characteristics of Weak Learners:

1. Low Complexity: Weak learners are intentionally kept simple to avoid overfitting. In Gradient Boosting, this is usually done by limiting the depth of the trees (e.g., max_depth = 1 or 2).

2. Slightly Better than Random: A weak learner should do marginally better than random guessing (accuracy above 50% in classification, or reduced error in regression).

3. Focus on Residuals: In each boosting iteration, the weak learner focuses on the residuals (errors) of the previous learners, correcting them step by step.

Q5. What is the intuition behind the Gradient Boosting algorithm?

The intuition behind the Gradient Boosting algorithm is based on the idea of sequentially improving a model by focusing on the mistakes (residual errors) made by previous models. Here's how it works in simple terms:

1. Start with a weak model:
The process begins by fitting a very simple model, often referred to as a weak learner. In regression problems, this could be as simple as predicting the mean of the target values. This model usually has low accuracy and will leave a lot of residual errors (the difference between the actual target values and the predicted values).

2. Focus on the errors:
The key idea in Gradient Boosting is to focus on the data points where the initial model made the most significant mistakes. The algorithm fits a new weak learner (typically a decision tree) to predict the residual errors (the difference between the actual values and the model's current predictions).

3. Correct the mistakes:
The new weak learner tries to correct the errors made by the previous one. Instead of building a strong model all at once, Gradient Boosting gradually improves the overall performance by adding a sequence of weak learners, each focused on correcting the residuals from the previous ones.

4. Iterate and combine:
This process is repeated for a fixed number of iterations (or until the model stops improving). In each iteration, a new weak learner is trained on the residuals, and its contribution to the final model is controlled by a learning rate (a small fraction of the residual is added at each step).
Each new model is added to the ensemble, slowly improving the overall prediction accuracy.

5. Optimize via gradients:
The term "Gradient Boosting" comes from the fact that this process can be viewed as minimizing a loss function (e.g., mean squared error for regression) by using gradient descent. In each step, the weak learner approximates the gradient of the loss function with respect to the current model's predictions, hence guiding the model in the right direction to reduce errors.