# Q1. Ans

Gradient boosting Regression calculates the difference between the current prediction and the known correct target value. This difference is called residual. After that Gradient boosting Regression trains a weak model that maps features to that residual.

# Q2. Ans

In [4]:
import numpy as np

class GradientBoostingRegressor:
    def __init__(self, n_estimators=100, learning_rate=0.1):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.models = []
        self.weights = []
        
    def fit(self, X, y):
        # Initialize the predicted values to the mean of the target variable
        predictions = np.full(len(X), np.mean(y))
        
        for _ in range(self.n_estimators):
            # Compute the residuals as the differences between the true values and the predicted values
            residuals = y - predictions
            
            # Fit a regression tree to the residuals
            model = DecisionTreeRegressor()
            model.fit(X, residuals)
            
            # Update the predictions by adding the predictions of the new model multiplied by the learning rate
            predictions += self.learning_rate * model.predict(X)
            
            # Save the model and its weight (learning rate)
            self.models.append(model)
            self.weights.append(self.learning_rate)
    
    def predict(self, X):
        # Make predictions by summing the predictions of all models multiplied by their weights
        predictions = np.zeros(len(X))
        
        for model, weight in zip(self.models, self.weights):
            predictions += weight * model.predict(X)
        
        return predictions

class DecisionTreeRegressor:
    def fit(self, X, y):
        self.feature_index, self.threshold, self.value = self._find_best_split(X, y)
        if self.feature_index is None:
            self.value = np.mean(y)
            return
        
        left_indices = X[:, self.feature_index] < self.threshold
        right_indices = ~left_indices
        
        self.left = DecisionTreeRegressor()
        self.left.fit(X[left_indices], y[left_indices])
        
        self.right = DecisionTreeRegressor()
        self.right.fit(X[right_indices], y[right_indices])
    
    def predict(self, X):
        if self.feature_index is None:
            return self.value
        
        predictions = np.zeros(len(X))
        left_indices = X[:, self.feature_index] < self.threshold
        right_indices = ~left_indices
        
        predictions[left_indices] = self.left.predict(X[left_indices])
        predictions[right_indices] = self.right.predict(X[right_indices])
        
        return predictions
    
    def _find_best_split(self, X, y):
        best_loss = np.inf
        best_feature_index = None
        best_threshold = None
        
        for feature_index in range(X.shape[1]):
            feature_values = X[:, feature_index]
            unique_values = np.unique(feature_values)
            
            for threshold in unique_values:
                left_indices = feature_values < threshold
                right_indices = ~left_indices
                
                left_loss = np.mean((y[left_indices] - np.mean(y[left_indices])) ** 2)
                right_loss = np.mean((y[right_indices] - np.mean(y[right_indices])) ** 2)
                total_loss = left_loss + right_loss
                
                if total_loss < best_loss:
                    best_loss = total_loss
                    best_feature_index = feature_index
                    best_threshold = threshold
        
        return best_feature_index, best_threshold, None

# Example usage
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X, y)

# Make predictions
X_test = np.array([[6], [7], [8]])
y_pred = model.predict(X_test)

# Evaluation
mse = np.mean((y - model.predict(X)) ** 2)
ssr = np.sum((y - model.predict(X)) ** 2)
sst = np.sum((y - np.mean(y)) ** 2)
r_squared = 1 - ssr / sst

print("Mean Squared Error:", mse)
print("R-squared:", r_squared)


Mean Squared Error: 36.00000000564406
R-squared: -3.5000000007055077


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


# Q3. Ans

In [5]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {
    'learning_rate': [0.1, 0.05, 0.01],
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 4, 5]
}

# Create the grid search object
grid_search = GridSearchCV(
    estimator=GradientBoostingRegressor(),
    param_grid=param_grid,
    scoring='neg_mean_squared_error',  # Use negative mean squared error as the evaluation metric
    cv=5  # Perform 5-fold cross-validation
)

# Fit the grid search to the data
grid_search.fit(X, y)

# Print the best hyperparameters and the corresponding MSE
print("Best Hyperparameters:", grid_search.best_params_)
print("Best MSE:", -grid_search.best_score_)


TypeError: Cannot clone object '<__main__.GradientBoostingRegressor object at 0x0000029BB4D1E4C0>' (type <class '__main__.GradientBoostingRegressor'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' method.

# Q4. Ans

Decision trees are used as the weak learner in gradient boosting. Specifically regression trees are used that output real values for splits and whose output can be added together, allowing subsequent models outputs to be added and “correct” the residuals in the predictions.

# Q5. Ans

In gradient boosting, we predict and adjust our predictions in the opposite (negative gradient) direction. This achieves the opposite (minimize the loss). Since, the loss of a model inversely relates to its performance and accuracy, doing so improves its performance.

# Q6. Ans

Sequential Ensemble Learning
It is a boosting technique where the outputs from individual weak learners associate sequentially during the training phase. The performance of the model is boosted by assigning higher weights to the samples that are incorrectly classified.

1. Initialize the ensemble: The algorithm starts by initializing the ensemble with a simple model, typically a decision tree with a small depth, called the base learner.

2. Fit the base learner: The base learner is fitted to the training data, and its predictions are computed.

3. Compute the residual errors: The difference between the actual target values and the predictions of the base learner is calculated. These differences are referred to as residual errors.

4. Fit the next weak learner: A new weak learner is fitted to the residual errors. The objective of this weak learner is to learn the patterns in the residual errors that were not captured by the previous base learner.

5. Update the ensemble: The predictions of the new weak learner are added to the predictions of the previous base learner. This update is done by multiplying the predictions of the weak learner by a small learning rate, which controls the contribution of each weak learner to the ensemble.

6. Repeat steps 3 to 5: The process is repeated by computing the residual errors based on the updated ensemble's predictions and fitting a new weak learner to the residual errors. This process continues for a specified number of iterations or until a stopping criterion is met.

7. Final ensemble prediction: The final prediction of the gradient boosting ensemble is obtained by summing the predictions of all the weak learners, weighted by their learning rates.

# Q7. Ans

Constructing the mathematical intuition of the Gradient Boosting algorithm involves understanding the key components and steps involved. Here are the main steps:

1. Define the loss function: The first step is to define a loss function that measures the error between the predicted values and the actual target values. Common loss functions for regression problems include mean squared error (MSE) and mean absolute error (MAE).

2. Initialize the model: The algorithm starts by initializing the model with a simple estimator, typically a decision tree with a shallow depth or a constant value. This initial model serves as the starting point for the ensemble.

3. Compute the negative gradient: The negative gradient of the loss function with respect to the predictions of the current model is computed. The negative gradient indicates the direction of steepest descent in the loss landscape, allowing the algorithm to identify the areas where the current model performs poorly.

4. Fit a weak learner: A weak learner, such as a decision tree, is fitted to the negative gradient. The weak learner is trained to approximate the negative gradient, aiming to correct the errors made by the current model.

5. Update the model: The predictions of the weak learner are multiplied by a small learning rate, referred to as the shrinkage parameter. This scaling factor controls the contribution of each weak learner to the ensemble. The predictions of the weak learner are added to the predictions of the current model, updating the model's predictions.

6. Repeat steps 3 to 5: Steps 3 to 5 are repeated iteratively, with each iteration focusing on fitting a new weak learner to the negative gradient of the loss function. The algorithm continues to add weak learners to the ensemble, gradually reducing the loss and improving the overall prediction accuracy.

7. Final ensemble prediction: The final prediction of the Gradient Boosting ensemble is obtained by summing the predictions of all the weak learners, weighted by their learning rates.

The mathematical intuition of Gradient Boosting lies in the iterative optimization process, where each weak learner is trained to correct the mistakes made by the previous learners. By minimizing the loss function in the direction of steepest descent, the algorithm converges to an ensemble of weak learners that collectively provide a strong predictive model.