## Q1. What is Gradient Boosting Regression?
 
 Gradient Boosting Regression is a powerful machine learning technique for regression tasks, which involves predicting a continuous target variable. It is an ensemble learning method that combines the predictions of multiple weak regression models (typically decision trees) to create a strong predictive model. Gradient Boosting Regression is particularly effective in capturing complex relationships in the data and often produces highly accurate predictions.

Here's an overview of how Gradient Boosting Regression works:

1. **Initialization:** Gradient Boosting Regression starts with an initial prediction, which is often set as the mean of the target variable for the entire dataset. This initial prediction serves as the starting point for the ensemble.

2. **Sequential Model Building:** Gradient Boosting Regression builds a sequence of decision trees (weak learners) in a sequential manner. Each decision tree is trained to correct the errors made by the ensemble of previously built trees.

3. **Error Calculation:** After each tree is built, the errors (residuals) between the predicted values and the actual target values are calculated for each data point. These errors represent the differences between the current ensemble prediction and the true target values.

4. **Building Weak Learner:** The next decision tree is trained to predict these errors instead of the original target values. The goal is to create a weak learner that can reduce the errors made by the current ensemble.

5. **Weighted Combination:** The predictions from the newly trained tree are then added to the current ensemble, with a weight that reflects its contribution to reducing the errors. The weights are typically determined through a gradient descent optimization process, where the algorithm minimizes a loss function (e.g., Mean Squared Error) with respect to the weights.

6. **Iterative Process:** Steps 3 to 5 are repeated for a specified number of iterations (controlled by a hyperparameter called the number of estimators or trees) or until a certain stopping criterion is met. In each iteration, a new weak learner is added, and the ensemble prediction is refined.

7. **Final Prediction:** The final prediction for a new data point is obtained by summing the predictions of all the weak learners in the ensemble. The ensemble is designed to iteratively reduce the errors in predictions, leading to a strong predictive model.

Gradient Boosting Regression has several hyperparameters that can be tuned to optimize its performance, including the learning rate (controls the contribution of each weak learner), the maximum depth of the decision trees, and the number of estimators (the number of trees in the ensemble).

Popular implementations of Gradient Boosting Regression include XGBoost, LightGBM, and scikit-learn's GradientBoostingRegressor, among others. These libraries provide efficient and optimized versions of the algorithm, making it easier to use and experiment with Gradient Boosting for regression tasks.

## Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a simple regression problem as an example and train the model on a small dataset. Evaluate the model's performance using metrics such as mean squared error and R-squared.

In [6]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(0)
x=np.random.rand(100,1)     #input variable


y=2*x +1 + 0.1*np.random.rand(100,1)    #output variable with some error


initial_prediction=np.mean(y)*np.ones_like(y)   #np.ones_like return array similar to y but with 1s

residuals=y-initial_prediction #initial residuals

def GradientBoostingRegressor(estimators, learningrate, depth, residuals=None):
    trees=[]
    n_estimators=estimators
    learning_rate=learningrate
    

    for _ in range(n_estimators):

        tree=DecisionTreeRegressor(max_depth=depth)

        tree.fit(x,residuals)

        tree_prediction=learning_rate*tree.predict(x)
        tree_prediction = tree_prediction.reshape(-1, 1)

        residuals-=tree_prediction

        trees.append(tree)

    ensemble_prediction=initial_prediction+ np.sum([learning_rate*tree.predict(x) for tree in trees ])

    print('mean squared error: ',mean_squared_error(y,ensemble_prediction))
    print('r2 score: ',r2_score(y,ensemble_prediction))
GradientBoostingRegressor(estimators=100, learningrate=0.1, depth=3, residuals=residuals)

mean squared error:  0.3311274454987376
r2 score:  -2.220446049250313e-16


## Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to optimise the performance of the model. Use grid search or random search to find the best hyperparameters

In [9]:
from itertools import product



def gridSeachCustom(param_grid):
    param_combinations = list(product(*param_grid.values()))
    for comb in param_combinations:
        residual=y-initial_prediction
        print(comb)
        GradientBoostingRegressor(estimators=comb[0], learningrate=comb[1], depth=comb[2], residuals=residual)
        print()
        
param_grid={
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.1, 0.01],
    'depth':[1,2,3]
}

gridSeachCustom(param_grid)

(50, 0.1, 1)
mean squared error:  0.3311274454987376
r2 score:  -2.220446049250313e-16

(50, 0.1, 2)
mean squared error:  0.3311274454987376
r2 score:  -2.220446049250313e-16

(50, 0.1, 3)
mean squared error:  0.3311274454987376
r2 score:  -2.220446049250313e-16

(50, 0.01, 1)
mean squared error:  0.3311274454987376
r2 score:  -2.220446049250313e-16

(50, 0.01, 2)
mean squared error:  0.3311274454987375
r2 score:  0.0

(50, 0.01, 3)
mean squared error:  0.3311274454987375
r2 score:  0.0

(100, 0.1, 1)
mean squared error:  0.3311274454987375
r2 score:  0.0

(100, 0.1, 2)
mean squared error:  0.3311274454987375
r2 score:  0.0

(100, 0.1, 3)
mean squared error:  0.3311274454987376
r2 score:  -2.220446049250313e-16

(100, 0.01, 1)
mean squared error:  0.3311274454987376
r2 score:  -2.220446049250313e-16

(100, 0.01, 2)
mean squared error:  0.3311274454987376
r2 score:  -2.220446049250313e-16

(100, 0.01, 3)
mean squared error:  0.3311274454987375
r2 score:  0.0

(150, 0.1, 1)
mean squared 

## Q4. What is a weak learner in Gradient Boosting?

In the context of Gradient Boosting, a weak learner is a simple, relatively low-complexity model that performs slightly better than random guessing on a classification or regression task. Weak learners are the building blocks of Gradient Boosting ensembles, and they are combined to create a strong predictive model.

Here are some key characteristics of weak learners in Gradient Boosting:

1. **Simplicity:** Weak learners are typically simple models with low complexity. For classification tasks, they might be shallow decision trees (often referred to as "decision stumps") with just one level or a small number of levels. For regression tasks, they can also be shallow decision trees or even linear regression models.

2. **Limited Predictive Power:** Weak learners are not expected to provide highly accurate predictions on their own. In fact, they can be quite weak in terms of predictive performance, and their predictions may be far from perfect.

3. **Bias:** Weak learners can have a certain level of bias, meaning they may make systematic errors in their predictions. However, their biases should not be strongly correlated with the errors of the other weak learners in the ensemble.

4. **Sequential Training:** In Gradient Boosting, weak learners are trained sequentially. Each new weak learner is trained to correct the errors made by the previous ensemble of weak learners. They are designed to focus on the examples that are difficult to classify or predict correctly.

5. **Contribution to Ensemble:** Despite their individual weaknesses, weak learners play a crucial role in the ensemble. When combined together through a weighted sum or other aggregation method, their predictions collectively result in a strong and accurate model.


The strength of Gradient Boosting lies in its ability to iteratively add weak learners and adjust their contributions in a way that gradually reduces the ensemble's errors and improves its predictive performance. This sequential training and error-correcting process make Gradient Boosting a powerful machine learning technique for both regression and classification tasks.

## Q5. What is the intuition behind the Gradient Boosting algorithm?

The intuition behind the Gradient Boosting algorithm can be understood through the metaphor of a team of experts correcting each other's mistakes to achieve a common goal. 


1. **Initial Prediction:** Imagine you have a team of "experts," each with some knowledge of a problem but not necessarily perfect. You are trying to solve a complex problem, such as making predictions in a machine learning task.

2. **Experts' First Attempt:** Each expert takes a turn to make an initial prediction based on their limited knowledge. These initial predictions are combined to form the first ensemble prediction.

3. **Identifying Mistakes:** You compare the ensemble's predictions to the true outcomes and identify the mistakes made collectively by the team of experts. These mistakes are essentially the differences between the ensemble's prediction and the true values.

4. **Specialization:** To address the mistakes, you ask each expert to specialize in the types of examples they got wrong. Each expert focuses on correcting the mistakes they made during the first attempt.

5. **Cooperation:** In the next round, each expert makes a new prediction, but this time, they are primarily trying to fix the mistakes they made in the previous round. These new predictions are combined with the previous ensemble prediction.

6. **Iterative Process:** The process repeats for several rounds, with each expert specializing further in correcting the mistakes made by the ensemble in the previous round. Each new prediction contributes to reducing the overall mistakes of the team.

7. **Final Ensemble Prediction:** After many rounds, the team of experts becomes highly specialized in correcting the errors made in the previous rounds. The final ensemble prediction is the combined result of all these rounds of correction.

The key idea behind Gradient Boosting is that each new "expert" (weak learner or decision tree) is trained to address the errors or residuals of the previous ensemble. It's as if each expert is an "error specialist" who focuses on the examples that the ensemble struggled with, gradually improving the model's predictions.

The term "Gradient" in Gradient Boosting comes from the fact that the algorithm uses gradient information (i.e., the slope of the loss function) to guide the training of each new expert. It optimizes the ensemble by minimizing the loss of the combined predictions.



## Q6. How does Gradient Boosting algorithm build an ensemble of weak learners?

The Gradient Boosting algorithm builds an ensemble of weak learners (typically decision trees) in a sequential manner. Each weak learner is trained to correct the errors or residuals made by the ensemble up to that point.

1. **Initialization:** Gradient Boosting starts with an initial prediction, which is often set as the mean of the target variable (for regression tasks) or an initial set of class probabilities (for classification tasks). This initial prediction serves as the starting point for the ensemble.

2. **Compute Residuals:** For regression tasks, the residuals are calculated as the differences between the true target values and the current ensemble's predictions. For classification tasks, the residuals can be computed as the negative gradient of the loss function with respect to the predicted class probabilities.

3. **Train Weak Learner:** A weak learner (usually a decision tree) is trained on the dataset with a focus on the residuals. The weak learner is optimized to minimize the residuals. It aims to capture the patterns and relationships in the data that the current ensemble is struggling with.

4. **Weighted Combination:** The predictions from the newly trained weak learner are combined with the current ensemble's predictions. The combination is weighted, where each weak learner contributes to the ensemble prediction with a weight that reflects its ability to reduce the residuals. The weights are determined through a gradient descent optimization process.

5. **Update Residuals:** The residuals are updated based on the new predictions made by the weak learner. In regression tasks, the residuals are reduced by the predictions, whereas in classification tasks, they are updated based on the gradient of the loss function.

6. **Iterative Process:** Steps 3 to 5 are repeated for a predetermined number of iterations (controlled by a hyperparameter called the number of estimators or trees) or until a certain stopping criterion is met. In each iteration, a new weak learner is added to the ensemble, and the process is repeated.

7. **Final Ensemble Prediction:** The final prediction for a new data point is obtained by summing the predictions of all the weak learners in the ensemble. The ensemble is designed to iteratively reduce the errors in predictions, leading to a strong predictive model.

The key idea behind Gradient Boosting is that each new weak learner specializes in correcting the errors or residuals made by the ensemble up to that point. This sequential process allows Gradient Boosting to focus on the examples that are difficult to predict, ultimately resulting in a powerful and accurate ensemble model.


##  Q7. What are the steps involved in constructing the mathematical intuition of Gradient Boosting algorithm?



1. **Initialization**:
   - Start with an initial prediction, often represented as $F_0(x)$, which is typically set as the mean (for regression) or the initial class probabilities (for classification) of the target variable.

2. **Residual Calculation**:
   - Calculate the residuals, denoted as $r$, by taking the differences between the true target values $y$ and the current prediction $F(x)$:
     $r = y - F(x)$

3. **Loss Function**:
   - Define a loss function $L(y, F(x))$ that quantifies the error or discrepancy between the true target values and the current prediction. Common loss functions include Mean Squared Error (MSE) for regression and log loss (cross-entropy) for classification.

4. **Weak Learner Fitting**:
   - Train a weak learner (usually a decision tree) to fit the residuals $r$. The weak learner aims to approximate the negative gradient of the loss function with respect to the current prediction $F(x)$:
     $h(x) = -\frac{\partial L(y, F(x))}{\partial F(x)}$

5. **Weighted Contribution**:
   - Determine the weight $\alpha$ of the weak learner's prediction based on a gradient descent step. This weight represents how much the weak learner's output contributes to the ensemble prediction. It's often computed as:
     $\alpha = \text{argmin}_{\alpha} \sum_i L(y_i, F(x_i) + \alpha h(x_i))$

6. **Update Ensemble Prediction**:
   - Update the ensemble prediction by adding the weighted prediction from the weak learner to the current prediction:
     $F(x) \leftarrow F(x) + \alpha h(x)$

7. **Update Residuals**:
   - Update the residuals $r$ by subtracting the weighted prediction from the weak learner:
     $r \leftarrow r - \alpha h(x)$

8. **Iterative Process**:
   - Repeat steps 4 to 7 for a predetermined number of iterations (controlled by the number of weak learners or estimators) or until a stopping criterion is met.

9. **Final Ensemble Prediction**:
   - The final ensemble prediction is the result of adding all the individual weak learner predictions to the initial prediction:
     $F(x)_{\text{final}} = F_0(x) + \sum_i \alpha_i h_i(x)$

10. **Gradient Descent Optimization**:
    - The algorithm optimizes the ensemble's predictions by minimizing the loss function through a series of weighted updates. It uses gradient descent techniques to find the optimal weights for each weak learner.

