# **ASSIGNMENT**

**Q1. What is Gradient Boosting Regression?**

Gradient Boosting Regression is a machine learning technique used for both classification and regression problems. It is an ensemble learning method that combines the predictions of several weak learners (typically decision trees) to create a strong predictive model. The term "gradient boosting" refers to the optimization process used to minimize the errors of the model.

Here's a general overview of how Gradient Boosting Regression works:

1. **Base Learners (Weak Models):** Gradient Boosting builds an ensemble of weak learners, often decision trees, where each tree tries to correct the errors made by the previous ones.

2. **Initialization:** The algorithm starts with a simple model, usually the mean or median of the target variable for regression problems. This initial model serves as the first approximation.

3. **Sequential Training:** Subsequent models are trained sequentially, with each one focusing on reducing the errors of the combined ensemble of models generated so far.

4. **Gradient Descent Optimization:** The key idea is to fit each new model to the residual errors (the differences between the actual and predicted values) of the combined ensemble. This is done by using gradient descent to find the direction and magnitude of the adjustments needed.

5. **Shrinkage (or Learning Rate):** A shrinkage parameter is introduced to control the contribution of each weak learner to the ensemble. A smaller shrinkage value requires a higher number of weak learners but can lead to a more robust model.

6. **Combining Weak Models:** The final prediction is the sum of the predictions from all the weak learners, each multiplied by its associated shrinkage factor.

Gradient Boosting Regression is known for its high predictive accuracy and robustness against overfitting. Popular implementations include XGBoost, LightGBM, and scikit-learn's GradientBoostingRegressor. These algorithms have been widely used in various applications, including finance, healthcare, and online advertising.

**Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a
simple regression problem as an example and train the model on a small dataset. Evaluate the model's
performance using metrics such as mean squared error and R-squared.**

In [1]:
import warnings
import numpy as np
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
warnings.filterwarnings("ignore")

class GradientBoostingRegressor:
    def __init__(self, n_estimators=100, learning_rate=0.1):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.models = []

    def fit(self, X, y):
        # Initialize with the mean for regression
        initial_prediction = np.mean(y)
        prediction = np.full_like(y, initial_prediction, dtype=np.float64)

        for _ in range(self.n_estimators):
            # Compute the residuals
            residuals = y - prediction

            # Fit a weak learner (decision tree) to the residuals
            model = DecisionTreeRegressor(max_depth=3)
            model.fit(X, residuals)

            # Make predictions with the weak learner
            weak_predictions = model.predict(X)

            # Update the overall prediction with a fraction of the weak learner's predictions
            prediction += self.learning_rate * weak_predictions

            # Store the weak learner in the ensemble
            self.models.append(model)

    def predict(self, X):
        # Make predictions by combining the predictions of all weak learners
        predictions = np.sum(self.learning_rate * model.predict(X) for model in self.models)
        return predictions

# Generate a small dataset for regression
np.random.seed(42)
X = np.random.rand(100, 1)
y = 2 * X.squeeze() + 1 + 0.1 * np.random.randn(100)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Gradient Boosting Regressor
gb_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1)
gb_regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gb_regressor.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.4f}")
print(f"R-squared: {r2:.4f}")


Mean Squared Error: 3.8057
R-squared: -9.1629


**Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to
optimise the performance of the model. Use grid search or random search to find the best
hyperparameters**

In [2]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import make_scorer

# Define the parameter grid for random search
param_dist = {
    'n_estimators': [50, 100, 150, 200],
    'learning_rate': [0.01, 0.1, 0.2, 0.3],
    'max_depth': [3, 4, 5, 6],
}

# Create a GradientBoostingRegressor
gb_regressor = GradientBoostingRegressor()

# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=gb_regressor,
    param_distributions=param_dist,
    scoring='neg_mean_squared_error',  # Use the negative mean squared error as it is a minimization problem
    n_iter=10,  # Number of random combinations to try
    cv=5,  # Number of cross-validation folds
    random_state=42
)

# Fit the random search to the data
random_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = random_search.best_params_
print("Best Hyperparameters:", best_params)

# Use the best model for predictions
best_gb_regressor = random_search.best_estimator_
y_pred_best = best_gb_regressor.predict(X_test)

# Evaluate the best model
mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)

print(f"Mean Squared Error (Best Model): {mse_best:.4f}")
print(f"R-squared (Best Model): {r2_best:.4f}")


Best Hyperparameters: {'n_estimators': 50, 'max_depth': 3, 'learning_rate': 0.1}
Mean Squared Error (Best Model): 0.0071
R-squared (Best Model): 0.9811


**Q4. What is a weak learner in Gradient Boosting?**

In the context of Gradient Boosting, a weak learner refers to a base model that performs slightly better than random chance on a given task. Typically, decision trees are used as weak learners in Gradient Boosting algorithms, but other algorithms can also be employed.

The term "weak" is relative, and it implies that the individual models in the ensemble are not highly accurate on their own. The strength of Gradient Boosting comes from the combination of these weak learners to form a robust and accurate predictive model.

In the training process of Gradient Boosting:

1. **Initialization:** The algorithm starts with a simple model, often the mean or median of the target variable in the case of regression, or a class distribution in the case of classification.

2. **Sequential Training:** Subsequent weak learners are added to the ensemble in a sequential manner. Each new weak learner focuses on capturing the errors made by the combined ensemble of models generated so far.

3. **Gradient Descent Optimization:** The weak learner is trained to fit the negative gradient of the loss function with respect to the ensemble's current predictions. This is done to reduce the residuals or errors of the combined model.

The concept of using weak learners is fundamental to Gradient Boosting's success. Each weak learner contributes a small piece of the overall solution, and by combining them sequentially, the algorithm is able to adapt and improve its predictions over iterations. The key is to use models that are just complex enough to capture the patterns in the data but not so complex that they overfit the training data.

**Q5. What is the intuition behind the Gradient Boosting algorithm?**

The intuition behind the Gradient Boosting algorithm can be understood through the following key concepts:

1. **Ensemble Learning:** Gradient Boosting belongs to the family of ensemble learning methods. Ensemble learning involves combining the predictions of multiple models to create a stronger and more robust model than any of its individual components.

2. **Sequential Improvement:** Gradient Boosting builds an ensemble of weak learners sequentially. Each weak learner is trained to correct the errors made by the combined ensemble of models generated so far.

3. **Gradient Descent Optimization:** The term "gradient" in Gradient Boosting refers to the use of gradient descent optimization to minimize the errors of the model. In each iteration, the algorithm identifies the direction and magnitude in the feature space where the model's predictions need improvement the most.

4. **Weak Learners:** The individual models in Gradient Boosting are weak learners, meaning they are models that perform slightly better than random chance. Decision trees are commonly used as weak learners, and they are added to the ensemble in a stepwise manner.

5. **Residual Fitting:** In each iteration, a new weak learner is trained to fit the residuals or errors of the current ensemble. This process allows the model to gradually reduce the errors made by the combined set of weak learners.

6. **Shrinkage (Learning Rate):** A shrinkage parameter, often referred to as the learning rate, is introduced to control the contribution of each weak learner to the ensemble. A smaller learning rate requires more weak learners but can lead to a more robust and generalized model.

7. **Adaptive Learning:** The algorithm is adaptive in nature. As weak learners are added sequentially, the model adapts and becomes more tailored to the specific patterns in the data, capturing complex relationships and interactions.

8. **Regularization:** Gradient Boosting inherently provides a form of regularization by building models sequentially and focusing on reducing the errors made by the ensemble. This helps prevent overfitting and contributes to the model's generalization performance.

In summary, the intuition behind Gradient Boosting is to iteratively improve the model's predictions by learning from the mistakes of the previous models. By combining weak learners and using gradient descent optimization, the algorithm adapts to the data's complexity and yields a powerful predictive model.

**Q6. How does Gradient Boosting algorithm build an ensemble of weak learners?**

The Gradient Boosting algorithm builds an ensemble of weak learners sequentially. The process involves the following steps:

1. **Initialization:**
   - The algorithm starts with an initial prediction, often the mean or median of the target variable in the case of regression, or a class distribution in the case of classification.
   - This initial prediction serves as the baseline for the ensemble.

2. **Residual Calculation:**
   - The difference between the actual target values and the current prediction (residuals) is calculated for each data point in the training set.

3. **Sequential Training:**
   - A weak learner (typically a decision tree) is trained on the residuals. The goal is to fit the weak learner to the errors made by the current ensemble of models.

4. **Gradient Descent Optimization:**
   - The weak learner is trained to minimize the residual errors by finding the negative gradient of the loss function with respect to the current predictions.
   - The learning rate (a hyperparameter) controls the step size in the direction of the negative gradient during optimization.

5. **Update Ensemble:**
   - The predictions of the weak learner are scaled by a factor (learning rate) and added to the ensemble.
   - The ensemble now includes the new weak learner, and its predictions are combined with the predictions from the previous weak learners.

6. **Iteration:**
   - Steps 2-5 are repeated for a predefined number of iterations or until a convergence criterion is met.
   - In each iteration, a new weak learner is added to the ensemble, and the model is refined to reduce the errors of the combined ensemble.

7. **Final Ensemble:**
   - The final prediction is the sum of the predictions from all the weak learners in the ensemble.
   - The combination of weak learners, each addressing a different aspect of the data, leads to a strong predictive model.

The process of sequentially adding weak learners and updating the predictions based on the negative gradient is what gives the algorithm its name, "Gradient Boosting." The ensemble gradually adapts to the complexity of the data, capturing intricate patterns and relationships to improve predictive accuracy.

**Q7. What are the steps involved in constructing the mathematical intuition of Gradient Boosting
algorithm?**

Constructing the mathematical intuition behind the Gradient Boosting algorithm involves understanding the key mathematical concepts and steps that drive the algorithm's learning process. Here are the steps involved in building the mathematical intuition of Gradient Boosting:

1. **Loss Function:**
   - Start with a loss function that measures the difference between the model's predictions and the actual target values. For regression problems, the mean squared error (MSE) is commonly used.

2. **Initial Prediction:**
   - Initialize the model with a simple prediction, often the mean or median of the target variable for regression, or a class distribution for classification.

3. **Residual Calculation:**
   - Compute the residuals by subtracting the initial prediction from the actual target values. These residuals represent the errors made by the current model.

4. **Sequential Weak Learner Training:**
   - Train a weak learner (usually a decision tree) on the residuals. The goal is to fit the weak learner to the errors made by the current ensemble.

5. **Negative Gradient:**
   - Calculate the negative gradient of the loss function with respect to the current predictions. This indicates the direction and magnitude of the adjustment needed to minimize the loss.

6. **Scaling by Learning Rate:**
   - Scale the predictions of the weak learner by a factor known as the learning rate. This controls the step size during the gradient descent optimization process.

7. **Update Ensemble:**
   - Add the scaled predictions of the weak learner to the current ensemble of models. This updates the overall prediction by incorporating the information from the new weak learner.

8. **Repeat:**
   - Repeat steps 3-7 for a predefined number of iterations or until a convergence criterion is met. In each iteration, a new weak learner is trained to capture the remaining errors.

9. **Final Prediction:**
   - The final prediction is the sum of the predictions from all the weak learners, each scaled by the learning rate.

10. **Regularization:**
    - Optionally, introduce regularization techniques to prevent overfitting, such as limiting the depth of the weak learners or adding a regularization term to the loss function.

11. **Evaluation:**
    - Evaluate the performance of the final ensemble on a validation or test dataset using appropriate metrics (e.g., mean squared error for regression, accuracy for classification).

Understanding the mathematical intuition involves grasping how the negative gradient guides the optimization process, how the weak learners are combined to reduce errors, and how the algorithm adapts to the data's complexity over iterations. It also involves recognizing the role of hyperparameters like the learning rate in controlling the update of the ensemble.

---------------------------