<a href="https://colab.research.google.com/github/Jhansipothabattula/Machine_Learning/blob/main/Day37.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boosting and Gradient Boosting

**Concept of Boosting**

- What is Boosting?

  - Ensemble Technique that sequentially combines weak learners to form a strong learner

  - Each subsequent model focuses on correcting the errors made by previous models

- How does Boosting Differ from Bagging?

| Feature      | Bagging    | Boosting     |
|:----------|:----------:|----------:|
| Approach        | Trains models independently on booststrapped subsets          | Trains models Sequentially to correct errors         |
| Purpose        | Reduces variance by averaging predictions         | Reduces bias by focusing on difficult cases         |
| Examples         | Random Forest          | Gradient Boosting, Adaboost        |


**Gradient Boosting**

- what is Gradient Boosting?

  - Boosting Algorithm that builds sequentially by minimizes a loss function using Gradient Descent

  - Iteratively adds weak learners to improve overall model perfomance

- How Gradient Boosting Works

  - Initialize Model: start with a simple model, often predicting the mean of the target variable

  - Compute residuals: Calculate the difference between the actual and predicted values

  - Fit weak Learner: Train a weak model to predict the residuals

  - Update prediction: Add the predictions of the weak learner to the overall model

  - Repeat: Continue adding weak learners until the desired number of iterations or a stopping criterion is reached



**Gradient Boosting**

- key Parameters in Gradient Boosting

  - Learning rate

    - Determines the contribution of each weak learner

    - Smaller values reduce overfitting but require more itertaions

  - Number of Estimators

    - The Number of weak learners(trees)added sequentially

    - Larger values improve learning but increase computation time

  - Regualrization

    - Techniques like limiting tree depth or adding penalties to prevent overfitting

**Understanding the key parameters**

- Learning Rate(learning_rate)

  - Lower values improve model perfomance by reducing overfitting but require more iterations

  - Typical Range: 0.01 to 0.3

- Number of estimators(n_estimators)

  - Represents the number of trees added to the ensemble

  - Larger values can improve pefomance but risk overfitting

- Tree Depth(max_depth)

  - Limits the complexity of individual trees

  - Shallower trees generalize better but might underfit

**1. Train and Evaluate a Gradient Boosting model on a dataset, tune key parameters, and compare it's perfomance with a Random Forest model**

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load Dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display dataset information
print(f"Features: {data.feature_names}")
print(f"Classes: {data.target_names}")

# Train Gradient Boosting model
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)

# Predict
y_pred_gb = gb_model.predict(X_test)

# Evaluate perfomance
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"Gradient Boosting Accuracy: \n{accuracy_gb:.2f}")
print(f"Classification Report: \n", classification_report(y_test, y_pred_gb))

# Define hyperparameter grid
param_grid = {
    "learning_rate": [0.01, 0.1, 0.2], # Corrected parameter name
    "n_estimators":[50, 100, 200],
    "max_depth":[3, 5, 7]
}

# Perform Grid Search
grid_search = GridSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

# Display best parameters and score
print(f"Best Parameters: \n{grid_search.best_params_}")
print(f"Best Cross Validation Accuracy: \n{grid_search.best_score_:.2f}")

# Train Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Predict
y_pred_rf = rf_model.predict(X_test)

# Evaluate
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: \n{accuracy_rf:.2f}")

Features: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Classes: ['malignant' 'benign']
Gradient Boosting Accuracy: 
0.96
Classification Report: 
               precision    recall  f1-score   support

           0       0.95      0.93      0.94        43
           1       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114

Best Parameters: 