<a href="https://colab.research.google.com/github/Foysal348/Gradient-Boosting/blob/main/Gradient_Boosting_for_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 21: Gradient Boosting (Practice Notebook)
## Classification Task with TODO Blocks

In this practice notebook, you will implement **Gradient Boosting for classification** step by step.

### Learning Objectives
- Understand how Gradient Boosting works for classification
- Practice model training, prediction, and evaluation
- Explore the effect of key hyperparameters

**Important:** Complete all TODO blocks. Do not skip steps.


## Step 1: Import Required Libraries

**TODO:** Import NumPy, Pandas, Matplotlib, and required scikit-learn modules.

In [2]:
# TODO: Import necessary libraries
# Hint: numpy, pandas, matplotlib.pyplot
# Hint: load dataset, train_test_split, GradientBoostingClassifier, metrics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report




## Step 2: Load Dataset

We will use the **Breast Cancer Wisconsin dataset**, a standard binary classification dataset.

**TODO:** Load the dataset and separate features (X) and target (y).

In [3]:
# TODO: Load the breast cancer dataset
# Hint: sklearn.datasets.load_breast_cancer
# TODO: Assign features to X and labels to y
data=load_breast_cancer()
df=pd.DataFrame(data.data,columns=data.feature_names)
df['target']=data.target
X=df.drop('target',axis=1)
y=df['target']



## Step 3: Inspect the Data

**TODO:** Display the first few rows of X and the distribution of y.

In [4]:
# TODO: View first 5 rows of X
# TODO: Check class distribution in y
print(X.head(5))
print(y.head(5))


   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst radius  worst texture  worst perimeter  \
0           

## Step 4: Train-Test Split

**TODO:** Split the dataset into training and testing sets.
- Use 80% data for training
- Set random_state for reproducibility

In [5]:
# TODO: Perform train-test split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)


## Step 5: Train Gradient Boosting Classifier

**TODO:** Initialize and train a GradientBoostingClassifier.

Suggested starting values:
- n_estimators = 100
- learning_rate = 0.1
- max_depth = 3

In [6]:
# TODO: Initialize GradientBoostingClassifier
gbc=GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
# TODO: Fit the model on training data
gbc.fit(X_train,y_train)



## Step 6: Make Predictions

**TODO:** Predict class labels and class probabilities for the test set.

In [7]:
# TODO: Predict class labels
y_pred=gbc.predict(X_test)
# TODO: Predict class probabilities
proba=gbc.predict_proba(X_test)


## Step 7: Model Evaluation

**TODO:** Evaluate the model using:
- Accuracy
- Confusion Matrix
- Classification Report

In [8]:
# TODO: Calculate accuracy score
accuracy=accuracy_score(y_test,y_pred)
# TODO: Print confusion matrix
print(confusion_matrix(y_test,y_pred))
# TODO: Print classification report
print(classification_report(y_test,y_pred))

[[40  3]
 [ 2 69]]
              precision    recall  f1-score   support

           0       0.95      0.93      0.94        43
           1       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



## Step 8: Effect of Learning Rate (Experiment)

**TODO:** Train multiple models with different learning rates and compare accuracy.

Suggested learning rates: 0.01, 0.05, 0.1, 0.2

In [9]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Learning rates
lr = [0.01, 0.05, 0.1, 0.2]

results = []

for i in lr:
    gbc = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=i,
        max_depth=3,
        random_state=42
    )

    gbc.fit(X_train, y_train)
    y_pred = gbc.predict(X_test)

    acc = accuracy_score(y_test, y_pred)

    results.append({
        "Learning Rate": i,
        "Accuracy Score": acc
    })

# Final results table
store = pd.DataFrame(results)
print(store)


   Learning Rate  Accuracy Score
0           0.01         0.95614
1           0.05         0.95614
2           0.10         0.95614
3           0.20         0.95614


## Step 9: Effect of Tree Depth (Experiment)

**TODO:** Compare model performance for different tree depths.

Suggested depths: 1, 2, 3, 5

In [11]:
# TODO: Loop over max_depth values
# TODO: Train model and evaluate accuracy
Depths=[1, 2, 3, 5]

results = []

for i in Depths:
    gbc = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.01,
        max_depth=i,
        random_state=42
    )

    gbc.fit(X_train, y_train)
    y_pred = gbc.predict(X_test)

    acc = accuracy_score(y_test, y_pred)

    results.append({
        "Depth": i,
        "Accuracy Score": acc
    })

# Final results table
store = pd.DataFrame(results)
print(store)



   Depth  Accuracy Score
0      1        0.938596
1      2        0.956140
2      3        0.956140
3      5        0.938596


## Step 10: Feature Importance

**TODO:** Extract and display the top 10 most important features.

In [12]:
# TODO: Extract feature_importances_
feature_importance=pd.Series(gbc.feature_importances_,index=X.columns).sort_values(ascending=False)

# TODO: Display top 10 features
print(feature_importance.head(10))

mean concave points        0.703619
worst concave points       0.066779
worst radius               0.053195
worst perimeter            0.052379
worst texture              0.031160
mean texture               0.028864
concave points error       0.016205
fractal dimension error    0.010888
concavity error            0.009551
worst smoothness           0.009392
dtype: float64


## Reflection Questions (Answer in Markdown)

1. How does learning rate affect model performance?
2. Why does Gradient Boosting prefer shallow trees?
3. When might Gradient Boosting overfit?
4. Compare this model conceptually with Random Forest.


**Answer 1:** The learning rate controls the contribution of each weak learner to the final model.

A small learning rate results in slow, incremental learning, usually improving generalization but requiring more trees.

A large learning rate speeds up learning but may cause unstable updates and overfitting.

In practice, a lower learning rate with more estimators tends to produce better and more stable performance.

**Answer 2:** Gradient Boosting is based on the idea of combining many weak learners.

Shallow trees have low variance and focus on simple patterns.

Each tree corrects the residual errors of previous trees.

Deep trees may overfit residuals and dominate the ensemble.

Thus, shallow trees ensure controlled, incremental learning and better generalization.

**Answer 3:** Gradient Boosting may overfit when:

The number of trees is too large

The learning rate is high

Trees are too deep

The dataset is small or noisy

No early stopping or regularization is used

Overfitting is typically observed as very high training accuracy but poor test performance.

**Answer 4:** Gradient Boosting builds trees sequentially, where each tree corrects the errors of the previous ones. It mainly reduces bias but is sensitive to hyperparameters.

Random Forest builds trees independently using randomness and averages their predictions. It primarily reduces variance and is more robust.

In summary, Gradient Boosting often achieves higher accuracy with careful tuning, while Random Forest is easier to use and less prone to overfitting.