<a href="https://colab.research.google.com/github/Ovizero01/Machine-Leaning/blob/main/021_Gradient%20Boosting/021_Gradient%20Boosting%20Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 21: Gradient Boosting (Practice Notebook)
## Classification Task with TODO Blocks

In this practice notebook, you will implement **Gradient Boosting for classification** step by step.

### Learning Objectives
- Understand how Gradient Boosting works for classification
- Practice model training, prediction, and evaluation
- Explore the effect of key hyperparameters

**Important:** Complete all TODO blocks. Do not skip steps.


## Step 1: Import Required Libraries

**TODO:** Import NumPy, Pandas, Matplotlib, and required scikit-learn modules.

In [18]:
# TODO: Import necessary libraries
# Hint: numpy, pandas, matplotlib.pyplot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Hint: load dataset, train_test_split, GradientBoostingClassifier, metrics
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Step 2: Load Dataset

We will use the **Breast Cancer Wisconsin dataset**, a standard binary classification dataset.

**TODO:** Load the dataset and separate features (X) and target (y).

In [4]:
# TODO: Load the breast cancer dataset
# Hint: sklearn.datasets.load_breast_cancer
# TODO: Assign features to X and labels to y
X, y = load_breast_cancer(return_X_y=True)

## Step 3: Inspect the Data

**TODO:** Display the first few rows of X and the distribution of y.

In [13]:
# TODO: View first 5 rows of X
feature_names = load_breast_cancer().feature_names
df = pd.DataFrame(X, columns=feature_names)
print(df.head())
# TODO: Check class distribution in y
print(np.unique(y, return_counts=True))

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst radius  worst texture  worst perimeter  \
0           

## Step 4: Train-Test Split

**TODO:** Split the dataset into training and testing sets.
- Use 80% data for training
- Set random_state for reproducibility

In [5]:
# TODO: Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Step 5: Train Gradient Boosting Classifier

**TODO:** Initialize and train a GradientBoostingClassifier.

Suggested starting values:
- n_estimators = 100
- learning_rate = 0.1
- max_depth = 3

In [16]:
# TODO: Initialize GradientBoostingClassifier
gbr = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

# TODO: Fit the model on training data
gbr.fit(X_train, y_train)

## Step 6: Make Predictions

**TODO:** Predict class labels and class probabilities for the test set.

In [17]:
# TODO: Predict class labels
pred = gbr.predict(X_test)
# TODO: Predict class probabilities
y_pred_proba = gbr.predict_proba(X_test)

## Step 7: Model Evaluation

**TODO:** Evaluate the model using:
- Accuracy
- Confusion Matrix
- Classification Report

In [19]:
# TODO: Calculate accuracy score
acc = accuracy_score(y_test, pred)
print("Accuracy:", acc)
# TODO: Print confusion matrix
cm = confusion_matrix(y_test, pred)
print("Confusion Matrix:\n", cm)
# TODO: Print classification report
print("Classification Report:\n", classification_report(y_test, pred))

Accuracy: 0.956140350877193
Confusion Matrix:
 [[40  3]
 [ 2 69]]
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.93      0.94        43
           1       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



## Step 8: Effect of Learning Rate (Experiment)

**TODO:** Train multiple models with different learning rates and compare accuracy.

Suggested learning rates: 0.01, 0.05, 0.1, 0.2

In [20]:
# TODO: Loop over different learning rates
learning_rates = [0.01, 0.05, 0.1, 0.2]
results = []
# TODO: Train model and store accuracy for each
for lr in learning_rates:
  model = GradientBoostingClassifier(
      n_estimators = 100,
      learning_rate = lr,
      max_depth = 3,
      random_state = 42
  )
  model.fit(X_train, y_train)
  preds = model.predict(X_test)
  results.append((lr, accuracy_score(y_test, preds)))
# TODO: Display results in a table
pd.DataFrame(results, columns=["Learning Rate", "Accuracy"])

Unnamed: 0,Learning Rate,Accuracy
0,0.01,0.95614
1,0.05,0.95614
2,0.1,0.95614
3,0.2,0.95614


## Step 9: Effect of Tree Depth (Experiment)

**TODO:** Compare model performance for different tree depths.

Suggested depths: 1, 2, 3, 5

In [21]:
# TODO: Loop over max_depth values
depths = [1, 2, 3, 5]
results = []
# TODO: Train model and evaluate accuracy
for depth in depths:
  model = GradientBoostingClassifier(
      n_estimators = 100,
      learning_rate = 0.1,
      max_depth = depth,
      random_state = 42
  )
  model.fit(X_train, y_train)
  preds = model.predict(X_test)
  results.append((depth, accuracy_score(y_test, preds)))
pd.DataFrame(results, columns=["Learning Rate", "Accuracy"])

Unnamed: 0,Learning Rate,Accuracy
0,1,0.95614
1,2,0.95614
2,3,0.95614
3,5,0.964912


## Step 10: Feature Importance

**TODO:** Extract and display the top 10 most important features.

In [25]:
# TODO: Extract feature_importances_
feature_importance = gbr.feature_importances_
importance_df = pd.Series(
    feature_importance, index=feature_names
).sort_values(ascending=False)
# TODO: Display top 10 features
importance_df.head(10)

Unnamed: 0,0
mean concave points,0.450528
worst concave points,0.240103
worst radius,0.075589
worst perimeter,0.051408
worst texture,0.039886
worst area,0.038245
mean texture,0.027805
worst concavity,0.018725
concavity error,0.013068
area error,0.008415


## Reflection Questions (Answer in Markdown)

1. How does learning rate affect model performance?
2. Why does Gradient Boosting prefer shallow trees?
3. When might Gradient Boosting overfit?
4. Compare this model conceptually with Random Forest.


1. The learning rate in Gradient Boosting controls how much each tree corrects previous errors; a smaller learning rate improves generalization but requires more trees, while a larger learning rate trains faster but can overfit or become unstable.

2. Gradient Boosting prefers shallow trees because they act as weak learners, making small corrections to erros without overfitting, allowing the ensemble to gradually build a strong, accurate model.

3. Gradient Boosting can overfit when the model uses too many trees, very deep trees or a high learning rate, causing it to fit noise in the training data rather than general patterns.

4. Gradient Boosting builds trees sequentially to correct previous errors, making it accurate but more prone to overfitting. Random Forest builds trees independently and averages their predictions, making it more robust and less likely to overfit.