# Gradient Boosting (Practice Notebook)
## Classification Task with TODO Blocks

In this practice notebook, you will implement **Gradient Boosting for classification** step by step.

### Learning Objectives
- Understand how Gradient Boosting works for classification
- Practice model training, prediction, and evaluation
- Explore the effect of key hyperparameters

**Important:** Complete all TODO blocks. Do not skip steps.


## Step 1: Import Required Libraries

**TODO:** Import NumPy, Pandas, Matplotlib, and required scikit-learn modules.

In [38]:
# TODO: Import necessary libraries
# Hint: numpy, pandas, matplotlib.pyplot
# Hint: load dataset, train_test_split, GradientBoostingClassifier, metrics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score

## Step 2: Load Dataset

We will use the **Breast Cancer Wisconsin dataset**, a standard binary classification dataset.

**TODO:** Load the dataset and separate features (X) and target (y).

In [14]:
# TODO: Load the breast cancer dataset
# Hint: sklearn.datasets.load_breast_cancer
# TODO: Assign features to X and labels to y
data = load_breast_cancer(as_frame=True)

X = data.data
y = data.target
print(X.shape)
print(y.shape)

(569, 30)
(569,)


## Step 3: Inspect the Data

**TODO:** Display the first few rows of X and the distribution of y.

In [26]:
# TODO: View first 5 rows of X
# TODO: Check class distribution in y
display(X.head())
display(y.head())
print(data.target_names)

X.info()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Unnamed: 0,target
0,0
1,0
2,0
3,0
4,0


['malignant' 'benign']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 30 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smo

## Step 4: Train-Test Split

**TODO:** Split the dataset into training and testing sets.
- Use 80% data for training
- Set random_state for reproducibility

In [27]:
# TODO: Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Step 5: Train Gradient Boosting Classifier

**TODO:** Initialize and train a GradientBoostingClassifier.

Suggested starting values:
- n_estimators = 100
- learning_rate = 0.1
- max_depth = 3

In [33]:
# TODO: Initialize GradientBoostingClassifier
# TODO: Fit the model on training data
gbc = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state= 42
)

gbc.fit(X_train, y_train)

## Step 6: Make Predictions

**TODO:** Predict class labels and class probabilities for the test set.

In [34]:
# TODO: Predict class labels
# TODO: Predict class probabilities
y_pred = gbc.predict(X_test)
y_proba = gbc.predict_proba(X_test)

## Step 7: Model Evaluation

**TODO:** Evaluate the model using:
- Accuracy
- Confusion Matrix
- Classification Report

In [37]:
# TODO: Calculate accuracy score
# TODO: Print confusion matrix
# TODO: Print classification report

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


0.956140350877193
[[40  3]
 [ 2 69]]
              precision    recall  f1-score   support

           0       0.95      0.93      0.94        43
           1       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



## Step 8: Effect of Learning Rate (Experiment)

**TODO:** Train multiple models with different learning rates and compare accuracy.

Suggested learning rates: 0.01, 0.05, 0.1, 0.2

In [42]:
# TODO: Loop over different learning rates
# TODO: Train model and store accuracy for each
# TODO: Display results in a table
#Effect of Learning Rate
learning_rates = [0.01, 0.05, 0.1, 0.2]
results = []

for lr in learning_rates:
  model = GradientBoostingClassifier(
      n_estimators=100,
      learning_rate=lr,
      max_depth = 3,
      random_state = 42

  )
  model.fit(X_train, y_train)
  preds = model.predict(X_test)
  results.append((lr, f1_score(y_test, preds), accuracy_score(y_test, preds)))

pd.DataFrame(results, columns=["Learning Rate", "F1 Score", "Accuracy Score"])

Unnamed: 0,Learning Rate,F1 Score,Accuracy Score
0,0.01,0.965035,0.95614
1,0.05,0.965035,0.95614
2,0.1,0.965035,0.95614
3,0.2,0.965035,0.95614


## Step 9: Effect of Tree Depth (Experiment)

**TODO:** Compare model performance for different tree depths.

Suggested depths: 1, 2, 3, 5

In [44]:
# TODO: Loop over max_depth values
# TODO: Train model and evaluate accuracy
depths = [1,2,3,5]
results = []

for depth in depths:
  model = GradientBoostingClassifier(
      n_estimators=100,
      learning_rate=0.1,
      max_depth = depth,
      random_state = 42

  )
  model.fit(X_train, y_train)
  preds = model.predict(X_test)
  results.append((depth, f1_score(y_test, preds), accuracy_score(y_test, preds)))

pd.DataFrame(results, columns=["Depth", "F1 Score", "Accuracy Score"])

Unnamed: 0,Depth,F1 Score,Accuracy Score
0,1,0.965035,0.95614
1,2,0.965035,0.95614
2,3,0.965035,0.95614
3,5,0.972222,0.964912


## Step 10: Feature Importance

**TODO:** Extract and display the top 10 most important features.

In [46]:
# TODO: Extract feature_importances_
# TODO: Display top 10 features
feature_importance = gbc.feature_importances_

importance_df = pd.Series(
    feature_importance, index=X.columns
).sort_values(ascending=False)

importance_df.head(10)

Unnamed: 0,0
mean concave points,0.450528
worst concave points,0.240103
worst radius,0.075589
worst perimeter,0.051408
worst texture,0.039886
worst area,0.038245
mean texture,0.027805
worst concavity,0.018725
concavity error,0.013068
area error,0.008415


## Reflection Questions (Answer in Markdown)

1. How does learning rate affect model performance?
2. Why does Gradient Boosting prefer shallow trees?
3. When might Gradient Boosting overfit?
4. Compare this model conceptually with Random Forest.
