<a href="https://colab.research.google.com/github/DhimanTarafdar/breast-cancer-classification-gradient-boosting/blob/main/Gradient_Boosting_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 21: Gradient Boosting (Practice Notebook)
## Classification Task with TODO Blocks

In this practice notebook, you will implement **Gradient Boosting for classification** step by step.

### Learning Objectives
- Understand how Gradient Boosting works for classification
- Practice model training, prediction, and evaluation
- Explore the effect of key hyperparameters

**Important:** Complete all TODO blocks. Do not skip steps.


## Step 1: Import Required Libraries

**TODO:** Import NumPy, Pandas, Matplotlib, and required scikit-learn modules.

In [24]:
# TODO: Import necessary libraries
# Hint: numpy, pandas, matplotlib.pyplot
# Hint: load dataset, train_test_split, GradientBoostingClassifier, metrics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


## Step 2: Load Dataset

We will use the **Breast Cancer Wisconsin dataset**, a standard binary classification dataset.

**TODO:** Load the dataset and separate features (X) and target (y).

In [25]:
# TODO: Load the breast cancer dataset
# Hint: sklearn.datasets.load_breast_cancer
# TODO: Assign features to X and labels to y
data= load_breast_cancer()
X = data.data
y = data.target

## Step 3: Inspect the Data

**TODO:** Display the first few rows of X and the distribution of y.

In [26]:
# TODO: View first 5 rows of X
# TODO: Check class distribution in y
X_df=pd.DataFrame(X, columns=data.feature_names)
X_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [27]:
pd.Series(y).value_counts()

Unnamed: 0,count
1,357
0,212


## Step 4: Train-Test Split

**TODO:** Split the dataset into training and testing sets.
- Use 80% data for training
- Set random_state for reproducibility

In [28]:
# TODO: Perform train-test split
X_train,X_test,y_train,y_test=train_test_split(
    X,y,test_size=0.2,random_state=42
)

## Step 5: Train Gradient Boosting Classifier

**TODO:** Initialize and train a GradientBoostingClassifier.

Suggested starting values:
- n_estimators = 100
- learning_rate = 0.1
- max_depth = 3

In [29]:
# TODO: Initialize GradientBoostingClassifier
# TODO: Fit the model on training data
gbc=GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gbc.fit(X_train,y_train)

## Step 6: Make Predictions

**TODO:** Predict class labels and class probabilities for the test set.

In [30]:
# TODO: Predict class labels
# TODO: Predict class probabilities
y_pred=gbc.predict(X_test)
y_pred_proba=gbc.predict_proba(X_test)

print("Predict class labels",y_pred)
print("Predict class probabilities",y_pred_proba)

Predict class labels [1 0 0 1 1 0 0 0 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1 0
 1 1 0]
Predict class probabilities [[8.62720920e-04 9.99137279e-01]
 [9.99630807e-01 3.69192869e-04]
 [9.98355986e-01 1.64401405e-03]
 [4.37465574e-04 9.99562534e-01]
 [2.55373763e-04 9.99744626e-01]
 [9.99536029e-01 4.63970578e-04]
 [9.99608817e-01 3.91183471e-04]
 [9.87840763e-01 1.21592370e-02]
 [7.01340882e-01 2.98659118e-01]
 [9.17435088e-04 9.99082565e-01]
 [4.33124714e-03 9.95668753e-01]
 [9.98963594e-01 1.03640620e-03]
 [2.34099076e-03 9.97659009e-01]
 [9.67763139e-01 3.22368607e-02]
 [5.12605192e-04 9.99487395e-01]
 [9.99471440e-01 5.28559589e-04]
 [5.53719895e-04 9.99446280e-01]
 [3.37399868e-04 9.99662600e-01]
 [5.29010795e-04 9.99470989e-01]
 [9.99601847e-01 3.98153448e-04]
 [5.29329803e-03 9.94706702e-01]
 [4.52111825e-04 9.995478

## Step 7: Model Evaluation

**TODO:** Evaluate the model using:
- Accuracy
- Confusion Matrix
- Classification Report

In [31]:
# TODO: Calculate accuracy score
# TODO: Print confusion matrix
# TODO: Print classification report

accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("accuracy score",accuracy)
print("confusion matrix",cm)
print("classification report",report)

accuracy score 0.956140350877193
confusion matrix [[40  3]
 [ 2 69]]
classification report               precision    recall  f1-score   support

           0       0.95      0.93      0.94        43
           1       0.96      0.97      0.97        71

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



## Step 8: Effect of Learning Rate (Experiment)

**TODO:** Train multiple models with different learning rates and compare accuracy.

Suggested learning rates: 0.01, 0.05, 0.1, 0.2

In [32]:
# TODO: Loop over different learning rates
# TODO: Train model and store accuracy for each
# TODO: Display results in a table

learning_rates = [0.01, 0.05, 0.1, 0.2]
results = []

for lr in learning_rates:
    model = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=lr,
        max_depth=3,
        random_state=42
    )
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    results.append((lr, accuracy_score(y_test, preds)))

pd.DataFrame(results, columns=["Learning Rate", "Accuracy"])

Unnamed: 0,Learning Rate,Accuracy
0,0.01,0.95614
1,0.05,0.95614
2,0.1,0.95614
3,0.2,0.95614


## Step 9: Effect of Tree Depth (Experiment)

**TODO:** Compare model performance for different tree depths.

Suggested depths: 1, 2, 3, 5

In [33]:
# TODO: Loop over max_depth values
# TODO: Train model and evaluate accuracy

max_depths = [1, 2, 3, 5]
results = []

for depth in max_depths:
    model = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=depth,
        random_state=42
    )
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    results.append((depth, accuracy_score(y_test, preds)))

pd.DataFrame(results, columns=["Max Depth", "Accuracy"])


Unnamed: 0,Max Depth,Accuracy
0,1,0.95614
1,2,0.95614
2,3,0.95614
3,5,0.964912


## Step 10: Feature Importance

**TODO:** Extract and display the top 10 most important features.

In [34]:
# TODO: Extract feature_importances_
# TODO: Display top 10 features

feature_importance = gbc.feature_importances_

importance_df = pd.Series(
    feature_importance, index=data.feature_names
).sort_values(ascending=False)

importance_df.head(10)

Unnamed: 0,0
mean concave points,0.450528
worst concave points,0.240103
worst radius,0.075589
worst perimeter,0.051408
worst texture,0.039886
worst area,0.038245
mean texture,0.027805
worst concavity,0.018725
concavity error,0.013068
area error,0.008415


In [23]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [20, 50, 100, 200],
    'learning_rate': [0.1, 0.3, 0.5, 0.7, 1.0],
    'max_depth': [1, 2, 3]
}

grid = GridSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)

grid.fit(X_train, y_train)

best_model = grid.best_estimator_
y_pred_grid = best_model.predict(X_test)

print("Best Parameters Found:")
print(grid.best_params_)
print("\nTest Accuracy (GridSearch):", accuracy_score(y_test, y_pred_grid))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_grid))

Best Parameters Found:
{'learning_rate': 0.5, 'max_depth': 1, 'n_estimators': 100}

Test Accuracy (GridSearch): 0.9473684210526315

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



## Reflection Questions (Answer in Markdown)

1. How does learning rate affect model performance?
2. Why does Gradient Boosting prefer shallow trees?
3. When might Gradient Boosting overfit?
4. Compare this model conceptually with Random Forest.


---

### 1️. How does learning rate affect model performance?

Learning rate controls how much each tree contributes to the final prediction.  
A smaller learning rate means each tree has less influence, so the model learns slowly but more carefully. This usually produces better generalization but requires more trees.

A larger learning rate allows the model to learn faster, but it may overshoot the optimal solution and lead to instability or overfitting.

In our experiment, we observed similar accuracy across different learning rates because we used a sufficient number of trees (100), which helped compensate for the differences in learning rate.

---

### 2️. Why does Gradient Boosting prefer shallow trees?

Gradient Boosting works best with shallow trees because it builds trees sequentially, where each new tree focuses on correcting the mistakes made by the previous ones.

Shallow trees (depth 1–3) are simple and help capture small patterns without overfitting. Since many trees are added together, each tree only needs to learn a small part of the overall relationship.

Deeper trees may memorize the training data and reduce the benefit of boosting. In our results, depth = 3 performed very well, while depth = 5 provided only a marginal improvement.

---

### 3️. When might Gradient Boosting overfit?

Gradient Boosting may overfit when:
- Too many trees are used
- The learning rate is too high
- Trees are too deep
- The dataset is small or noisy

Overfitting occurs when the model starts memorizing training data instead of learning general patterns. This can be controlled by using fewer estimators, lowering the learning rate, limiting tree depth, and applying techniques such as early stopping and cross-validation.

---

### 4️. Compare this model conceptually with Random Forest.

Random Forest builds many trees independently in parallel and averages their predictions. Each tree is trained on a random subset of data and features, which increases diversity and reduces overfitting.

Gradient Boosting, on the other hand, builds trees sequentially. Each new tree learns from the errors of the previous ones.

Random Forest is generally faster to train and more robust, while Gradient Boosting often achieves higher accuracy but requires careful tuning. Conceptually, Random Forest is like averaging opinions from independent experts, whereas Gradient Boosting is like improving step by step by learning from mistakes.


## Model Performance Analysis and Key Decisions

### Overall Model Performance

Our Gradient Boosting Classifier achieved 95.61% accuracy on the breast cancer dataset, which is excellent for medical classification. The model correctly identified 40 out of 43 benign cases and 69 out of 71 malignant cases. Only 5 patients were misclassified in total, which shows the model is highly reliable.

The confusion matrix reveals that we had 3 false positives (benign classified as malignant) and 2 false negatives (malignant classified as malignant). In medical diagnosis, false negatives are more dangerous because missing a cancer case is worse than a false alarm. Our model has only 2 false negatives, which is reassuring.

### Key Decisions Based on Experiments

**Learning Rate Selection:**
I tested learning rates from 0.01 to 0.2, and all gave the same accuracy of 95.61%. This suggests our model is stable across different learning rates with 100 trees. For production, I would choose 0.1 as it balances training speed and performance. If I needed even better results, I could try 0.05 with more trees.

**Tree Depth Selection:**
The experiments showed that shallow trees (depth 1-3) gave 95.61% accuracy, while depth 5 slightly improved to 96.49%. This confirms that Gradient Boosting works well with shallow trees. I would select max_depth=3 for the final model because it's simpler, faster, and the 0.88% improvement with depth 5 is minimal and might just be overfitting.

**Feature Importance Insights:**
The model identified "mean concave points" as the most important feature (45% importance), followed by "worst concave points" (24%). This makes medical sense because concave points indicate the severity of cell nuclei irregularities, which is a strong cancer indicator. Interestingly, texture and radius features had much lower importance, showing that shape irregularities matter more than size or texture for this dataset.

### Final Recommendations

Based on these results, I would recommend deploying this model with the following configuration:
- n_estimators = 100
- learning_rate = 0.1
- max_depth = 3
- These parameters give us 95.61% accuracy with good generalization

However, before using this in real medical diagnosis, we should collect more data and validate on different hospitals' datasets. The 2 false negatives are concerning in a medical context, so we might want to adjust the classification threshold to reduce false negatives even if it increases false positives.

The model is production-ready for assisting doctors, but it should not replace human judgment. It can serve as a second opinion tool to flag suspicious cases for further examination.
