
# 🧪 Breast Cancer Detection (Jupyter Notebook)

This notebook builds a clean, end‑to‑end pipeline for **breast cancer detection** using the classic **Wisconsin Breast Cancer** dataset from `scikit-learn`.  
We'll go from **data loading → EDA → modeling → evaluation → interpretation → model export**.

**Models used:**
- Logistic Regression (with scaling)
- Random Forest (tree-based)

**What you'll learn/do:**
- Explore the data and its features
- Split data with stratification
- Build robust `scikit-learn` pipelines
- Evaluate with accuracy, precision, recall, F1, ROC AUC
- Plot ROC curves and confusion matrices
- Use cross‑validation
- Inspect feature importance & permutation importance
- Save and reuse the trained model



## 1) Setup & Imports


In [None]:

# (Optional) If running locally and you need to install:
# !pip install -U scikit-learn pandas matplotlib joblib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report
)
from sklearn.inspection import permutation_importance
import joblib

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

pd.set_option('display.max_columns', 100)



## 2) Load Dataset
We use `sklearn.datasets.load_breast_cancer`, which is bundled with scikit‑learn (no download needed).


In [None]:

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')

X.head()



## 3) Quick EDA
Check shapes, class balance, and summary stats.


In [None]:

print("Shape X:", X.shape, "| Shape y:", y.shape)
print("\nClasses:", dict(zip(data.target_names, np.bincount(y))))
print("\nDescription:\n", data.DESCR.split('\n\n')[0])


In [None]:

# Class balance plot
class_counts = y.value_counts().sort_index()
labels = [data.target_names[i] for i in class_counts.index]

plt.figure()
plt.bar(labels, class_counts.values)
plt.title("Class Distribution")
plt.xlabel("Class")
plt.ylabel("Count")
plt.show()


In [None]:

# Summary statistics
X.describe()


In [None]:

# Simple correlation heatmap (no seaborn) for the first 15 features (to keep readable)
corr = X.iloc[:, :15].corr()
plt.figure(figsize=(8, 6))
plt.imshow(corr, aspect='auto', interpolation='nearest')
plt.colorbar(label='Correlation')
plt.title("Correlation Heatmap (first 15 features)")
plt.xticks(ticks=np.arange(corr.shape[1]), labels=corr.columns, rotation=90)
plt.yticks(ticks=np.arange(corr.shape[0]), labels=corr.index)
plt.tight_layout()
plt.show()



## 4) Train/Test Split
We keep the class ratios with **stratification**.


In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

X_train.shape, X_test.shape



## 5) Models
We'll compare two baselines:

- **Logistic Regression** in a pipeline with **StandardScaler**  
- **Random Forest** (no scaling needed)


In [None]:

log_reg_clf = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=500, random_state=RANDOM_STATE))
])

rf_clf = RandomForestClassifier(
    n_estimators=300, max_depth=None, random_state=RANDOM_STATE, n_jobs=-1
)

log_reg_clf, rf_clf



## 6) Train the Models


In [None]:

log_reg_clf.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)

print("Done.")



## 7) Evaluation
We report common metrics and show ROC curves and confusion matrices.


In [None]:

def evaluate_model(name, model, X_tr, y_tr, X_te, y_te):
    y_pred = model.predict(X_te)
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_te)[:, 1]
    else:
        # For safety; most classifiers here have predict_proba
        y_proba = None

    metrics = {
        "accuracy": accuracy_score(y_te, y_pred),
        "precision": precision_score(y_te, y_pred),
        "recall": recall_score(y_te, y_pred),
        "f1": f1_score(y_te, y_pred),
    }
    if y_proba is not None:
        metrics["roc_auc"] = roc_auc_score(y_te, y_proba)

    print(f"\n=== {name} ===")
    for k, v in metrics.items():
        print(f"{k:>10}: {v:.4f}")
    print("\nClassification report:\n", classification_report(y_te, y_pred, target_names=data.target_names))

    # ROC Curve
    if y_proba is not None:
        fpr, tpr, _ = roc_curve(y_te, y_proba)
        plt.figure()
        plt.plot(fpr, tpr, label=f"{name} (AUC={metrics.get('roc_auc', float('nan')):.3f})")
        plt.plot([0, 1], [0, 1], linestyle='--')
        plt.xlabel("False Positive Rate")
        plt.ylabel("True Positive Rate")
        plt.title(f"ROC Curve — {name}")
        plt.legend(loc="lower right")
        plt.show()

    # Confusion matrix
    cm = confusion_matrix(y_te, y_pred)
    plt.figure()
    plt.matshow(cm, fignum=0)
    plt.title(f"Confusion Matrix — {name}")
    plt.xlabel("Predicted")
    plt.ylabel("True")
    for (i, j), val in np.ndenumerate(cm):
        plt.text(j, i, int(val), ha='center', va='center')
    plt.show()

evaluate_model("Logistic Regression", log_reg_clf, X_train, y_train, X_test, y_test)
evaluate_model("Random Forest", rf_clf, X_train, y_train, X_test, y_test)



## 8) Cross‑Validation
Stratified 5‑fold cross‑validation on the training set.


In [None]:

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

for name, model in [
    ("Logistic Regression", log_reg_clf),
    ("Random Forest", rf_clf),
]:
    scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="roc_auc", n_jobs=-1)
    print(f"{name} CV ROC AUC: mean={scores.mean():.4f} ± {scores.std():.4f}")



## 9) Feature Importance & Interpretation
We inspect:
- Logistic Regression coefficients (absolute magnitude)
- Random Forest `feature_importances_`
- Permutation importance (on the test set)


In [None]:

# Logistic Regression coefficients (after scaling)
log_reg = log_reg_clf.named_steps["model"]
scaler = log_reg_clf.named_steps["scaler"]

coef = pd.Series(np.abs(log_reg.coef_[0]), index=X.columns).sort_values(ascending=False)
coef.head(10)


In [None]:

plt.figure()
coef.head(15).iloc[::-1].plot(kind="barh")
plt.title("Top 15 | Logistic Regression | |coef|")
plt.xlabel("|Coefficient|")
plt.tight_layout()
plt.show()


In [None]:

# Random Forest Feature Importances
rf_importance = pd.Series(rf_clf.feature_importances_, index=X.columns).sort_values(ascending=False)
rf_importance.head(10)


In [None]:

plt.figure()
rf_importance.head(15).iloc[::-1].plot(kind="barh")
plt.title("Top 15 | Random Forest | Feature Importance")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()


In [None]:

# Permutation importance (on test set) for the best performing model (choose RF by default here)
perm = permutation_importance(rf_clf, X_test, y_test, n_repeats=20, random_state=RANDOM_STATE, n_jobs=-1)
perm_importance = pd.Series(perm.importances_mean, index=X.columns).sort_values(ascending=False)
perm_importance.head(10)


In [None]:

plt.figure()
perm_importance.head(15).iloc[::-1].plot(kind="barh")
plt.title("Top 15 | Permutation Importance (RF on test set)")
plt.xlabel("Mean decrease in score")
plt.tight_layout()
plt.show()



## 10) Save the Trained Model
Export the Random Forest (change to Logistic Regression if that scores better for you).


In [None]:

model_path = "/mnt/data/breast_cancer_model.joblib"
joblib.dump(rf_clf, model_path)
model_path



## 11) Simple Inference
Create a synthetic sample from the **feature means** and predict.


In [None]:

# Build a single sample using the mean of each feature (for demonstration)
x_new = X.mean().values.reshape(1, -1)
pred_class = rf_clf.predict(x_new)[0]
pred_proba = rf_clf.predict_proba(x_new)[0, 1]

print(f"Predicted class: {data.target_names[pred_class]}")
print(f"Predicted probability of 'malignant': {pred_proba:.4f}")



## 12) Notes & Responsible Use

- This is an educational demo using a small, clean benchmark dataset.  
- Real clinical workflows require rigorous validation, calibration, bias checks, and regulatory approval.  
- Do not use this model for real medical decisions.
