<a href="https://colab.research.google.com/github/IBhupen/ybi-foundation-project/blob/main/Cancer_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cancer Prediction (Breast Cancer Wisconsin Dataset)

This notebook builds machine-learning models to predict whether a tumor is **malignant** or **benign** using the classic Breast Cancer Wisconsin dataset from scikit-learn.

**Contents**
1. Setup & Data Load
2. Exploratory Data Analysis (EDA)
3. Preprocessing & Train/Test Split
4. Baseline Model (Logistic Regression)
5. Additional Models (kNN, Random Forest, SVM)
6. Cross-Validation
7. Evaluation (Accuracy, Precision, Recall, F1, ROC-AUC)
8. Feature Importance & Interpretability (Permutation Importance)
9. Save Trained Model

**How to run:** Execute each cell from top to bottom. All libraries used are standard in most Python ML environments.

## 1) Setup & Data Load

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import joblib

data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target')
target_names = list(data.target_names)
X.head()

## 2) Exploratory Data Analysis (EDA)

In [None]:
# Basic shape and info
print('Shape:', X.shape)
print('\nMissing values per column (should be zeros):\n', X.isna().sum().sort_values(ascending=False).head())
display(X.describe().T)

In [None]:
# Class balance
classes, counts = np.unique(y, return_counts=True)
for c, cnt in zip(classes, counts):
    print(f'Class {c} ({target_names[c]}): {cnt}')

## 3) Preprocessing & Train/Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print('Train shape:', X_train.shape, ' Test shape:', X_test.shape)

## 4) Baseline Model — Logistic Regression

In [None]:
logreg = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=500, n_jobs=None))
])
logreg.fit(X_train, y_train)
y_pred_lr = logreg.predict(X_test)
y_prob_lr = logreg.predict_proba(X_test)[:, 1]
print('LogReg Accuracy:', accuracy_score(y_test, y_pred_lr))
print(classification_report(y_test, y_pred_lr, target_names=target_names))

## 5) Additional Models — kNN, Random Forest, SVM

In [None]:
knn = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', KNeighborsClassifier(n_neighbors=7))
])
rf = RandomForestClassifier(n_estimators=300, random_state=42)
svm = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', SVC(kernel='rbf', probability=True, C=2.0, gamma='scale', random_state=42))
])

models = {
    'LogisticRegression': logreg,
    'kNN': knn,
    'RandomForest': rf,
    'SVM': svm,
}

for name, model in models.items():
    model.fit(X_train, y_train)
print('All models trained.')

## 6) Cross-Validation

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = {}
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
    cv_scores[name] = (scores.mean(), scores.std())
cv_scores

## 7) Evaluation — Metrics & ROC Curves

In [None]:
import itertools
def evaluate(model, name):
    y_pred = model.predict(X_test)
    if hasattr(model, 'predict_proba'):
        y_prob = model.predict_proba(X_test)[:, 1]
    else:
        # use decision_function if available, else fallback
        if hasattr(model, 'decision_function'):
            from sklearn.metrics import roc_auc_score
            scores = model.decision_function(X_test)
            # Scale decision scores to [0,1] by ranking
            ranks = pd.Series(scores).rank(method='average') / len(scores)
            y_prob = ranks.values
        else:
            y_prob = np.zeros_like(y_pred, dtype=float)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    try:
        auc = roc_auc_score(y_test, y_prob)
    except Exception:
        auc = float('nan')
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names)
    disp.plot()
    plt.title(f'Confusion Matrix — {name}')
    plt.show()
    # ROC curve — one plot per model (no subplots, no custom colors per instructions)
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    plt.figure()
    plt.plot(fpr, tpr, label=f'{name}')
    plt.plot([0,1],[0,1],'--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(f'ROC Curve — {name} (AUC={auc:.3f})')
    plt.legend()
    plt.show()
    return {'model': name, 'accuracy': acc, 'precision': prec, 'recall': rec, 'f1': f1, 'roc_auc': auc}

results = []
for name, model in models.items():
    results.append(evaluate(model, name))
results_df = pd.DataFrame(results)
results_df.sort_values('f1', ascending=False)

## 8) Feature Importance (Permutation)

In [None]:
# We'll use permutation importance on the RandomForest model as an example
rf.fit(X_train, y_train)
r = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
importances = pd.Series(r.importances_mean, index=X.columns).sort_values(ascending=False)
display(importances.head(15))
plt.figure()
importances.head(15)[::-1].plot(kind='barh')
plt.title('Top 15 Features (Permutation Importance) — RandomForest')
plt.tight_layout()
plt.show()

## 9) Save Trained Model

In [None]:
best_model_name = results_df.sort_values('f1', ascending=False).iloc[0]['model']
best_model = models[best_model_name]
joblib.dump(best_model, 'best_cancer_model.joblib')
print('Best model saved as best_cancer_model.joblib')

---
### Notes
- Dataset source: `sklearn.datasets.load_breast_cancer()`
- Target classes: 0 = malignant, 1 = benign (per `data.target_names`)
- Feel free to add hyperparameter tuning (GridSearchCV/RandomizedSearchCV) if required.
- The notebook keeps charts simple (matplotlib, single-chart figures).