## Import Libraries
We begin by importing all the required Python libraries for this project.  
This includes:
- **Data handling:** pandas, numpy  
- **Modeling & evaluation:** scikit-learn classifiers, pipelines, cross-validation  
- **Saving artifacts:** joblib, json  
- **Dataset access:** ucimlrepo for fetching the Breast Cancer dataset  

Keeping all imports in the first cell ensures cleaner structure and reproducibility.

In [1]:
import os, sys, json, joblib
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

from ucimlrepo import fetch_ucirepo

## Fetch Dataset
We fetch the **Breast Cancer Wisconsin (Original)** dataset directly from the UCI ML Repository using `ucimlrepo`.  
This returns:
- **X**: a DataFrame of 9 cytological features  
- **y**: the target column (benign vs malignant)  
- Metadata and variable descriptions for context

In [2]:
breast_cancer = fetch_ucirepo(id=15)
X = breast_cancer.data.features.copy()
y = breast_cancer.data.targets.squeeze().copy()

print("Raw shapes:", X.shape, y.shape)
print(breast_cancer.metadata["name"])
display(breast_cancer.variables.head())

Raw shapes: (699, 9) (699,)
Breast Cancer Wisconsin (Original)


Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,Sample_code_number,ID,Categorical,,,,no
1,Clump_thickness,Feature,Integer,,,,no
2,Uniformity_of_cell_size,Feature,Integer,,,,no
3,Uniformity_of_cell_shape,Feature,Integer,,,,no
4,Marginal_adhesion,Feature,Integer,,,,no


## Clean Data and Split
The target classes are encoded as `2 = benign` and `4 = malignant`.  
We remap them into `{0,1}` for scikit-learn compatibility.  
We also ensure all features are numeric, then split the dataset into:
- **Training set (80%)**  
- **Test set (20%)**, stratified to preserve class proportions.

In [3]:
# Convert class {2,4} -> {0,1}; ensure numeric; stratified split.
y = y.replace({2: 0, 4: 1})

# Coerce all feature columns to numeric (ucimlrepo typically already does this).
for c in X.columns:
    X[c] = pd.to_numeric(X[c], errors="coerce")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train:", X_train.shape, " Test:", X_test.shape)
print("Target balance in train:\n", y_train.value_counts(normalize=True).rename({0:"benign",1:"malignant"}))

Train: (559, 9)  Test: (140, 9)
Target balance in train:
 Class
benign       0.654741
malignant    0.345259
Name: proportion, dtype: float64


## Model Selection
We compare several candidate classifiers using a unified pipeline with 5-fold cross-validation:
- Logistic Regression  
- Random Forest  
- Gradient Boosting  
- Support Vector Classifier (RBF kernel)  

Each pipeline includes a **SimpleImputer** to handle missing values.  
We select the model with the highest mean cross-validation accuracy.

In [4]:
candidates = {
    "log_reg": LogisticRegression(max_iter=500),
    "rf": RandomForestClassifier(n_estimators=300, random_state=42),
    "gb": GradientBoostingClassifier(random_state=42),
    "svc_rbf": SVC(kernel="rbf", probability=True, random_state=42)
}

results = []
best_name, best_score, best_pipe = None, -1.0, None

for name, est in candidates.items():
    pipe = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("model", est),
    ])
    scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="accuracy")
    mean_acc, std_acc = scores.mean(), scores.std()
    results.append({"model": name, "cv_mean_acc": mean_acc, "cv_std": std_acc})
    if mean_acc > best_score:
        best_name, best_score, best_pipe = name, mean_acc, pipe

pd.DataFrame(results).sort_values("cv_mean_acc", ascending=False)

Unnamed: 0,model,cv_mean_acc,cv_std
0,log_reg,0.969611,0.013333
1,rf,0.967825,0.010671
3,svc_rbf,0.967793,0.018398
2,gb,0.964237,0.007932


## Train and Evaluate Best Model
We fit the selected pipeline on the training set and evaluate it on the test set.  
Outputs include:
- Test accuracy  
- Precision, recall, and F1-score for both benign and malignant cases

In [5]:
best_pipe.fit(X_train, y_train)
y_pred = best_pipe.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)

print(f"Selected model: {best_name} | CV mean acc: {best_score:.4f}")
print(f"Test Accuracy: {test_acc:.4f}\n")
print(classification_report(y_test, y_pred, target_names=["benign","malignant"]))

Selected model: log_reg | CV mean acc: 0.9696
Test Accuracy: 0.9500

              precision    recall  f1-score   support

      benign       0.96      0.97      0.96        92
   malignant       0.94      0.92      0.93        48

    accuracy                           0.95       140
   macro avg       0.95      0.94      0.94       140
weighted avg       0.95      0.95      0.95       140



## Save Model and Schema
We persist:
- **Trained model pipeline** as `model.pkl`  
- **Schema** (`schema.json`) containing:
  - Feature order  
  - Target label mapping  
  - Model name and metrics  

These artifacts will be used by the Flask web app for predictions.

In [6]:
os.makedirs("app/templates", exist_ok=True)

model_path = "app/model.pkl"
schema_path = "app/schema.json"

joblib.dump(best_pipe, model_path)

schema = {
    "feature_order": list(X.columns),   # the input order the app will use
    "target_labels": {0: "benign", 1: "malignant"},
    "selected_model": best_name,
    "cv_accuracy": float(best_score),
    "test_accuracy": float(test_acc)
}

with open(schema_path, "w") as f:
    json.dump(schema, f, indent=2)

model_path, schema_path

('app/model.pkl', 'app/schema.json')

## Generate Flask Application
We create a minimal Flask web app that:
- Loads the trained pipeline (`model.pkl`) and schema (`schema.json`)
- Renders a form for entering the nine cytological features (1–10 scale)
- Returns a prediction (benign/malignant), optional confidence (if supported), and model/accuracy info

The app file will be written to `app/app.py`. This keeps the serving logic separate from the notebook.

In [7]:
app_py = r"""from flask import Flask, render_template, request
import joblib, json, numpy as np, os

app = Flask(__name__)

MODEL_PATH = os.path.join(os.path.dirname(__file__), "model.pkl")
SCHEMA_PATH = os.path.join(os.path.dirname(__file__), "schema.json")

pipe = joblib.load(MODEL_PATH)
with open(SCHEMA_PATH) as f:
    schema = json.load(f)

FEATURES = schema["feature_order"]
LABELS = schema.get("target_labels", {0:"benign",1:"malignant"})

@app.route("/", methods=["GET"])
def index():
    return render_template("index.html", features=FEATURES)

@app.route("/predict", methods=["POST"])
def predict():
    try:
        vals = [request.form.get(feat, type=float) for feat in FEATURES]
        x = np.array(vals, dtype=float).reshape(1, -1)
        pred = int(pipe.predict(x)[0])
        proba = getattr(pipe, "predict_proba", None)
        prob = float(proba(x)[0, pred]) if proba else None
        label = LABELS.get(str(pred), LABELS.get(pred, "unknown"))
        return render_template(
            "result.html",
            label=label,
            prob=prob,
            model_name=schema.get("selected_model"),
            test_acc=schema.get("test_accuracy")
        )
    except Exception as e:
        return render_template("error.html", error=str(e)), 400

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)
"""
with open("app/app.py", "w", encoding="utf-8") as f:
    f.write(app_py)

"app/app.py written."

'app/app.py written.'

## Create HTML Templates
We create three Jinja2 templates under `app/templates/`:
- `index.html`: feature input form
- `result.html`: prediction output (with optional confidence)
- `error.html`: friendly error page

In [8]:
index_html = """<!doctype html>
<html>
  <head>
    <meta charset="utf-8"/>
    <title>Breast Cancer Predictor</title>
    <style>
      body { font-family: system-ui, Arial, sans-serif; max-width: 760px; margin: 40px auto; }
      .grid { display: grid; grid-template-columns: 1fr 1fr; gap: 12px; }
      input, button { padding: 10px; font-size: 1rem; width: 100%; box-sizing: border-box; }
      button { cursor: pointer; }
      label { font-size: .95rem; }
    </style>
  </head>
  <body>
    <h1>Breast Cancer (Original) — Prediction</h1>
    <p>Enter the nine cytological features (1–10 scale):</p>
    <form method="POST" action="/predict">
      <div class="grid">
        {% for f in features %}
          <label>
            {{ f.replace('_',' ').title() }}<br/>
            <input name="{{ f }}" type="number" min="1" max="10" step="1" required>
          </label>
        {% endfor %}
      </div>
      <p><button type="submit">Predict</button></p>
    </form>
  </body>
</html>
"""

result_html = """<!doctype html>
<html>
  <head><meta charset="utf-8"><title>Result</title></head>
  <body style="font-family:system-ui,Arial,sans-serif;max-width:760px;margin:40px auto;">
    <h1>Prediction Result</h1>
    <p><strong>Prediction:</strong> {{ label|capitalize }}</p>
    {% if prob is not none %}
      <p><strong>Confidence:</strong> {{ (prob*100)|round(2) }}%</p>
    {% endif %}
    <p><em>Model:</em> {{ model_name }} | <em>Held-out accuracy:</em> {{ (test_acc*100)|round(2) }}%</p>
    <p><a href="/">← Back</a></p>
  </body>
</html>
"""

error_html = """<!doctype html>
<html>
  <head><meta charset="utf-8"><title>Error</title></head>
  <body style="font-family:system-ui,Arial,sans-serif;max-width:760px;margin:40px auto;">
    <h1>Error</h1>
    <pre>{{ error }}</pre>
    <p><a href="/">← Back</a></p>
  </body>
</html>
"""

with open("app/templates/index.html","w",encoding="utf-8") as f: f.write(index_html)
with open("app/templates/result.html","w",encoding="utf-8") as f: f.write(result_html)
with open("app/templates/error.html","w",encoding="utf-8") as f: f.write(error_html)

"Templates written."

'Templates written.'

## Deployment Files (Heroku + GitHub Actions)
We generate the files needed to deploy the app to Heroku with CI/CD:
- `Procfile`: tells Heroku how to run the app with Gunicorn
- `requirements.txt`: Python dependencies
- `runtime.txt`: Python version (pin for Heroku)
- `.gitignore`: ignore caches and notebook checkpoints
- `.github/workflows/deploy.yml`: GitHub Actions to deploy on push to `main`

Before pushing, set the GitHub repo secrets:
- `HEROKU_API_KEY` (your Heroku account API key)
- `HEROKU_APP_NAME` (the name of your Heroku app)
- `HEROKU_EMAIL` (your Heroku account email)

In [9]:
procfile = "web: gunicorn app.app:app\n"
requirements = "\n".join([
    "flask",
    "gunicorn",
    "pandas",
    "numpy",
    "scikit-learn",
    "joblib",
    "ucimlrepo"
]) + "\n"
runtime = "python-3.11.9\n"
gitignore = "\n".join([
    "__pycache__/",
    ".ipynb_checkpoints/",
    "*.pyc",
    ".env"
]) + "\n"

workflow = r"""name: CI/CD to Heroku

on:
  push:
    branches: [ "main" ]
  workflow_dispatch:

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: "3.11"

    - name: Install deps
      run: |
        python -m pip install --upgrade pip
        pip install -r requirements.txt

    - name: Smoke check artifacts
      run: |
        python - << 'PY'
        import os, json
        assert os.path.exists("app/model.pkl"), "Missing app/model.pkl. Run the training cells and commit artifacts."
        assert os.path.exists("app/schema.json"), "Missing app/schema.json."
        print("Artifacts present.")
        PY

    - name: Deploy to Heroku
      uses: akhileshns/heroku-deploy@v3.12.12
      with:
        heroku_api_key: ${{ secrets.HEROKU_API_KEY }}
        heroku_app_name: ${{ secrets.HEROKU_APP_NAME }}
        heroku_email: ${{ secrets.HEROKU_EMAIL }}
        usedocker: false
"""

with open("Procfile","w") as f: f.write(procfile)
with open("requirements.txt","w") as f: f.write(requirements)
with open("runtime.txt","w") as f: f.write(runtime)
with open(".gitignore","w") as f: f.write(gitignore)

os.makedirs(".github/workflows", exist_ok=True)
with open(".github/workflows/deploy.yml","w") as f: f.write(workflow)

"Deployment files written."

'Deployment files written.'