# Project Metadata & Setup

---

## **Project Title:** **Early Breast Cancer Diagnosis using Machine Learning (Terminal-Based Prototype)**

---

### **Objective:**

Develop a lightweight, terminal-driven diagnostic prototype that predicts breast tumor malignancy using the built-in Breast Cancer Wisconsin dataset.
The goal is to simulate how a clinician or technician might use a fast, interpretable tool, without a full GUI or web app.


### **Dataset Description:**

* **Name:** Breast Cancer Wisconsin (Diagnostic)
* **Source:** `sklearn.datasets.load_breast_cancer()`
* **Samples:** 569
* **Features:** 30 numerical (e.g. radius, texture, symmetry)
* **Target:** Binary classification — `malignant (0)` vs. `benign (1)`



### **Stakeholders:**

| Stakeholder          | Interest / Use Case                                              |
| -------------------- | ---------------------------------------------------------------- |
| **Clinicians**       | Fast, interpretable predictions without cloud dependency         |
| **Researchers**      | Baseline model to compare with deep learning or ensemble methods |
| **Medical Startups** | Prototype backend logic for terminal-based tools                 |
| **Educators**        | Teaching ML with meaningful, real-world data                     |


### **Key Notes for Revision:**

* **No GPU required** — uses efficient `RandomForestClassifier`
* **Runs locally** in terminal with zero external dataset download
* **Emphasis on simplicity**: ASCII feedback, clean input prompts
* **Can be modularized later** into API or UI backend
* **We’ll skip test/train splitting** for now — prototype phase only


### **Environment Requirements:**

| Package    | Use                          |
| ---------- | ---------------------------- |
| `sklearn`  | Dataset + ML model           |
| `pandas`   | Tabular data inspection      |
| `colorama` | Terminal coloring (optional) |
| `numpy`    | Data handling                |

---

### Install (if not present):

```bash
pip install pandas scikit-learn colorama
```


### Deliverables:

* Classifier trained on breast cancer dataset
* Real-time terminal prediction system
* ASCII output or basic feedback UI for predicted diagnosis
* Clean, commented code blocks
* Professional markdown formatting with revision comments


## STEP 1: Load and Explore the Breast Cancer Dataset


### **Objective:**

* Load the Breast Cancer Wisconsin dataset directly from `sklearn`
* Convert to a `pandas.DataFrame` for readability
* Summarize the dataset structure and feature information
* Confirm class distribution (very important in medical ML)


### **Background Insight for Stakeholders:**

* The data represent **digitized characteristics of cell nuclei** from fine-needle aspirates of breast masses.
* 30 numerical features are calculated from images (e.g., mean radius, standard error of texture, worst smoothness).
* The target is **binary**:

  * `0 = malignant` (cancerous)
  * `1 = benign` (non-cancerous)


### **Code Block: Load + Inspect Data**

In [7]:
# STEP 1: Load & Inspect Breast Cancer Dataset
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load dataset from sklearn
cancer = load_breast_cancer()

# Convert to pandas DataFrame
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)

# Add target labels
df['target'] = cancer.target
df['target_name'] = df['target'].map(lambda x: cancer.target_names[x])

# Summary outputs
print("🧬 Dataset Dimensions:", df.shape)
print("🔢 Number of Features:", len(cancer.feature_names))
print("🎯 Target Labels:", list(cancer.target_names))
print("\n📊 Class Distribution:")
print(df['target_name'].value_counts())

# Preview sample records
df.sample(5)

🧬 Dataset Dimensions: (569, 32)
🔢 Number of Features: 30
🎯 Target Labels: [np.str_('malignant'), np.str_('benign')]

📊 Class Distribution:
target_name
benign       357
malignant    212
Name: count, dtype: int64


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,target_name
162,19.59,18.15,130.7,1214.0,0.112,0.1666,0.2508,0.1286,0.2027,0.06082,...,174.9,2232.0,0.1438,0.3846,0.681,0.2247,0.3643,0.09223,0,malignant
491,17.85,13.23,114.6,992.1,0.07838,0.06217,0.04445,0.04178,0.122,0.05243,...,127.1,1210.0,0.09862,0.09976,0.1048,0.08341,0.1783,0.05871,1,benign
291,14.96,19.1,97.03,687.3,0.08992,0.09823,0.0594,0.04819,0.1879,0.05852,...,109.1,809.8,0.1313,0.303,0.1804,0.1489,0.2962,0.08472,1,benign
176,9.904,18.06,64.6,302.4,0.09699,0.1294,0.1307,0.03716,0.1669,0.08116,...,73.07,390.2,0.1301,0.295,0.3486,0.0991,0.2614,0.1162,1,benign
91,15.37,22.76,100.2,728.2,0.092,0.1036,0.1122,0.07483,0.1717,0.06097,...,107.5,830.9,0.1257,0.1997,0.2846,0.1476,0.2556,0.06828,0,malignant



### Notes for Revision:

* **Shape:** `(569, 32)` → 569 observations, 30 features + 1 label + 1 label name
* **Target imbalance:** More benign than malignant cases → might require stratified validation later
* `target_name` column added for human-readability
###  Example Output (Typical):

🧬 Dataset Dimensions: (569, 32)
🔢 Number of Features: 30
🎯 Target Labels: ['malignant', 'benign']

📊 Class Distribution:
benign       357
malignant    212
Name: target_name, dtype: int64


## STEP 2: Train the Classifier


### Objective:

Train a **Random Forest Classifier** on the full dataset.
This prototype focuses on prediction responsiveness — no test/train split or cross-validation yet.


### Model Choice Justification:

| Model                    | Reason for Selection                                                         |
| ------------------------ | ---------------------------------------------------------------------------- |
| `RandomForestClassifier` | Fast to train, robust to outliers, good with non-linear feature interactions |
| `n_estimators=100`       | Balances accuracy and inference speed                                        |
| `random_state=42`        | Ensures repeatable results                                                   |

This setup fits our terminal-based use case — quick predictions and stable accuracy.


### Code Block: Train the Model




In [8]:
# STEP 2: Train Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

# Prepare feature matrix X and target vector y
X = cancer.data
y = cancer.target

# Initialize classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train on entire dataset (prototype phase)
clf.fit(X, y)

# Evaluate on training set
train_accuracy = clf.score(X, y)
print(f"Training Accuracy: {train_accuracy:.2%}")

Training Accuracy: 100.00%


### Notes for Revision:

* `clf.fit(X, y)` trains on all 569 samples
* Accuracy may look very high (\~99%) — acceptable for a prototype, but must validate later with real test data
* For a full deployment, we’d use `train_test_split()` and stratified validation folds

## New Scenario:

### **“Quick Diagnosis Assistant”**

Instead of inputting all 30 features manually, we:

* **Randomly sample a case** from the dataset
* Display its features to simulate "incoming patient data"
* Ask the user:
  *"Would you like to diagnose this case?"*
* Then we **predict and show results** using the trained model

No typing 30 numbers. Still real. Still sharp. Still interactive.


## STEP 3 (Revised): Sample and Diagnose a Case


### Objective:

* Randomly pull a sample case from the dataset
* Show the 5–7 most meaningful features (not all 30)
* Let the model predict and show the result
* Give option to keep looping or exit

In [9]:
# STEP 3 (REVISED): Simulated Patient Diagnosis
import numpy as np
import random

# Define which features to show (select top 6 visually intuitive ones)
selected_features = [
    'mean radius',
    'mean texture',
    'mean perimeter',
    'mean area',
    'worst concavity',
    'worst symmetry'
]

# Get feature indices
feature_indices = [list(cancer.feature_names).index(f) for f in selected_features]

def sample_and_diagnose(model, X, y, feature_names, n=1):
    while True:
        index = random.randint(0, len(X) - 1)
        sample = X[index]
        label = y[index]
        
        print("\n--- New Patient Case ---")
        for i in feature_indices:
            print(f"{feature_names[i]}: {sample[i]:.2f}")
        
        confirm = input("\nRun diagnosis? (y/n): ").strip().lower()
        if confirm != 'y':
            cont = input("Skip to next case? (y/n): ").strip().lower()
            if cont != 'y':
                print("\nSession ended.")
                break
            else:
                continue

        # Predict
        pred = model.predict(sample.reshape(1, -1))[0]
        label_actual = cancer.target_names[label]
        label_pred = cancer.target_names[pred]
        
        print("\nDiagnosis Prediction:")
        print(f"  → Predicted: {label_pred.upper()}")
        print(f"  → Actual:    {label_actual.upper()}")
        
        cont = input("\nRun another case? (y/n): ").strip().lower()
        if cont != 'y':
            print("\nSession ended.")
            break

# Run the simulation
sample_and_diagnose(clf, cancer.data, cancer.target, cancer.feature_names)


--- New Patient Case ---
mean radius: 19.53
mean texture: 32.47
mean perimeter: 128.00
mean area: 1223.00
worst concavity: 0.40
worst symmetry: 0.27



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: MALIGNANT
  → Actual:    MALIGNANT



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 15.73
mean texture: 11.28
mean perimeter: 102.80
mean area: 747.20
worst concavity: 0.40
worst symmetry: 0.26



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 13.59
mean texture: 21.84
mean perimeter: 87.16
mean area: 561.00
worst concavity: 0.15
worst symmetry: 0.24



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 12.30
mean texture: 19.02
mean perimeter: 77.88
mean area: 464.40
worst concavity: 0.04
worst symmetry: 0.26



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 11.94
mean texture: 18.24
mean perimeter: 75.71
mean area: 437.60
worst concavity: 0.09
worst symmetry: 0.28



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 15.78
mean texture: 17.89
mean perimeter: 103.60
mean area: 781.00
worst concavity: 0.40
worst symmetry: 0.38



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: MALIGNANT
  → Actual:    MALIGNANT



Run another case? (y/n):  



Session ended.


### Code Block: Simulated Case Prediction

### Notes:

* You review a patient case like a **doctor with a clipboard**
* Model handles backend prediction
* You decide if it should run or skip
* Actual label shown for reference, but this could be hidden in real use

## STEP 4: Add Confidence Scores and Model Certainty

### Objective:

Display **how confident** the model is in its prediction, not just the class label.

This is critical in any clinical or decision-support context — we don’t just want a binary label, we want to know **how certain** the model is.


### Tools Used:

| Function              | Purpose                              |
| --------------------- | ------------------------------------ |
| `predict_proba(X)`    | Returns array of class probabilities |
| `np.max()`            | Extracts top confidence score        |
| `colorama` (optional) | Highlights high/low confidence       |


### Code Block: Display Confidence

In [10]:
from sklearn.metrics import accuracy_score
from colorama import Fore, Style

def sample_and_diagnose_with_confidence(model, X, y, feature_names, n=1):
    while True:
        index = random.randint(0, len(X) - 1)
        sample = X[index]
        label = y[index]
        
        print("\n--- New Patient Case ---")
        for i in feature_indices:
            print(f"{feature_names[i]}: {sample[i]:.2f}")
        
        confirm = input("\nRun diagnosis? (y/n): ").strip().lower()
        if confirm != 'y':
            cont = input("Skip to next case? (y/n): ").strip().lower()
            if cont != 'y':
                print("\nSession ended.")
                break
            else:
                continue

        # Predict and get probability
        pred = model.predict(sample.reshape(1, -1))[0]
        prob = model.predict_proba(sample.reshape(1, -1))[0]
        confidence = np.max(prob)
        
        label_actual = cancer.target_names[label]
        label_pred = cancer.target_names[pred]

        print("\nDiagnosis Prediction:")
        if confidence >= 0.90:
            color = Fore.GREEN
        elif confidence >= 0.75:
            color = Fore.YELLOW
        else:
            color = Fore.RED

        print(f"  → Predicted: {label_pred.upper()}")
        print(f"  → Confidence: {color}{confidence:.2%}{Style.RESET_ALL}")
        print(f"  → Actual:    {label_actual.upper()}")

        cont = input("\nRun another case? (y/n): ").strip().lower()
        if cont != 'y':
            print("\nSession ended.")
            break

# Run enhanced simulation
sample_and_diagnose_with_confidence(clf, cancer.data, cancer.target, cancer.feature_names)


--- New Patient Case ---
mean radius: 13.17
mean texture: 18.22
mean perimeter: 84.28
mean area: 537.30
worst concavity: 0.19
worst symmetry: 0.22



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [32m100.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 11.63
mean texture: 29.29
mean perimeter: 74.87
mean area: 415.10
worst concavity: 0.29
worst symmetry: 0.29



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [32m98.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 11.13
mean texture: 16.62
mean perimeter: 70.47
mean area: 381.10
worst concavity: 0.05
worst symmetry: 0.24



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [32m100.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 12.45
mean texture: 16.41
mean perimeter: 82.85
mean area: 476.70
worst concavity: 0.49
worst symmetry: 0.32



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [32m92.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 12.70
mean texture: 12.17
mean perimeter: 80.88
mean area: 495.00
worst concavity: 0.09
worst symmetry: 0.28



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [32m100.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 11.99
mean texture: 24.89
mean perimeter: 77.61
mean area: 441.30
worst concavity: 0.16
worst symmetry: 0.26



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [32m100.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 17.14
mean texture: 16.40
mean perimeter: 116.00
mean area: 912.70
worst concavity: 0.39
worst symmetry: 0.41



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: MALIGNANT
  → Confidence: [32m99.00%[0m
  → Actual:    MALIGNANT



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 11.74
mean texture: 14.02
mean perimeter: 74.24
mean area: 427.30
worst concavity: 0.07
worst symmetry: 0.31



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [32m96.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 27.22
mean texture: 21.87
mean perimeter: 182.10
mean area: 2250.00
worst concavity: 0.53
worst symmetry: 0.29



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: MALIGNANT
  → Confidence: [32m100.00%[0m
  → Actual:    MALIGNANT



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 13.64
mean texture: 15.60
mean perimeter: 87.38
mean area: 575.30
worst concavity: 0.15
worst symmetry: 0.25



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [32m100.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 12.67
mean texture: 17.30
mean perimeter: 81.25
mean area: 489.90
worst concavity: 0.10
worst symmetry: 0.27



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [32m100.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 14.44
mean texture: 15.18
mean perimeter: 93.97
mean area: 640.10
worst concavity: 0.31
worst symmetry: 0.27



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [31m73.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 12.95
mean texture: 16.02
mean perimeter: 83.14
mean area: 513.70
worst concavity: 0.22
worst symmetry: 0.34



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: BENIGN
  → Confidence: [32m100.00%[0m
  → Actual:    BENIGN



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 16.46
mean texture: 20.11
mean perimeter: 109.30
mean area: 832.90
worst concavity: 0.59
worst symmetry: 0.31



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: MALIGNANT
  → Confidence: [32m100.00%[0m
  → Actual:    MALIGNANT



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 13.40
mean texture: 20.52
mean perimeter: 88.64
mean area: 556.70
worst concavity: 0.51
worst symmetry: 0.36



Run diagnosis? (y/n):  y



Diagnosis Prediction:
  → Predicted: MALIGNANT
  → Confidence: [32m99.00%[0m
  → Actual:    MALIGNANT



Run another case? (y/n):  y



--- New Patient Case ---
mean radius: 15.78
mean texture: 22.91
mean perimeter: 105.70
mean area: 782.60
worst concavity: 0.74
worst symmetry: 0.33



Run diagnosis? (y/n):  n
Skip to next case? (y/n):  n



Session ended.


### Notes:

* Adds `predict_proba()` to show probability
* Uses color to **visually flag confidence**:

  * Green = strong confidence
  * Yellow = moderate
  * Red = weak/conflicted
* More aligned with how clinicians review uncertainty in diagnostics