# Project Metadata & Setup

---

## **Project Title:** **Early Breast Cancer Diagnosis using Machine Learning (Terminal-Based Prototype)**

---

### **Objective:**

Develop a lightweight, terminal-driven diagnostic prototype that predicts breast tumor malignancy using the built-in Breast Cancer Wisconsin dataset.
The goal is to simulate how a clinician or technician might use a fast, interpretable tool, without a full GUI or web app.


### **Dataset Description:**

* **Name:** Breast Cancer Wisconsin (Diagnostic)
* **Source:** `sklearn.datasets.load_breast_cancer()`
* **Samples:** 569
* **Features:** 30 numerical (e.g. radius, texture, symmetry)
* **Target:** Binary classification — `malignant (0)` vs. `benign (1)`



### **Stakeholders:**

| Stakeholder          | Interest / Use Case                                              |
| -------------------- | ---------------------------------------------------------------- |
| **Clinicians**       | Fast, interpretable predictions without cloud dependency         |
| **Researchers**      | Baseline model to compare with deep learning or ensemble methods |
| **Medical Startups** | Prototype backend logic for terminal-based tools                 |
| **Educators**        | Teaching ML with meaningful, real-world data                     |


### **Key Notes for Revision:**

* **No GPU required** — uses efficient `RandomForestClassifier`
* **Runs locally** in terminal with zero external dataset download
* **Emphasis on simplicity**: ASCII feedback, clean input prompts
* **Can be modularized later** into API or UI backend
* **We’ll skip test/train splitting** for now — prototype phase only


### **Environment Requirements:**

| Package    | Use                          |
| ---------- | ---------------------------- |
| `sklearn`  | Dataset + ML model           |
| `pandas`   | Tabular data inspection      |
| `colorama` | Terminal coloring (optional) |
| `numpy`    | Data handling                |

---

### Install (if not present):

```bash
pip install pandas scikit-learn colorama
```


### Deliverables:

* Classifier trained on breast cancer dataset
* Real-time terminal prediction system
* ASCII output or basic feedback UI for predicted diagnosis
* Clean, commented code blocks
* Professional markdown formatting with revision comments


## STEP 1: Load and Explore the Breast Cancer Dataset


### **Objective:**

* Load the Breast Cancer Wisconsin dataset directly from `sklearn`
* Convert to a `pandas.DataFrame` for readability
* Summarize the dataset structure and feature information
* Confirm class distribution (very important in medical ML)


### **Background Insight for Stakeholders:**

* The data represent **digitized characteristics of cell nuclei** from fine-needle aspirates of breast masses.
* 30 numerical features are calculated from images (e.g., mean radius, standard error of texture, worst smoothness).
* The target is **binary**:

  * `0 = malignant` (cancerous)
  * `1 = benign` (non-cancerous)


### **Code Block: Load + Inspect Data**

In [2]:
# STEP 1: Load & Inspect Breast Cancer Dataset
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load dataset from sklearn
cancer = load_breast_cancer()

# Convert to pandas DataFrame
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)

# Add target labels
df['target'] = cancer.target
df['target_name'] = df['target'].map(lambda x: cancer.target_names[x])

# Summary outputs
print("🧬 Dataset Dimensions:", df.shape)
print("🔢 Number of Features:", len(cancer.feature_names))
print("🎯 Target Labels:", list(cancer.target_names))
print("\n📊 Class Distribution:")
print(df['target_name'].value_counts())

# Preview sample records
df.sample(5)

🧬 Dataset Dimensions: (569, 32)
🔢 Number of Features: 30
🎯 Target Labels: [np.str_('malignant'), np.str_('benign')]

📊 Class Distribution:
target_name
benign       357
malignant    212
Name: count, dtype: int64


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target,target_name
283,16.24,18.77,108.8,805.1,0.1066,0.1802,0.1948,0.09052,0.1876,0.06684,...,126.9,1031.0,0.1365,0.4706,0.5026,0.1732,0.277,0.1063,0,malignant
295,13.77,13.27,88.06,582.7,0.09198,0.06221,0.01063,0.01917,0.1592,0.05912,...,94.17,661.1,0.117,0.1072,0.03732,0.05802,0.2823,0.06794,1,benign
356,13.05,18.59,85.09,512.0,0.1082,0.1304,0.09603,0.05603,0.2035,0.06501,...,94.22,591.2,0.1343,0.2658,0.2573,0.1258,0.3113,0.08317,1,benign
98,11.6,12.84,74.34,412.6,0.08983,0.07525,0.04196,0.0335,0.162,0.06582,...,82.96,512.5,0.1431,0.1851,0.1922,0.08449,0.2772,0.08756,1,benign
227,15.0,15.51,97.45,684.5,0.08371,0.1096,0.06505,0.0378,0.1881,0.05907,...,114.2,808.2,0.1136,0.3627,0.3402,0.1379,0.2954,0.08362,1,benign



### Notes for Revision:

* **Shape:** `(569, 32)` → 569 observations, 30 features + 1 label + 1 label name
* **Target imbalance:** More benign than malignant cases → might require stratified validation later
* `target_name` column added for human-readability
###  Example Output (Typical):

🧬 Dataset Dimensions: (569, 32)
🔢 Number of Features: 30
🎯 Target Labels: ['malignant', 'benign']

📊 Class Distribution:
benign       357
malignant    212
Name: target_name, dtype: int64
