# Breast Cancer Binary Classification

## Objective

The goal of this task is to build and compare multiple binary classification models to predict whether a breast tumor is:

- **0 — Malignant (Cancerous)**
- **1 — Benign (Non-cancerous)**

We will train and evaluate the following models:

- Logistic Regression
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)

The focus of this task is:
- Model training
- Model evaluation
- Performance comparison

 feature scaling is not used in this task.

# Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report
)

## Dataset Overview

We are using the **Breast Cancer Wisconsin Dataset**, available directly in scikit-learn.

### Dataset Characteristics:
- 569 samples (patients)
- 30 numerical features
- Binary target variable (Malignant or Benign)
- No missing values

Each feature represents a measurement extracted from a digitized image of a breast mass, such as:

- Radius
- Texture
- Perimeter
- Area
- Smoothness
- Concavity
- Symmetry

These features describe physical properties of cell nuclei and help distinguish between malignant and benign tumors.

In [2]:
data = load_breast_cancer()
X = data.data
y = data.target

print("Samples:", X.shape[0])
print("Features:", X.shape[1])
print("Target classes (0=Malignant, 1=Benign):", np.unique(y))

Samples: 569
Features: 30
Target classes (0=Malignant, 1=Benign): [0 1]


## Train-Test Split

The dataset is divided into:

- 80% Training data
- 20% Testing data

We use:

- `random_state=42` → ensures reproducibility
- `stratify=y` → preserves the class distribution in both sets

Stratification is important because medical datasets often have class imbalance. 
It ensures that both malignant and benign cases are proportionally represented in both training and testing data.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2,random_state=42,stratify=y)

print("Train:", X_train.shape, "Test:", X_test.shape)
print("Train class distribution:", np.bincount(y_train))
print("Test class distribution:", np.bincount(y_test))

Train: (455, 30) Test: (114, 30)
Train class distribution: [170 285]
Test class distribution: [42 72]


## Model Selection

We use three classification algorithms covered in class:

### 1️⃣ Logistic Regression
A linear classification algorithm that models the probability of a class using the logistic (sigmoid) function.
It works well when the classes are approximately linearly separable.

### 2️⃣ Support Vector Machine (SVM)
SVM tries to find the optimal separating hyperplane that maximizes the margin between classes.
It is powerful in high-dimensional spaces.

### 3️⃣ K-Nearest Neighbors (KNN)
A distance-based algorithm.
It classifies a new sample based on the majority class among its k closest neighbors. 

In [4]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "SVM": SVC(),  
    "KNN": KNeighborsClassifier()  
}

## Evaluation Metrics

To compare the models, we compute:

### Accuracy
The percentage of correctly classified samples.
However, accuracy alone may not be sufficient in medical applications.

### Precision (Malignant = 0)
Among all tumors predicted as malignant, how many were actually malignant?

Precision helps reduce false positives.

### Recall (Malignant = 0)
Among all actual malignant tumors, how many did the model correctly detect?

This is also called **Sensitivity**.

In medical diagnosis, recall is extremely important because missing a malignant case (false negative) can delay treatment.

### F1-Score
The harmonic mean of Precision and Recall.
It balances both false positives and false negatives.

### Confusion Matrix
Shows:
- True Positives
- False Positives
- True Negatives
- False Negatives

It gives a complete picture of model performance.

In [5]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Treat Malignant (0) as the "positive" class
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, pos_label=0)
    rec = recall_score(y_test, y_pred, pos_label=0)
    f1 = f1_score(y_test, y_pred, pos_label=0)
    cm = confusion_matrix(y_test, y_pred, labels=[0, 1])

    results = {
        "Model": name,
        "Accuracy": acc,
        "Precision (Malignant=0)": prec,
        "Recall (Malignant=0)": rec,
        "F1 (Malignant=0)": f1
    }

    return results, cm, y_pred

# Training and Evaluating the Models

In [6]:
all_results = []
conf_mats = {}

for name, model in models.items():
    res, cm, y_pred = evaluate_model(name, model, X_train, y_train, X_test, y_test)
    all_results.append(res)
    conf_mats[name] = cm

    print("="*60)
    print(name)
    print("Confusion Matrix (labels=[0 Malignant, 1 Benign]):")
    print(cm)
    print("\nClassification Report (default pos_label=1 => focuses on Benign):")
    print(classification_report(y_test, y_pred, target_names=["Malignant(0)", "Benign(1)"]))

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Logistic Regression
Confusion Matrix (labels=[0 Malignant, 1 Benign]):
[[39  3]
 [ 1 71]]

Classification Report (default pos_label=1 => focuses on Benign):
              precision    recall  f1-score   support

Malignant(0)       0.97      0.93      0.95        42
   Benign(1)       0.96      0.99      0.97        72

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

SVM
Confusion Matrix (labels=[0 Malignant, 1 Benign]):
[[36  6]
 [ 2 70]]

Classification Report (default pos_label=1 => focuses on Benign):
              precision    recall  f1-score   support

Malignant(0)       0.95      0.86      0.90        42
   Benign(1)       0.92      0.97      0.95        72

    accuracy                           0.93       114
   macro avg       0.93      0.91      0.92       114
weighted avg       0.93      0.93      0.93       114

KNN
Confusion Matrix (labels=[0 Malignant, 1 Ben

## Model Comparison

After training all models, we compare them using:

- Accuracy
- Precision (Malignant)
- Recall (Malignant)
- F1-Score (Malignant)

Since this is a medical classification task, we prioritize:

1️⃣ Recall (Malignant detection rate)  
2️⃣ F1-score  
3️⃣ Accuracy  

The best model will be determined based on overall performance, with emphasis on detecting malignant tumors correctly.

In [7]:
comparison_df = pd.DataFrame(all_results)
comparison_df = comparison_df.sort_values(by="F1 (Malignant=0)", ascending=False).reset_index(drop=True)
comparison_df

Unnamed: 0,Model,Accuracy,Precision (Malignant=0),Recall (Malignant=0),F1 (Malignant=0)
0,Logistic Regression,0.964912,0.975,0.928571,0.95122
1,SVM,0.929825,0.947368,0.857143,0.9
2,KNN,0.912281,0.863636,0.904762,0.883721


## Final Conclusion

Among the three models tested, **Logistic Regression** achieved the best overall performance, particularly in terms of F1-score and Recall for malignant cases.

### Most Important Metric in Medical Context

In medical diagnosis, **Recall (Sensitivity) for Malignant tumors** is the most critical metric.

Why?

Because:
- A false negative (predicting benign when it is malignant) can delay treatment.
- Delayed treatment may significantly increase health risks.
- A false positive, while inconvenient, is less dangerous than missing cancer.

Therefore, a model with higher recall for malignant cases is generally preferred in clinical settings.

In [8]:
best_model = comparison_df.loc[0, "Model"]
best_row = comparison_df.loc[0]

print("Best model (by F1 for Malignant=0):", best_model)
print(best_row)

Best model (by F1 for Malignant=0): Logistic Regression
Model                      Logistic Regression
Accuracy                              0.964912
Precision (Malignant=0)                  0.975
Recall (Malignant=0)                  0.928571
F1 (Malignant=0)                       0.95122
Name: 0, dtype: object
