# Breast Cancer Binary Classification

## Objective
The goal of this project is to build and compare multiple binary classification models to predict whether a tumor is:

- 0 → Malignant (Cancerous)
- 1 → Benign (Non-cancerous)

I will train and evaluate the following models:
- Logistic Regression
- Support Vector Machine (SVM)
- K-Nearest Neighbors (KNN)

Dataset:
Breast Cancer Wisconsin Dataset from scikit-learn.

## Import Required Libraries
In this step, we import the libraries needed for:
- Loading the dataset
- Splitting the data
- Training models
- Evaluating performance

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

## Load the Dataset
We load the Breast Cancer Wisconsin dataset directly from scikit-learn.
The dataset contains:
- 569 samples
- 30 numerical features
- Binary target variable

In [2]:
data = load_breast_cancer()
X = data.data
y = data.target

## Train-Test Split
We split the dataset into training and testing sets using:
- test_size = 0.2
- random_state = 42
- stratify = y

Stratification ensures that the class distribution remains balanced.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

## Logistic Regression Model
We train a Logistic Regression model using default parameters.

In [4]:
log_model = LogisticRegression(max_iter=10000)
log_model.fit(X_train, y_train)

y_pred_log = log_model.predict(X_test)

## Support Vector Machine (SVM)
Now we train an SVM classifier with default settings.

In [5]:
svm_model = SVC()
svm_model.fit(X_train, y_train)

y_pred_svm = svm_model.predict(X_test)

## K-Nearest Neighbors (KNN)
We train a KNN model using default parameters.

In [6]:
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)

y_pred_knn = knn_model.predict(X_test)

## Model Evaluation
We evaluate each model using the following metrics:
- Accuracy
- Precision
- Recall
- F1-score
- Confusion Matrix

In [9]:
def evaluate_model(y_test, y_pred):
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    
    return acc, prec, rec, f1, cm

## Evaluating All Models
Now we compute the evaluation metrics for:
- Logistic Regression
- SVM
- KNN

In [10]:
log_results = evaluate_model(y_test, y_pred_log)
svm_results = evaluate_model(y_test, y_pred_svm)
knn_results = evaluate_model(y_test, y_pred_knn)

## Confusion Matrices
Confusion matrices help us understand model predictions.

In [11]:
print("Logistic Regression Confusion Matrix:")
print(log_results[4])

print("\nSVM Confusion Matrix:")
print(svm_results[4])

print("\nKNN Confusion Matrix:")
print(knn_results[4])

Logistic Regression Confusion Matrix:
[[39  3]
 [ 1 71]]

SVM Confusion Matrix:
[[36  6]
 [ 2 70]]

KNN Confusion Matrix:
[[38  4]
 [ 6 66]]


## Model Comparison
We create a table comparing all models based on evaluation metrics.

In [12]:
comparison = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "KNN"],
    "Accuracy": [log_results[0], svm_results[0], knn_results[0]],
    "Precision": [log_results[1], svm_results[1], knn_results[1]],
    "Recall": [log_results[2], svm_results[2], knn_results[2]],
    "F1-Score": [log_results[3], svm_results[3], knn_results[3]]
})

comparison

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score
0,Logistic Regression,0.964912,0.959459,0.986111,0.972603
1,SVM,0.929825,0.921053,0.972222,0.945946
2,KNN,0.912281,0.942857,0.916667,0.929577


## Conclusion

### Best Performing Model
Based on the evaluation metrics, the best-performing model is typically **Logistic Regression or SVM** (depending on the exact results obtained after running the notebook).

### Most Important Metric in Medical Diagnosis
In a medical context, **Recall** is the most important metric.

Reason:
Recall measures how well the model correctly identifies patients who actually have cancer (malignant tumors).

A low recall would mean:
Some cancer patients are predicted as healthy, which can be very dangerous.

Therefore, minimizing false negatives is critical in healthcare applications.