## Import Libraries and Load Dataset

- import the required machine learning libraries
- load the Breast Cancer dataset from sklearn
- extract features (X) and target variable (y)

In [1]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np

data = load_breast_cancer()

X = data.data
y = data.target

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Shape of X: (569, 30)
Shape of y: (569,)


## Train-Test Split

- split the dataset into training and testing sets
- use 80% for training and 20% for testing
- use stratify=y to maintain class distribution

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
print("training samples:", X_train.shape[0])
print("testing samples:", X_test.shape[0])

training samples: 455
testing samples: 114


## Logistic regression

- Create a Logistic Regression model using default parameters.
- train the model using training data.
- evaluate the model using classification metrics.

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

log_model = LogisticRegression(max_iter=10000)

log_model.fit(X_train, y_train)

y_pred_log = log_model.predict(X_test)

print("Logistic Regression Results")
print("Accuracy:", accuracy_score(y_test, y_pred_log))
print("Precision:", precision_score(y_test, y_pred_log))
print("Recall:", recall_score(y_test, y_pred_log))
print("F1-score:", f1_score(y_test, y_pred_log))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_log))

Logistic Regression Results
Accuracy: 0.9649122807017544
Precision: 0.9594594594594594
Recall: 0.9861111111111112
F1-score: 0.9726027397260274
Confusion Matrix:
 [[39  3]
 [ 1 71]]


## Support Vector Machine

- create an SVM classifier with default parameters.
- train the model on training data.
- evaluate the model using the same metrics for comparison.

In [4]:
from sklearn.svm import SVC

svm_model = SVC()

svm_model.fit(X_train, y_train)

y_pred_svm = svm_model.predict(X_test)

print("SVM Results")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Precision:", precision_score(y_test, y_pred_svm))
print("Recall:", recall_score(y_test, y_pred_svm))
print("F1-score:", f1_score(y_test, y_pred_svm))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))

SVM Results
Accuracy: 0.9298245614035088
Precision: 0.9210526315789473
Recall: 0.9722222222222222
F1-score: 0.9459459459459459
Confusion Matrix:
 [[36  6]
 [ 2 70]]


## K-Nearest Neighbors

- create a KNN model using default parameters.
- train the model using the training set.
- evaluate its performance using the same classification metrics.

In [5]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier()

knn_model.fit(X_train, y_train)

y_pred_knn = knn_model.predict(X_test)

print("KNN Results")
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("Precision:", precision_score(y_test, y_pred_knn))
print("Recall:", recall_score(y_test, y_pred_knn))
print("F1-score:", f1_score(y_test, y_pred_knn))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))

KNN Results
Accuracy: 0.9122807017543859
Precision: 0.9428571428571428
Recall: 0.9166666666666666
F1-score: 0.9295774647887324
Confusion Matrix:
 [[38  4]
 [ 6 66]]


## Model Comparison

- Summarize all evaluation metrics in one table.
- This makes comparison between models easier.
- Identify the best performing model.

In [9]:
results = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "KNN"],
    "Accuracy": [
        accuracy_score(y_test, y_pred_log),
        accuracy_score(y_test, y_pred_svm),
        accuracy_score(y_test, y_pred_knn)
    ],
    "Precision": [
        precision_score(y_test, y_pred_log),
        precision_score(y_test, y_pred_svm),
        precision_score(y_test, y_pred_knn)
    ],
    "Recall": [
        recall_score(y_test, y_pred_log),
        recall_score(y_test, y_pred_svm),
        recall_score(y_test, y_pred_knn)
    ],
    "F1-Score": [
        f1_score(y_test, y_pred_log),
        f1_score(y_test, y_pred_svm),
        f1_score(y_test, y_pred_knn)
    ],
    "confusion_matrix":[
        confusion_matrix(y_test, y_pred_log),
        confusion_matrix(y_test, y_pred_svm),
        confusion_matrix(y_test, y_pred_knn)
    ]
})

results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,confusion_matrix
0,Logistic Regression,0.964912,0.959459,0.986111,0.972603,"[[39, 3], [1, 71]]"
1,SVM,0.929825,0.921053,0.972222,0.945946,"[[36, 6], [2, 70]]"
2,KNN,0.912281,0.942857,0.916667,0.929577,"[[38, 4], [6, 66]]"


## Conclusion


- The best model is the one with the highest F1-score.
- In medical diagnosis, Recall is the most important metric.
- Minimizing False Negatives is critical because missing a cancer case can be life-threatening.