# 0. Setting Up The Data

In [None]:
pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# data (as pandas dataframes) 
df = breast_cancer_wisconsin_diagnostic.data.original 
  
# metadata 
print(breast_cancer_wisconsin_diagnostic.metadata) 
  
# variable information 
print(breast_cancer_wisconsin_diagnostic.variables) 


# 1. Business Understanding

**Problem:** Predict if a breast cancer tumor is malignant or benign based on diagnostic measurements.  
**Objective:** Learn to apply the kNN algorithm to classify tumors and evaluate performance.

# 2. Data Understanding

In [None]:
df.info()
df.describe().T

The dataset contains 569 instances and 32 columns:
- ID: Unique identifier
- Diagnosis: Target variable (M = Malignant, B = Beningn)
- 30 numeric features: measurements of breast tumors

**Observations:**
- No null values
- Distribution: 212 malignant, 357 benign.
- Values are not normalized

# 3. Data Preparation

In [None]:
# Drop ID column
df = df.drop(columns=["ID"])

# Map diagnosis to binary values: Malignant = 1, Benign = 0
df["Diagnosis"] = df["Diagnosis"].map({"M":1, "B":0})

# Split features and target variable
features = df.drop(columns=["Diagnosis"])
labels = df["Diagnosis"]

# Normalize features
features = (features - features.mean()) / features.std()

# Display summary statistics of features
features.describe().T

In [None]:
import matplotlib.pyplot as plt

features = ['radius', 'texture', 'perimeter', 'area', 'smoothness', 'compactness', 
            'concavity', 'concave_points', 'symmetry', 'fractal_dimension']

for feature in features:
    cols = [f"{feature}1", f"{feature}2", f"{feature}3"]
    df_subset = df[cols].copy()
    df_subset.columns = [f"{feature.capitalize()}1", f"{feature.capitalize()}2", f"{feature.capitalize()}3"]
    
    df_subset.plot(kind='hist', bins=30, alpha=0.5, figsize=(8,5),
                   title=f"Distribution of {feature.capitalize()}")
    plt.xlabel(feature.capitalize())
    plt.show()


# 4. Modeling

### Data splitting
The data is split into three sets using the hold-out validation technique:
- Training Set: 60% of the data for training the classifier
- Validation Set: 20% to select the best hyperparameter k value
- Test Set: 20% to evaluate the final model performance

Stratified sampling is used to maintain class distribution across all sets.

In [None]:
from sklearn.model_selection import train_test_split

X_temp, X_test, y_temp, y_test = train_test_split(
    features,
    labels,
    test_size=0.2,
    random_state=42,
    stratify=labels
)

X_train, X_val, y_train, y_val = train_test_split(
    X_temp,
    y_temp,
    test_size=0.25,
    random_state=42,
    stratify=y_temp
)

In [None]:
X_train.shape, X_val.shape, X_test.shape

The training set contains 341 instances, the validation set 114 instances, and the test set 114 instances which confirms the 60-20-20 split.

### Hyperparameter Tuning
We will train kNN classifiers with different odd values of k (from 1 to 21) and evaluate their accuracy on the validation set to select the best model configuration.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

k_values = range(1, 22, 2)

results = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)

    y_val_pred = knn.predict(X_val)
    val_accuracy = accuracy_score(y_val, y_val_pred)

    results.append({
        "k": k,
        "Validation Accuracy": val_accuracy
    })

results_df = pd.DataFrame(results)
results_df.sort_values("Validation Accuracy", ascending=False)

The validation results indicate that multiple k values result in the same highest accuracy. The smallest k is selected to favor a simpler model.

In [None]:
best_k = results_df.loc[results_df["Validation Accuracy"].idxmax(), "k"]

print("Best k =", best_k)
best_k

Analysis provides 2 as the most accurate k-value to use, however looking at the data every other k-value past 3, where it strangely slightly dips in accuracy, provides the same level of accuracy.
Knowing the workings of k-value, where-in new datapoints are designated via "polling" based on the designation of its closest neighbouring points, choosing 5 as the k value seems most sensible, as this will not allow for creation of a stalemate in polling.

# 5. Evaluation

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

model = KNeighborsClassifier(n_neighbors = 5)
model.fit(features, labels)

labels_pred = model.predict(features)
cm = confusion_matrix(labels, labels_pred)
cmd = ConfusionMatrixDisplay(cm, display_labels=["Benign", "Malignant"])
cmd.plot()

One malignant tumor erroneously identified as benign and 10 benign tumors as malignant.


In [None]:
accuracy = (356+202) / (569)
precision_be = (356) / (356 + 10)
precision_mal = (202) / (202+1)
recall_be = 356 / 357
recall_mal = 202 / 212

print(accuracy)
print(precision_be)
print(precision_mal)
print(recall_be)
print(recall_mal)

Accuracy of the model is 356 correctly identified benign tumors and 202 malignant tumors, divided by the full amount in the dataset: 569.
This rounds to 98,1% accuracy.
Precision for benign detection is the correctly identified 356 tumors divided by that amount in addition with the incorrectly benign attributed malignant tumors, which rounds to 97,3%
Precision for the malignat tumors following same principle rounds to 99,5%
Recall for the benign tumors is the correctly identified 356 tumors divided by all the designated bening tumors in the dataset which rounds to 99,7%
Recall for the malignant tumors rounds to 95,3%

# 6. Deployment

The completed model displays some potential in being able to detect nature of a tumor by its physical dimensions.
However the model seems to have a slight bias towards designating its inputs towards benign.
Both recall for malignant tumors and precision in determing benign tumors are considerably lower when compared to other statistics in the model.
The dataset provided is substantive enough in quantity that we can consider this as statistically significant, and thus the model is not suited