**Breast Cancer Diagnosis Prediction using K-Nearest Neighbors (KNN) and K-Means Clustering** 
##
Classify breast cancer diagnoses into two categories: malignant (M) and benign (B).The project analyzes tumor characteristics from the Wisconsin Diagnostic Breast Cancer dataset.
##

**Objective**
##
Apply both unsupervised (K-Means) and supervised (KNN) learning techniques

Compare clustering vs classification approaches

Achieve high accuracy in cancer diagnosis prediction

Create reproducible and deployable machine learning models
##

**Import Libraries**
##
Libraries that is being used for all the approaches and stages
##

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pickle
import warnings
warnings.filterwarnings('ignore')

**Load and Process Data**
##
-  Dataset those are csv file has been loaded here and processing data by cleaning dataset.
-  Normalize Dataset for scaling between 0 to 1 using MinMax.
#

In [13]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
df = pd.read_csv('../Data/dataset.csv')

df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})

df.drop(columns=['id'], inplace=True, errors='ignore')
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
print("Data preprocessing complete.")
print(f"Training samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}")
df_preprocessed = X_scaled.copy()
df_preprocessed['diagnosis'] = y.reset_index(drop=True)

preprocess_path = '../Data/preprocess.csv'
df_preprocessed.to_csv(preprocess_path, index=False)
print(f"Saved preprocessed data to {preprocess_path}")

Data preprocessing complete.
Training samples: 455, Test samples: 114
Saved preprocessed data to ../Data/preprocess.csv


**K-Means Clustering** 
##
Unsupervised Learning K-Means groups data points into K clustersby minimizing distance to cluster centroids.
##
##
-  Spliting Data into 80% train, 20% test
K-Means Clustering with k=2

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

df = pd.read_csv('../Data/preprocess.csv')
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Save test set to ../Data/test.csv (features + label)
test_df = X_test.copy().reset_index(drop=True)
test_df['diagnosis'] = y_test.reset_index(drop=True)
test_path = '../Data/test.csv'
test_df.to_csv(test_path, index=False)
print(f"Saved test set to {test_path}")

# Run K-Means on the full preprocessed feature set (or use X_train if preferred)
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans.fit(X)            # change to kmeans.fit(X_train) if you prefer clustering only on train
cluster_labels = kmeans.labels_

# Compare clusters with actual diagnosis
cluster_df = pd.DataFrame({'Actual': y.values, 'Cluster': cluster_labels})

# Determine which cluster corresponds to malignant (1) by majority
cluster_means = cluster_df.groupby('Cluster')['Actual'].mean()
print("Average diagnosis value per cluster:")
print(cluster_means)

# Map cluster id -> predicted label (cluster with larger mean -> malignant)
cluster_to_label = {}
if cluster_means.shape[0] == 2:
    malignant_cluster = int(cluster_means.idxmax())
    benign_cluster = int(cluster_means.idxmin())
    cluster_to_label = {malignant_cluster: 1, benign_cluster: 0}
else:
    # fallback: assign 1 to cluster with mean >= 0.5
    for c, m in cluster_means.items():
        cluster_to_label[int(c)] = int(m >= 0.5)

predicted_labels = pd.Series(cluster_labels).map(cluster_to_label)

# Report alignment
cross_tab = pd.crosstab(cluster_df['Actual'], cluster_df['Cluster'], 
                        rownames=['Actual'], colnames=['Cluster'])
print("\nCross-tabulation between Actual Diagnosis and K-Means Clusters:")
print(cross_tab)

accuracy = (predicted_labels.values == cluster_df['Actual'].values).mean()
print(f"\nApproximate clustering alignment accuracy (after mapping): {accuracy:.4f}")

# Visualize actual vs clusters
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.countplot(x=y.values, ax=axes[0])
axes[0].set_title('Actual Diagnosis Distribution')
axes[0].set_xlabel('Diagnosis (0=Benign, 1=Malignant)')
axes[0].set_ylabel('Count')

sns.countplot(x=cluster_labels, ax=axes[1])
axes[1].set_title('K-Means Clusters (k=2)')
axes[1].set_xlabel('Cluster Label')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()