**<span style="color: purple; font-size: 30px;">Breast Cancer Clustering Using K-Means</span>**

## Project Overview
This project applies **unsupervised machine learning** techniques to the Breast Cancer dataset in order to identify natural groupings of patients based on their cell characteristics.  
Since the dataset does not include diagnostic labels (Malignant/Benign), **K-Means clustering** is used to divide patients into distinct clusters, which can later be interpreted as potential cancerous or non-cancerous groups.

## Objectives
- Perform data preprocessing (handling unimportant features, scaling features).
- Apply **K-Means clustering** to group patients.
- Evaluate clustering quality using **Silhouette Score**, **Davies-Bouldin Index**, and **Calinski-Harabasz Score**.
- Analyze feature patterns within each cluster to interpret which cluster may correspond to malignant vs benign tumors.


In [3]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn. metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score


In [5]:
## Load Dataset
df = pd.read_csv("Breast_Cancer_Diagnostic - Breast_Cancer_Diagnostic.csv")
df

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883
...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016


In [12]:
df.dtypes  

radius_mean               float64
texture_mean              float64
perimeter_mean            float64
area_mean                 float64
smoothness_mean           float64
compactness_mean          float64
concavity_mean            float64
concave points_mean       float64
symmetry_mean             float64
fractal_dimension_mean    float64
dtype: object

In [13]:
# Step 1: Standardize the float data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)   # since all columns are float, we can scale directly

In [14]:
# Step 2: Apply K-Means (k=2 for cancer vs non_cancer assumption)
kmeans = KMeans(n_clusters = 2, random_state = 28)
clusters = kmeans.fit_predict(X_scaled)

In [16]:
# Step 3: add cluster labels back to dataframe
df['Cluster'] = clusters
df

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,Cluster
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,0
...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,1
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,0


In [19]:
# Step 4: calculate cluster averages for key features
cluster_means = df.groupby('Cluster')[['radius_mean', 'area_mean', 'concavity_mean']].mean()
print("Cluster feature means: \n", cluster_means)                                   

Cluster feature means: 
          radius_mean    area_mean  concavity_mean
Cluster                                          
0          18.122446  1050.313690        0.185352
1          12.453511   489.224938        0.048348


In [20]:
# Step 5: Decide which cluster = M or B
if cluster_means.loc[0].mean() > cluster_means.loc[1].mean():
    cluster_mapping = {0: "M", 1: "B"}
else:
    cluster_mapping = {0: "B", 1: "M"}

In [21]:
# Step 6: Apply mapping
df['Dignosis'] = df['Cluster'].map(cluster_mapping)
df

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,Cluster,Dignosis
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,0,M
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,0,M
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,0,M
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,0,M
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,0,M
...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,0,M
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,0,M
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,1,B
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,0,M


In [22]:
# Step 7: Evaluate clustering
sil_score = silhouette_score(X_scaled, clusters)
db_score = davies_bouldin_score(X_scaled, clusters)
ch_score = calinski_harabasz_score(X_scaled, clusters)

print("Silhouette Score:", sil_score)
print("Davies Bouldin Score:", db_score)
print("Calinski-Harabasz Score:", ch_score)

Silhouette Score: 0.3935334845695349
Davies Bouldin Score: 1.10046782044029
Calinski-Harabasz Score: 359.6417155726988


## 1. Silhouette Score

- **Range:** -1 to +1
- **Meaning:**
    - +1 → clusters are well separated & dense (excellent clustering).
    - 0 → clusters are overlapping (not clear).
    - -1 → samples are assigned to the wrong cluster.
- **Benefit:** Tells you how well-defined your clusters are. For your cancer dataset, a higher score means patients are grouped more distinctly into malignant-like vs benign-like clusters.


## 2. Davies-Bouldin Index (DBI)

- **Range:** 0 to ∞
- **Meaning:**
    - Lower is better → clusters are compact & well-separated.
- **Benefit:** Measures similarity between clusters. A low DBI means the malignant-like and benign-like groups are very different, which is what you want.


## 3. Calinski-Harabasz Score (CH Index)

- **Range:** 0 to ∞
- **Meaning:**
    - Higher is better → clusters are dense within themselves and well- separated from each other.
- **Benefit:** Good indicator of clustering strength. In medical datasets, a high CH score means the tumor groups are strongly distinguishable.