# Clustering

1. Download any Multi Dimensional Classification Dataset from UCI repository.
2. Ignore the class labels and perform Clustering.
3. Experiment with various clustering techniques (Agglomerative , Kmeans...) and number of clusters (3 cluster, 4 clusters, ...)
4. Check and compare the performance with ground truth using Rand Index (RI) and Adjusted Rand Index (RAI) metrics.

UCI classification datsets repository link for reference:

### 1. Dataset selection

Here we use [Wisconsin breast cancer original database](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/).

Attribute Information
- id
- clump_thickness
- uniformity_of_cell_size
- uniformity_of_cell_shape
- marginal_adhesion
- epithelial_cell_size
- bare_nuclei
- bland_chromatin
- normal_nucleoli
- mitosis
- class 

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("Data/wisconsin_breast_cancer.csv")
df.head()

Unnamed: 0,id,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitosis,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


### 2. Data cleaning and clustering

There are 16 missing values in `bare_nuclei` column replaced with "?". Lets drop these rows.

In [3]:
missing = df["bare_nuclei"] == "?"
missing.sum()

16

In [4]:
df = df.drop(df[df["bare_nuclei"] == "?"].index)

Since clusters labels will be given as 1 and 0, but our labels are 2 and 4 lets replace them with 1 and 0 respectively.

In [5]:
true_labels = np.array(df["class"])
true_labels[true_labels == 2] = 1
true_labels[true_labels == 4] = 0

In [6]:
features = ["clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis"]
X = df[features]
X.head()

Unnamed: 0,clump_thickness,uniformity_of_cell_size,uniformity_of_cell_shape,marginal_adhesion,epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitosis
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1


### 3. Agglomerative clustering with k = 2, 3, 4 and 5

In [7]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=2)
cluster.fit(X)
predicted_agglomerative = cluster.labels_
cluster.labels_[:10]

array([1, 0, 1, 0, 1, 0, 1, 1, 1, 1])

In [8]:
cluster = AgglomerativeClustering(n_clusters=3)
cluster.fit(X)
predicted_agglomerative_3 = cluster.labels_ 
cluster.labels_[:10]

array([1, 0, 1, 2, 1, 2, 1, 1, 1, 1])

In [9]:
cluster = AgglomerativeClustering(n_clusters=4)
cluster.fit(X)
predicted_agglomerative_4 = cluster.labels_ 
cluster.labels_[:10]

array([1, 2, 1, 0, 1, 0, 1, 1, 1, 1])

In [10]:
cluster = AgglomerativeClustering(n_clusters=5)
cluster.fit(X)
predicted_agglomerative_5 = cluster.labels_ 
cluster.labels_[:10]

array([4, 2, 4, 0, 4, 1, 4, 4, 4, 4])

### 4. KMeans clustering with k = 2,3,4 and 5

In [11]:
from sklearn.cluster import KMeans

cluster = KMeans(n_clusters=2)
cluster.fit(X)
predicted_kmeans = cluster.labels_
cluster.labels_[:10]

array([0, 1, 0, 1, 0, 1, 0, 0, 0, 0], dtype=int32)

In [12]:
from sklearn.cluster import KMeans

cluster = KMeans(n_clusters=3)
cluster.fit(X)
predicted_kmeans_3 = cluster.labels_ 
cluster.labels_[:10]

array([0, 1, 0, 1, 0, 2, 0, 0, 0, 0], dtype=int32)

In [13]:
from sklearn.cluster import KMeans

cluster = KMeans(n_clusters=4)
cluster.fit(X)
predicted_kmeans_4 = cluster.labels_ 
cluster.labels_[:10]

array([1, 2, 1, 3, 1, 0, 1, 1, 1, 1], dtype=int32)

In [14]:
from sklearn.cluster import KMeans

cluster = KMeans(n_clusters=5)
cluster.fit(X)
predicted_kmeans_5 = cluster.labels_ 
cluster.labels_[:10]

array([0, 4, 2, 3, 0, 1, 2, 2, 2, 0], dtype=int32)

### 5. Performance comparing using rand-index (RI)

In [15]:
from sklearn.metrics import rand_score

score = rand_score(true_labels, predicted_agglomerative)
print(f"Rand Index for agglomerative clustering: {score}")

score = rand_score(true_labels, predicted_kmeans)
print(f"Rand Index for KMeans clustering: {score}")

Rand Index for agglomerative clustering: 0.9348226514901053
Rand Index for KMeans clustering: 0.9239511728058463


### 6. Performance comparing using adjusted rand-index (ARI)

In [16]:
from sklearn.metrics import adjusted_rand_score

score = adjusted_rand_score(true_labels, predicted_agglomerative)
print(f"Adjusted Rand Index for agglomerative clustering: {score}")

score = adjusted_rand_score(true_labels, predicted_kmeans)
print(f"Adjusted Rand Index for KMeans clustering: {score}")

Adjusted Rand Index for agglomerative clustering: 0.8689991723757481
Adjusted Rand Index for KMeans clustering: 0.8464675664733539


> As comparing rand index and adjusted rand-index of both the methods we can say that they are both clustering in a good manner, where agglomerative is performing slightly better than kmeans but it won't be a larger margin.

Lets simply compare the rand and adjusted rand index for different clusters with different number of `n_clusters` even though number of clusters and labels are different just for fun.



In [17]:

score = adjusted_rand_score(true_labels, predicted_agglomerative_3)
print(f"Adjusted Rand Index for agglomerative clustering k = 3: {score}")

score = adjusted_rand_score(true_labels, predicted_kmeans_3)
print(f"Adjusted Rand Index for KMeans clustering k = 3: {score}")

score = adjusted_rand_score(true_labels, predicted_agglomerative_4)
print(f"Adjusted Rand Index for agglomerative clustering k = 4: {score}")

score = adjusted_rand_score(true_labels, predicted_kmeans_4)
print(f"Adjusted Rand Index for KMeans clustering k = 4: {score}")

score = adjusted_rand_score(true_labels, predicted_agglomerative_5)
print(f"Adjusted Rand Index for agglomerative clustering k = 5: {score}")

score = adjusted_rand_score(true_labels, predicted_kmeans_5)
print(f"Adjusted Rand Index for KMeans clustering k = 5: {score}")

Adjusted Rand Index for agglomerative clustering k = 3: 0.7816780097783734
Adjusted Rand Index for KMeans clustering k = 3: 0.786250221035846
Adjusted Rand Index for agglomerative clustering k = 4: 0.7755273800794471
Adjusted Rand Index for KMeans clustering k = 4: 0.7436565144794507
Adjusted Rand Index for agglomerative clustering k = 5: 0.7333405905288745
Adjusted Rand Index for KMeans clustering k = 5: 0.4019391697196678
