# Exploring Typiclust 

To identify areas of potential improvements, potential problems must be analysed. It could be hypothesised that typiclust could be more class balanced when sampling, meaning that diversity could be increased. 

The objective of this experiment is to analyse how clustering methods, particularly K-Means and K-Means++, represent labeled data in the context of TypiClust clustering. It was observed that with a budget size of 60, TypiClust tended to oversample certain classes, potentially impacting model performance. To better understand this phenomenon, clustering results were visualised using t-SNE to compare how clusters formed relative to true labels. Through this analysis, we aim to assess whether K-Means effectively represents the feature space and explore strategies to improve diversity in sampled data.

In [1]:
import torchvision 
import torch 
import numpy as np 

from sklearn.cluster import KMeans
import pandas as pd 

from sklearn.manifold import TSNE 

import plotly.graph_objs as go
from plotly.offline import iplot


## Current Class Diversity 

As observed, in this instance it oversampled the dog label. In the past, it was also observed that the automobile, dog and horse tend to be over sampled depending on the initial pooling. 

In [2]:
# Define CIFAR-10 data transforms with augmentations
train_transform = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    ]) 

# CIFAR10 training dataset
train_ds = torchvision.datasets.CIFAR10(root='./datasets', 
                                       train=True, 
                                       download=True, 
                                       transform=train_transform
                                       )

# Training dataset loader 
train_loader = torch.utils.data.DataLoader(
            train_ds, batch_size=10, shuffle=False,
            num_workers=0, pin_memory=True, drop_last=False)


In [3]:
budget_60 = np.load(r'datasets\typiclust_cifar10_budget_60.npy').tolist()

In [4]:
label_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 
            'dog', 'frog', 'horse', 'ship', 'truck']
labels = [0] * 10
for i in budget_60:
    img, label = train_ds[i]
    labels[label]+= 1

for c,l in enumerate(labels):
    print(f"{label_names[c]}: {l}")
    

airplane: 4
automobile: 5
bird: 6
cat: 6
deer: 8
dog: 9
frog: 5
horse: 5
ship: 6
truck: 6


# Visualising TypiClust Clustering (T-SNE)

To understand why oversampling may occur, visualising the clusters can provide insights into how they are formed relative to their true labels. By comparing K-Means clustering results with the actual labels of the training set, we can better assess the alignment between clustering structure and ground truth.

In [5]:
# Load feature representations generated from pretrained SimCLR 
feats = np.load("./datasets/feats.npy")

In [6]:
# Clustering according to TypiClust  

# Labelled set indices 
l_set_indices = []

budget_size = 10
MAX_NUM_CLUSTER = 500
perplexity = 50 # T-SNE parmas 

n_clusters = min(len(l_set_indices) + budget_size, 
                            MAX_NUM_CLUSTER)

# K-Means 
km = KMeans(n_clusters=n_clusters, init="k-means++")
km.fit(feats)
clusters = km.predict(feats)

In [7]:
feats_df = pd.DataFrame(data=feats)

# Add cluster label to each point 
feats_df['clusters'] = clusters

# Index of each data point 
feats_df['idx'] = [i for i in range(len(feats))]

In [8]:
# Sample 5000 random data points 
plotX = pd.DataFrame(np.array(feats_df.sample(5000)))
plotX.columns = feats_df.columns

In [9]:
# Find the label class of each data point 
labels = []
for index, row in plotX.iterrows():
    img, label = train_ds[int(row['idx'])]
    labels.append(label)

plotX['label'] = labels

In [10]:
plotX

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,505,506,507,508,509,510,511,clusters,idx,label
0,0.005079,0.033506,0.004486,0.000000,0.019639,0.012299,0.127739,0.000000,0.104995,0.016156,...,0.000000,0.000000,0.002622,0.020218,0.025496,0.014650,0.015216,3.0,16161.0,8
1,0.356676,0.000908,0.149722,0.000000,0.027146,0.056867,0.013451,0.000000,0.094401,0.013550,...,0.000000,0.210675,0.030789,0.072562,0.000000,0.000000,0.012705,7.0,20282.0,4
2,0.005168,0.000032,0.083838,0.000003,0.108542,0.082445,0.059671,0.323994,0.086045,0.048063,...,0.005203,0.194623,0.002892,0.100071,0.000000,0.230320,0.000000,9.0,44813.0,5
3,0.000000,0.043847,0.155785,0.060298,0.002540,0.001057,0.019491,0.000000,0.289474,0.028601,...,0.000000,0.215231,0.001268,0.000000,0.000000,0.224795,0.004475,5.0,3823.0,6
4,0.000000,0.000000,0.000000,0.000797,0.000000,0.011797,0.048466,0.000421,0.000000,0.004504,...,0.000000,0.000000,0.000000,0.000000,0.056612,0.050381,0.061281,4.0,12345.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,0.000000,0.000000,0.039677,0.303487,0.000000,0.000000,0.028411,0.000000,0.050262,0.001467,...,0.000000,0.448323,0.008303,0.109156,0.000000,0.049678,0.045866,6.0,33753.0,4
4996,0.000000,0.142466,0.077517,0.144023,0.000000,0.003998,0.054043,0.105628,0.104216,0.071372,...,0.005699,0.035679,0.098226,0.089646,0.000000,0.001508,0.000000,1.0,30111.0,2
4997,0.001409,0.045695,0.170150,0.010677,0.000000,0.003899,0.109423,0.058182,0.096817,0.047037,...,0.000000,0.000000,0.029231,0.003437,0.000573,0.008003,0.000000,7.0,9107.0,3
4998,0.000000,0.000000,0.033023,0.101326,0.007097,0.099455,0.067268,0.052109,0.004868,0.000000,...,0.328659,0.007695,0.033618,0.000875,0.000000,0.059878,0.014549,7.0,38242.0,4


## T-SNE K-means Plot
Performing K-means to visualise Typiclust's 'clustering for diversity' step. 


In [11]:
#T-SNE with two dimensions
tsne_2d = TSNE(n_components=2, perplexity=perplexity)

#This DataFrame contains two dimensions, built by T-SNE
TCs_2d = pd.DataFrame(tsne_2d.fit_transform(plotX.drop(["clusters", "label", "idx"], axis=1)))

TCs_2d.columns = ["TC1_2d","TC2_2d"]

plotX = pd.concat([plotX,TCs_2d], axis=1, join='inner')

In [12]:
# Initialise clusters 
plotX["dummy"] = 0
cluster0 = plotX[plotX["clusters"] == 0]
cluster1 = plotX[plotX["clusters"] == 1]
cluster2 = plotX[plotX["clusters"] == 2]
cluster3 = plotX[plotX["clusters"] == 3]
cluster4 = plotX[plotX["clusters"] == 4]
cluster5 = plotX[plotX["clusters"] == 5]
cluster6 = plotX[plotX["clusters"] == 6]
cluster7 = plotX[plotX["clusters"] == 7]
cluster8 = plotX[plotX["clusters"] == 8]
cluster9 = plotX[plotX["clusters"] == 9]
cluster10 = plotX[plotX["clusters"] == 10]

In [13]:
# trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["TC1_2d"],
                    y = cluster0["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 0",
                    marker = dict(color = 'rgba(31, 119, 180, 0.8)'),
                    text = None)

# trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["TC1_2d"],
                    y = cluster1["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 1",
                    marker = dict(color = 'rgba(255, 127, 14, 0.8)'),
                    text = None)

# trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["TC1_2d"],
                    y = cluster2["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 2",
                    marker = dict(color = 'rgba(44, 160, 44, 0.8)'),
                    text = None)

# trace4 is for 'Cluster 3'
trace4 = go.Scatter(
                    x = cluster3["TC1_2d"],
                    y = cluster3["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 3",
                    marker = dict(color = 'rgba(214, 39, 40, 0.8)'),
                    text = None)

# trace5 is for 'Cluster 4'
trace5 = go.Scatter(
                    x = cluster4["TC1_2d"],
                    y = cluster4["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 4",
                    marker = dict(color = 'rgba(148, 103, 189, 0.8)'),
                    text = None)

# trace6 is for 'Cluster 5'
trace6 = go.Scatter(
                    x = cluster5["TC1_2d"],
                    y = cluster5["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 5",
                    marker = dict(color = 'rgba(140, 86, 75, 0.8)'),
                    text = None)

# trace7 is for 'Cluster 6'
trace7 = go.Scatter(
                    x = cluster6["TC1_2d"],
                    y = cluster6["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 6",
                    marker = dict(color = 'rgba(227, 119, 194, 0.8)'),
                    text = None)

# trace8 is for 'Cluster 7'
trace8 = go.Scatter(
                    x = cluster7["TC1_2d"],
                    y = cluster7["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 7",
                    marker = dict(color = 'rgba(127, 127, 127, 0.8)'),
                    text = None)

# trace9 is for 'Cluster 8'
trace9 = go.Scatter(
                    x = cluster8["TC1_2d"],
                    y = cluster8["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 8",
                    marker = dict(color = 'rgba(188, 189, 34, 0.8)'),
                    text = None)

# trace10 is for 'Cluster 9'
trace10 = go.Scatter(
                    x = cluster9["TC1_2d"],
                    y = cluster9["TC2_2d"],
                    mode = "markers",
                    name = "Cluster 9",
                    marker = dict(color = 'rgba(23, 190, 207, 0.8)'),
                    text = None)

data = [trace1, trace2, trace3,
        trace4, trace5, trace6,
        trace7, trace8, trace9,
        trace10]

title = "K-means on Features in 2D Using T-SNE (perplexity=" + str(perplexity) + ")"

layout = dict(title = title,
              xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'TC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)

# T-SNE Labels Plot 
Plot K-means clusters but colour code them according to their training data labels. 

In [14]:
# Initalise Clusters 
cluster0 = plotX[plotX["label"] == 0]
cluster1 = plotX[plotX["label"] == 1]
cluster2 = plotX[plotX["label"] == 2]
cluster3 = plotX[plotX["label"] == 3]
cluster4 = plotX[plotX["label"] == 4]
cluster5 = plotX[plotX["label"] == 5]
cluster6 = plotX[plotX["label"] == 6]
cluster7 = plotX[plotX["label"] == 7]
cluster8 = plotX[plotX["label"] == 8]
cluster9 = plotX[plotX["label"] == 9]

# trace1 is for 'Cluster 0'
trace1 = go.Scatter(
                    x = cluster0["TC1_2d"],
                    y = cluster0["TC2_2d"],
                    mode = "markers",
                    name = "Airplane",
                    marker = dict(color = 'rgba(31, 119, 180, 0.8)'),
                    text = None)

# trace2 is for 'Cluster 1'
trace2 = go.Scatter(
                    x = cluster1["TC1_2d"],
                    y = cluster1["TC2_2d"],
                    mode = "markers",
                    name = "Automobile",
                    marker = dict(color = 'rgba(255, 127, 14, 0.8)'),
                    text = None)

# trace3 is for 'Cluster 2'
trace3 = go.Scatter(
                    x = cluster2["TC1_2d"],
                    y = cluster2["TC2_2d"],
                    mode = "markers",
                    name = "Bird",
                    marker = dict(color = 'rgba(44, 160, 44, 0.8)'),
                    text = None)

# trace4 is for 'Cluster 3'
trace4 = go.Scatter(
                    x = cluster3["TC1_2d"],
                    y = cluster3["TC2_2d"],
                    mode = "markers",
                    name = "Cat",
                    marker = dict(color = 'rgba(214, 39, 40, 0.8)'),
                    text = None)

# trace5 is for 'Cluster 4'
trace5 = go.Scatter(
                    x = cluster4["TC1_2d"],
                    y = cluster4["TC2_2d"],
                    mode = "markers",
                    name = "Deer",
                    marker = dict(color = 'rgba(148, 103, 189, 0.8)'),
                    text = None)

# trace6 is for 'Cluster 5'
trace6 = go.Scatter(
                    x = cluster5["TC1_2d"],
                    y = cluster5["TC2_2d"],
                    mode = "markers",
                    name = "Dog",
                    marker = dict(color = 'rgba(140, 86, 75, 0.8)'),
                    text = None)

# trace7 is for 'Cluster 6'
trace7 = go.Scatter(
                    x = cluster6["TC1_2d"],
                    y = cluster6["TC2_2d"],
                    mode = "markers",
                    name = "Frog",
                    marker = dict(color = 'rgba(227, 119, 194, 0.8)'),
                    text = None)

# trace8 is for 'Cluster 7'
trace8 = go.Scatter(
                    x = cluster7["TC1_2d"],
                    y = cluster7["TC2_2d"],
                    mode = "markers",
                    name = "Horse",
                    marker = dict(color = 'rgba(127, 127, 127, 0.8)'),
                    text = None)

# trace9 is for 'Cluster 8'
trace9 = go.Scatter(
                    x = cluster8["TC1_2d"],
                    y = cluster8["TC2_2d"],
                    mode = "markers",
                    name = "Ship",
                    marker = dict(color = 'rgba(188, 189, 34, 0.8)'),
                    text = None)

# trace10 is for 'Cluster 9'
trace10 = go.Scatter(
                    x = cluster9["TC1_2d"],
                    y = cluster9["TC2_2d"],
                    mode = "markers",
                    name = "Truck",
                    marker = dict(color = 'rgba(23, 190, 207, 0.8)'),
                    text = None)


data = [trace1, trace2, trace3,
        trace4, trace5, trace6,
        trace7, trace8, trace9,
        trace10]

title = "Visualizing feature labels in 2D Using T-SNE (perplexity=" + str(perplexity) + ")"

layout = dict(title = title,
              xaxis= dict(title= 'TC1',ticklen= 5,zeroline= False),
              yaxis= dict(title= 'TC2',ticklen= 5,zeroline= False)
             )

fig = dict(data = data, layout = layout)

iplot(fig)



These plots help visualize and assess how well K-Means represents different labels within the feature space. As observed, the labels plot highlights a densely packed region at the center of the graph, suggesting that many labels with associated images share similar representational features. This congested area includes labels such as Truck, Cat, Bird, Frog, Dog, Deer, and Automobile.

This clustering behavior could present challenges, as seen in the K-Means visualization, where multiple overlapping clusters represent the heavily labeled region. Notably, Cluster 6 (pink) appears to dominate this area as a single cluster. This suggests that not all clusters consistently correspond to a single label, a key consideration in the clustering approach used in TypiClust.

To explore ways of better representing the true labeled feature space, the code was adapted to generate the following plots:

### K-Means++ 

K-Means++ was considered for clustering due to the high-dimensional nature of the feature representations. As noted in this [Towards Data Science article](https://towardsdatascience.com/k-means-algorithm-for-high-dimensional-data-clustering-714c6980daa9/) article, K-Means++ selects initial centroids probabilistically, reducing the risk of poor convergence caused by random initialization while also aiming to enhance cluster separation.

![k-means++](kmeans++betterqualitynewplot.png "K-means++ n=10")

When visualising our K-Means++ clustering with 10 clusters, we observe that it does not accurately capture the dense, labeled region of the training data. Instead, ‘Cluster 0’ appears to represent this entire congested area as a single entity. In contrast, K-Means more effectively represented this region by forming multiple overlapping clusters, providing a more granular view of the data distribution. Therefore, K-Means appears to be the more suitable clustering algorithm for this task.


# K-means clustering when budget = 20
Sometimes the budget/sample-size will not equate to the number of classes in the dataset. 

Here is the clustered feature space when n=20: 

![k-means n=20](k-means20.png "K-means n=20")

There does not seem to be a clear distinction between the different clusters and it's true labels. However, some areas of the graph such as the cluster 6 and 11 seem to both represent Automobiles. While others remain undiscernible or be mixed with other labelled.




# Considerations and Final Observations 
By visualising and analysing clustering in TypiClust, we observe that K-means clustering appears to be the most representative of the labeled features. A comparative analysis between labeled features and K-means clusters reveals that larger clusters typically contain a balanced mix of different labels. This suggests that clusters containing labeled samples might also include examples of other labels, potentially offering insights for more effective sampling strategies.

Furthermore, larger clusters tend to be spatially closer and positioned near the center of the 2D space. However, large clusters for sampling may not always promote diversity. Notably, when the number of clusters is increased to 20, smaller clusters tend to form at the periphery. Interestingly, these smaller clusters often contain only a single labeled class despite their reduced size.

It would be interesting to explore and focus on sampling from these smaller, more distant clusters and considering the selection of clusters that already contain labeled data.