# Dataset
## The CIFAR-10 dataset
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

You can check and download the dataset from [here](https://www.cs.toronto.edu/~kriz/cifar.html)

## Loading Dataset Into Memory

Wrote my own ImageDataLoader Class that is extendible to support multiple datasets in the future as well.
It loads the dataset using the directory path, and returns four results: **train_X, train_y , test_X , test_y** 

Although clustering does not make use of the labels since this is an unsupervised learning algorithm.
This is intended to support a general case ImageDataLoader not specifically built for the image clustering problem. 

In [2]:
from source.data_loader import ImageDataLoader
data_loader = ImageDataLoader()
train_X, train_y , test_X , test_y = data_loader.load_cifar10("./cifar-10-batches-py", num_batches = 5)

100%|██████████| 5/5 [00:02<00:00,  2.02it/s]


## Clustering Using KMeans

The purpose of K-means is to **identify groups**, or clusters of data points in a multidimensional space. The number K in K-means is the number of clusters to create. Initial cluster means are usually chosen at random.

K-means is usually implemented as an **iterative procedure** in which each iteration involves two successive steps. The first step is to assign each of the data points to a cluster. The second step is to modify the cluster means so that they become the mean of all the points assigned to that cluster.

The **quality** of the current assignment is given by the **distortion measure** which is the sum of squared distances between each cluster centroid and points inside the cluster.

In [3]:
from source.kmeans import KMeans 
num_clusters = 3
model = KMeans(num_clusters= num_clusters)
model.fit(train_X)

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/300 [00:00<?, ?it/s][A
  0%|          | 1/300 [00:02<12:25,  2.49s/it][A

Restart 1  Iteration 1
The Error of this iteration is  6454.0255124163195
The Distoration Measure score of this assignment is  736211338.4420099



  1%|          | 2/300 [00:04<12:15,  2.47s/it][A

Restart 1  Iteration 2
The Error of this iteration is  1029.543448790343
The Distoration Measure score of this assignment is  552010069.2635128



  1%|          | 3/300 [00:07<12:06,  2.45s/it][A

Restart 1  Iteration 3
The Error of this iteration is  247.8135654194893
The Distoration Measure score of this assignment is  564756659.3375537



  1%|▏         | 4/300 [00:09<12:00,  2.43s/it][A

Restart 1  Iteration 4
The Error of this iteration is  130.7571690482471
The Distoration Measure score of this assignment is  568177505.4971395



  2%|▏         | 5/300 [00:12<11:54,  2.42s/it][A

Restart 1  Iteration 5
The Error of this iteration is  95.38333940619506
The Distoration Measure score of this assignment is  570415171.739671



  2%|▏         | 6/300 [00:14<11:49,  2.41s/it][A

Restart 1  Iteration 6
The Error of this iteration is  82.25758515157054
The Distoration Measure score of this assignment is  571945091.8259251



  2%|▏         | 7/300 [00:16<11:43,  2.40s/it][A

Restart 1  Iteration 7
The Error of this iteration is  75.2805870753218
The Distoration Measure score of this assignment is  572958306.2255691



  3%|▎         | 8/300 [00:19<11:40,  2.40s/it][A

Restart 1  Iteration 8
The Error of this iteration is  74.28219705593563
The Distoration Measure score of this assignment is  573593609.759007



  3%|▎         | 9/300 [00:21<11:38,  2.40s/it][A

Restart 1  Iteration 9
The Error of this iteration is  67.52155304841564
The Distoration Measure score of this assignment is  574218554.9954886



  3%|▎         | 10/300 [00:24<11:36,  2.40s/it][A

Restart 1  Iteration 10
The Error of this iteration is  63.92872957034326
The Distoration Measure score of this assignment is  574679905.9101065



  4%|▎         | 11/300 [00:26<11:31,  2.39s/it][A

Restart 1  Iteration 11
The Error of this iteration is  63.953185279219184
The Distoration Measure score of this assignment is  575089722.0438623



  4%|▍         | 12/300 [00:28<11:29,  2.40s/it][A

Restart 1  Iteration 12
The Error of this iteration is  60.69715916691178
The Distoration Measure score of this assignment is  575428756.4959744



  4%|▍         | 13/300 [00:31<11:29,  2.40s/it][A

Restart 1  Iteration 13
The Error of this iteration is  60.93928674904088
The Distoration Measure score of this assignment is  575648129.1770102



  5%|▍         | 14/300 [00:33<11:27,  2.40s/it][A

Restart 1  Iteration 14
The Error of this iteration is  60.521922931804006
The Distoration Measure score of this assignment is  575854008.1501585



  5%|▌         | 15/300 [00:36<11:22,  2.39s/it][A

Restart 1  Iteration 15
The Error of this iteration is  57.85254334310766
The Distoration Measure score of this assignment is  576076691.2234118



  5%|▌         | 16/300 [00:38<11:22,  2.40s/it][A

Restart 1  Iteration 16
The Error of this iteration is  54.96209625287293
The Distoration Measure score of this assignment is  576234373.506193



  6%|▌         | 17/300 [00:40<11:18,  2.40s/it][A

Restart 1  Iteration 17
The Error of this iteration is  57.340631086531936
The Distoration Measure score of this assignment is  576327602.618876



  6%|▌         | 18/300 [00:43<11:17,  2.40s/it][A

Restart 1  Iteration 18
The Error of this iteration is  54.66892547137958
The Distoration Measure score of this assignment is  576538072.2470121



  6%|▋         | 19/300 [00:45<11:12,  2.39s/it][A

Restart 1  Iteration 19
The Error of this iteration is  55.60952137813569
The Distoration Measure score of this assignment is  576650213.0276229



  7%|▋         | 20/300 [00:47<11:08,  2.39s/it][A

Restart 1  Iteration 20
The Error of this iteration is  52.348851256114806
The Distoration Measure score of this assignment is  576771089.5563279



  7%|▋         | 21/300 [00:50<11:07,  2.39s/it][A

Restart 1  Iteration 21
The Error of this iteration is  47.453518328525526
The Distoration Measure score of this assignment is  576831528.3778876



  7%|▋         | 22/300 [00:52<11:03,  2.39s/it][A

Restart 1  Iteration 22
The Error of this iteration is  43.636016098501514
The Distoration Measure score of this assignment is  576799253.8492826



  8%|▊         | 23/300 [00:55<11:00,  2.38s/it][A

Restart 1  Iteration 23
The Error of this iteration is  47.23701725262645
The Distoration Measure score of this assignment is  576780060.3592252



  8%|▊         | 24/300 [00:57<11:03,  2.40s/it][A

Restart 1  Iteration 24
The Error of this iteration is  46.90725184974523
The Distoration Measure score of this assignment is  576733084.9296957



  8%|▊         | 25/300 [00:59<10:59,  2.40s/it][A

Restart 1  Iteration 25
The Error of this iteration is  43.33934868843201
The Distoration Measure score of this assignment is  576729914.9634187



  9%|▊         | 26/300 [01:02<11:21,  2.49s/it][A

Restart 1  Iteration 26
The Error of this iteration is  42.80627213024543
The Distoration Measure score of this assignment is  576651568.9514319



  9%|▉         | 27/300 [01:05<11:30,  2.53s/it][A

Restart 1  Iteration 27
The Error of this iteration is  40.35677325384008
The Distoration Measure score of this assignment is  576573077.6212704



  9%|▉         | 28/300 [01:07<11:30,  2.54s/it][A

Restart 1  Iteration 28
The Error of this iteration is  40.20034462749675
The Distoration Measure score of this assignment is  576453137.6178455



 10%|▉         | 29/300 [01:10<12:10,  2.69s/it][A

Restart 1  Iteration 29
The Error of this iteration is  37.08183665289788
The Distoration Measure score of this assignment is  576304516.0104527



 10%|█         | 30/300 [01:14<13:00,  2.89s/it][A

Restart 1  Iteration 30
The Error of this iteration is  37.70824231641732
The Distoration Measure score of this assignment is  576194263.8313769



 10%|█         | 31/300 [01:17<13:32,  3.02s/it][A

Restart 1  Iteration 31
The Error of this iteration is  36.410424747613234
The Distoration Measure score of this assignment is  576096567.0863119



 11%|█         | 32/300 [01:20<13:32,  3.03s/it][A

Restart 1  Iteration 32
The Error of this iteration is  31.328286910927794
The Distoration Measure score of this assignment is  576027652.5480263



 11%|█         | 33/300 [01:23<13:10,  2.96s/it][A

Restart 1  Iteration 33
The Error of this iteration is  30.809809330971667
The Distoration Measure score of this assignment is  575902301.4155818



 11%|█▏        | 34/300 [01:26<12:43,  2.87s/it][A

Restart 1  Iteration 34
The Error of this iteration is  24.77441758338641
The Distoration Measure score of this assignment is  575849284.2785318



 12%|█▏        | 35/300 [01:28<12:18,  2.79s/it][A

Restart 1  Iteration 35
The Error of this iteration is  25.29041973826195
The Distoration Measure score of this assignment is  575731036.8971183



 12%|█▏        | 36/300 [01:31<12:07,  2.75s/it][A

Restart 1  Iteration 36
The Error of this iteration is  21.06402711936204
The Distoration Measure score of this assignment is  575629763.7545211



 12%|█▏        | 37/300 [01:34<12:03,  2.75s/it][A

Restart 1  Iteration 37
The Error of this iteration is  16.399213719637444
The Distoration Measure score of this assignment is  575550831.8114974



 13%|█▎        | 38/300 [01:36<11:39,  2.67s/it][A

Restart 1  Iteration 38
The Error of this iteration is  13.721820473123817
The Distoration Measure score of this assignment is  575515222.3400244



 13%|█▎        | 39/300 [01:39<11:28,  2.64s/it][A

Restart 1  Iteration 39
The Error of this iteration is  12.767909272618558
The Distoration Measure score of this assignment is  575472698.0681964



 13%|█▎        | 40/300 [01:42<11:48,  2.73s/it][A

Restart 1  Iteration 40
The Error of this iteration is  13.07495497837213
The Distoration Measure score of this assignment is  575377419.3134178



 14%|█▎        | 41/300 [01:44<11:38,  2.70s/it][A

Restart 1  Iteration 41
The Error of this iteration is  9.874858955352073
The Distoration Measure score of this assignment is  575287214.9102302



 14%|█▍        | 42/300 [01:47<11:27,  2.67s/it][A

Restart 1  Iteration 42
The Error of this iteration is  9.738305074891638
The Distoration Measure score of this assignment is  575224849.6233418



 14%|█▍        | 43/300 [01:49<11:07,  2.60s/it][A

Restart 1  Iteration 43
The Error of this iteration is  8.960120967050237
The Distoration Measure score of this assignment is  575178888.1190381



 15%|█▍        | 44/300 [01:52<10:50,  2.54s/it][A

Restart 1  Iteration 44
The Error of this iteration is  8.173015376408882
The Distoration Measure score of this assignment is  575102365.2569904



 15%|█▌        | 45/300 [01:54<10:37,  2.50s/it][A

Restart 1  Iteration 45
The Error of this iteration is  8.533514635584401
The Distoration Measure score of this assignment is  575005927.3221633



 15%|█▌        | 46/300 [01:57<10:31,  2.49s/it][A

Restart 1  Iteration 46
The Error of this iteration is  6.854462536401649
The Distoration Measure score of this assignment is  574932465.3994879



 16%|█▌        | 47/300 [01:59<10:40,  2.53s/it][A

Restart 1  Iteration 47
The Error of this iteration is  6.0075820940446665
The Distoration Measure score of this assignment is  574888524.461606



 16%|█▌        | 48/300 [02:02<10:31,  2.51s/it][A

Restart 1  Iteration 48
The Error of this iteration is  5.84155653993393
The Distoration Measure score of this assignment is  574866766.9356073



 16%|█▋        | 49/300 [02:04<10:29,  2.51s/it][A

Restart 1  Iteration 49
The Error of this iteration is  4.681659378088499
The Distoration Measure score of this assignment is  574837213.5183212



 17%|█▋        | 50/300 [02:07<10:38,  2.55s/it][A

Restart 1  Iteration 50
The Error of this iteration is  3.917024435538809
The Distoration Measure score of this assignment is  574823670.3829013



 17%|█▋        | 51/300 [02:09<10:40,  2.57s/it][A

Restart 1  Iteration 51
The Error of this iteration is  2.9840497563393877
The Distoration Measure score of this assignment is  574789102.8781079



 17%|█▋        | 52/300 [02:12<11:07,  2.69s/it][A

Restart 1  Iteration 52
The Error of this iteration is  3.0878491000675647
The Distoration Measure score of this assignment is  574766957.455729



 18%|█▊        | 53/300 [02:15<10:49,  2.63s/it][A

Restart 1  Iteration 53
The Error of this iteration is  3.5972259517260756
The Distoration Measure score of this assignment is  574751231.9922777



 18%|█▊        | 54/300 [02:18<11:25,  2.78s/it][A

Restart 1  Iteration 54
The Error of this iteration is  1.7715681864739476
The Distoration Measure score of this assignment is  574728497.0241306



 18%|█▊        | 55/300 [02:21<11:34,  2.83s/it][A

Restart 1  Iteration 55
The Error of this iteration is  1.185521687664445
The Distoration Measure score of this assignment is  574725595.0330865



 19%|█▊        | 56/300 [02:24<11:13,  2.76s/it][A

Restart 1  Iteration 56
The Error of this iteration is  1.0750262318812134
The Distoration Measure score of this assignment is  574711180.0724269



 19%|█▉        | 57/300 [02:26<11:06,  2.74s/it][A

Restart 1  Iteration 57
The Error of this iteration is  0.9895736754500607
The Distoration Measure score of this assignment is  574700377.6950097



 19%|█▉        | 58/300 [02:29<10:59,  2.72s/it][A

Restart 1  Iteration 58
The Error of this iteration is  0.9181766375235866
The Distoration Measure score of this assignment is  574689734.377918



 20%|█▉        | 59/300 [02:31<10:41,  2.66s/it][A

Restart 1  Iteration 59
The Error of this iteration is  0.5386748517957548
The Distoration Measure score of this assignment is  574676084.2096045



 20%|██        | 60/300 [02:34<10:30,  2.63s/it][A

Restart 1  Iteration 60
The Error of this iteration is  0.6919315451765322
The Distoration Measure score of this assignment is  574669150.7139432



 20%|██        | 61/300 [02:37<10:33,  2.65s/it][A

Restart 1  Iteration 61
The Error of this iteration is  0.24604961058798924
The Distoration Measure score of this assignment is  574662753.9211885



 21%|██        | 62/300 [02:39<10:32,  2.66s/it][A

Restart 1  Iteration 62
The Error of this iteration is  0.30866796627055654
The Distoration Measure score of this assignment is  574659316.5228906



 21%|██        | 63/300 [02:42<10:11,  2.58s/it][A
100%|██████████| 1/1 [02:43<00:00, 163.69s/it]

Restart 1  Iteration 63
The Error of this iteration is  0.0
The Distoration Measure score of this assignment is  574655678.0976384
This Restart scored better than last _plorone. Updating Attributes...





### Evaluting the fitted model

In [None]:
print("The Model needed ",model.iter_num_," iterations to converge.")
print("The Model scored ",model.distoration_measure_," for the distoration measure value.")

### Visualizing the fitted model
Plotting the cluster centroids.

In [None]:
import matplotlib.pyplot as plt
for centroid in model.centroids_:
    centroid = centroid.transpose((1,2,0)).astype(int)
    plt.figure()
    plt.imshow(centroid)

### Visualizing the fitted model
Plotting cluster representatives of each cluster.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
cluster_representatives_index = np.unique(model.cluster_labels_, return_index=True)[1]
for cluster in range(cluster_representatives_index.shape[0]):  
    plt.figure()
    plt.imshow(train_X[cluster].transpose((1,2,0)))

### Visualizing the fitted model
Plotting measure vs iteration histories of the kmeans run.

In [None]:
model.plot_measure_vs_iteration()

### Advanced Experiments: Trying out re-running Kmeans with more K values = [3, 5, 10, 15]

In [None]:
models = []
for num_clusters_trial in [3,5,10,15]:
    model = KMeans(num_clusters= num_clusters_trial)
    model.fit(train_X)
    models.append(model)

#### Evaluating the expriement results: 
    Model 1 with K=3 has the minimum distoration measure score. 
    Also, we notice that the results are widely different in case of changing K values.

In [None]:
k_values = [3,5,10,15]
for index,model in enumerate(models):    
    print("****** Model",index+1,", K = ",k_values[index]," ******")
    print("The Model needed ",model.iter_num_," iterations to converge.")
    print("The Model scored ",model.distoration_measure_," for the distoration measure value.")
    model.plot_measure_vs_iteration()

### Advanced Experiments: Trying out re-running Kmeans with more 3 restarts
#### With  best preforming model from above.

In [None]:
model = KMeans(num_clusters= 3, num_restarts = 3)
model.fit(train_X)
models.append(model)

#### Evaluating the expriement results: 
    Original run of Kmeans gave distoration measure of 574652597.7368706
    First restart of Kmeans gave distoration measure of 574559070.1893139
    Second restart of Kmeans gave distoration measure of 574655678.0976384
    Third restart of Kmeans gave distoration measure of 574655678.0976384
    
#### Notice that the changes in restarts doesn't wildely change the distoration measure. Only minor improvements are made.