# C K-means MNIST
_4 points_

- What is the majority class of each cluster? 
- What is the percentage of the majority class in each cluster? 
- Does each number have a cluster?
- If not, which hasn't?

Do this for 10, 100, 1000 iterations

## Solution

In [25]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import f1_score
import seaborn as sns
import pandas as pd

In [5]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() # loading dataset
# reshape dataset to reduce to 2D array
x_train_flat = x_train.reshape([x_train.shape[0],
                              x_train.shape[1] * x_train.shape[2]])
x_test_flat = x_test.reshape([x_test.shape[0],
                           x_test.shape[1] * x_test.shape[2]])

In [27]:
# create KMeans model
kmeans_10_iter = KMeans(n_clusters=10, max_iter=10, n_jobs=-1).fit(x_train_flat) # n_jobs = -1 to use all cores for calculation
kmeans_100_iter = KMeans(n_clusters=10, max_iter=100, n_jobs=-1).fit(x_train_flat)
kmeans_1000_iter = KMeans(n_clusters=10, max_iter=1000, n_jobs=-1).fit(x_train_flat)

In [28]:
"""
Due to the fact that KMeans has only a parameter to indicate
the maximum number of iterations, it will not always use that capacity.
Here to visualize how many iterations were actually used!
"""
print(kmeans_10_iter.n_iter_)
print(kmeans_100_iter.n_iter_)
print(kmeans_1000_iter.n_iter_)

9
63
41


In [84]:
kmeans_10_scores = []
kmeans_100_scores = []
kmeans_1000_scores = []
for i in range(10):
    kmeans_10_scores.append([])
    kmeans_100_scores.append([])
    kmeans_1000_scores.append([])
#print(kmeans_10_scores)

#kmeans_10
count = 0
for value in kmeans_10_iter.labels_:
    kmeans_10_scores[value].append(y_train[count])
    count += 1

#kmeans_100
count = 0
for value in kmeans_100_iter.labels_:
    kmeans_100_scores[value].append(y_train[count])
    count += 1

#kmeans_1000
count = 0
for value in kmeans_1000_iter.labels_:
    kmeans_1000_scores[value].append(y_train[count])
    count += 1

In [83]:
df0 = pd.DataFrame(kmeans_10_scores)
#df0

In [89]:
kmeans_10_clusters = []
kmeans_100_clusters = []
kmeans_1000_clusters = []
for i in range(10):
    # 10 iterations
    local_array = np.bincount(kmeans_10_scores[:][i])
    kmeans_10_clusters.append(local_array)
    # 100 iterations
    local_array = np.bincount(kmeans_100_scores[:][i])
    kmeans_100_clusters.append(local_array)
    #1000 iterations
    local_array = np.bincount(kmeans_1000_scores[:][i])
    kmeans_1000_clusters.append(local_array)
#print(kmeans_10_clusters)

In [105]:
"""calculating majority in each cluster and percentage
"""
print("Kmeans with 10 iterations\n")
count = 0
labels = np.arange(10)
majority_labels_10 = []
percentages_maj_cluster_10 = []
for array in kmeans_10_clusters:
    #print(max(array))
    maximum = np.where(array == max(array))[0]
    if len(maximum) == 1:
        print("Cluster: {} => majority label: {}".format(count, maximum[0]))    
        majority_labels_10.append(maximum[0])
        percentages_maj_cluster_10.append(max(array) / sum(array))
    else:
        majority_labels_10.append(-1)
        percentages_maj_cluster_10.append(-1)
    count += 1
differences_10_iter = np.where(np.isin(labels, majority_labels_10) == False)[0]
print("The following numbers have no own cluster: {}".format(differences_10_iter))
print(percentages_maj_cluster_10)

Kmeans with 10 iterations

Cluster: 0 => majority label: 8
Cluster: 1 => majority label: 4
Cluster: 2 => majority label: 2
Cluster: 3 => majority label: 3
Cluster: 4 => majority label: 6
Cluster: 5 => majority label: 0
Cluster: 6 => majority label: 1
Cluster: 7 => majority label: 7
Cluster: 8 => majority label: 1
Cluster: 9 => majority label: 0
The following numbers have no own cluster: [5 9]
[0.528322440087146, 0.35785423268447264, 0.8914613423959218, 0.5190463850738534, 0.8557491289198607, 0.8032115869017632, 0.569113263785395, 0.4270774766670415, 0.5975049244911359, 0.9046993098915543]


In [106]:
"""calculating majority in each cluster and percentage
"""
print("Kmeans with 100 iterations\n")
count = 0
labels = np.arange(10)
majority_labels_100 = []
percentages_maj_cluster_100 = []
for array in kmeans_100_clusters:
    #print(max(array))
    maximum = np.where(array == max(array))[0]
    if len(maximum) == 1:
        print("Cluster: {} => majority label: {}".format(count, maximum[0]))    
        majority_labels_100.append(maximum[0])
        percentages_maj_cluster_100.append(max(array) / sum(array))
    else:
        majority_labels_100.append(-1)
        percentages_maj_cluster_100.append(-1)
    count += 1
differences_100_iter = np.where(np.isin(labels, majority_labels_100) == False)[0]
print("The following numbers have no own cluster: {}".format(differences_100_iter))
print(percentages_maj_cluster_100)

Kmeans with 100 iterations

Cluster: 0 => majority label: 0
Cluster: 1 => majority label: 3
Cluster: 2 => majority label: 4
Cluster: 3 => majority label: 1
Cluster: 4 => majority label: 1
Cluster: 5 => majority label: 6
Cluster: 6 => majority label: 2
Cluster: 7 => majority label: 0
Cluster: 8 => majority label: 8
Cluster: 9 => majority label: 7
The following numbers have no own cluster: [5 9]
[0.787267570122912, 0.526562919237993, 0.3569509738079248, 0.6230305062018102, 0.5323676680972819, 0.8589160839160839, 0.8960068332265642, 0.9066537467700259, 0.5295740116457248, 0.42603884372177053]


In [107]:
"""calculating majority in each cluster and percentage
"""
print("Kmeans with 1000 iterations\n")
count = 0
labels = np.arange(10)
majority_labels_1000 = []
percentages_maj_cluster_1000 = []
for array in kmeans_1000_clusters:
    #print(max(array))
    maximum = np.where(array == max(array))[0]
    if len(maximum) == 1:
        print("Cluster: {} => majority label: {}".format(count, maximum[0]))    
        majority_labels_1000.append(maximum[0])
        percentages_maj_cluster_1000.append(max(array) / sum(array))
    else:
        majority_labels_1000.append(-1)
        percentages_maj_cluster_1000.append(-1)
    count += 1
differences_1000_iter = np.where(np.isin(labels, majority_labels_1000) == False)[0]
print("The following numbers have no own cluster: {}".format(differences_1000_iter))
print(percentages_maj_cluster_1000)

Kmeans with 1000 iterations

Cluster: 0 => majority label: 8
Cluster: 1 => majority label: 1
Cluster: 2 => majority label: 7
Cluster: 3 => majority label: 4
Cluster: 4 => majority label: 6
Cluster: 5 => majority label: 0
Cluster: 6 => majority label: 2
Cluster: 7 => majority label: 1
Cluster: 8 => majority label: 0
Cluster: 9 => majority label: 3
The following numbers have no own cluster: [5 9]
[0.5283682520263037, 0.6253994953742641, 0.42771357353274414, 0.35735904046631545, 0.8592411260709915, 0.9048693969687198, 0.8971122994652406, 0.5278368794326241, 0.7872005044136192, 0.5264004288394533]


In [117]:
combined_percentages = [percentages_maj_cluster_10, percentages_maj_cluster_100, percentages_maj_cluster_1000]
df1 = pd.DataFrame(np.array(combined_percentages).T, columns=["10", "100", "1000"])
np.array(combined_percentages).T
df1

Unnamed: 0,10,100,1000
0,0.528322,0.787268,0.528368
1,0.357854,0.526563,0.625399
2,0.891461,0.356951,0.427714
3,0.519046,0.623031,0.357359
4,0.855749,0.532368,0.859241
5,0.803212,0.858916,0.904869
6,0.569113,0.896007,0.897112
7,0.427077,0.906654,0.527837
8,0.597505,0.529574,0.787201
9,0.904699,0.426039,0.5264


# Answer

## Majority class of each cluster

(also see the results in the above test runs)

#### kmeans with 10 iterations (real iterations = 9)
Cluster 0 -> 8;
Cluster 1 -> 4;
Cluster 2 -> 2;
Cluster 3 -> 3;
Cluster 4 -> 6;
Cluster 5 -> 0;
Cluster 6 -> 1;
Cluster 7 -> 7;
Cluster 8 -> 1;
Cluster 9 -> 0 <br><br>

#### kmeans with 100 iterations (real iterations = 63)
Cluster 0 -> 0;
Cluster 1 -> 3;
Cluster 2 -> 4;
Cluster 3 -> 1;
Cluster 4 -> 1;
Cluster 5 -> 6;
Cluster 6 -> 2;
Cluster 7 -> 0;
Cluster 8 -> 8;
Cluster 9 -> 7 <br><br>

#### kmeans with 1000 iterations (real iterations = 41)
Cluster 0 -> 8;
Cluster 1 -> 1;
Cluster 2 -> 7;
Cluster 3 -> 4;
Cluster 4 -> 6;
Cluster 5 -> 0;
Cluster 6 -> 2;
Cluster 7 -> 1;
Cluster 8 -> 0;
Cluster 9 -> 3 <br><br>

## Percentage of majority of each cluster

(see above table)


## Do all numbers have a cluster?
Assumption: "Own cluster" means the number has a cluster in which it holds the majority of elements; one could expect that each number "has one cluster" meaning each number is the majority group in one cluster.

#### kmeans with 10 iterations
No, the numbers 5,9 have no own cluster. <br>
The numbers 0, 1 have two clusters each.

#### kmeans with 100 iterations
No, the numbers 5, 9 have no own cluster. <br>
The numbers 0, 1 have two clusters each.

#### kmeans with 1000 iterations
No, the numbers 5, 9 have no own cluster. <br>
The numbers 0, 1 have two clusters each.