<img src='https://drive.google.com/uc?id=1tqYIvII8lJ_FnqE6ugS21n4s93kMwTLy' />

## Machine Learning
## School of Computing and Engineering, University of West London
## Massoud Zolgharni


# Tutorial: K-Means algorithm

Clustering algorithms seek to learn, from the properties of the data, an optimal division or discrete labeling of groups of points.

Many clustering algorithms are available in Scikit-Learn and elsewhere, but perhaps the simplest to understand is an algorithm known as *k-means clustering*, which is implemented in ``sklearn.cluster.KMeans``.

We begin with the standard imports:

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
Foldername = '/content/gdrive/My Drive/Colab Notebooks/UWL/ML_L6/'
#import sys
#sys.path.append('/content/gdrive/My Drive/Colab Notebooks')
#%cd /content/gdrive/My Drive/Colab Notebooks
#!ls

Then, let's generate or load a two-dimensional dataset containing several distinct blobs.
To emphasise that this is an unsupervised algorithm, we will leave the labels out of the visualization

In [None]:
plt.rcParams['figure.figsize'] = (16, 9)

# Creating a sample dataset with 7 clusters
# from sklearn.datasets import make_blobs
# X, y = make_blobs(n_samples=1000, n_features=2, centers=6)
# f = open("data_kclusters.txt", 'w')
# for p in list(range(0,len(X))):
#    f.write('%.2f\t %.2f\n' % (X[p,0], X[p,1]))
# f.close()

X = np.loadtxt( Foldername+"data_6clusters.txt" )
plt.scatter(X[:, 0], X[:, 1])
plt.show()

By eye, it is relatively easy to pick out the clusters.
The *k*-means algorithm does this automatically, and in Scikit-Learn uses the typical estimator API

Let us try a random K

In [None]:
# Initializing KMeans
kmeans = KMeans(n_clusters=2)
# Fitting with inputs
kmeans = kmeans.fit(X)
# Predicting the clusters
labels = kmeans.predict(X)
# Getting the cluster centers
C = kmeans.cluster_centers_

Let's visualise the results by plotting the data colored by these labels.
We will also plot the cluster centers as determined by the *k*-means estimator:

In [None]:
plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.scatter(C[:, 0], C[:, 1],  marker='*', c='#050505', s=1000)
plt.show()

# Elbow method

The Elbow Method is one of the most popular methods to determine this optimal value of K

In [None]:
WCSS = []
K = range(2,10)
for this_k in K:
    kmeanModel = KMeans(n_clusters = this_k)
    kmeanModel.fit(X)
    WCSS.append(kmeanModel.inertia_)

# Plot the elbow
plt.figure()
plt.plot(K, WCSS, 'bx-')
plt.xlabel('k')
plt.ylabel('WCSS')
plt.title('The Elbow Method showing the optimal k')
plt.show()

In [None]:
# Initializing KMeans
K_elbow = .... # replace this value by what you found in elbow method
kmeans = KMeans(n_clusters = K_elbow)
# Fitting with inputs
kmeans = kmeans.fit(X)
# Predicting the clusters
labels = kmeans.predict(X)
# Getting the cluster centers
C = kmeans.cluster_centers_

plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=labels)
#plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.scatter(C[:, 0], C[:, 1],  marker='*', c='#050505', s=1000)
plt.show()