# Portfolio 3 - Clustering Visualisation

K-means clustering is one of the simplest and popular unsupervised learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. This notebook illustrates the process of K-means clustering by generating some random clusters of data and then showing the iterations of the algorithm as random cluster means are updated. 

We first generate random data around 4 centers.

In [None]:
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
center_1 = np.array([1,2])
center_2 = np.array([6,6])
center_3 = np.array([9,1])
center_4 = np.array([-5,-1])

# Generate random data and center it to the four centers each with a different variance
np.random.seed(5)
data_1 = np.random.randn(200,2) * 1.5 + center_1
data_2 = np.random.randn(200,2) * 1 + center_2
data_3 = np.random.randn(200,2) * 0.5 + center_3
data_4 = np.random.randn(200,2) * 0.8 + center_4

data = np.concatenate((data_1, data_2, data_3, data_4), axis = 0)

plt.scatter(data[:,0], data[:,1], s=7, c='k', alpha = 0.5)
plt.show()

## 1. Generate random cluster centres

You need to generate four random centres.

This part of portfolio should contain at least:  
- The number of clusters `k` is set to 4;
- Generate random centres via `centres = np.random.randn(k,c)*std + mean` where `std` and `mean` are the standard deviation and mean of the data. `c` represents the number of features in the data. Set the random seed to 6.
- Color the generated centers with `green`, `blue`, `yellow`, and `cyan`. Set the edgecolors to `red`.

In [None]:
import random
k = 4
std = np.std(data)
mean = np.mean(data)


In [None]:
np.random.seed(6)
centres = np.random.randn(k,2)*std + mean
centres

In [None]:
for point in data:
    plt.scatter(point[0], point[1], c = 'k', alpha = 0.7, s = 9)
for centre in centres:
    plt.scatter(centre[0], centre[1], edgecolors='r', s = 100)

## 2. Visualise the clustering results in each iteration

### *Explanation of K-Means Algorithm* ###

The k-means algorithm comprises of 3 functions : **euclidean**, **\_4means\_clustering** and **centre_updates**

#### *Euclidean*####

Takes two arguments *datapoints* and *centre*. Datapoints should be an array of points on the 2-d plane in the (x,y) format, while *centre* should be a singular point. It calculates the Euclidean distance between two points, which is given by the formula:

<center>$ D = \sqrt{(x-a)^2 + (y-b)^2)} $ </center>

where $ (x,y) $ are the coordinates of the centre, while the point $ P $ is located at $ (a,b) $ . The function returns *distlist*, a list containing the euclidean distance of each datapoint from the centre

#### *\_4means\_clustering*####

This function is used specifically for this case of k-means clustering with $ k = 4 $. It takes the parameters *datapoints* (similar to datapoints parameter of euclidean), and *centrepoints*, a numpy array containing the 4 centres' coordinates. It makes use of the euclidean function to create 4 numpy arrays, containing the distance of each datapoint from each of the 4 centres. 

It then compares each element of the resulting 4 arrays, with elements of the same index (i.e. first element across all arrays, second element across all arrays and so on), and appends the minimum value from the distances to a list.

Then we create 4 boolean lists, which indicate whether the point was closest to centre 1, centre 2, centre 3, or centre 4. It fills True for whichever centre it is closest to, and false otherwise. 

Then it appends the points which are closer to centre 1 under cluster 1 and so on. 

We now have 4 lists, representing the 4 clusters, each containing coordinates of all points which fall into that cluster. We store these as numpy arrays :*first, second, third and fourth.*

We then plot these points according to the color code:

| Cluster | Color |
| --- | --- | 
| 1 | Green |
| 2 | Blue |
| 3 | Yellow |
| 4 | Cyan |


#### *centre_updates* ####

This function is responsible for plotting and updating centres. It is to be noted\* that from the second iteration onwards, the centre_updates function is responsible for plotting the data and centres, and also updating the centres. It also stores value of previous centres in the list *centre_history*

Centres are updated by taking the mean value of all points of a cluster, and assigning them to a variable. This variable is then appended to a new global variable - *newcentres*, which stores the new centres in an array format. The function also contains a call to the \_4means\_clustering function. Due to this, the function is able to plot the data along with the centres, and update the centres. 


<small> \**Note that in the first iteration, our centres were already defined, and hence there was no need to update centres*

In [None]:
centre_history = []

In [None]:
def euclidean (datapoints, centre):
    ##Calculate Euclidean distance between all points in data and a particular centre
    distlist = []
    for point in datapoints:
        distlist.append(round(np.linalg.norm(point-centre), 5))
    return distlist

In [None]:
def _4means_clustering(datapoints, centrepoints):
    dist1 = np.array(euclidean(data, centrepoints[0]))
    dist2 = np.array(euclidean(data, centrepoints[1]))
    dist3 = np.array(euclidean(data, centrepoints[2]))
    dist4 = np.array(euclidean(data, centrepoints[3]))
    z = []
    for i in range(len(dist1)):
        z.append(min(dist1[i], dist2[i], dist3[i], dist4[i]))
    in_1 = []
    in_2 = []
    in_3 = []
    in_4 = []
    for item in z:
        in_1.append(item in dist1)
        in_2.append(item in dist2)
        in_3.append(item in dist3)
        in_4.append(item in dist4)
    cluster1 = []
    cluster2 = []
    cluster3 = []
    cluster4 = []
    for i in range(len(data)):
        if (in_1[i]):
            cluster1.append(data[i])
        elif(in_2[i]):
            cluster2.append(data[i])
        elif(in_3[i]):
            cluster3.append(data[i])
        elif(in_4[i]):
            cluster4.append(data[i])
    global first, second, third, fourth ##Create global variables indicating first, second, third, fourth clusters
    first = np.array(cluster1)
    second = np.array(cluster2)
    third = np.array(cluster3)
    fourth = np.array(cluster4)

    
    ##Plot clusters as per color code
    for dot in data:
        if (dot in first):
            plt.scatter(dot[0], dot[1], c = 'green')
        elif (dot in second):
            plt.scatter(dot[0], dot[1], c = 'blue')
        elif (dot in third):
            plt.scatter(dot[0], dot[1], c = 'yellow')
        elif (dot in fourth):
            plt.scatter(dot[0], dot[1], c = 'cyan')

In [None]:
def centre_updates():
    global newcentres ##New centres
    c1 = np.array([first[:,0].mean(), first[:,1].mean()])
    c2 = np.array([second[:,0].mean(), second[:,1].mean()])
    c3 = np.array([third[:,0].mean(), third[:,1].mean()])
    c4 = np.array([fourth[:,0].mean(), fourth[:,1].mean()])
    centrelist = [c1,c2,c3,c4]
    
    
    newcentres = np.array(centrelist)
    
    _4means_clustering(data, newcentres)
    
    ##Plot centers as per color code
    plt.scatter(newcentres[0][0], newcentres[0][1], c = 'green', edgecolors='r', s = 70)
    plt.scatter(newcentres[1][0], newcentres[1][1], c = 'blue', edgecolors='r', s = 70)
    plt.scatter(newcentres[2][0], newcentres[2][1], c = 'yellow', edgecolors='r', s = 70)
    plt.scatter(newcentres[3][0], newcentres[3][1], c = 'cyan', edgecolors='r', s = 70)
    centre_history.append(newcentres)

In [None]:
fig = plt.figure(figsize=(50,10))
fig.add_subplot(1,10,1)
plt.scatter(data_1[:, 0], data_1[:,1])
fig.add_subplot(2,10,1)
plt.scatter(10,5)
fig.add_subplot(3,10,1)

In [None]:
_4means_clustering(data, centres), centre_updates()

In [None]:
centre_updates()

In [None]:
centre_updates()

In [None]:
centre_updates()

In [None]:
centre_updates()

In [None]:
for centre in centre_history:
    print(centre)