# Investigating K-Means and Gaussian Mixture Models

_Motivation: Expectation-Maximization-Gaussian-Mixtures/EM-for-gmm.ipynb_

##  Expectation-Maximisation for Gaussian Mixture Models: optimisation 

In [None]:
import numpy as np
import matplotlib.pyplot as plt 
import copy
from scipy.stats import multivariate_normal

import sys
import clusSupport as cs

dataDir = "data"
# Make sure the outputDir subdirectory exists
outputDir = "output/Practical_C_DigitsData"
import os, errno
try:
    os.makedirs(outputDir)
except OSError as e:
    if e.errno != errno.EEXIST:
        raise

%matplotlib inline

plot_kwds = {'alpha':0.5, 's':25, 'linewidths':0}

In [None]:
# Model parameters
init_means = [
    [5, 0], # mean of cluster 1
    [1, 1], # mean of cluster 2
    [0, 5]  # mean of cluster 3
]
init_covariances = [
    [[.5, 0.], [0, .5]], # covariance of cluster 1
    [[.92, .38], [.38, .91]], # covariance of cluster 2
    [[.5, 0.], [0, .5]]  # covariance of cluster 3
]
init_weights = [1/4., 1/2., 1/4.]  # weights of each cluster

# Generate data
np.random.seed(4)
X = cs.generate_MoG_data(100, init_means, init_covariances, init_weights)

In [None]:
x0 = []
x1 = []
for row in X:
    x0.append(row[0])
    x1.append(row[1])
d = np.array([x0, x1])
data = d.T

plt.figure()
plt.scatter(data.T[0], data.T[1], c='b', **plot_kwds)
frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)
plt.savefig(outputDir + '/threeEllipseBlobs.pdf')

#plt.figure()
#d = np.vstack(data)
#plt.plot(d[:,0], d[:,1],'ko')
#plt.rcParams.update({'font.size':16})
#plt.tight_layout()

We can probably pick out the 3 clusters by eye, although there also appear to be some discordant points so the cluster centres are not obvious.

In [None]:
np.random.seed(4)

# Initialization of parameters

# random choice of 3 indices
chosen = np.random.choice(len(data), 3, replace=False)
# randomly pick 3 of the data points as initial centres
initial_means = [data[x] for x in chosen]
# Get 3 copies of the covariance of the overall data
initial_covs = [np.cov(data, rowvar=0)] * 3
# Get 3 copies of 1/3
initial_weights = [1/3.] * 3 

In [None]:
#from scipy.stats import multivariate_normal
#import numpy as np
#import matplotlib.pyplot as plt
#x = np.linspace(0, 5, 10, endpoint=False)
#y = multivariate_normal.pdf(x, mean=2, cov=0.5)
#plt.plot(x, y)

In [None]:
#from scipy.stats import multivariate_normal
#var = multivariate_normal(mean=[0,0], cov=[[1,0],[0,1]])
#Z = var.pdf([1,0])
#var

In [None]:
# Parameters after initialization
paletteName = 'deep'
fontSize = 10
cs.plot_contours(data, initial_means, initial_covs, 'Initial clusters', paletteName, fontSize)
plt.savefig(outputDir + '/threeEllipseBlobsInitialClusters.pdf')

As you can see, the initial cluster placements are not particularly good, because two of the initial centres belong to one apparent cluster and a whole set of points lies outside the normal range of their nearest centre.

In [None]:
# Parameters after running EM to convergence
results = cs.EM(data, initial_means, initial_covs, initial_weights)
finalWeights = results['weights']
finalMeans = results['means']
finalCovariances = results['covs']
print(init_weights)
print(finalWeights)
print(init_means)
print(finalMeans)
print(init_covariances)
print(finalCovariances)

loglikelihoods = results['loglik']
cs.plot_contours(data, results['means'], results['covs'], 'Final clusters', paletteName, fontSize)
plt.savefig(outputDir + '/threeEllipseBlobsFinalClusters.pdf')

As can be seen, the EM algorithm converges after 22 iterations. At that point, the movement of the centres is negligible and the stopping criterion ensures that the algorithm terminates. We can also see from the plot that the EM algorithm has placed the Gaussians so they are centred on the clusters and the covariance contours indicate how the distributions are aligned with the local data in the cluster.

In [None]:
plt.plot(range(len(loglikelihoods)), loglikelihoods, linewidth=4)
plt.xlabel('Iteration')
plt.ylabel('Log-likelihood')
plt.rcParams.update({'font.size':16})
plt.tight_layout()
plt.savefig(outputDir + '/threeEllipseBlobsObjectiveConvergence.pdf')

EM works by minimising the (negative) log likelihood of the data. The plot indicates that the first 3 steps make the most progress, and that subsequent steps just refine the placement until the stopping criterion is met. 

## Gaussian Mixture Models and stretched data: comparison with K-Means

_Motivation: PythonDataScienceHandbook/notebooks/05.12-Gaussian-Mixtures.ipynb_

Again, we create some 'blob' data.

In [None]:
# Generate some data
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=400, centers=4,
                       cluster_std=0.60, random_state=0)
X = X[:, ::-1] # flip axes for better plotting

plt.figure()
plt.axis('equal')
plt.scatter(X[:, 0], X[:, 1], s=25, alpha=0.3);
plt.savefig(outputDir + '/fourEllipseBlobs.pdf')

In [None]:
# Plot the data with K Means Labels
plot_kwds = {'alpha' : 0.5, 's' : 40, 'linewidths':0}
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=0)
labels = kmeans.fit(X).predict(X)
title = "KMeans fit to 4 blobs"
plt = cs.plot_2dClusters(X, labels, title, paletteName, fontSize, plot_kwds)
plt.savefig(outputDir + '/fourEllipseBlobs_Kmeans.pdf')

K-Means appears to do a good job assigning points to (colour-coded) clusters. This is not surprising, because the clusters are globular and relatively well separated. 

In [None]:
cs.plot_kmeans(kmeans, X)
plt.savefig(outputDir + '/fourEllipseBlobsKMeansWithDisks.pdf')
#from scipy.spatial.distance import cdist
#centres = kmeans.cluster_centers_
#radii = [cdist(X[labels == i], [centre]).max()
#         for i, centre in enumerate(centres)]

#fc='#CCCCCC'
#plt = cs.overlayDisks(plt, centres, radii, fc, plot_kwds)

The K-means regions are circular and hence a good fit with the data. However, data for clustering does not always have this nice property. We take the data and "stretch" it by multiplying by a random matrix, which changes the shape of the clusters as seen below.

In [None]:
rng = np.random.RandomState(13)
X_stretched = np.dot(X, rng.randn(2, 2))
plt.scatter(X_stretched[:, 0], X_stretched[:, 1], s=25, alpha=0.3);
plt.savefig(outputDir + '/fourEllipseBlobsStretched.pdf')

We add the K-means circular regions to see how K-means decides on cluster membership.

In [None]:
kmeans = KMeans(n_clusters=4, random_state=0)
cs.plot_kmeans(kmeans, X_stretched)
plt.savefig(outputDir + '/fourEllipseBlobsStretchedKMeansWithDisks.pdf')

K-means still does a good job separting the blue and green clusters, but has difficulty distinguishing between the yellow and purple ones. The circular regions are not a great choice here.

In [None]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4).fit(X)
labels = gmm.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis');
plt.savefig(outputDir + '/fourEllipseBlobsStretchedGMM.pdf')

The Gaussian Mixture Model attempts to place Gaussian distributions so that each is centred on a cluster. Indeed each point is assigned a probability of belonging to each of the Gaussians in the mixture. For many of the points, cluster membership is simple to determine, but the first of the points below could belong to either cluster 1 or cluster 2, but probably to cluster 1, because it has a higher probability of membership with that cluster.

In [None]:
probs = gmm.predict_proba(X)
print(probs[:5].round(3))

If we make the size of the points depend on the cluster membership probability, we can see from the plot below that there are two points at about (2.8,0) and (6.5,0) that have been assigned to the purple cluster rather than either the yellow or the blue cluster, respectively.

In [None]:
size = 50 * probs.max(1) ** 2  # square emphasizes differences
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=size);
plt.savefig(outputDir + '/fourEllipseBlobsGMMProbSize.pdf')

By analogy with the K-Means "fit", we can also indicate the Gaussian Mixture regions, see below. Because of sampling reasons, the regions are not quite circular, although they are close.

In [None]:
gmm = GaussianMixture(n_components=4, random_state=42)
cs.plot_gmm(gmm, X)
plt.savefig(outputDir + '/fourEllipseBlobsGMMwithDisks.pdf')

We can apply the Gaussian Mixture model to stretched data and it handles it readily, because of the greater flexibility in defining the shape of its regions, see below:

In [None]:
gmm = GaussianMixture(n_components=4, covariance_type='full', random_state=42)
cs.plot_gmm(gmm, X_stretched)
plt.savefig(outputDir + '/fourEllipseBlobsStretchedGMMwithDisks.pdf')