# Lecture 5: 
- K-Means Clustering

__Optional Reading Material:__
- [Scikit-learn: K-Means Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
- [Visualizing K-Means clustering](https://www.naftaliharris.com/blog/visualizing-k-means-clustering/)

### k-Means Clustering
The idea of k-means clustering is to group a set of data points into $k$ clusters, such that we make the clusters as clustered, or as concentrated, as possible. So, if we wish to cluster into clusters $\{D_1, D_2, . . . D_{k}\}$, then we wish to minimize the total within-cluster variation:
$$
\displaystyle\sum_{i=1}^{k}\sum_{x\in D_i} d(x,\mu_i)^2.
$$


When clustering, the points are clustered very much dependently on each other. Usually, the distance function is the Euclidean distance. It is not usually feasible to search the space of all possible solutions, so some fast optimization algorithm is used (sometimes several times). In sklearn, you as a user can control what method is used. k-Means clustering is faster and more robust if the data is as low-dimensional as possible. This means that linearly correlated variables should be combined. This is called __dimensionality reduction__.

__k-Means clustering__ is implemented as follows:

In [1]:
import numpy as np
from sklearn.cluster import KMeans

In [3]:
X=  np.array([[1,1,3,4],
              [2,1,5,5],
              [5,5,1,2],
              [5,4,1,3],
              [5,5,1,1]])
kmeans = KMeans(4)
kmeans.fit(X)
kmeans.labels_

array([1, 2, 0, 3, 0], dtype=int32)

Again, if you look at the data, these are likely the clusters that you would have detected intuitively.

Here's a more complicated example:

http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_iris.html#sphx-glr-auto-examples-cluster-plot-cluster-iris-py