# Clustering

In this module, we will learn about clustering in Python.

In [1]:
%matplotlib inline
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets as skdat
import scipy

## K-means clustering

As discussed in the talk, K-means is an effective clustering algorithm that is widely used. In this lab, you will implement K-means and explore its strengths and limitations.

Here is the pseudocode for K-means:<br/>
<ol>
    <li>randomly initialize $\mu_k, \mathrm{\;for\;} k=1,\dots,K$</li>
    <li>while $\mu$ not converged:</li>
    <ol>
        <li>assign data point n to nearest cluster: $r_{n} \leftarrow \arg \min_k ||x_n - \mu_k ||^2, \mathrm{\;for\;} n=1,\dots,N$</li>
        <li>count number of data points assigned to cluster $k$, $N_k \leftarrow \sum_{n=1}^N r_{nk}, \mathrm{\;for\;} k=1,\dots,K$</li>
        <li>update cluster centers: $\mu_k \leftarrow \frac{1}{N_k} \sum_{n=1}^N x_n r_{nk}, \mathrm{\;for\;} k=1,\dots,K$</li>
    </ol>
</ol>   
Recall that $r_{nk}$ is a binary variable indicating whether data point $n$ is assigned to cluster $k$ and that $\mu_k$ is the center for cluster $k$. Note that $\mu_k$ could be a scalar or a vector; its dimensionality matches that of the data $x_{1:N}$. 

### Task 1

Given the above pseudocode for K-means, complete the implementation below.

Guidance: your answer will likely include a main *while* loop (one for each outer iteration of K-means) and two *for* loops (one stepping through the data and another stepping through the clusters). 

In [2]:
def kmeans(X, K, eps=1e-5, max_iterations=200):
    """
        arguments:
            X: a (N,D) numpy array of observed data
            K: integer indicating number of clusters
            eps (optional): real threshold for change in mu for deciding when to stop
            max_iterations (optional): max num of iterations, regardless of eps threshold
        returns:
            mu: a (K,D) numpy array of cluster means after k-means converged
            R: a (N,K) numpy array of binary cluster assignments
    """
    
    #todo: put your code here
    (N,D) = X.shape
    np.random.seed(100)
    rand_n = np.random.choice(range(N),size=K)
    mu = X[rand_n,:]
    prev_mu = np.inf #always do the first iteration
    R = np.zeros((N,K)) #this will be written over before first read
    it = 0
    while np.abs(mu-prev_mu).sum() >= eps and it<max_iterations:
        #store previous value of mu
        prev_mu = mu.copy()
        #A. update cluster assignments according to nearest cluster
        for n in range(N):
            d = np.array([np.linalg.norm(X[n,:] - mu[k,:]) for k in range(K)])
            R[n,:] = 0.
            R[n,d.argmin()] = 1.
        #B. calculate cluster sizes
        Nk = R.sum(axis=0) + 1e-9 #add epsilon to avoid divide by zero errors
        #C. update cluster centers
        mu = np.dot(R.T, X) / Nk[:,np.newaxis] #results in a (K,D) matrix
        it += 1
        #print('cluster centers',mu)
        #print('cluster assignments',R)    
    
    return mu, R

#apply to some toy data generated by sklearn package:
N,D,K = 100, 2, 3
X, true_class = skdat.make_blobs(n_samples=N, n_features=D, centers=K, random_state=0)
mu, R = kmeans(X,K)
print('cluster centers',mu)
print('cluster assignments',R.argmax(axis=1))
print('true assignments',true_class)

cluster centers [[ 2.26282192  1.26005527]
 [ 0.81438285  4.08457242]
 [-1.82049132  2.84315898]]
cluster assignments [0 1 0 1 1 1 2 2 0 1 1 1 0 1 2 0 2 1 2 2 2 2 2 1 0 0 0 0 2 2 1 0 0 1 2 1 1
 0 0 2 2 0 0 1 1 1 0 0 2 2 1 0 1 0 2 2 0 0 1 0 0 2 2 2 2 0 1 2 0 1 2 0 2 0
 1 1 1 1 2 0 1 1 0 1 1 1 1 1 0 1 1 0 1 2 2 1 1 1 2 2]
true assignments [1 0 1 0 0 0 2 2 1 0 0 0 1 0 2 1 2 0 2 2 2 2 2 0 1 1 1 1 2 2 0 1 1 0 2 2 0
 1 1 2 2 1 1 0 0 0 1 1 2 2 2 1 0 1 2 2 1 1 0 1 1 2 2 2 2 1 0 2 1 0 2 0 0 1
 1 0 0 0 2 1 0 0 1 0 1 0 0 0 1 0 1 1 2 2 2 2 0 0 2 2]


In [12]:
plt.ioff()
def plot_clusters_2D(X,R,mu,separate=False,titles=None):
    colors = ['r', 'g', 'b']
    A = R.argmax(axis=1) #calculate most likely cluster assignment
    for k in range(K):
        nsk, = np.where(A==k) #select data points based on assignment
        c = colors[k % len(colors)] #choose colour from list
        plt.scatter(X[nsk,0], X[nsk,1], marker='x', color=c)
        plt.scatter(mu[k,0],mu[k,1],color=c)
        if titles is not None: plt.title(titles[k])
        if separate and k<K-1: plt.figure()
plot_clusters_2D(X,R,mu)

#### Question: 
How does your implementation scale in the number of data points $N$ and the number of clusters $K$?

#### Answer:
[write here]

#### Label switching
Notice the discovered cluster assignments from K-means are different from the known assignments used in data generation, even though the means are quite close. This phenonmenon is known as "label switching" and results from the fact that the underlying model is symmetric with respect to cluster assignments (i.e., data points belong to the same cluster they are assigned to even if we switch the cluster id's of all the assignments). 

#### Bonus Task (optional)
Write a method to calculate how well the cluster assignments discovered by K-means matches those of the generated data regardless of label switching. Use an error of 0 if they match and an error of 1 if they do not, then find the average over all data points.

## Limitations of K-means

### Task 2

As discussed in the talk, there are several limitations of K-means. One of them is that it is sensitive to outliers. 

#### Question:
Why is K-means sensitive to outliers (write below)?
#### Answer:
[here]

The first step in this task is to add outliers to the above 2D toy data set. Set the variable outliers below to include outliers in the context of X and add it to the X dataset:

In [4]:
outliers = np.array([
                [30., 20.],
                [-30., -30.],
                [0., -25.]
                 ])
X_outlier = np.vstack((X, outliers))
mu1, R1 = kmeans(X_outlier,K)
plot_clusters_2D(X_outlier,R1,mu1)

Now apply your *kmeans* function to X_outliers and visualize the results. Specifically, color-code each data point $n$ by its assignment $r_{nk}$, giving each cluster a different color. 

## Clustering New York City collisions data

To end this lab session, we will apply the clustering methods you just implemented to a subset of the New York City collisions data set. The data set is the location of traffic collisions in New York City between June 1st 2016 and June 8th 2016. The data was obtained from https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data

We will use the `Pandas` library to simply load a CSV (comma-separated values) file of the data and display a summary (the first and last several rows) of the whole data file.

In [5]:
import pandas as pd
data_path = './data/nyc_collisions_01june_08june_2016.csv'
collisions_table = pd.read_csv(data_path)
collisions_table #browse data in a table format

Unnamed: 0,DATE,TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,UNIQUE KEY,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,06/01/2016,9:50,,,,,,,,73 AVENUE,...,Unspecified,Unspecified,,,3454728,PASSENGER VEHICLE,PASSENGER VEHICLE,PASSENGER VEHICLE,,
1,06/01/2016,9:50,,,,,,,,,...,Unspecified,,,,3453352,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
2,06/01/2016,9:50,,,,,,RICHMOND ROAD,TARGEE STREET,,...,Unspecified,,,,3454271,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
3,06/01/2016,9:50,,,,,,41-18 56TH ST,56 STREET,,...,Unspecified,,,,3453995,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
4,06/01/2016,9:46,,,,,,MAIN STREET,,,...,Unspecified,,,,3454727,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
5,06/01/2016,9:45,QUEENS,11417.0,40.669946,-73.842613,"(40.669946, -73.8426132)",,,150-19 CROSSBAY BLVD,...,Unspecified,,,,3453945,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
6,06/01/2016,9:44,BRONX,10454.0,40.801626,-73.909672,"(40.8016264, -73.9096721)",EAST 136 STREET,WALNUT AVENUE,,...,Unspecified,,,,3453118,PASSENGER VEHICLE,OTHER,,,
7,06/01/2016,9:40,,,,,,NORTH CONDUIT AVENUE,140 STREET,,...,Unspecified,,,,3454133,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
8,06/01/2016,9:37,BRONX,10466.0,40.889193,-73.831298,"(40.8891934, -73.8312982)",DYRE AVENUE,EAST 233 STREET,,...,Unspecified,,,,3455995,PASSENGER VEHICLE,PASSENGER VEHICLE,,,
9,06/01/2016,9:35,,,,,,AVERY AVENUE,,,...,Unspecified,,,,3454038,PASSENGER VEHICLE,,,,


### Taking a subset of the data points and features
You will notice that there are multiple columns per collision and that not every collision has a related location. A more complex model may be able to incorporate this extra information but for now let's focus on just the locations of collisions and time of day. 

We filter the data: removing columns (features) and rows (collisions) so that we end up with a `numpy` array `X` of collisions with valid latitutes and longitudes.

In [6]:
loc_collisions_table = collisions_table[np.isfinite(collisions_table['LATITUDE'])] #remove rows with NaNs
loc_collisions_table['TIME_HOUR'] = loc_collisions_table.TIME.apply(lambda x: float(x.split(':')[0]) + float(x.split(':')[1])/60.)
Xcol = loc_collisions_table.as_matrix(columns=['LONGITUDE','LATITUDE','TIME_HOUR'])
Xcol #display the data as a N-by-4 numpy array

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


array([[-73.8426132 ,  40.669946  ,   9.75      ],
       [-73.9096721 ,  40.8016264 ,   9.73333333],
       [-73.8312982 ,  40.8891934 ,   9.61666667],
       ..., 
       [-73.9725871 ,  40.7533182 ,  14.25      ],
       [-73.7784205 ,  40.6790298 ,  14.25      ],
       [-73.9040958 ,  40.7210307 ,  14.25      ]])

### Standardizing the data
It is good practice to standardize the data, that is, to transform the data such that it has zero mean and unit standard deviation. This helps with hyperparameter selection and parameter exploration. But sometimes it's useful to work in the original space, so be sure to save the transformation variables for later.

In [7]:
tr_mn = Xcol.mean(axis=0)
tr_sd = (Xcol-tr_mn).std(axis=0)
Xcol1 = (Xcol-tr_mn)/tr_sd
#X = (X1-tr_mn)/tr_var
print('transformation variables:\n mean=',tr_mn,'std=',tr_sd)
print('X=\n',Xcol1)
#check:
print('check transformed X properties:\n mean=',Xcol1.mean(axis=0),'std=',Xcol1.std(axis=0))

transformation variables:
 mean= [-73.91901381  40.72145239  13.69823791] std= [ 0.08740379  0.07928329  5.55553749]
X=
 [[ 0.87411096 -0.64965    -0.71068513]
 [ 0.10687995  1.01123472 -0.71368514]
 [ 1.00356759  2.11571719 -0.73468521]
 ..., 
 [-0.61294008  0.40192346  0.0993175 ]
 [ 1.60854935 -0.53507604  0.0993175 ]
 [ 0.17067924 -0.00531874  0.0993175 ]]
check transformed X properties:
 mean= [  9.36951360e-14  -6.13649684e-14  -5.29832133e-17] std= [ 1.  1.  1.]


### Visualize the data
Now we are going to make a simple visualization of the data. Since `X` consists of 2-dimensional points we use a 2-dimensional scatter.

In [8]:
import seaborn as sns
sns.set(color_codes=True)
ax = sns.lmplot("LONGITUDE","LATITUDE", data=loc_collisions_table, fit_reg=False)


### Task 4

Apply your *kmeans* implementation to the NYC collisions data. Visualize the results, including the cluster assignments. What do you find?

In [13]:
K = 3
mu3, R3 = kmeans(Xcol1[:,:2],K)
print('cluster centers',mu3)
#titles = ['%i data points, average time of day = %.1f hours' % (R3.sum(axis=0)[k], mu3[k,2]*tr_sd[2]+tr_mn[2]) for k in range(K)]
#plot_clusters_2D(Xcol1[:,:2]*tr_sd[:2]+tr_mn[:2],R3,
#                 mu3*tr_sd[:2]+tr_mn[:2],
#                 separate=True) #,titles=titles)
#plot_clusters_2D(Xcol1[:,:2]*tr_sd[:2]+tr_mn[:2],R3,
#                 mu3*tr_sd[:2]+tr_mn[:2],
#                 separate=True) #,titles=titles)


cluster centers [[-0.72968579 -0.84060602]
 [ 1.23576927 -0.14909324]
 [-0.06860418  1.01964794]]
