## Lab 1b K-means
In this lab, you will implement and use a k-means clustering algorithm by following a 3 step process.

The next cell contains the main processing loop for k-means. Run this cell and finish implementing each of the functions that follow it.

In [None]:
from random import sample 
from math import sqrt

def kmeans(data, K):
    """
    Runs the K-means algorithm on n rows of data using K clusters.
    Returns a matrix of K cluster centroids, one per row.
    """
    # initialize cluster centers
    centroids = init_centroids(K, data)
    for i in range(NUM_ITERATIONS):
        # assign data points to clusters
        index = assign_step(data, centroids)
        # update cluster centers
        centroids = update_step(data, index, K)
    return centroids

### Step 1 and Step 2
Review the code that we have provided, make notes on what needs to be done (Step 1), and then by complete the missing code sections (Step 2).

Consider the required inputs and expected output for each functions, and answer the following:

1. **What are dimensions and type of each argument?**
1. **What the dimensions and type of the return value?**
1. **Describe how the required calculations can be done in Python.**  

Ask a TA to review your notes on this function if you are not sure about it.  Once you have a clear understanding of the function please fill in the code necessary to implement it.  We have provided some testing code to help ensure you have correctly implemented each function.  You are welcome to examine the testing code to help get a handle on how the function is supposed to work.

Here is an example of what your solutions for steps 1 and 2 should look like:

In [None]:
def distance(a, b):
    """Returns the distance between two vectors."""
    dist = 0
    # 
    for i, j in zip(a, b):
        dist += (i - j)**2
    return sqrt(dist)

# Testing code:
d = distance([0,1],[1,0])
print(d)
assert abs(d-sqrt(2))<.001   

#### Dimensions and type of each argument:
a and b are both vectors (or lists) of numbers that are the same length, d

#### Dimensions and type of each the value:
returns the distance between a and b

#### How can the required calculations can be done in Python:
Use iteration over the elements of and b to compute d-dimensional [Euclideam distance](https://en.wikipedia.org/wiki/Euclidean_distance) between the vectors.

In [None]:
def init_centroids(K, data):
    """
    Selects K different random rows from data and returns them.  
    data is a 2D array.                
    """
    
    
    
    return centroids

# Testing code:
data = [[1,0],[0,1],[1,1]] # Three 2-D points
centroids = init_centroids(3,data)
print('Our three centroids:', centroids)
assert len(centroids)==3
assert len(centroids[0])==2
# Check K unique data points have been returned:
assert [1,0] in centroids and [0,1] in centroids and [1,1] in centroids 

STEP 1 HERE
##### Dimensions and type of each argument:
##### Dimensions and type of the return value:
##### How can the required calculations can be done in Python:

hint: use ```random.sample```

In [None]:
def assign_step(data, centroids):
    """
    Determines a centroid index for every row of the data.
    Returns a vector of centroid indices, one for each row of the data.
    """
    
    n = len(data)  # number of data points
    K = len(centroids) # number of cluster centroids
    index = [0]*n # Pre-allocated array of indices
    
    
    
    return index

data = [[0, 1.1], [1.2, 0], [.8, 0]]
centroids = [[1,0],[0,1]]
index = assign_step(data, centroids)
print('The 3 data points belong to clusters', index, 'respectively')
assert index == [1,0, 0]

STEP 1 (cont.) HERE
##### Dimensions and type of each argument:
##### Dimensions and type of the return value:
##### How can the required calculations can be done in Python: 

In [None]:
def update_step(data, index, K):
    """
    Computes the centroid for each cluster.  Uses k to make this
    calculation efficient.
    Returns a matrix of K centroids, one per row.
    """
    
            
    
    return centroids

data = [[0, 1.1], [1.2, 0], [.8, 0]]
index = [1,0,0]
centroids_old = [[1,0],[0,1]]
centroids_new = update_step(data, index, 2)
print(centroids_new)
assert centroids_new == [[1.0, 0], [0, 1.1]]

STEP 1 HERE
##### Dimensions and type of each argument:
##### Dimensions and type of the return value:
##### How can the required calculations can be done in Python:  

### Step 3 Try out your K-means solution
Run the following code to try out your k-means solution on some synthetic data.  The data points are blue points and the final centroids are indicated by red x's.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

mu1 = 1
sigma = .04
N = 100

c1 = np.random.multivariate_normal([mu1,0], [[sigma, 0],[0, sigma]], N)
c2 = np.random.multivariate_normal([0,mu1], [[sigma, 0],[0, sigma]], N)
c3 = np.random.multivariate_normal([mu1,mu1], [[sigma, 0],[0, sigma]], N)
sd = np.vstack((c1,c2,c3))
plt.plot(sd[:, 0], sd[:, 1], '.')

NUM_ITERATIONS = 25
centroids = kmeans(sd, 3)
centroids = np.array(centroids)
plt.plot(centroids[:, 0], centroids[:, 1], 'rx')

#### Sanity Check:
Your scatter plot should look similar to this one:
    
<img src="images/scatter.png" width="500px">
    

### Step 3 Profile your K-means code
Once your k-means implementation code run the following to do line profiling on your solution to determine what parts of your code are taking the most time to execute:

In [None]:
NUM_ITERATIONS = 25

import line_profiler
l = line_profiler.LineProfiler()
l.add_function(kmeans)
l.run('kmeans(sd, 3)')
l.print_stats()

#### Sanity Check
When we run line profiling on our implementation we get the following:

```
Timer unit: 1e-06 s

Total time: 0.153622 s
File: <ipython-input-1-4668301c5461>
Function: kmeans at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           def kmeans(data, K):
     5                                               '''
     6                                               Runs the K-means algorithm on n rows of data using K clusters.
     7                                               Returns a matrix of K cluster centroids, one per row.
     8                                               '''
     9                                               # initialize cluster centers
    10         1          116    116.0      0.1      centroids = init_centroids(K, data)
    11        26           29      1.1      0.0      for i in range(NUM_ITERATIONS):
    12                                                   # assign data points to clusters
    13        25       122180   4887.2     79.5          index = assign_step(data, centroids)
    14                                                   # update cluster centers
    15        25        31297   1251.9     20.4          centroids = update_step(data, index, K)
    16         1            0      0.0      0.0      return centroids
    ```