# A1 Tasks 10-12: Clustering Digits

_Story:_ You are a space scavenger exploring the origin of the most important invention in the world: the fidget spinner. You just arrived on the deserted planet and your AI buddy, Bubbles, has “imaged some items she found in a garbage heap” and collected some data for you. You look at the data and it seems like you might have found an ancient language!

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv('data/digits.csv',header = None).values[:, 1:]
print(data.shape)

__Task 10:__ Run the code below to test your k-means implementation by clustering the alien character data using k=5.

In [None]:
# TASK 10 here
from algorithms.kmeans import kmeans
NUM_CLUSTERS = 5
cluster_centers = kmeans(data, NUM_CLUSTERS)

What do your k-means clusters look like? How many clusters are there in the dataset? **Run the provided code below to visually check the accuracy of your k-means models.**

If you're k-means implementation works, you're result will look something like this.
![image.png](images/five-digits.png)

In [None]:
# Task 10 Continued. Run this code to visualize your cluster centers
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, NUM_CLUSTERS, figsize=(8, 3))
unflattened_centers = np.array(cluster_centers).reshape(NUM_CLUSTERS, 8, 8)
for axi, center in zip(ax.flat, unflattened_centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

In [None]:
%load_ext line_profiler
%lprun -f kmeans kmeans(data, NUM_CLUSTERS)

## Optimizing k-means

When we evaluate machine learning models such as k-means, the main factors we look to optimize include model accuracy, model interpretability, memory footprint, and execution time. In these last parts of the assignment, we will be looking at how to improve your model’s training time: the amount of time it takes for k-means to find the clusters via improving the speed that it computes centroids.

The next two tasks require you to create vectorized and parallelized versions of your k-means code. Put these versions of your code in `algorithms/kmeansvec.py` and `algorithms/kmeanspar.py`. 


### Vectorizing k-means
For numerical processing in Python many scientists rely on the [NumPy package](http://www.numpy.org/).  This package is built on optimized C code, it is often much faster to use it rather than plain Python. For example, a Python for-loop executes many additional instructions to ensure that it can correctly iterate over a list containing multiple data types.

However, to take advantage of NumPy optimizations you need to “vectorize” your code. This means instead of applying functions/operations element-wise; we want to think of the function inputs as mathematical vectors. Here are some examples of vectorization:

In [None]:
import numpy as np
# Initialize the NumPy arrays
u = np.array([1, 2, 3])
v = np.array([1, 2, 4])

# Vector sum
print('Summing in plain python:', [u[i] + v[i] for i in range(len(u))])
print('Summing with NumPy:', u + v)

# Dot product
print('Plain python dot product:', sum([u[i] * v[i] for i in range(len(u))]))
print('Numpy dot product:', np.dot(u, v))

The idea is to avoid element-wise operations and work on whole vectors and matrices instead.

__Task 11:__ Make copy of your ```kmeans.py``` code and use vectorization to speed up your k-means code in `algorithms/kmeansvec.py`. Your improved k-means program should not contain nested for-loops. Verify its correctness by its performance on your handwritten digits dataset below.

In [None]:
from algorithms.kmeansvec import kmeans_vec
NUM_CLUSTERS = 5
cluster_centers = kmeans_vec(data, NUM_CLUSTERS)

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

#cluster_centers = model.cluster_centers_

fig, ax = plt.subplots(1, NUM_CLUSTERS, figsize=(8, 3))
unflattened_centers = np.array(cluster_centers).reshape(NUM_CLUSTERS, 8, 8)
for axi, center in zip(ax.flat, unflattened_centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

In [None]:
%load_ext line_profiler
%lprun -f kmeans_vec kmeans_vec(data, NUM_CLUSTERS)

### Parallelizing k-means
Modern computers generally contain multiple independent processing units called cores.

When we can split the computations between cores, we may possibly speed up a program. For merge sort, we can initially partition the data into halves and send the left half to core A and the right half to core B. Then each core can sort a half at the same time. Once each half of the data has been sorted, it then needs to be merged. Parallelizing with just one division on to two cores will result in an almost twofold increase in speed.  

We can use the multiprocessing module in Python to take advantage of processing over two or more cores.

Below is an example of how one can parallelize merge sort using this approach.

In [None]:
from multiprocessing import Pool

from algorithms.mergesort import merge_sort, merge


def parallel_merge_sort(alist):
    # Sets up 2 Python processes
    p = Pool(2)
    
    mid = len(alist) // 2
    
    # Assign tasks each process 
    sorted_left, sorted_right = p.map(merge_sort, [alist[:mid], alist[mid:]])  
    return merge(sorted_left, sorted_right)

assert parallel_merge_sort([]) == []
assert parallel_merge_sort([10]) == [10]
assert parallel_merge_sort([2, 1, 3]) == [1, 2, 3]

__Task 12:__ In this task you make a copy of your ```kmeansvec.py``` code and use parallelization to speed up your k-means code in `algorithms/kmeanspar.py` by parallelling the ```assign_step()``` function to make use of the a pool object from the multiprocessing module. Test your code by visualizing the output using the code below.

In [None]:
from algorithms.kmeanspar import kmeans_par
NUM_CLUSTERS = 5
cluster_centers = kmeans_par(data, NUM_CLUSTERS)

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, NUM_CLUSTERS, figsize=(8, 3))
unflattened_centers = np.array(cluster_centers).reshape(NUM_CLUSTERS, 8, 8)
for axi, center in zip(ax.flat, unflattened_centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

__Task 13:__ Use timing or profiling to examine and compare the performance of your k-means implementations.  See the final part of lab1a for more information on lprun. Here is another great resource 

http://pynash.org/2013/03/06/timing-and-profiling/

Place your code below.  Create any additional functions you need achieve this and include them with you hand-in.



## Submission
When you are ready to submit, your local repository should have the following files in it:
```
algorithms/
	tests/
		__init__.py
		test_algorithms.py
	__init__.py
	bubblesort.py
	mergesort.py
	kmeans.py
    kmeansvec.py
    kmeanspar.py
preparation/
	loading.py
clustering_digits.ipynb
timing_sorts.ipynb
```
Commit your changes and then push your repository to your private GitHub repository.


References
Parallelizable sorts: http://www.dcc.fc.up.pt/~fds/aulas/PPD/1112/sorting.pdf

