## Clustering

looks at a number of datapoints and finds datapoints that are similar to each other

Applications
- similar news articles
- market segmentation
- DNA analysis
- astronomical data analysis

***k-means clustering algorithm***
- Step 1: randomly initialize K cluster centroids
- Step 2: assign points to cluster centroids closest to each training examples (get L2 norm or distance) 
- Step 3: move cluster centroids to be updated to mean of the **points assigned to that cluster**. Eliminate cluster or randomly reinitalize cluster centroid if no points are assigned (eliminating is more popular)

- Repeat steps 2 and 3


**k-means optimization objective**
c^(i) = index of cluster (1, 2, ... K) to which example x^(i) is currently assigned
μ_k = cluster centroid k
μ_c^(i) = cluster centroid of cluster to whoch example x^(i) has been assigned

**Distortion Cost Function**: J = 1/m ∑ ||x^(i) - μ_c^(i) || ^ 2
average of squared distances between every training example and the location of the cluster centeroid to which the training example has been assigned
- every single step the cost function should decrease
    - if it doesnt go down, it has converged
    - if it goes up, there is an error in the code

**Initializing K-Means**
- Choose K < m
- randomly pick K training examples
- set μ_1, ... μ_k equal to these K examples
<br> <br>
- then run K means to get all c and μ
- compute cost function (distortion)
- repeat ~100 times
<br><br>
- pick the set of clusters that gave lowest cost J

**Choosing the Number of Clusters**
- Elbow Method: run K means with variety values of K, plot distortion/cost as a function of number of clusters, pikc the "elbow" where cost decreaes sharply up to a clear elbow

- In practice, you are often running K means for a later (downstream) purpose -> evaluate how well it does for that downstream purpose!

## K-Means Example

The K-means algorithm is a method to automatically cluster similar
data points together. 

* Concretely, you are given a training set $\{x^{(1)}, ..., x^{(m)}\}$, and you want
to group the data into a few cohesive “clusters”. 


* K-means is an iterative procedure that
     * Starts by guessing the initial centroids, and then 
     * Refines this guess by 
         * Repeatedly assigning examples to their closest centroids, and then 
         * Recomputing the centroids based on the assignments.
         

* In pseudocode, the K-means algorithm is as follows:

    ``` python
    # Initialize centroids
    # K is the number of clusters
    centroids = kMeans_init_centroids(X, K)
    
    for iter in range(iterations):
        # Cluster assignment step: 
        # Assign each data point to the closest centroid. 
        # idx[i] corresponds to the index of the centroid 
        # assigned to example i
        idx = find_closest_centroids(X, centroids)

        # Move centroid step: 
        # Compute means based on centroid assignments
        centroids = compute_centroids(X, idx, K)
    ```


* The inner-loop of the algorithm repeatedly carries out two steps: 
    1. Assigning each training example $x^{(i)}$ to its closest centroid, and
    2. Recomputing the mean of each centroid using the points assigned to it. 
    
    
* The $K$-means algorithm will always converge to some final set of means for the centroids. 

* However, the converged solution may not always be ideal and depends on the initial setting of the centroids.
    * Therefore, in practice the K-means algorithm is usually run a few times with different random initializations. 
    * One way to choose between these different solutions from different random initializations is to choose the one with the lowest cost function value (distortion).

You will implement the two phases of the K-means algorithm separately
in the next sections. 
* You will start by completing `find_closest_centroid` and then proceed to complete `compute_centroids`.

In [1]:
import numpy as np 
import matplotlib.pyplot as plt

* **find_closest_centroids** takes the data matrix `X` and the locations of all
centroids inside `centroids` 
* It should output a one-dimensional array `idx` (which has the same number of elements as `X`) that holds the index  of the closest centroid (a value in $\{0,...,K-1\}$, where $K$ is total number of centroids) to every training example . *(Note: The index range 0 to K-1 varies slightly from what is shown in the lectures (i.e. 1 to K) because Python list indices start at 0 instead of 1)*
* Specifically, for every example $x^{(i)}$ we set
$$c^{(i)} := j \quad \mathrm{that \; minimizes} \quad ||x^{(i)} - \mu_j||^2,$$
where 
 * $c^{(i)}$ is the index of the centroid that is closest to $x^{(i)}$ (corresponds to `idx[i]` in the starter code), and 
 * $\mu_j$ is the position (value) of the $j$’th centroid. (stored in `centroids` in the starter code)
 * $||x^{(i)} - \mu_j||$ is the L2-norm

In [2]:
def find_closest_centroids(X, centroids):
    K = centroids.shape[0]

    idx = np.zeros(X.shape[0], dtype = int)

    for i in range(X.shape[0]):
        # holds costs between X[i] and each centroids[j]
        costs = []

        # loop through all centroids
        for j in range(centroids.shape[0]):
            norm_ij = np.linalg.norm(X[i] - centroids[j])
            cost_ij = norm_ij ** 2

            costs.append(cost_ij)

        # assign closest centroid to exmaple i 
        idx[i] = np.argmin(costs)
        
    return idx

**compute_centroids** - recompute the value for each centroid

* Specifically, for every centroid $\mu_k$ we set
$$\mu_k = \frac{1}{|C_k|} \sum_{i \in C_k} x^{(i)}$$ 

    where 
    * C_k is the set of examples that are assigned to centroid k
    * |C_k| is the number of examples in the set C_k


* Concretely, if two examples say $x^{(3)}$ and $x^{(5)}$ are assigned to centroid k=2,
then you should update $$\mu_2 = \frac{1}{2}(x^{(3)}+x^{(5)})$$.

In [None]:
def compute_centroids(X, idx, K):
    m, n = X.shape
    centroids = np.zeros((K, n))

    for i in range(K):
        assigned_points = X[idx == i] # points assinged to specific centroid i
        centroids[i] = np.mean(assigned_points, axis = 0)
    
    return centroids