Clustering algorithms are mostly straightforward: Points are assigned to centroids, and centroids are updated --- repeat until convergence.

Much of the difficulty in using clustering involves picking how many centroids to create and how to initialize those centroids. 

Determine which cluster centroid is closest to each point.

RETURN a 2-d numpy array where each row indicates which cluster a point is closets to, and thus also assigned to:

e.g. [0,1,0,...,0] indicates the point is assigned to the second cluster, and
[0,0,...,1] indicates the point is assigned to the last cluster

In [1]:
from scipy.stats import multivariate_normal
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

In [2]:
### Assign points to clusters according to the k-means algorithm
### Follow directions above

def assign_clusters_k_means(points, clusters):
    """
    Determine the nearest cluster to each point, returning an array indicating the closest cluster
    
    Positional Arguments:
        points: a 2-d numpy array where each row is a different point, and each
            column indicates the location of that point in that dimension
        clusters: a 2-d numpy array where each row is a different centroid cluster;
            each column indicates the location of that centroid in that dimension
    
    Example:
        points = np.array([[0,1], [2,2], [5,4], [3,6], [4,2]])
        clusters = np.array([[0,1],[5,4]])
        cluster_weights = assign_clusters_k_means(points, clusters)
        
        print(cluster_weights) #--> np.array([[1, 0],
                                              [1, 0],
                                              [0, 1],
                                              [0, 1],
                                              [0, 1]])
    """
    # NB: "cluster_weights" is used as a common term between functions
    # the name makes more sense in soft-clustering contexts
    
    #Find distances between each point and each cluster (euclides distance)
    dist_to_clust = np.concatenate(
        [np.apply_along_axis(np.linalg.norm, 1, points - c).reshape((-1,1)) for c in clusters],
        axis = 1)
    
    #Function to convert minimum distance to 1 and other to 0
    def find_min(x):
        m = np.min(x)
        flag = [1 if n == m else 0 for n in x]
        return flag
    
    #Apply function
    cluster_assignments = np.apply_along_axis(find_min, 1, dist_to_clust)
    
    return cluster_assignments
    

points = np.array([[0,1], [2,2], [5,4], [3,6], [4,2]])
clusters = np.array([[0,1],[5,4]])
cluster_weights = assign_clusters_k_means(points, clusters)
print(cluster_weights) #--> np.array([[1, 0],[1, 0],[0, 1],[0, 1],[0, 1]])

[[1 0]
 [1 0]
 [0 1]
 [0 1]
 [0 1]]
