# Worksheet 05

Name:  Matias Ou    
UID: U34955662

### Topics

- Cost Functions
- Kmeans

### Cost Function

Solving Data Science problems often starts by defining a metric with which to evaluate solutions were you able to find some. This metric is called a cost function. Data Science then backtracks and tries to find a process / algorithm to find solutions that can optimize for that cost function.

For example suppose you are asked to cluster three points A, B, C into two non-empty clusters. If someone gave you the solution `{A, B}, {C}`, how would you evaluate that this is a good solution?

Notice that because the clusters need to be non-empty and all points must be assigned to a cluster, it must be that two of the three points will be together in one cluster and the third will be alone in the other cluster.

In the above solution, if A and B are closer than A and C, and B and C, then this is a good solution. The smaller the distance between the two points in the same cluster (here A and B), the better the solution. So we can define our cost function to be that distance (between A and B here)!

The algorithm / process would involve clustering together the two closest points and put the third in its own cluster. This process optimizes for that cost function because no other pair of points could have a lower distance (although it could equal it).

### K means

a) (1-dimensional clustering) Walk through Lloyd's algorithm step by step on the following dataset:

`[0, .5, 1.5, 2, 6, 6.5, 7]` (note: each of these are 1-dimensional data points)

Given the initial centroids:

`[0, 2]`

- First assign each point its closest centroid 
- 2 recompute the center of each cluster: `[0.25, 4.6]`
- Assign each point in the data set its closest center `[0, 0.5, 1.5, 2], [6, 6.5, 7]`
- recompute the center of each cluster again: `[1.0, 6.5]`
- Assign each point in the data set their closest center: `[0, 0.5, 1.5, 2], [6, 6.5, 7]`
- And finally convergence. 

b) Describe in plain english what the cost function for k means is.

It measures how spread out the data points are within their respective clusters. A lower cost means that the data points are reletively closely clustered around their centroids, giving a better clustering solution. 

c) For the same number of clusters K, why could there be very different solutions to the K means algorithm on a given dataset?

There could be very different solutions for the same given dataset because every time we run the algorithm we are randomly selecting the centroids, so every initialization will lead to a different solution for the same number of k clusters. 

d) Does Lloyd's Algorithm always converge? Why / why not?

yes it always converges because during every iteration on the centroids of the data oints we recalculate them based on the current assignments of the data points to the centroids. This means that the distance between data points with their centriods is decreasing every iteration. 

e) Follow along in class the implementation of Kmeans

In [4]:
%pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.3.1-cp39-cp39-macosx_10_9_x86_64.whl (10.2 MB)
[K     |████████████████████████████████| 10.2 MB 1.2 MB/s eta 0:00:01
[?25hCollecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Collecting scipy>=1.5.0
  Downloading scipy-1.11.3-cp39-cp39-macosx_10_9_x86_64.whl (37.3 MB)
[K     |████████████████████████████████| 37.3 MB 997 kB/s eta 0:00:011
[?25hCollecting joblib>=1.1.1
  Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Installing collected packages: threadpoolctl, scipy, joblib, scikit-learn
Successfully installed joblib-1.3.2 scikit-learn-1.3.1 scipy-1.11.3 threadpoolctl-3.2.0
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
import numpy as np
from PIL import Image as im
import matplotlib.pyplot as plt
import sklearn.datasets as datasets

centers = [[0, 0], [2, 2], [-3, 2], [2, -4]]
X, _ = datasets.make_blobs(n_samples=300, centers=centers, cluster_std=1, random_state=0)

class KMeans():

    def __init__(self, data, k):
        self.data = data
        self.k = k
        self.assignment = [-1 for _ in range(len(data))]
        self.snaps = []
    
    def snap(self, centers):
        TEMPFILE = "temp.png"

        fig, ax = plt.subplots()
        ax.scatter(X[:, 0], X[:, 1], c=self.assignment)
        ax.scatter(centers[:,0], centers[:, 1], c='r')
        fig.savefig(TEMPFILE)
        plt.close()
        self.snaps.append(im.fromarray(np.asarray(im.open(TEMPFILE))))


    def lloyds(self):
        centers = self.data[np.random.choice(len(self.data), self.k, replace=False)]
        self.snap(centers)
        while True:
            new_assignment = [-1 for _ in range(len(self.data))]
            for i, x in enumerate(self.data):
                new_assignment[i] = np.argmin([np.linalg.norm(x - c) for c in centers])
            if new_assignment == self.assignment:
                break
            self.assignment = new_assignment
            for i in range(self.k):
                centers[i] = np.mean(self.data[np.array(self.assignment) == i], axis=0)
            self.snap(centers)
            

kmeans = KMeans(X, 6)
kmeans.lloyds()
images = kmeans.snaps

images[0].save(
    'kmeans.gif',
    optimize=False,
    save_all=True,
    append_images=images[1:],
    loop=0,
    duration=500
)




# # Class implementation
# import numpy as np
# from PIL import Image as im
# import matplotlib.pyplot as plt
# import sklearn.datasets as datasets

# centers = [[0, 0], [2, 2], [-3, 2], [2, -4]]
# X, _ = datasets.make_blobs(n_samples=300, centers=centers, cluster_std=1, random_state=0)

# class KMeans():

#     def __init__(self, data, k):
#         self.data = data
#         self.k = k
#         self.assignment = [-1 for _ in range(len(data))]
#         self.snaps = []
    
#     def distance(self, x, y):
#         return np.linalg.norm(x - y)
    
#     def snap(self, centers):
#         TEMPFILE = "temp.png"

#         fig, ax = plt.subplots()
#         ax.scatter(X[:, 0], X[:, 1], c=self.assignment)
#         ax.scatter(centers[:,0], centers[:, 1], c='r')
#         fig.savefig(TEMPFILE)
#         plt.close()
#         self.snaps.append(im.fromarray(np.asarray(im.open(TEMPFILE))))

#     def initialize(self):
#         return self.data[np.random.choice(len(self.data), self.k, replace=False)]

#     def assign(self, centers):
#         for i in range(len(self.data)):
#             min = self.distance(self.data[i], centers[0])
#             self.assignment[i] = 0
#             # self.assignment = 0
#             for j in range(len(centers)):
#                 dist = self.distance(self.data[i], centers[j])
#                 if dist < min:
#                     min = dist
#                     self.assignment[i] = j

#         # for i, x in enumerate(self.data):
#         #     self.assignment[i] = np.argmin([np.linalg.norm(x - c) for c in centers])

#     def is_diff_clusters(self, centers, new_centers):
#         for i in range(len(centers)):
#             if self.distance(centers[i], new_centers[i]) != 0:
#                 return True
#         return False

#     def get_centers(self):
#         new_centers = []
#         for i in set(self.assignment): # for every different assignment
#             cluster = self.data[np.array(self.assignment) == i]
#             new_centers.append(np.mean(cluster, axis=0))

#         return np.array(new_centers)
    
#     def lloyds(self):
#         centers = self.initialize()
#         self.assign(centers)
#         self.snap(centers)
#         new_centers = self.get_centers()

#         while self.is_diff_clusters(centers, new_centers):
#             self.assign(new_centers)
#             centers = new_centers
#             self.snap(new_centers)
#             new_centers = self.get_centers()