# $k$-Means Clustering

Given the following data set with 7 objects (in $\mathbb{R}^2$):

In [6]:
import pandas as pd
from plotly import graph_objects as go
data = pd.DataFrame([[1.0,1.0],[11.0,6.0],[12.0,5.0],[5.0,6.0],[3.0,7.0],[3.0,5.0],[4.0,4.0]],columns=['x1','x2'])
fig = go.Figure(
    go.Scatter(x=data['x1'],y=data['x2'],mode='markers',name='All points'),
    layout=go.Layout(yaxis=dict(scaleanchor="x", scaleratio=1))
)
fig.add_trace(go.Scatter(x=data['x1'][:3],y=data['x2'][:3],mode='markers',name='Centers'))
fig.show()

The data set are all points (both red and blue circles) in the above plot. All coordinates are integer.

The three red circles are the data points chosen as *initial cluster centers*.

a) Perform $k$-means clustering using Lloyd's approach with $k=3$ on this data set.

At each iteration, give the new cluster assignment, then the new cluster centers. 

#### TODO

In [7]:
import numpy as np

In [11]:
df=pd.DataFrame({
    'x1':data[:]['x1'],
    'x2':data[:]['x2']
})

centers = data[:3].copy()

colmap = {0: 'rgb(255, 0, 0)', 1: 'rgb(0, 255, 0)', 2: 'rgb(0, 0, 255)'}
colmap_center = ['rgba(255, 0, 0, 0.4)', 'rgba(0, 255, 0, 0.4)', 'rgba(0, 0, 255,0.4)']

def assignment(df,points,centers):
    for i in range(len(centers)):
        df['distance_from_{}'.format(str(i))]=(np.sqrt((df['x1']-centers['x1'][i])**2+(df['x2']-centers['x2'][i])**2))
    center_dist_cols = ['distance_from_{}'.format(i) for i in range(len(centers))]
    df['closest'] = df.loc[:,center_dist_cols].idxmin(axis=1)
    df['closest'] = df['closest'].map(lambda x: int(x.lstrip('distance_from_'))) #lstrip is copy of string with leading chars removed
    df['color'] = df['closest'].map(lambda x: colmap[x])
    return df

def update(df,centers):
    for i in range(len(centers)):
        if len(df[df['closest']==i])>0:
            centers['x1'][i] = np.mean(df[df['closest']==i]['x1'])
            centers['x2'][i] = np.mean(df[df['closest']==i]['x2'])
    return centers

x=6
for i in range(x):
    old_centers = centers.copy()
    df = assignment(df,data,centers)
    update(df,centers)
    fig = go.Figure(
        go.Scatter(x=df['x1'],y=df['x2'],mode='markers',name='All points',marker_color=df['color']),
        layout=go.Layout(yaxis=dict(scaleanchor="x", scaleratio=1))
    )
    fig.add_trace(go.Scatter(x=centers['x1'],y=centers['x2'],mode='markers',name='Centers',marker_color=colmap_center))
    fig.show()
    if old_centers.equals(centers):
        print("Keine neuen Center mehr nach",i,"Itterationen.")
        break       

Keine neuen Center mehr nach 2 Itterationen.


In [12]:
centers

Unnamed: 0,x1,x2
0,3.2,4.6
1,8.0,6.0
2,11.5,5.5


In [13]:
df

Unnamed: 0,x1,x2,distance_from_0,distance_from_1,distance_from_2,closest,color
0,1.0,1.0,4.219005,8.602325,11.42366,0,"rgb(255, 0, 0)"
1,11.0,6.0,7.924645,3.0,0.707107,2,"rgb(0, 0, 255)"
2,12.0,5.0,8.809086,4.123106,0.707107,2,"rgb(0, 0, 255)"
3,5.0,6.0,2.280351,3.0,6.519202,0,"rgb(255, 0, 0)"
4,3.0,7.0,2.408319,5.09902,8.631338,0,"rgb(255, 0, 0)"
5,3.0,5.0,0.447214,5.09902,8.514693,0,"rgb(255, 0, 0)"
6,4.0,4.0,1.0,4.472136,7.648529,0,"rgb(255, 0, 0)"


In [53]:
class kMean:
    def __init__(self, points, cluster, maxItter):
            self.points = points
            self.cluster = cluster
            self.maxItter = maxItter
            self.nCluster = len(cluster)
            
    def dist(self,x,y):
        d=0
        for i in range(len(x)):
            d+=(x[i]-y[i])**2
        return np.sqrt(d)
    
    def itter(self):
        distances =[]
        for i in range(self.maxItter):
            distance = []
            d=[]
            for point in self.points:
                for k in range(self.nCluster):
                    d.append(self.dist(point,self.cluster[k]))
                distance.append(np.argmin(d))
                d=[]
            for j in range(len(self.points)):
                self.cluster[distance[j]] = self.points[j]
        return self.cluster
    
dat = [[1,1],[11,6],[12,5],[5,6],[3,7],[3,5],[4,4]]
dat2 = [[2,6],[4,3],[6,7],[5,6],[3,7],[3,5],[4,4]]
km = kMean(dat[3:],dat[:3],1)

km.itter()

[[4, 4], [5, 6], [12, 5]]

b) What problem does the algorithm encounter in performing $k$-means on this data set?

#### TODO

c) Propose at least two strategies (other than restarting) to handle this situation. Justify your answers.

#### TODO

# Hamerly's $k$-means

Given the following data set with 10 objects (in $\mathbb{R}^2$):

In [None]:
import pandas as pd
from plotly import graph_objects as go
data = pd.DataFrame([[1,2],[4,1],[7,2],[2,4],[4,4],[3,11],[8,7],[7,11],[12,8],[11,10]],columns=['x1','x2'])
fig = go.Figure(
    go.Scatter(x=data['x1'],y=data['x2'],mode='markers',name='All points'),
    layout=go.Layout(yaxis=dict(scaleanchor="x", scaleratio=1))
)
fig.add_trace(go.Scatter(x=data['x1'][[4,8]],y=data['x2'][[4,8]],mode='markers',name='Centers'))
fig.show()

The data set are all points (both red and blue circles) in the above plot. All coordinates are integer.

The two red circles are the data points chosen as *initial cluster centers*.

Perform Hamerly's $k$-means algorithm with $k=2$ on this data set.

At each iteration, give the new cluster assignment, then the new cluster centers.

#### TODO

1. Distanzen zu den Clusterzentren normal bestimmen und zuordnen
2. Schranken bilden.
    1. Obere Schranke: Abstand zum eigenen Clusterzentrum
    2. Untere Schranke: Abstand zum zweit nächsten Clusterzentrum $\rightarrow\;o<u$ 
3. Neue Zentren normal über arithmetisches Mittel berechnen
4. Distanz von den alten zu neuen Zentren berechnen. (Wie weit hat sich ein Zentrum bewegt)
    1. Die obere Schranke jedes Punktes steigt um die Distanz die sich das eigene Zentrum bewegt hat
    2. Die untere Schranke jedes Punktes sinkt um die Distanz die sich das zweit nächste Zentrum bewegt hat
    
$\Rightarrow$ Die obere Schranke ist ein Maß für den maximalen Abstand zum eigenen Zentrum und die untere Schranke ein Maß für den minimalen Abstand zur zweitnächsten.
5. Wenn jetzt bei einem Punkt $o>u$ ist, dann muss die obere Schranke neu berechnet werden. Ist dannn immer noch $o>u$ muss die untere Schranke berechnet werden und gegebenen Dalls das Cluster gewechselt werden
6. Wenn ein Punkt das Cluster gewechselt hat wird deren obere und untere Schranke getauscht

# Implement $k$-Medians clustering

In Python, implement the $k$-medians algorithm, that performs the following steps:

1. Choose initial centers using uniform random sampling from the data.
1. Assign each point to the nearest cluster center.
1. Compute new cluster centers using the median in each variable.
1. Repeat 2 and 3 until the cluster assignment does not change anymore.

Your implementation *must* allow using Manhattan *and* Euclidean distance.

a) Implement a function `choose_centers(X, k, seed)` to choose initial centers from $X \subset \mathbb{R}^d$.<br/>
This function *must* use a random generator initialized with the given seed, and sample uniformly without replacement.<br/>
Given the same seed, it must choose the same centers each time.

In [None]:
import numpy as np
def choose_centers(X, k, seed):
    # TODO: Compute and return initial centers
    return None

b) Implement a function `find_nearest(X, centers, p)` that returns the index of the nearest center in `centers` for each point in `X`, where `p` specifies, which Minkowski norm $L_p$ should be used.

The function must return:
1. The sum of all distances from points to their assigned center.
1. An array of $n$ values, giving the index of the nearest center for each point.

Consider printing the summed distances for debugging purposes.

In [None]:
def find_nearest(X, centers, p):
    # TODO: Compute and return the sum of distances and the nearest center assignments
    return None, None

c) Implement a function `compute_medians(X, assignments, p)` that given all points `X` and their respective cluster `assignments` (an array of $n$ values) returns the next set of centers.<br/>
Again, `p` specifies, which Minkowski norm $L_p$ should be used.

The function must return:
1. The sum of all distances from points to their assigned center
1. An array of $k$ points, the new centers.

Consider printing the summed distances for debugging purposes.

In [None]:
def compute_medians(X, assignments, p):
    # TODO: Compute new centers and return the sum of distances and the new centers
    return None, None

d) Implement the $k$-medians algorithm `kmedians(X, k, p, seed)` accepting
the data `X`, the number of centers `k`, the Minkowski norm `p` (at least 1 and 2), and the random generator `seed`.<br/>

When run multiple times with the same `seed`, it must produce the
same result every time, and different seeds must not always produce the same result.

The function must return:
1. An array of $k$ points, the final centers.
1. An array of $n$ values, giving the index of the nearest center for each point.

In [None]:
def kmedians(X, k, p, seed):
    # TODO: Implement the kmedians algorithm and return the final centers and the cluster assignments
    return None, None

e) Implement an evaluation function `eval(X, centers, assignments)` that computes the
sum of distances of each point to the center given by `assignments` (which may, or may not, be the nearest center).

The function must return:
1. The sum of Manhattan distances.
1. The sum of Euclidean distances.
1. The sum of squared deviations.

In [None]:
def eval(X, centers, assignments):
    # TODO: Given some assignments compute and return the specified measures
    return None, None, None

f) In the jupyter notebook, you are given a $k$-means function using the same initial center routine.
Run the following algorithms 10 times for seeds 0,...,9 and $k=3$ on the iris data set:
1. Your $k$-medians with Manhattan distance.
1. Your $k$-medians with Euclidean distance.
1. The provided $k$-means.

For each algorithm, print the mean and the minimum for *each* of the three metrics computed by the `eval` function.

In [None]:
# This is the provided code, just run it once.
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
def sk_kmeans(X, k, seed):
    """ Run sklearn k-means """
    sk = KMeans(k, init=choose_centers(X, k, seed), n_init=1).fit(X)
    return sk.cluster_centers_, sk.labels_

iris = load_iris()
X = iris.data
y = iris.target

In [None]:
# TODO: Evaluate the algorithms on the dataset.

g) Use sklearn's PCA to project the iris data set into two dimensions (for visualization only):

In [None]:
from sklearn.decomposition import PCA
irisP = PCA(n_components=2).fit_transform(iris)

Plot the cluster assignments for all three methods and `seed=1` (one figure per method).<br/>
Discuss the result.

In [None]:
# TODO: Plotting

#### TODO: Discussion