Practical: k-means
==


PART 1 - Hard coded k-means
--

### Implementing $k$-means
- In this first part of the practical we're going to implement $k$-means from scratch based on what we learnt in the lecture.
- The file MyKmeans.py contains code that implements $k$-means clustering
- Open the file and go through the function making sure you understand what every step does

Let's first simulate some data to use in our clustering and plot them:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2)
data=np.random.normal(size=250*2).reshape(250,2)
data[0:124,0]=data[0:124,0]+3
data[0:124,1]=data[0:124,1]-4

plt.xlabel("X1")
plt.ylabel("X2")
plt.scatter(data[:,0],data[:,1])

Now let's run MyKmeans() in this dataset and see if we identify the two clusters:

In [None]:
from MyKmeans import MyKmeans #this runs the contents of the file MyKeans.py that you examined above

In [None]:
import pandas as pd
data_df=pd.DataFrame(data)
MyRes2 = MyKmeans(df=data_df,n_cluster=2, c_initial=range(2))
print(MyRes2)
plt.scatter(MyRes2.iloc[:,0],MyRes2.iloc[:,1], c=MyRes2.iloc[:,2])

From visual inspection it looks like $k$-means has done a pretty good job in separating the data into clusters.

Use `crosstab()` from Pandas to compare the clusters' allocation with the true clusters:

In [None]:
true=list([0]*125+[1]*125) #construct a list with "true" cluster number
pd.crosstab(MyRes2.cluster,pd.Series(true), colnames=["True"], rownames=["kmeans"])

Has it grouped all points correctly? Run the above for different numbers of clusters by adjusing `n_cluster` and `c_initial` and see how the clusters change.

In [None]:
#Insert your code here

Try also running it for different starting points to see whether and how the resulting clustering changes

 ⚠️❓HOW DOES THIS DIFFER FROM C_INITIAL ABOVE??

We've seen in the lecture that $k$-means tries to minimize the total within cluster sum of squares. The function ***calculateSS()*** below takes as intput the results of MyKmeans and calculates the total, total within clusters and total between clusters sum of squares.

Go through the function and make sure you understand how it works and the differences between each quantity.

In [None]:
#Function calculateSS() 
#  Input: output from MyKmeans()
#  Output: dataframe with Total within clusters, Between and Total Sum of Squares.

def calculateSS(res_clusters):
    
    #Create a list with enough elements to store a number for each cluster referenced 
    Within_SS = [0]*int(res_clusters.iloc[:,res_clusters.shape[1]-1].max()+1)
    
    Total_SS = sum(res_clusters.iloc[:,0:res_clusters.shape[1]-1].apply(lambda x: sum((x-x.mean())**2), axis=1))
    
    for i in pd.unique(res_clusters.iloc[:,res_clusters.shape[1]-1]):
        i=int(i)
        df=res_clusters[res_clusters.cluster==i]
        Within_SS[i] = sum(df.iloc[:,0:df.shape[1]-1].apply(lambda x: sum((x-x.mean())**2)))
    
    Total_Within_SS = sum(Within_SS)
    
    Between_SS = Total_SS-Total_Within_SS
    
    res=pd.DataFrame([[Total_Within_SS,Between_SS,Total_SS]])
    res.columns=["Tot_Within","Between","Total"]
    
    return(res)
    

In [None]:
calculateSS(MyRes2)

In our example, we know that there are 2 clusters by construction. If we didn't, how would we choose $k$?

We'd need to run MyKmeans for various values of $k$ and choose the one after which the reduction in total within clusters variation doesn't change much.

Let's write a function that iteratively changes $k$ and each time calculates the total within clusters sum of squares using our `calculateSS()` function:

In [None]:
def choose_k(max_k, data):
    #create placeholder lists with the correct number of elements
    res = [0] * (max_k+1)
    MySS = [0] * (max_k+1)
    for i in range(1,max_k+1):
        print("Trying k means with ",i," clusters")
        res[i]=MyKmeans(df=data, n_cluster=i, c_initial=range(i))
        MySS[i]=calculateSS(res_clusters=res[i])
    return MySS[1:max_k+1]

Run the function on `data_df` for up to 10 clusters and plot the results:

In [None]:
k=10
k_res=choose_k(max_k=k, data=data_df)
k_res=pd.concat(k_res, ignore_index=True)

In [None]:
k_res

In [None]:
import matplotlib.pyplot as plt
plt.plot(range(1,11),k_res['Tot_Within'], 'o')
plt.ylabel("Total Within SS")

Looking at the above plot, would anything stop you choosing $k=3$? Not really.

Run `MyKmeans()` for 3 and 4 clusters and plot the results.



In [None]:
#Insert your code here


In [None]:
#Insert your code here
MyRes4 = MyKmeans(df=data_df,n_cluster=4, c_initial=range(1,5))
plt.scatter(MyRes4.iloc[:,0],MyRes4.iloc[:,1], c=MyRes4.iloc[:,2])

PART 2 - Using kmeans from scikit-learn
--

Most of the things we've done so far can easily be done using the kmeans function from scikit-learn.

In [None]:
from sklearn.cluster import KMeans

To run scikit-learn's `KMeans` on our data for two clusters using the same initial centroids as we did before, run:

To run scikit-learn's KMeans() on our data for two clusters using the same initial centroids as we did before, run the following:

In [None]:
model=KMeans(n_clusters=2,init=data_df.iloc[0:2,:],max_iter=40)
sk_kmeans0=model.fit(data_df)
sk_kmeans0

You can see the which clusters the points are assigned to:

In [None]:
sk_kmeans0.predict(data_df)

The total within sum of squares can be obtained as follows:

In [None]:
sk_kmeans0.inertia_

Unfortunately, scikit-learn does not provide a way to directly access values for between-sum-of-squares and total-sum-of-squares. Positions of the centroids can be obtained as follows:

In [None]:
sk_kmeans0.cluster_centers_

You don't have to specify the initial centroids. You can let scikit learn find these. (Indeed, this is what it expects by default, which is why you may have received a warning above). For reproducible results, you can set the seed for random_stage:

In [None]:
model=KMeans(n_clusters=2,max_iter=40, random_state=777777)
sk_kmeans0=model.fit(data_df)
sk_kmeans0

Compare the results with those obtained from our implementation.

`KMeans` runs the algorithm multiple times using different starting centroids (10, by default) and returns the one with the best results. It's good practice to try different starting centroids as the clustering results can depend on these. You can change the number of starting centroids that are used using the `n_init` parameter:

In [None]:
model=KMeans(n_clusters=2, n_init=20, max_iter=40, random_state=777777)
sk_kmeans0=model.fit(data_df)
sk_kmeans0

Alternative `KMeans()` implementations are available in scikit-learn that can be more robust than ours. Look at the documentation for the KMeans() function that's included in your installed version of sklearn.

Let's slight alter our `choose_k` function to use the utput

In [None]:
def Schoose_k(max_k, data):
    #create placeholder lists with the correct number of elements
    res = [0] * (max_k+1)
    MySS = [0] * (max_k+1)
    for i in range(1,max_k+1):
        print("Trying k means with ",i," clusters")
        model=KMeans(n_clusters=i, n_init=20, max_iter=40, random_state=777777)
        sk_kmeans0=model.fit(data)
        res[i]=sk_kmeans0.inertia_
    return res[1:max_k+1]

Run the above updated version using,

In [None]:
sk_res = Schoose_k(max_k=k, data=data_df)

Let's plot the results along with ours and compare,

In [None]:
plt.title("Blue:MyKmeans(), Orange:scikit-learn's KMeans()")
plt.plot(range(1,11),k_res['Tot_Within'], 'o')
plt.ylabel("Total Within SS")
plt.plot(range(1,11),sk_res, 'o', markersize=4)
plt.ylabel("Total Within SS")

Let's use $k=2$ again and run kmeans() to get the clustering results

In [None]:
sk_res = KMeans(n_clusters=2, n_init=20, max_iter=40, random_state=777777).fit_predict(data_df)

In [None]:
plt.subplot(1,2,1)
plt.title("MyKmeans clustering")
plt.xlabel("X1")
plt.ylabel("X2")
plt.scatter(MyRes2.iloc[:,0],MyRes2.iloc[:,1], c=MyRes2.iloc[:,2])

plt.subplot(1,2,2)
plt.title("Kmeans() clustering")
plt.xlabel("X1")
plt.ylabel("X2")
plt.scatter(data_df.iloc[:,0], data_df.iloc[:,1],c=sk_res)

What do you notice in the above plots? Are the results obtained the same?

PART 3 - KMeans() on the iris dataset
--

The iris dataset is often used to illustrate clustering and classification and it's also available in sklearn.datasets.

The dataset contains the length and width of sepals and petals of different flowers of 3 different species: virginica, versicolor and setosa.

In the plot below, the colour corresponds to the flower family of each observation:

In [None]:
import seaborn as sns           # the seaborn library provides a good function for making pair plots
import sklearn.datasets as skd  # we use sklearn.datasets to get the iris dataset

iris=skd.load_iris()
iris_df=pd.DataFrame(iris.data)
iris_df.columns=iris.feature_names
iris_df['label']=[iris.target_names[t] for t in iris.target]

palette=sns.color_palette("hls", 3) # gives a red-green-blue colour palette
sns.pairplot(iris_df, hue='label', palette=palette)


As we see below, the dataset has 50 observations from each species

In [None]:
iris_df['label'].value_counts()

Looking at the pair plot above, could kmeans distinguish between the 3 species? Is there any one with which it could struggle?

Let's try kmeans on the iris dataset using the true cluster number

In [None]:
cl_iris=KMeans(n_clusters=3, n_init=20, max_iter=100, random_state=777777).fit_predict(iris_df.iloc[:,0:4])

In [None]:
iris_df['cluster_label']=cl_iris

Plot the data and colour the points by the assigned clusters. How do the results compare to the true groups? **NOTE:** the colour-cluster combination will not align with those of the previous plot.

In [None]:
#Insert your code here

Has _k_-means done a good job?

Compare the true class and the assigned clusters using [`crosstab` from Pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html).

In [None]:
#Insert your code here

Now try scaling your data first before applying kmeans.

In [None]:
from sklearn.preprocessing import scale
iris_scaled=scale(iris.data)

Look at the documentation of `scale` if you're not sure what it's doing.

Run KMeans() on the scaled dataset, using the same seed, and store the results in cl_iris_scaled

In [None]:
#Insert your code here


Compare with the results from k_iris and the true labels.

In [None]:
#Insert your code here
#comparing with the true labels

Have the results changed? Has there been any improvement?

_Normally_ you would expect scaling to improve the clustering. In fact, in this case, the clustering was already quite successful.

Run `Schoose_k` for a series of k values. Would you have chosen $k=3$?

In [None]:
# Insert your code here


---
## Acknowledgements and Reuse
<p>
<small>
Python version by Adam Carter, EPCC, The University of Edinburgh, based on an R version previously created at EPCC, The University of Edinburgh.
</small>
</p>
<p>
<small>
&copy; 2023 EPCC, The University of Edinburgh
</small>
</p>
<p>
<small>
You are welcome to re-use this notebook and its contents under the terms of CC-BY-4.0
</small>