# Semantics and Pragmatics, KIK-LG103

## Lab session 4, Part 2

---

We will now move on to the second main topic of the day, **clustering**. We will try out two different clustering methods that were introduced in the lecture. The first one is **k-means** and the second one is **hierarchical (agglomerative)** clustering.

Once again, import all the necessary stuff before you read on.

In [None]:
import sys
sys.path.append("../../src/")

from lab4utils import embed, to_feature_matrix, plot_dendrogram, plot_kmeans
from sklearn.cluster import AgglomerativeClustering, KMeans

---

### Section 2.1

---

The first method we wil look at is a "flat" clustering algorithm called k-means. Flat in this case means that the resulting clusters do not have any explicit structure. The optimal result is simply that words within a cluster are maximally similar to each other, while words in different clusters are maximally different from each other. We saw a [demo](http://shabal.in/visuals/kmeans/1.html) of how k-means clustering works; if you need a refresher you can check that out again.

---

**Ex 2.1.1** In the code cell below we show you how to cluster a set of words and plot the results. The things you need to worry about are on the second and the third line. 

Try out different words and numbers of clusters and think about the following questions: 

- How well does the clustering work?
- Which words seems to work best?
- What kind of categories do you think the clusters represent?
- Is there a number of clusters that gives sensible results most of the time, or one that doesn't work at all?
- Do you see any problems with having to define the number of clusters yourself?
- Do the clusters change when you run the algorithm several times?

---

In [None]:
# Define the words to be clustered and plotted
words = "run jump swim walk go take cry laugh speak talk hear".split()
clusters = 2

# Represent the words in a suitable way for the clustering algorithm
X = to_feature_matrix(words)
# Initialize clustering algorithm
model = KMeans(n_clusters=clusters)
# Train model
model = model.fit(X)
    
# Plot results
plot_kmeans(model, words)

---

### Section 2.2

---

in this section we will look at the second method: **hierarchical agglomerative clustering**. The method is hierarchical because it gives us a hierarchy of clusters instead of the flat, structureless clusters of k-means. We can investigate the hierarchy at different levels, resulting in different clusters depending on the level where we decide to group the words. Agglomerative means that the algorithm works in a bottom-up manner. Initially, each word is considered its own cluster. The clusters are then iteratively merged until we end up with one cluster. This results in the hierarchical structure.

All of this is easier to see in a dendrogram. Run the code cell below to see the results.

---

**Ex 2.2.1** Again, try out different words. This time the number of clusters isn't the most important thing. As you might notice, the number you define doesn't change the structure of the resulting hierachy. What it changes is the "depth" where the algorithm groups the words into clusters. These clusters are shown after the word labels (`word/cluster_id`). Think about the following questions:

- Do you see any potential benefits to using a hierarchical clustering instead of a flat one like k-means? Any problems? 
- Do the resulting clusters from this method match on to the clusters produces by k-means?
- Do the resulting clusters change when you run the algorithm multiple times?

---

In [None]:
# Define the words to be clustered and plotted
words = "run jump swim walk go take cry laugh speak talk hear".split()
clusters = 4

# Represent the words in a suitable way for the clustering algorithm
X = to_feature_matrix(words)
# Initialize clustering algorithm
model = AgglomerativeClustering(n_clusters=4)
model = model.fit(X)

# Train model
model = model.fit(X)
    
# Plot results
plot_dendrogram(model, labels=words)