In [None]:
%pip install numpy scipy pandas
%pip install --no-cache-dir --force-reinstall https://dm.cs.tu-dortmund.de/nats/nats25_03_01_spherical_kmeans_clustering-0.1-py3-none-any.whl
import nats25_03_01_spherical_kmeans_clustering

# Clustering
## Spherical k-Means Clustering

In this assignment, your task is to implement spherical k-means clustering *yourself*.

You will need to pay attention to performance. Using "for" loops over all instances and variables will not work, but instead you need to perform efficient vectorized operations.

In [None]:
import numpy as np, pandas as pd, scipy
# Load the input data
import gzip, json, urllib
file_path, _ = urllib.request.urlretrieve("https://dm.cs.tu-dortmund.de/nats/data/minecraft-articles.json.gz")
raw = json.load(gzip.open(file_path, "rt", encoding="utf-8"))
titles, texts, classes = [x["title"] for x in raw], [x["text"] for x in raw], [x["heuristic"] for x in raw]

Before you begin anything, always first have a look at the data you are dealing with!

In [None]:
# Have a look at the data set!

## Vectorize the text

Vectorize the Wiki texts, use the standard TF-IDF from the lecture (standard SMART `ltc` version, lowercase, *not* the scikit-learn variant) as discussed in the previous assignments. Use a minimum document frequency of 5 and standard english stopwords to reduce the vocabulary.

In [None]:
tfidf = None # sparse tf-idf matrix
vocabulary = None # vocabulary
idf = None # IDF values
pass # Your solution here

In [None]:
pd.DataFrame.sparse.from_spmatrix(tfidf, columns=vocabulary)

In [None]:
nats25_03_01_spherical_kmeans_clustering.hidden_tests_7_0(idf, vocabulary, tfidf, texts)

## Reassignment step

Implement the reassignment step of **spherical** k-means. Use **vectorized code**, or it will likely be too slow.

Do *not* use a Python `for` loop, and do *not* convert the input data to a dense matrix (slow).

In [None]:
def reassign(tfidf, centers):
    """Reassign each object in tfidf to the most similar center.
       Return a flat array, not a matrix."""
    pass # Your solution here
    
# Test run
print(reassign(tfidf[:20], tfidf[:5]))

In [None]:
nats25_03_01_spherical_kmeans_clustering.hidden_tests_10_0(range, tfidf, reassign)

## Recompute the cluster centers

Given a cluster assignment, recompute the cluster centers as used by *spherical* k-means.

Vectorize your code: do not iterate over all points with a Python for loop

Hint: for the assignment, it is okay to assume that a cluster never becomes empty.

In [None]:
def new_centers(tfidf, assignment):
    """Return a matrix containing the new cluster centers for spherical k-means."""
    centers = [] # Okay to use a list or an array for the assignment
    pass # Your solution here
    return np.array(centers) # Always return an array, copying is okay for the assignment

In [None]:
nats25_03_01_spherical_kmeans_clustering.hidden_tests_13_0(new_centers, tfidf)

## Initialization

Now write initialization code. Given a random generator *seed*, chose `k` objects as initial cluster centers without replacement. Please use numpy.

In [None]:
def initial_centers(tfidf, k, seed):
    """Choose k initial cluster centers."""
    pass # Your solution here

In [None]:
nats25_03_01_spherical_kmeans_clustering.hidden_tests_16_0(initial_centers, tfidf)

## Implement a Quality Measure

As quality measure, compute the *sum* of cosine similarities of every point to its cluster center

In [None]:
def quality(tfidf, centers, assignment):
    """Evaluate the quality given the current centers and cluster assignment."""
    pass # Your solution here
    return s

In [None]:
# This test is likely slow if you use a "for" loop in quality(). But that is okay.
nats25_03_01_spherical_kmeans_clustering.hidden_tests_19_0(quality, tfidf)

As a reference value, compute the quality of assigning every object to the global *spherical* center.

Hint: you can use `new_centers` here.

In [None]:
center1 = None # Compute the overall center
sim1 = 0 # Compute the overall similarity

pass # Your solution here

print("Similarity sum to center:", sim1)
print("Average similarity to center:", sim1 / tfidf.shape[0])

In [None]:
nats25_03_01_spherical_kmeans_clustering.hidden_tests_22_0(sim1, center1, tfidf)

## Implement Spherical k-Means

Now use these methods to implement spherical k-means clustering. Stop after a maximum number of iterations, or if no point is reassigned.

Return the cluster centers, the final cluster assignment, and an array of quality scores evaluated every time *after* reassigning the points to the clusters.

In [None]:
def spherical_kmeans(tfidf, initial_centers, max_iter=100):
    qualities = []
    pass # Your solution here
    return centers, assignment, qualities

In [None]:
nats25_03_01_spherical_kmeans_clustering.hidden_tests_25_0(spherical_kmeans, quality, tfidf, reassign, new_centers)

## CLUSTER!

Now try out if your code works! First, cluster with `k=2`.

In [None]:
c = initial_centers(tfidf, 2, 42)
c, a, q = spherical_kmeans(tfidf, c, 100)
for i, x in enumerate(q): print(i, x)

In [None]:
nats25_03_01_spherical_kmeans_clustering.hidden_tests_28_0(quality, spherical_kmeans, tfidf)

## Study the Clusters

As we cannot rely on heuristics such as the "knee" to choose the number of clusters, we need to perform manual inspection:

- what are the most important words of each cluster?
- what are the most central documents in each cluster?

In [None]:
def most_important(vocabulary, center, k=10):
    """Find the most important words for each cluster."""
    pass # Your solution here

In [None]:
nats25_03_01_spherical_kmeans_clustering.hidden_tests_31_0(most_important, vocabulary, tfidf)

In [None]:
def most_central(tfidf, centers, assignment, i, k=5):
    """Find the most central documents of cluster i"""
    pass # Your solution here

In [None]:
nats25_03_01_spherical_kmeans_clustering.hidden_tests_33_0(most_central, tfidf)

## Explain your Clusters

Write a function to print a cluster explanation using above functions, and run it for k=20.

In [None]:
def explain(tfidf, vocabulary, titles, centers, assignment):
    """Use what you built."""
    pass # Your solution here    

In [None]:
# Cluster with k=20, and explain!
pass # Your solution here

In [None]:
nats25_03_01_spherical_kmeans_clustering.hidden_tests_37_0(titles, explain, print, most_central, vocabulary, most_important, tfidf)