#EXERCISE

## Merging bag of words with K-means clustering

* Generate 300 samples of text document, each has a length of 50 when the number of true clusters are 4. You can use the following command for this purpose.

`text = ''.join(random.choice(string.ascii_letters + string.digits + ' ') for _ in range(text_length))`

* Preprocess the text data and create BoW representation

* Determine the optimal number of clusters using silhouette coefficients when the number of clusters are between 2 to 10

* Select the optimal number of clusters (K) based on the highest silhouette coefficient using

`from sklearn.metrics import silhouette_score`

`silhouette_score(X, kmeans.fit_predict(input))`

* Apply K-means clustering with the selected K

* Evaluate the results with purity. Recall that for purity calculation we need to use true clusters.

`purity = max(completeness, homogeneity)`

You many use

`sklearn.metrics.homogeneity_score(labels_true, labels_pred)`

`sklearn.metrics.completeness_score(labels_true, labels_pred)`

$ Purity =$ $\frac{1}{N}$ $\sum_{k}$ $\max_{j} $ $ |C_k \cap T_j| $

$N$ is the total number of data points.

$k$ represents the clusters created by the clustering algorithm.

$j$ represents the true classes or ground truth labels.

$C_k$ is the set of data points in cluster $k$.

$T_j$ is the set of data points in true class $j$.

* Print out the highest silhouette score

* Print out the purity score






In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import pairwise_distances
from sklearn.datasets import make_blobs
import random
import string
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.metrics import pairwise_distances_argmin_min
from sklearn.metrics import completeness_score
from sklearn.metrics import homogeneity_score

# Generate 300 samples of text document, each has a length of 50 when the number of true clusters are 4
random.seed(42)
n_samples = 300
text_length = 50
n_clusters = 4

def generate_random_text(text_length):
    text = ''.join(random.choice(string.ascii_letters + string.digits + ' ') for _ in range(text_length))
    # string.ascii_letters represents all uppercase and lowercase letters of the English alphabet
    # string.digits represents all numbers from 0 to 9
    # ' ' represents a space
    # The for loop creates a text string by randomly selecting characters from these three categories for a specified length, which is text_length

    return text

text_data = [generate_random_text(text_length) for _ in range(n_samples)]
labels = np.random.randint(0, n_clusters, n_samples)  # "labels" contains the IDs of the actual clusters

# BoW
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text_data)

# Determine the optimal number of clusters using silhouette coefficients when the number of clusters are between 2 to 10
k_values = range(2, 10)
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=0)
    cluster_labels = kmeans.fit_predict(X)
    silhouette_avg = silhouette_score(X, cluster_labels)
    silhouette_scores.append(silhouette_avg)

# Select the optimal number of clusters (K) based on the highest silhouette coefficient using
optimal_k = k_values[np.argmax(silhouette_scores)]

# Apply K-means clustering with the selected K
kmeans = KMeans(n_clusters=optimal_k, random_state=0)
cluster_labels = kmeans.fit_predict(X)

# Evaluate the results with the purity score
# Remember that for calculating purity, it's necessary to use the actual cluster labels ("labels")
completeness = completeness_score(labels, cluster_labels)
homogeneity = homogeneity_score(labels, cluster_labels)
purity = max(completeness, homogeneity)  # Choose the highest score as the purity

print(f"Number of clusters (K): {optimal_k}")
print(f"Purity: {purity}")


Number of clusters (K): 2
Purity: 0.1941950278616768
