<a href="https://colab.research.google.com/github/ANotFox/FDS-Implementation/blob/main/fds_implementation_new%5E2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

Function build_context_matrices

This function takes in the tokens, vocab_size, and word_to_id variables as inputs. Firstly it computes the word counts of all the tokens using `np.bincount(tokens)` which basically counts the number of occurances of each token in tokens

For example: If

tokens = [1, 2, 3, 4, 4, 1, 3]

vocab size = 4

word_counts = [2, 1, 2, 2] (1 occurs twice, 2 once, 3 twice and 4 occurs twice)

Then based on the value of `w1`, we sort the word_counts and take the ids of the top `w1` frequencies.

`np.argsort(word_counts)[-self.w1:]`

The above function sorts and returns the ids of the top w1 words.

Afterwards, the function initializes two context matrices left and right (L,R) where the matrices are used to represent the relations between the current word with its previous and next word

The 2D array is of size (vocab_size, w1)

Which means how much each word in the vocabulary is related to the top w1 other words in the vocabulary.

Next we iterate through every token and if the current word is related to the previous or the next word i.e, the previous or the next word exists in the `top_words` we increment that entry in the matrix by 1. We do this for every word id in tokens (Keep in mind that the word ids in the tokens array have a range between `0` and `vocab_size` inclusive.)

```
for i in range(len(tokens)-1):
            curr_word = tokens[i]
            # Right context
            next_word = tokens[i+1]
            if next_word in top_words:
                R[curr_word, np.where(top_words == next_word)[0][0]] += 1

            # Left context
            if i > 0:
                prev_word = tokens[i-1]
                if prev_word in top_words:
                    L[curr_word, np.where(top_words == prev_word)[0][0]] += 1

```

`np.where(top_words == prev_word)[0][0]]` This gives the index in `top_words` where the left or right word matches with the top_word.

Return L and R matrix



Function svd_transform

Here we are computing the SVD of the L and R matrices by using `np.linalg.svd`.
SVD is needed to get the relationship between the words in a reduced dimensionality space.

U: It captures the relationship between words (rows of the matrix) and the latent dimensions (reduced features). Context of the data.

Σ: Scales these relationships by their importance (singular values).

V: It captures the features of the data.


We care about the word level embeddings in this implementation for which we only require U @ E and not V transpose

We reduce the rank of the matrix by considering only the top r1 singular values.

In short,

Perform SVD: Extract the most important latent features of the data.

Reduce Dimensionality: Retain only the top rank dimensions, filtering out noise or less significant dimensions.

Normalize: Prepare the data for tasks like clustering, classification, or similarity computations.

Function cluster_descriptors

This function is used to perform weighted k means clustering on the data.
The centroids are initialized with the descriptors of the top `k1` most frequent tokens because frequently occuring words are assumed to provide a good starting point for clustering, they represent important central themes or POS categories in the data.
`n_init` makes sure that k means runs only once.


Function fit

This function trains the model by performing svd and clustering twice
Initially it computes the left and right context matrices, performs svd on both to reduce the rank and only keep important parts of the data.
Uses `hstack` to horizontally stack both the transformed left and right context matrix and then performs k means clustering on the reduced rank descriptors.
These cluster assignments are then used again for the same procedure but this time in the left and right matrices, the cluster descriptors are used for the increments instead of the word individual indices to get more refined clustering.
The final clusters are then returned.

In [18]:
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
from collections import defaultdict

class SVD2Tagger:
    def __init__(self, w1=1000, r1=100, k1=500, r2=300, k2=45):
        """
        Initialize SVD2 tagger with parameters
        w1: number of most frequent words for context
        r1: rank for first SVD
        k1: number of clusters for first clustering
        r2: rank for second SVD
        k2: final number of POS tags
        """
        self.w1 = w1
        self.r1 = r1
        self.k1 = k1
        self.r2 = r2
        self.k2 = k2

    def build_context_matrices(self, tokens, vocab_size, word_to_id):
        """Build left and right context matrices"""
        # Get most frequent word indices
        word_counts = np.bincount(tokens)
        print("After using bincount on tokens:")
        print(word_counts[:100])
        top_words = np.argsort(word_counts)[-self.w1:]

        # Initialize context matrices
        L = np.zeros((vocab_size, self.w1))
        R = np.zeros((vocab_size, self.w1))

        # Fill context matrices
        for i in range(len(tokens)-1):
            curr_word = tokens[i]
            # Right context
            next_word = tokens[i+1]
            if next_word in top_words:
                R[curr_word, np.where(top_words == next_word)[0][0]] += 1

            # Left context
            if i > 0:
                prev_word = tokens[i-1]
                if prev_word in top_words:
                    L[curr_word, np.where(top_words == prev_word)[0][0]] += 1

        return L, R

    def svd_transform(self, matrix, rank):
        """Perform reduced rank SVD and return transformed matrix"""
        U, S, Vt = np.linalg.svd(matrix, full_matrices=False)
        # Keep only top rank singular values
        S = np.diag(S[:rank])
        U = U[:, :rank]
        return normalize(U @ S)

    def cluster_descriptors(self, descriptors, n_clusters, word_counts):
        """Perform weighted k-means clustering"""
        # Initialize centroids with most frequent words
        top_indices = np.argsort(word_counts)[-n_clusters:]
        init_centroids = descriptors[top_indices]

        kmeans = KMeans(n_clusters=n_clusters, init=init_centroids, n_init=1)
        return kmeans.fit_predict(descriptors)

    def fit(self, tokens, vocab_size, word_to_id):
        """Perform full SVD2 training process"""
        # First pass
        print("Building initial context matrices...")
        L1, R1 = self.build_context_matrices(tokens, vocab_size, word_to_id)

        print("Performing first SVD transformation...")
        L1_transformed = self.svd_transform(L1, self.r1)
        R1_transformed = self.svd_transform(R1, self.r1)

        # Concatenate descriptors
        descriptors1 = np.hstack([L1_transformed, R1_transformed])

        print("Performing first clustering...")
        word_counts = np.bincount(tokens)
        first_clusters = self.cluster_descriptors(descriptors1, self.k1, word_counts)

        # Second pass
        print("Building refined context matrices...")
        L2 = np.zeros((vocab_size, self.k1))
        R2 = np.zeros((vocab_size, self.k1))

        # Fill second pass context matrices using cluster assignments
        for i in range(len(tokens)-1):
            curr_word = tokens[i]
            # Right context
            next_word = tokens[i+1]
            R2[curr_word, first_clusters[next_word]] += 1

            # Left context
            if i > 0:
                prev_word = tokens[i-1]
                L2[curr_word, first_clusters[prev_word]] += 1

        print("Performing second SVD transformation...")
        L2_transformed = self.svd_transform(L2, self.r2)
        R2_transformed = self.svd_transform(R2, self.r2)

        # Final clustering
        print("Performing final clustering...")
        descriptors2 = np.hstack([L2_transformed, R2_transformed])
        self.final_clusters = self.cluster_descriptors(descriptors2, self.k2, word_counts)

        return self.final_clusters

    def get_cluster_examples(self, tokens, word_to_id, n_examples=5):
        """Get example words for each cluster"""
        id_to_word = {v: k for k, v in word_to_id.items()}
        cluster_examples = defaultdict(list)

        for word_id in range(len(self.final_clusters)):
            cluster = self.final_clusters[word_id]
            if word_id in id_to_word:
                word = id_to_word[word_id]
                cluster_examples[cluster].append(word)

        # Print top n examples for each cluster
        for cluster in sorted(cluster_examples.keys()):
            examples = cluster_examples[cluster][:n_examples]
            print(f"Cluster {cluster}: {', '.join(examples)}")

# # Example usage
# tagger = SVD2Tagger(w1=1000, r1=100, k1=500, r2=300, k2=45)
# clusters = tagger.fit(train_data, vocab_size, word_to_id)
# print("\nCluster Examples:")
# tagger.get_cluster_examples(train_data, word_to_id)

Function prepare_treebank_data

In this function we are passing a list of sentences represented as a 2D matrix of size 2 tuples where the first element in the tuple is the word for the sentence and the second element is the pos tag for that word in the nltk pentreebank corpus.

Initially in the function, we are converting all the words to lowercase and then in the variable `word_freq` we are storing the frequency of all the words in `words`. Thereafter, we are including only those words in the `word_to_id` hashmap whose frequency is greater than 1. Also the after adding all the words with frequency greater than 1 in the hashmap, we add `<UNK>` key to the hashmap to denote all the words which we ignored initially (Words with frequency = 1). The id corresponding to the key is just the length of the hashmap (`len(word_to_id)`) at that point in time.

In the final steps, the function computes the `tokens` array where it runs through all the words again and then checks if the current word exists in the `word_to_id` mapping. If the word exists, then the corresponding id stored in the `word_to_id` mapping gets appended to the `tokens` array otherwise the id of `<UNK>` gets appended which is
`len(word_to_id) - 1`.

Example
Function Argument: [[("I", ), ("like", ), ("to", ), ("eat", ), ("bananas", )], [("I", ), ("like", ), ("to", ), ("eat", ), ("apples", )]]

The above input shows two sentences with one word difference.

`freq`: I - 2

  like - 2

  to - 2

  eat - 2

  bananas - 1

  apples - 1

`word_to_id` - I - 0

like - 1

to - 2

eat - 3


bananas and apples wont come in the dictionary because their frequency is = 1.

Last two words are <"UNK">s  for this example.

tokens = [0, 1, 2, 3, 4, 4]

len(word_to_id) = 4


What each return value represents
tokens: Id value for every word in the corpus.

word_to_id - Unique words which have a frequency > 1.

vocab_size - Number of unique word ids that we are dealing with i.e, `len(word_to_id)`.


In [19]:
import nltk
import numpy as np
from collections import defaultdict
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
from typing import List, Dict, Tuple
from scipy.optimize import linear_sum_assignment

def prepare_treebank_data(tagged_sents: List[List[Tuple[str, str]]]) -> Tuple[np.ndarray, Dict, int]:
    """
    Prepare Penn Treebank data for SVD2Tagger

    Args:
        tagged_sents: List of sentences, where each sentence is a list of (word, tag) tuples

    Returns:
        tokens: numpy array of word IDs
        word_to_id: dictionary mapping words to IDs
        vocab_size: size of vocabulary
    """
    # Build vocabulary
    word_to_id = {}
    words = [word.lower() for sent in tagged_sents for word, _ in sent]
    # Add words to vocabulary with frequency > 1
    word_freq = defaultdict(int)
    for word in words:
        word_freq[word] += 1

    # Create word_to_id mapping
    for word in word_freq:
        if word_freq[word] > 1:  # Only include words that appear more than once
            word_to_id[word] = len(word_to_id)

    # Add <UNK> token
    word_to_id['<UNK>'] = len(word_to_id)

    # Convert words to IDs
    tokens = []
    for sent in tagged_sents:
        for word, _ in sent:
            word = word.lower()
            if word in word_to_id:
                tokens.append(word_to_id[word])
            else:
                tokens.append(word_to_id['<UNK>'])

    return np.array(tokens), word_to_id, len(word_to_id)

def calculate_M1_score(true_tags, predicted_clusters):
    """
    Calculate Many-to-One accuracy score
    Each cluster is mapped to its most common true tag
    """
    # Create mapping from cluster to most common tag
    cluster_to_tag = defaultdict(lambda: defaultdict(int))

    # Count tag occurrences for each cluster
    for cluster, tag in zip(predicted_clusters, true_tags):
        cluster_to_tag[cluster][tag] += 1

    # Assign each cluster to its most frequent tag
    cluster_tag_mapping = {}
    for cluster in cluster_to_tag:
        majority_tag = max(cluster_to_tag[cluster].items(), key=lambda x: x[1])[0]
        cluster_tag_mapping[cluster] = majority_tag

    # Calculate accuracy
    correct = 0
    total = len(true_tags)

    for pred, true in zip(predicted_clusters, true_tags):
        if pred in cluster_tag_mapping and cluster_tag_mapping[pred] == true:
            correct += 1

    return correct / total

def calculate_1to1_score(true_tags, predicted_clusters):
    """
    Calculate One-to-One accuracy score using the Hungarian algorithm
    Each cluster can only be mapped to one true tag
    """
    # Get unique tags and clusters
    unique_tags = sorted(set(true_tags))
    unique_clusters = sorted(set(predicted_clusters))

    # Create confusion matrix
    confusion_matrix = np.zeros((len(unique_clusters), len(unique_tags)))

    # Fill confusion matrix
    for cluster, tag in zip(predicted_clusters, true_tags):
        cluster_idx = unique_clusters.index(cluster)
        tag_idx = unique_tags.index(tag)
        confusion_matrix[cluster_idx][tag_idx] += 1

    # Use Hungarian algorithm to find optimal one-to-one mapping
    row_ind, col_ind = linear_sum_assignment(-confusion_matrix)

    # Create mapping from cluster to tag
    cluster_tag_mapping = {unique_clusters[i]: unique_tags[j]
                          for i, j in zip(row_ind, col_ind)}

    # Calculate accuracy
    correct = 0
    total = len(true_tags)

    for pred, true in zip(predicted_clusters, true_tags):
        if pred in cluster_tag_mapping and cluster_tag_mapping[pred] == true:
            correct += 1

    return correct / total

def evaluate_clusters(tagger, test_sents, true_tags):
    """
    Evaluate cluster quality against true POS tags
    """
    from sklearn.metrics import v_measure_score, adjusted_rand_score

    # Get cluster assignments for test words
    test_clusters = []
    true_pos_tags = []

    # Collect valid clusters and tags
    for sent, true_sent in zip(test_sents, true_tags):
        for word, true_tag in zip(sent, true_sent):
            word = word.lower()
            if word in tagger.word_to_id:
                word_id = tagger.word_to_id[word]
                if word_id < len(tagger.final_clusters):  # Add check for valid word_id
                    test_clusters.append(tagger.final_clusters[word_id])
                    true_pos_tags.append(true_tag)

    if not test_clusters:
        print("Warning: No valid clusters found for evaluation")
        return {
            'v_measure': 0.0,
            'ari': 0.0,
            'many_to_one': 0.0,
            'one_to_one': 0.0
        }

    # Calculate clustering metrics
    v_measure = v_measure_score(true_pos_tags, test_clusters)
    ari = adjusted_rand_score(true_pos_tags, test_clusters)
    m1_score = calculate_M1_score(true_pos_tags, test_clusters)
    one_to_one_score = calculate_1to1_score(true_pos_tags, test_clusters)

    # Get cluster distribution for each POS tag
    tag_cluster_dist = defaultdict(lambda: defaultdict(int))
    for tag, cluster in zip(true_pos_tags, test_clusters):
        tag_cluster_dist[tag][cluster] += 1

    # Print detailed distribution
    print("\nCluster Distribution for each POS tag:")
    for tag in sorted(tag_cluster_dist.keys()):
        clusters = tag_cluster_dist[tag]
        total = sum(clusters.values())
        main_clusters = sorted(clusters.items(), key=lambda x: x[1], reverse=True)[:3]
        print(f"\n{tag}:")
        for cluster, count in main_clusters:
            print(f"  Cluster-{cluster}: {count/total*100:.1f}%")

    return {
        'v_measure': v_measure,
        'ari': ari,
        'many_to_one': m1_score,
        'one_to_one': one_to_one_score
    }

def main():
    # Load Penn Treebank data
    print("Loading Penn Treebank data...")
    corpus = nltk.corpus.treebank.tagged_sents()
    # Split into train and test
    train_size = int(len(corpus) * 0.8)
    train_data = corpus[:train_size]
    test_data = corpus[train_size:]

    # Prepare data
    print("Preparing data...")
    tokens, word_to_id, vocab_size = prepare_treebank_data(train_data)
    print("Tokens for each selected word")
    print(tokens[:100])
    print("Word to id mappings hashmap.")
    print(word_to_id)

    # Initialize and train SVD2Tagger
    print("Training SVD2Tagger...")
    tagger = SVD2Tagger(w1=1000, r1=100, k1=500, r2=300, k2=45)
    clusters = tagger.fit(tokens, vocab_size, word_to_id)

    # Save word_to_id mapping in tagger
    tagger.word_to_id = word_to_id
    tagger.id_to_word = {v: k for k, v in word_to_id.items()}

    # Print cluster examples
    print("\nCluster Examples:")
    tagger.get_cluster_examples(tokens, word_to_id)

    # Prepare test data for evaluation
    test_words = [[word.lower() for word, _ in sent] for sent in test_data]
    test_tags = [[tag for _, tag in sent] for sent in test_data]

    # Evaluate clusters
    print("\nEvaluating clusters...")
    metrics = evaluate_clusters(tagger, test_words, test_tags)
    print(f"V-measure score: {metrics['v_measure']:.4f}")
    print(f"Adjusted Rand Index: {metrics['ari']:.4f}")
    print(f"Many-to-One (M:1) Accuracy: {metrics['many_to_one']:.4f}")
    print(f"One-to-One (1:1) Accuracy: {metrics['one_to_one']:.4f}")

    # Example tagging
    print("\nExample Tagging:")
    example_sent = "banana banana banana banana apple".lower().split()
    word_ids = [word_to_id.get(word, word_to_id['<UNK>']) for word in example_sent]
    valid_word_ids = [wid for wid in word_ids if wid < len(tagger.final_clusters)]
    clusters = [tagger.final_clusters[wid] for wid in valid_word_ids]

    for word, cluster in zip(example_sent[:len(clusters)], clusters):
        print(f"{word}: Cluster-{cluster}")

if __name__ == "__main__":
    main()

Loading Penn Treebank data...
Preparing data...
Tokens for each selected word
[4896    0    1    2    3    4    1    5    6    7    8    9   10   11
   12   13   14   15   16    0   17   18   19 4896   20    1    7   21
   22   23   15   24 4896    1   25    3    4   26   27   18   19   28
   29   30   31    1   32   33   34   10   11   12   19   35   36   37
 4896   15   10   38   19   39   40   41   42   42   43   44   45   46
   47   48   49   10   50   51   19   52   53   54   10   23   19   55
   56   42   43   57   58   59   60    3   61    1   62   63   64   65
   15    7]
Word to id mappings hashmap.
Training SVD2Tagger...
Building initial context matrices...
After using bincount on tokens:
[   2 3955    3  103   27  216    4 3841   55  336 1539    5   25   19
    3 3068  318  555   40 1854    3    3   16   53    3    7 1274   15
    2    7   13    7  293   17  864  182    9   14   15   11   16   25
  802 1718   68    7    4    7  280    9   49   11    8    6   36   17
    2  4

In [4]:
import nltk
import numpy as np
from collections import defaultdict
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans

# Import SVD2Tagger class
from typing import List, Dict, Tuple

def prepare_treebank_data(tagged_sents: List[List[Tuple[str, str]]]) -> Tuple[np.ndarray, Dict, int]:
    """
    Prepare Penn Treebank data for SVD2Tagger

    Args:
        tagged_sents: List of sentences, where each sentence is a list of (word, tag) tuples

    Returns:
        tokens: numpy array of word IDs
        word_to_id: dictionary mapping words to IDs
        vocab_size: size of vocabulary
    """
    # Build vocabulary
    word_to_id = {}
    words = [word.lower() for sent in tagged_sents for word, _ in sent]

    # Add words to vocabulary with frequency > 1
    word_freq = defaultdict(int)
    for word in words:
        word_freq[word] += 1

    # Create word_to_id mapping
    for word in word_freq:
        if word_freq[word] > 1:  # Only include words that appear more than once
            word_to_id[word] = len(word_to_id)

    # Add <UNK> token
    word_to_id['<UNK>'] = len(word_to_id)

    # Convert words to IDs
    tokens = []
    for sent in tagged_sents:
        for word, _ in sent:
            word = word.lower()
            if word in word_to_id:
                tokens.append(word_to_id[word])
            else:
                tokens.append(word_to_id['<UNK>'])

    return np.array(tokens), word_to_id, len(word_to_id)

def evaluate_clusters(tagger, test_sents, true_tags):
    """
    Evaluate cluster quality against true POS tags
    """
    from sklearn.metrics import v_measure_score, adjusted_rand_score

    # Get cluster assignments for test words
    test_clusters = []
    true_pos_tags = []

    for sent, true_sent in zip(test_sents, true_tags):
        for word, true_tag in zip(sent, true_sent):
            word = word.lower()
            if word in tagger.word_to_id:
                word_id = tagger.word_to_id[word]
                test_clusters.append(tagger.final_clusters[word_id])
                true_pos_tags.append(true_tag)

    # Calculate clustering metrics
    v_measure = v_measure_score(true_pos_tags, test_clusters)
    ari = adjusted_rand_score(true_pos_tags, test_clusters)

    return {
        'v_measure': v_measure,
        'ari': ari
    }

def main():
    # Load Penn Treebank data
    print("Loading Penn Treebank data...")
    corpus = nltk.corpus.treebank.tagged_sents()

    # Split into train and test
    train_size = int(len(corpus) * 0.8)
    train_data = corpus[:train_size]
    test_data = corpus[train_size:]

    # Prepare data
    print("Preparing data...")
    tokens, word_to_id, vocab_size = prepare_treebank_data(train_data)

    # Initialize and train SVD2Tagger
    print("Training SVD2Tagger...")
    tagger = SVD2Tagger(w1=1000, r1=100, k1=500, r2=300, k2=45)
    clusters = tagger.fit(tokens, vocab_size, word_to_id)

    # Save word_to_id mapping in tagger
    tagger.word_to_id = word_to_id
    tagger.id_to_word = {v: k for k, v in word_to_id.items()}

    # Print cluster examples
    print("\nCluster Examples:")
    tagger.get_cluster_examples(tokens, word_to_id)

    # Prepare test data for evaluation
    test_words = [[word.lower() for word, _ in sent] for sent in test_data]
    test_tags = [[tag for _, tag in sent] for sent in test_data]

    # Evaluate clusters
    print("\nEvaluating clusters...")
    metrics = evaluate_clusters(tagger, test_words, test_tags)
    print(f"V-measure score: {metrics['v_measure']:.4f}")
    print(f"Adjusted Rand Index: {metrics['ari']:.4f}")

    # Print some example tags
    print("\nExample Tagging:")
    example_sent = "banana banana banana banana apple.".lower().split()
    word_ids = [word_to_id.get(word, word_to_id['<UNK>']) for word in example_sent]
    clusters = [tagger.final_clusters[wid] for wid in word_ids]

    for word, cluster in zip(example_sent, clusters):
        print(f"{word}: Cluster-{cluster}")

if __name__ == "__main__":
    main()

Loading Penn Treebank data...
Preparing data...
Training SVD2Tagger...
Building initial context matrices...
Performing first SVD transformation...
Performing first clustering...
Building refined context matrices...
Performing second SVD transformation...
Performing final clustering...

Cluster Examples:
Cluster 0: 55, 30, 1.5, 9, 27
Cluster 1: lorillard, york-based, 's, heard, questionable
Cluster 2: board, director, form, deaths, decades
Cluster 3: nov., early, hollingsworth, virtually, money-market
Cluster 4: old, dutch, former, british, kent
Cluster 5: vinken, vitulli, spoon, ross, cray
Cluster 6: once, it, researchers, even, although
Cluster 7: join, make, findings, bring, be
Cluster 8: *-5, like, banned, hopes, attempts
Cluster 9: still, looking, worried, closed, priced
Cluster 10: risk, appears, continues, continue, expects
Cluster 11: reported, stopped, worked, dumped, suspended
Cluster 12: will, is, was, has, were
Cluster 13: *-1, *-2, attention, *-3, *-4
Cluster 14: mr., dr., 