# <u>Chapter 10</u>: Clustering Speech-to-Text Transcriptions

`Topic modeling` refers to the task of identifying groups of items, in our case words, that best describes a collection of documents or sentences. The topics emerge during the specific process; hence they are called _latent_. A popular topic modeling technique to extract the hidden topics from a given corpus is the `Latent Dirichlet Allocation` (LDA). 

In [None]:
import sys
import subprocess
import pkg_resources

# Find out which packages are missing.
installed_packages = {dist.key for dist in pkg_resources.working_set}
required_packages = {'pandas', 'spacy', 'gensim', 'pyLDAvis'}
missing_packages = required_packages - installed_packages

# If there are missing packages install them.
if missing_packages:
    print('Installing the following packages: ' + str(missing_packages))
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing_packages], stdout=subprocess.DEVNULL)

## Data preparation

First, we obtain a copy of the dataset from the CSV file.

In [46]:
import pandas as pd

# Read the hypotheses from the speech-to-text.
hypothesis_df = pd.read_csv('data/hypothesis.csv', names=['hypothesis'], skiprows=1)
corpus = hypothesis_df[hypothesis_df['hypothesis'] != "<ERROR>"]

corpus.head()

Unnamed: 0,hypothesis
0,the cheap paper should not sacrifice toughness...
1,it is obvious that legibility is the first thi...
3,America which is the worst conceivable
4,it must be said that it is in no way like the ...
5,this experiment was so far successful that abo...


Next, we tokenize the sentences and perform a light preprocessing of the words.

In [47]:
from spacy.lang.en import English

nlp = English()

# Tokenize the input text.
def tokenize(text):
    tokens = []
    doc = nlp(text)

    for word in doc:
        # Checks whether the word consists of whitespace.
        if word.orth_.isspace():
            continue
        # Does the word resemble to a URL?
        elif word.like_url:
            tokens.append('URL')
        # Does the word resemble to an email?
        elif word.like_email:
            tokens.append('EMAIL')
        else:
            tokens.append(word.lower_)

    return tokens

We load a set of stop words to be used later.

In [48]:
import spacy

sp = spacy.load("en_core_web_sm")

# Define the list of stopwords.
stop_words = sp.Defaults.stop_words

Let's define the standard method for lemmatization.

In [49]:
# Lemmatize the input word.
def lemmatize(text):
    sentence = sp(text)
    lemma = ''
    for token in sentence:
        lemma += token.lemma_ + ' '

    return lemma.strip()

An important design decision for this test is to focus solely on nouns and adjectives for the input.

In [50]:
# Keep only the nouns and adjectives.
def filter_nouns_adj(text):
    sentence = sp(text.lower())
    nouns_adj = ''
    for token in sentence:
        if token.pos_ == "NOUN" or token.pos_ == "ADJ":
            nouns_adj += token.text + ' '

    return nouns_adj.strip()

We define a parse sequence for the data:
* Keep only nouns and adjectives.
* Tokenize the input.
* Remove stop words.
* Lemmatize the tokens.
* Keep tokens with more than 4 characters.

In [51]:
# Extract the text for LDA.
def extract_text_for_lda(text):
    filtered_text = filter_nouns_adj(text)
    tokens = tokenize(filtered_text)
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [lemmatize(t) for t in tokens]
    tokens = [t for t in tokens if len(t) > 4]

    return tokens

text_data = []

# Parse all data from the corpus.
for row, col in corpus.iterrows():
    tokens = extract_text_for_lda(col.hypothesis)
    text_data.append(tokens)

_gensim_ is a library that provides a suite of tools for implementing LDA. Initially, we need to transform the data in the format (bag-of-words) expected by the library. We can also save this data to a file for future reference.


We are now ready to incorporate LDA using the _gensim_ library that provides a suite of tools for implementing LDA. But first, a necessary transformation is required.

In [52]:
import gensim
import pickle

# Transform the data for gensim.
dictionary = gensim.corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]

# Save the data in a file.
pickle.dump(corpus, open('./data/corpus.pkl', 'wb'))
dictionary.save('./data/dictionary.gensim')

### Identify 3 topics

Let's create the LDA model using 3 clusters and obtain the 4 most common words for each topic.

In [58]:
# Create and save the model for 3 topics.
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15, random_state=123)
ldamodel.save('./data/model3.gensim')

# Get the 4 most common words per topic.
topics = ldamodel.print_topics(num_words=4)
for t in topics:
    print(t)

(0, '0.007*"court" + 0.006*"government" + 0.006*"public" + 0.006*"morning" + 0.006*"century"')
(1, '0.011*"great" + 0.008*"paper" + 0.006*"modern" + 0.006*"service" + 0.006*"bread"')
(2, '0.019*"letter" + 0.012*"plant" + 0.012*"water" + 0.009*"animal" + 0.008*"house"')


Next we extract the distribution of topics for a random input text.

In [54]:
# Text to identify a topic.
test = 'the assassination of president kennedy took place at dallas, texas'
test = extract_text_for_lda(test)
test_bow = dictionary.doc2bow(test)

print(ldamodel.get_document_topics(test_bow))

[(0, 0.6977441), (1, 0.11669667), (2, 0.18555923)]


### Identify 5 topics

We repeat the same process for five topics.

In [55]:
# Create and save the model for 5 topics.
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15, random_state=123)
ldamodel.save('./data/model5.gensim')

# Get the 5 most common words per topic.
topics = ldamodel.print_topics(num_words=5)
for t in topics:
    print(t)

(0, '0.013*"government" + 0.010*"tablespoon" + 0.008*"business" + 0.007*"sugar" + 0.007*"butter"')
(1, '0.008*"house" + 0.007*"rifle" + 0.007*"president" + 0.007*"paper" + 0.006*"number"')
(2, '0.029*"letter" + 0.009*"beautiful" + 0.008*"great" + 0.008*"character" + 0.008*"roman"')
(3, '0.015*"court" + 0.014*"great" + 0.009*"bread" + 0.007*"modern" + 0.006*"people"')
(4, '0.019*"plant" + 0.012*"water" + 0.011*"flour" + 0.008*"animal" + 0.008*"service"')


## pyLDAvis

Next, a handy interactive visualization is created to examine the newly constructed LDA models.

_pyLDAvis_ is a python library for interactive topic model visualization. For each topic: 
* the saliency (in red) quantifies how much a term tell us about the topic
* the size of the bubble measures the importance of the topics, relative to the data
* the bubbles that are closer reveal similar topics



### Visualize 3 topics

In [56]:
import pyLDAvis
import pyLDAvis.gensim_models

# Load the corpus.
dictionary = gensim.corpora.Dictionary.load('./data/dictionary.gensim')
corpus = pickle.load(open('./data/corpus.pkl', 'rb'))

# Read the LDA model, store and show the visualization in HTML.
lda = gensim.models.ldamodel.LdaModel.load('./data/model3.gensim')
lda_display = pyLDAvis.gensim_models.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(lda_display, './data/lda-3-topics.html')
pyLDAvis.display(lda_display)

  default_term_info = default_term_info.sort_values(


### Visualize 5 topics

In [57]:
# Read the LDA model, store and show the visualization in HTML.
lda = gensim.models.ldamodel.LdaModel.load('./data/model5.gensim')
lda_display = pyLDAvis.gensim_models.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.save_html(lda_display, './data/lda-5-topics.html')
pyLDAvis.display(lda_display)

  default_term_info = default_term_info.sort_values(


## What we have learned …

| | | |
| --- | --- | --- |
| **Visualizations**<ul><li>pyLDAvis plots</li></ul> | **ML concepts** <ul><li>Clustering</li></ul> | **ML algorithms & models** <ul><li>Latent Dirichlet Allocation</li></ul> |
| | | |