# Week 5, Lesson 4, Activity 8: LDA for Topic Modelling

&copy;2021, Ekaterina Kochmar \
(updated to newer library versions: Nadejda Roubtsova, February 2022)

Your task in this activity is to:

- Implement topic modelling approach using Latent Dirichlet Allocation and apply it to the `20 Newsgroups` dataset.

## Step 1: Load the data

First, let's import the libraries that we are going to use in this notebook. Then, let's define a method to load *training* and *test* subsets using a predefined list of categories. Note that you are working with the same dataset as last week.

In [None]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np

def load_dataset(a_set, cats):
    dataset = fetch_20newsgroups(subset=a_set, categories=cats,
                          remove=('headers', 'footers', 'quotes'),
                          shuffle=True)
    return dataset

categories = ["comp.windows.x", "misc.forsale", "rec.autos", "rec.motorcycles", "rec.sport.baseball"]
categories += ["rec.sport.hockey", "sci.crypt", "sci.med", "sci.space", "talk.politics.mideast"]

newsgroups_all = load_dataset(# load_dataset using 'all' as a_set and the list of categories from above
                              )
print(len(newsgroups_all.data))

## Step 2: Preprocess the data

Convert word forms to stems to get concise representations for the documents: 

In [None]:
import nltk
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer("english")

def stem(text):
    # return stemmed text

Import `gensim` and preprocess the data:

In [None]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS as stopwords

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text, min_len=4):
        # if token is not in stopwords, stem it and append to the result list
    return result

Check how each document is represented. For example, look into the very first one:

In [None]:
doc_sample = newsgroups_all.data[0]
print('Original document: ')
print(doc_sample)

print('\n\nTokenized document: ')
words = []
for token in gensim.utils.tokenize(doc_sample):
    words.append(token)
print(words)

print('\n\nPreprocessed document: ')
# print preprocessed document for comparison

How do the first 10 look like?

In [None]:
for i in range(0, 10):
    print(str(i) + "\t" + ", ".join(preprocess(newsgroups_all.data[i])[:10]))

Now let's represent each document as a dictionary of relevant words. Each word (*value* in the dictionary) has a unique identifier (*key*): 

In [None]:
processed_docs = []
for i in range(0, len(newsgroups_all.data)):
    processed_docs.append(preprocess(newsgroups_all.data[i]))

print(len(processed_docs))
    
dictionary = gensim.corpora.Dictionary(processed_docs)
print(len(dictionary))

index = 0
# print(key, value) for the first 10 items in dictionary 


Put some constraints on the dictionary of terms: for instance, keep up to $100000$ words that occur more frequently than $10$ times (`no_below`) and less frequently than in $50\%$ of the documents (`no_above`). This should help you extract the most useful terms, while still keeping a reasonable number of them.

In [None]:
dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=100000)
print(len(dictionary))

Let's see how a particular document is represented in this dictionary: for example, look into the very first post.

In [None]:
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
# take a look into the very first post in bow_corpus – print the contents

Let's decode what each index (key) in this dictionary points to:

In [None]:
bow_doc = bow_corpus[0]

for i in range(len(bow_doc)):
    print(f"Key {bow_doc[i][0]} =\"{dictionary[bow_doc[i][0]]}\":\
    occurrences={bow_doc[i][1]}")

## Step 3: Train an LDA model


In [None]:
# Create the dictionary
id2word = dictionary

# Create the corpus with word frequencies
corpus = bow_corpus

# Build the LDA model
# Check gensim documentation and familiarise yourself with LdaModel functionality
# (see https://radimrehurek.com/gensim/models/ldamodel.html)
# Experiment with other LDA model settings
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=1000,
                                           passes=10,
                                           alpha='symmetric',
                                           iterations=100,
                                           per_word_topics=True)


for index, topic in lda_model.print_topics(-1):
    print(f"Topic: {index} \nWords: {topic}")

## Step 4: Analyse the results

What is the most representative topic in each document?

In [None]:
def analyse_topics(ldamodel, corpus, texts):
    main_topic = {}
    percentage = {}
    keywords = {}
    text_snippets = {}
    # Get main topic in each document
    for i, topic_list in enumerate(ldamodel[corpus]):
        topic = topic_list[0] if ldamodel.per_word_topics else topic_list            
        topic = sorted(topic, key=lambda x: (x[1]), reverse=True)
        # Get the main topic, contribution (%) and keywords for each document
        for j, (topic_num, prop_topic) in enumerate(topic):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp[:5]])
                main_topic[i] = int(topic_num)
                percentage[i] = round(prop_topic,4)
                keywords[i] = topic_keywords
                text_snippets[i] = texts[i][:8]
            else:
                break
    return main_topic, percentage, keywords, text_snippets


main_topic, percentage, keywords, text_snippets = analyse_topics(# apply to the relevant structures
                                                                 )

indexes = []
rows = []
for i in range(0, 10):
    indexes.append(i)
rows.append(['ID', 'Main Topic', 'Contribution (%)', 'Keywords', 'Snippet'])

for idx in indexes:
    rows.append([str(idx), f"{main_topic.get(idx)}", 
                f"{percentage.get(idx):.4f}",
                f"{keywords.get(idx)}\n",
                f"{text_snippets.get(idx)}"])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]
for row in rows:
    print(''.join(' {:{width}} '.format(row[i], width=column_widths[i]) 
                  for i in range(0, len(row)))) 
    

Finally, here is how you can explore words and topics with `pyLDAvis` (for installation instructions, see https://pyldavis.readthedocs.io/en/latest/readme.html):

In [None]:
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary=lda_model.id2word)
vis