# Topic Modeling
In this exercise, we will do topic modeling with gensim. Use the [topics and transformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) as a reference.

In [35]:
import os
from collections import defaultdict

import gensim
import nltk

In [36]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

For tokenizing words and stopword removal, download the NLTK punkt tokenizer and stopwords list.

In [37]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/david/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/david/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

First, we load the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included with gensim that contains 300 news articles of the Australian Broadcasting Corporation.

In [38]:
from gensim.test.utils import datapath
train_file = datapath('lee_background.cor')
articles_orig = open(train_file).read().splitlines()

Preprocess the text by lowercasing, removing stopwords, stemming, and removing rare words.

In [39]:
# define stopword list
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords | {'\"', '\'', '\'\'', '`', '``', '\'s'}

# initialize stemmer
stemmer = nltk.stem.PorterStemmer()

def preprocess(article, stem=False):
    # tokenize
    article = nltk.word_tokenize(article)

    # lowercase all words
    article = [word.lower() for word in article]

    # remove stopwords
    article = [word for word in article if word not in stopwords]

    # optional: stem
    if stem:
        article = [stemmer.stem(word) for word in article]
    return article

articles = [preprocess(article) for article in articles_orig]

# create the dictionary and corpus objects that gensim uses for topic modeling
dictionary = gensim.corpora.Dictionary(articles)

# remove words that occur in less than 2 documents, or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)
temp = dictionary[0]  # load the dictionary by calling it once
corpus_bow = [dictionary.doc2bow(article) for article in articles]

2025-02-27 12:31:55,565 : INFO : adding document #0 to Dictionary<0 unique tokens: []>
2025-02-27 12:31:55,593 : INFO : built Dictionary<7349 unique tokens: ["'ve", ',', '.', '100', '4:00pm']...> from 300 documents (total 40467 corpus positions)
2025-02-27 12:31:55,594 : INFO : Dictionary lifecycle event {'msg': 'built Dictionary<7349 unique tokens: ["\'ve", \',\', \'.\', \'100\', \'4:00pm\']...> from 300 documents (total 40467 corpus positions)', 'datetime': '2025-02-27T12:31:55.594628', 'gensim': '4.3.3', 'python': '3.9.21 (main, Dec 11 2024, 16:24:11) \n[GCC 11.2.0]', 'platform': 'Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35', 'event': 'created'}
2025-02-27 12:31:55,602 : INFO : discarding 3829 tokens: [(',', 294), ('.', 300), ('associated', 1), ('burn', 1), ('claire', 1), ('cranebrook', 1), ('deterioration', 1), ('directions', 1), ('falls', 1), ('finger', 1)]...
2025-02-27 12:31:55,602 : INFO : keeping 3520 tokens which were in no less than 2 and no more than 150 


Now we create a TF-IDF model and transform the corpus into TF-IDF vectors.

In [40]:
from gensim import models


tfidf = models.TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]

print(corpus_bow[0])
print(corpus_tfidf[0])

2025-02-27 12:31:55,631 : INFO : collecting document frequencies
2025-02-27 12:31:55,632 : INFO : PROGRESS: processing document #0
2025-02-27 12:31:55,643 : INFO : TfidfModel lifecycle event {'msg': 'calculated IDF weights for 300 documents and 3520 features (22934 matrix non-zeros)', 'datetime': '2025-02-27T12:31:55.643612', 'gensim': '4.3.3', 'python': '3.9.21 (main, Dec 11 2024, 16:24:11) \n[GCC 11.2.0]', 'platform': 'Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35', 'event': 'initialize'}


[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 7), (42, 1), (43, 1), (44, 1), (45, 3), (46, 1), (47, 1), (48, 2), (49, 2), (50, 3), (51, 3), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 8), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 5), (90, 1), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 3), (99, 1), (100, 1), (101, 3), (102, 1), (103, 1), (104, 1), (105, 4), (106, 2), (107, 1), (108, 1), (109, 1), (110, 1)]

Now we train an [LDA model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html) with 10 topics on the TF-IDF corpus. Save it to a variable `model_lda`.

In [41]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
num_topics = 10

# Make an index to word dictionary.
id2word = dictionary.id2token

model_lda = LdaModel(
    corpus=corpus_tfidf,
    id2word=id2word,
    num_topics=num_topics,
)

2025-02-27 12:31:55,656 : INFO : using symmetric alpha at 0.1
2025-02-27 12:31:55,658 : INFO : using symmetric eta at 0.1
2025-02-27 12:31:55,660 : INFO : using serial LDA version on this node
2025-02-27 12:31:55,664 : INFO : running online (single-pass) LDA training, 10 topics, 1 passes over the supplied corpus of 300 documents, updating model once every 300 documents, evaluating perplexity every 300 documents, iterating 50x with a convergence threshold of 0.001000
2025-02-27 12:31:55,814 : INFO : -27.008 per-word bound, 134934897.8 perplexity estimate based on a held-out corpus of 300 documents with 2210 words
2025-02-27 12:31:55,815 : INFO : PROGRESS: pass 0, at document #300/300
2025-02-27 12:31:55,902 : INFO : topic #2 (0.100): 0.002*"qantas" + 0.002*"workers" + 0.002*"afghanistan" + 0.002*"hewitt" + 0.002*"industrial" + 0.001*"afghan" + 0.001*"agreement" + 0.001*"britain" + 0.001*"maintenance" + 0.001*"commission"
2025-02-27 12:31:55,904 : INFO : topic #1 (0.100): 0.002*"union" +

Let's inspect the first 5 topics of our model.

In [42]:
model_lda.print_topics(5)

2025-02-27 12:31:55,921 : INFO : topic #0 (0.100): 0.002*"alliance" + 0.002*"northern" + 0.001*"test" + 0.001*"bill" + 0.001*"taliban" + 0.001*"afghanistan" + 0.001*"mr" + 0.001*"palestinian" + 0.001*"interim" + 0.001*"us"
2025-02-27 12:31:55,923 : INFO : topic #3 (0.100): 0.002*"detainees" + 0.001*"reid" + 0.001*"stage" + 0.001*"palestinian" + 0.001*"road" + 0.001*"human" + 0.001*"government" + 0.001*"mr" + 0.001*"timor" + 0.001*"south"
2025-02-27 12:31:55,925 : INFO : topic #6 (0.100): 0.002*"mr" + 0.001*"arafat" + 0.001*"$" + 0.001*"new" + 0.001*"zimbabwe" + 0.001*"giuliani" + 0.001*"year" + 0.001*"hih" + 0.001*"afp" + 0.001*"nauru"
2025-02-27 12:31:55,927 : INFO : topic #8 (0.100): 0.003*"palestinian" + 0.002*"man" + 0.002*"hamas" + 0.002*"israeli" + 0.002*"india" + 0.002*"melbourne" + 0.002*"say" + 0.002*"security" + 0.001*"gaza" + 0.001*"club"
2025-02-27 12:31:55,928 : INFO : topic #5 (0.100): 0.002*"test" + 0.002*"south" + 0.002*"lee" + 0.002*"bowler" + 0.002*"virgin" + 0.002*"p

[(0,
  '0.002*"alliance" + 0.002*"northern" + 0.001*"test" + 0.001*"bill" + 0.001*"taliban" + 0.001*"afghanistan" + 0.001*"mr" + 0.001*"palestinian" + 0.001*"interim" + 0.001*"us"'),
 (3,
  '0.002*"detainees" + 0.001*"reid" + 0.001*"stage" + 0.001*"palestinian" + 0.001*"road" + 0.001*"human" + 0.001*"government" + 0.001*"mr" + 0.001*"timor" + 0.001*"south"'),
 (6,
  '0.002*"mr" + 0.001*"arafat" + 0.001*"$" + 0.001*"new" + 0.001*"zimbabwe" + 0.001*"giuliani" + 0.001*"year" + 0.001*"hih" + 0.001*"afp" + 0.001*"nauru"'),
 (8,
  '0.003*"palestinian" + 0.002*"man" + 0.002*"hamas" + 0.002*"israeli" + 0.002*"india" + 0.002*"melbourne" + 0.002*"say" + 0.002*"security" + 0.001*"gaza" + 0.001*"club"'),
 (5,
  '0.002*"test" + 0.002*"south" + 0.002*"lee" + 0.002*"bowler" + 0.002*"virgin" + 0.002*"palestinian" + 0.001*"match" + 0.001*"new" + 0.001*"mr" + 0.001*"macgill"')]

We see the 5 topics with the highest importance. For each topic, the 10 most important words are shown, together with their coefficient of "alignment" to the topic.

## Document Similarity
We now use our LDA model to compare the similarity of new documents (*queries*) to documents in our collection.

First, create an index of the news articles in our corpus. Use the `MatrixSimilarity` transformation as described in gensim's [similarity queries tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [43]:
from gensim import similarities
index = similarities.MatrixSimilarity(model_lda[corpus_tfidf])

2025-02-27 12:31:56,069 : INFO : creating matrix with 300 documents and 10 features


Now, write a function that takes a query string as input and returns the LDA representation for it. Make sure to apply the same preprocessing as we did to the documents.

In [44]:
def get_lda_representation(query_preprocessed):
    query_preprocessed = preprocess(query_preprocessed)
    query_bow = dictionary.doc2bow(query_preprocessed)
    query_tfidf = tfidf[query_bow]
    query_lda = model_lda[query_tfidf]
    return query_lda

query = "An earthquake is really dangerous and can kill many people"
print(get_lda_representation(query))

[(0, 0.032822028), (1, 0.032820974), (2, 0.032821283), (3, 0.032839775), (4, 0.032827053), (5, 0.18322982), (6, 0.55416155), (7, 0.03283031), (8, 0.032826815), (9, 0.032820377)]


Print the top 5 most similar documents, together with their similarities, using your index created above.

In [45]:
query_lda = get_lda_representation(query)
sims = index[query_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims[:5]:
    print(f"Similarity: {doc_score}, Article: {articles_orig[doc_position]}")

Similarity: 0.9687249064445496, Article: Australia's quicks and opening batsmen have put the side in a dominant position going into day three of the Boxing Day Test match against South Africa at the MCG. Australia is no wicket for 126, only 151 runs shy of South Africa after Andy Bichel earlier starred as the tourists fell for 277. When play was abandoned due to rain a few overs short of scheduled stumps yesterday, Justin Langer was not out 67 and Matthew Hayden 55. The openers went on the attack from the start, with Langer's innings including six fours and Hayden's eight. Earlier, Shaun Pollock and Nantie Haywood launched a vital rearguard action to help South Africa to a respectable first innings total. The pair put on 44 runs for the final wicket to help the tourists to 277. The South Africans had slumped to 9 for 233 through a combination of Australia's good bowling, good fielding and good luck. After resuming at 3 for 89 yesterday morning, the tourists looked to be cruising as Jac

Run your code again, now training an LDA model with 100 topics. Do you see a qualitative difference in the top-5 most similar documents?