# Topic Modeling
In this exercise, we will do topic modeling with gensim. Use the [topics and transformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) as a reference.

In [2]:
import os
from collections import defaultdict

import gensim
import nltk

For tokenizing words and stopword removal, download the NLTK punkt tokenizer and stopwords list.

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fabia\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fabia\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

First, we load the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included with gensim that contains 300 news articles of the Australian Broadcasting Corporation.

In [4]:
from gensim.test.utils import datapath
train_file = datapath('lee_background.cor')
articles_orig = open(train_file).read().splitlines()

Preprocess the text by lowercasing, removing stopwords, stemming, and removing rare words.

In [5]:
# define stopword list
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords | {'\"', '\'', '\'\'', '`', '``', '\'s'}

# initialize stemmer
stemmer = nltk.stem.PorterStemmer()

def preprocess(article):
    # tokenize
    article = nltk.word_tokenize(article)

    # lowercase all words
    article = [word.lower() for word in article]

    # remove stopwords
    article = [word for word in article if word not in stopwords]

    # optional: stem
    # article = [stemmer.stem(word) for word in article]
    return article

articles = [preprocess(article) for article in articles_orig]

# create the dictionary and corpus objects that gensim uses for topic modeling
dictionary = gensim.corpora.Dictionary(articles)

# remove words that occur in less than 2 documents, or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)
temp = dictionary[0]  # load the dictionary by calling it once
corpus_bow = [dictionary.doc2bow(article) for article in articles]


Now we create a TF-IDF model and transform the corpus into TF-IDF vectors.

In [6]:
model_tfidf = gensim.models.TfidfModel(corpus_bow)
corpus_tfidf = model_tfidf[corpus_bow]

print(corpus_bow[0])
print(corpus_tfidf[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 7), (42, 1), (43, 1), (44, 1), (45, 3), (46, 1), (47, 1), (48, 2), (49, 2), (50, 3), (51, 3), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 8), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 5), (90, 1), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 3), (99, 1), (100, 1), (101, 3), (102, 1), (103, 1), (104, 1), (105, 4), (106, 2), (107, 1), (108, 1), (109, 1), (110, 1)]

Now we train an [LDA model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html) with 10 topics on the TF-IDF corpus. Save it to a variable `model_lda`.

In [7]:
model_lda = gensim.models.LdaModel(corpus=corpus_tfidf, id2word=dictionary, num_topics=10)

Let's inspect the first 5 topics of our model.

In [8]:
model_lda.print_topics(5)

[(2,
  '0.002*"detainees" + 0.002*"woomera" + 0.002*"arafat" + 0.002*"centre" + 0.002*"fire" + 0.002*"palestinian" + 0.002*"night" + 0.001*"labor" + 0.001*"south" + 0.001*"firefighters"'),
 (0,
  '0.002*"afghanistan" + 0.002*"river" + 0.002*"bin" + 0.002*"laden" + 0.002*"man" + 0.002*"metres" + 0.002*"mr" + 0.002*"world" + 0.001*"states" + 0.001*"palestinian"'),
 (9,
  '0.003*"hih" + 0.002*"company" + 0.002*"commission" + 0.002*"report" + 0.002*"palestinian" + 0.002*"royal" + 0.002*"mr" + 0.001*"killed" + 0.001*"year" + 0.001*"road"'),
 (7,
  '0.002*"radio" + 0.001*"world" + 0.001*"south" + 0.001*"two" + 0.001*"abloy" + 0.001*"assa" + 0.001*"river" + 0.001*"krishna" + 0.001*"warne" + 0.001*"week"'),
 (6,
  '0.002*"qantas" + 0.002*"reid" + 0.002*"israeli" + 0.002*"palestinian" + 0.002*"mr" + 0.001*"australian" + 0.001*"suicide" + 0.001*"hewitt" + 0.001*"workers" + 0.001*"maintenance"')]

We see the 5 topics with the highest importance. For each topic, the 10 most important words are shown, together with their coefficient of "alignment" to the topic.

## Document Similarity
We now use our LDA model to compare the similarity of new documents (*queries*) to documents in our collection.

First, create an index of the news articles in our corpus. Use the `MatrixSimilarity` transformation as described in gensim's [similarity queries tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [9]:
index = gensim.similarities.MatrixSimilarity(model_lda[corpus_tfidf])

Now, write a function that takes a query string as input and returns the LDA representation for it. Make sure to apply the same preprocessing as we did to the documents.

In [17]:
query = "Human computer interaction"
pre_query = preprocess(query)

vec_bow = dictionary.doc2bow(pre_query)
vec_lsi = model_lda[vec_bow]  # convert the query to LSI space
print(vec_lsi)

[(0, 0.05001586), (1, 0.050014295), (2, 0.050014537), (3, 0.05001436), (4, 0.050014295), (5, 0.5498681), (6, 0.05001476), (7, 0.050014295), (8, 0.050015204), (9, 0.050014295)]


Print the top 5 most similar documents, together with their similarities, using your index created above.

In [18]:
sims = index[vec_lsi]
print(list(enumerate(sims)))

sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims[:5]:
    print(doc_score, articles_orig[doc_position])

[(0, 0.10241625), (1, 0.11014203), (2, 0.11761003), (3, 0.109698236), (4, 0.11270343), (5, 0.109107696), (6, 0.087748684), (7, 0.114298336), (8, 0.12805778), (9, 0.12096505), (10, 0.97500217), (11, 0.9754217), (12, 0.106940836), (13, 0.11324691), (14, 0.11486876), (15, 0.11321192), (16, 0.11692338), (17, 0.10992997), (18, 0.11799764), (19, 0.114796124), (20, 0.97821873), (21, 0.113405384), (22, 0.11777974), (23, 0.11245378), (24, 0.11595931), (25, 0.12004268), (26, 0.114118755), (27, 0.11547435), (28, 0.110628165), (29, 0.11037162), (30, 0.11573291), (31, 0.1321513), (32, 0.98098457), (33, 0.9265822), (34, 0.110535875), (35, 0.14734425), (36, 0.9750854), (37, 0.08775028), (38, 0.112546995), (39, 0.11048309), (40, 0.109421395), (41, 0.11492355), (42, 0.11039954), (43, 0.11251092), (44, 0.109232135), (45, 0.11113871), (46, 0.114339076), (47, 0.11196843), (48, 0.10619465), (49, 0.10646482), (50, 0.11291571), (51, 0.11684278), (52, 0.10700049), (53, 0.11541241), (54, 0.11445456), (55, 0.10

Run your code again, now training an LDA model with 100 topics. Do you see a qualitative difference in the top-5 most similar documents?