# Topic Modeling
In this exercise, we will do topic modeling with gensim. Use the [topics and transformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) as a reference.

In [2]:
import os
from collections import defaultdict

import gensim
import nltk

For tokenizing words and stopword removal, download the NLTK punkt tokenizer and stopwords list.

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fabia\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fabia\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

First, we load the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included with gensim that contains 300 news articles of the Australian Broadcasting Corporation.

In [4]:
from gensim.test.utils import datapath
train_file = datapath('lee_background.cor')
articles_orig = open(train_file).read().splitlines()

Preprocess the text by lowercasing, removing stopwords, stemming, and removing rare words.

In [5]:
# define stopword list
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords | {'\"', '\'', '\'\'', '`', '``', '\'s'}

# initialize stemmer
stemmer = nltk.stem.PorterStemmer()

def preprocess(article):
    # tokenize
    article = nltk.word_tokenize(article)

    # lowercase all words
    article = [word.lower() for word in article]

    # remove stopwords
    article = [word for word in article if word not in stopwords]

    # optional: stem
    # article = [stemmer.stem(word) for word in article]
    return article

articles = [preprocess(article) for article in articles_orig]

# create the dictionary and corpus objects that gensim uses for topic modeling
dictionary = gensim.corpora.Dictionary(articles)

# remove words that occur in less than 2 documents, or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)
temp = dictionary[0]  # load the dictionary by calling it once
corpus_bow = [dictionary.doc2bow(article) for article in articles]


Now we create a TF-IDF model and transform the corpus into TF-IDF vectors.

In [9]:
model_tfidf = gensim.models.TfidfModel(corpus_bow)
corpus_tfidf = model_tfidf[corpus_bow]

print(corpus_bow[0])
print(corpus_tfidf[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 7), (42, 1), (43, 1), (44, 1), (45, 3), (46, 1), (47, 1), (48, 2), (49, 2), (50, 3), (51, 3), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 8), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 5), (90, 1), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 3), (99, 1), (100, 1), (101, 3), (102, 1), (103, 1), (104, 1), (105, 4), (106, 2), (107, 1), (108, 1), (109, 1), (110, 1)]

Now we train an [LDA model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html) with 10 topics on the TF-IDF corpus. Save it to a variable `model_lda`.

In [10]:
model_lda = gensim.models.LdaModel(corpus=corpus_tfidf, id2word=dictionary, num_topics=10)

Let's inspect the first 5 topics of our model.

In [31]:
model_lda.print_topics(5)

[(9,
  '0.002*"government" + 0.002*"party" + 0.002*"federal" + 0.002*"group" + 0.001*"asylum" + 0.001*"cut" + 0.001*"palestinian" + 0.001*"pacific" + 0.001*"workers" + 0.001*"economy"'),
 (5,
  '0.002*"asic" + 0.002*"afghanistan" + 0.001*"best" + 0.001*"guides" + 0.001*"river" + 0.001*"company" + 0.001*"adventure" + 0.001*"canyoning" + 0.001*"interlaken" + 0.001*"afghan"'),
 (2,
  '0.003*"metres" + 0.002*"50" + 0.002*"agreement" + 0.002*"event" + 0.001*"bill" + 0.001*"militants" + 0.001*"palestinian" + 0.001*"reid" + 0.001*"mr" + 0.001*"us"'),
 (1,
  '0.002*"palestinian" + 0.002*"south" + 0.002*"arafat" + 0.002*"israel" + 0.002*"hamas" + 0.002*"australia" + 0.001*"fire" + 0.001*"hewitt" + 0.001*"new" + 0.001*"club"'),
 (6,
  '0.003*"palestinian" + 0.002*"people" + 0.002*"israeli" + 0.002*"security" + 0.002*"south" + 0.002*"suicide" + 0.001*"test" + 0.001*"police" + 0.001*"west" + 0.001*"mr"')]

We see the 5 topics with the highest importance. For each topic, the 10 most important words are shown, together with their coefficient of "alignment" to the topic.

## Document Similarity
We now use our LDA model to compare the similarity of new documents (*queries*) to documents in our collection.

First, create an index of the news articles in our corpus. Use the `MatrixSimilarity` transformation as described in gensim's [similarity queries tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [38]:
index = gensim.similarities.MatrixSimilarity(model_lda[corpus_tfidf])

Now, write a function that takes a query string as input and returns the LDA representation for it. Make sure to apply the same preprocessing as we did to the documents.

In [39]:
query = "Human computer interaction"
pre_query = preprocess(query)

vec_bow = dictionary.doc2bow(pre_query)
vec_lsi = model_lda[vec_bow]  # convert the query to LSI space
print(vec_lsi)

[(0, 0.050017152), (1, 0.050016955), (2, 0.5498435), (3, 0.050016955), (4, 0.050018985), (5, 0.05001799), (6, 0.050017077), (7, 0.050017323), (8, 0.050017122), (9, 0.050016955)]


Print the top 5 most similar documents, together with their similarities, using your index created above.

In [41]:
sims = index[vec_lsi]
print(list(enumerate(sims)))

sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims[:5]:
    print(doc_score, articles[doc_position])

[(0, 0.08776023), (1, 0.11016472), (2, 0.11762812), (3, 0.109708965), (4, 0.11271897), (5, 0.109133095), (6, 0.96472156), (7, 0.114336915), (8, 0.10710348), (9, 0.12098105), (10, 0.1111706), (11, 0.11224067), (12, 0.1226941), (13, 0.97582626), (14, 0.114903174), (15, 0.9758115), (16, 0.11691949), (17, 0.109932184), (18, 0.11802079), (19, 0.11478705), (20, 0.1250358), (21, 0.9758907), (22, 0.117768794), (23, 0.11247062), (24, 0.11598759), (25, 0.9733644), (26, 0.114097595), (27, 0.11557376), (28, 0.11066702), (29, 0.11033342), (30, 0.11582889), (31, 0.11038124), (32, 0.12694208), (33, 0.08775689), (34, 0.11050568), (35, 0.110986635), (36, 0.9750492), (37, 0.08775702), (38, 0.11258303), (39, 0.11051073), (40, 0.100570664), (41, 0.97649825), (42, 0.11038694), (43, 0.11252522), (44, 0.12987262), (45, 0.9749904), (46, 0.11437751), (47, 0.11200681), (48, 0.1062413), (49, 0.1267344), (50, 0.112948805), (51, 0.11688079), (52, 0.10707758), (53, 0.1154563), (54, 0.11446591), (55, 0.13366735), (5

Run your code again, now training an LDA model with 100 topics. Do you see a qualitative difference in the top-5 most similar documents?