# Topic Modeling
In this exercise, we will do topic modeling with gensim. Use the [topics and transformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) as a reference.

In [2]:
import os
from collections import defaultdict

import gensim
import nltk

For tokenizing words and stopword removal, download the NLTK punkt tokenizer and stopwords list.

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Nevin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Nevin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

First, we load the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included with gensim that contains 300 news articles of the Australian Broadcasting Corporation.

In [4]:
from gensim.test.utils import datapath
train_file = datapath('lee_background.cor')
articles_orig = open(train_file).read().splitlines()

Preprocess the text by lowercasing, removing stopwords, stemming, and removing rare words.

In [5]:
# define stopword list
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords | {'\"', '\'', '\'\'', '`', '``', '\'s'}

# initialize stemmer
stemmer = nltk.stem.PorterStemmer()

def preprocess(article):
    # tokenize
    article = nltk.word_tokenize(article)

    # lowercase all words
    article = [word.lower() for word in article]

    # remove stopwords
    article = [word for word in article if word not in stopwords]

    # optional: stem
    # article = [stemmer.stem(word) for word in article]
    return article

articles = [preprocess(article) for article in articles_orig]

# create the dictionary and corpus objects that gensim uses for topic modeling
dictionary = gensim.corpora.Dictionary(articles)

# remove words that occur in less than 2 documents, or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)
temp = dictionary[0]  # load the dictionary by calling it once
corpus_bow = [dictionary.doc2bow(article) for article in articles]


Now we create a TF-IDF model and transform the corpus into TF-IDF vectors.

In [6]:
model_tfidf = gensim.models.TfidfModel(corpus_bow)
corpus_tfidf = model_tfidf[corpus_bow]

print('BOW:')
print(corpus_bow[0])

print('TF-IDF:')
print(corpus_tfidf[0])


BOW:
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1), (9, 1), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 7), (42, 1), (43, 1), (44, 1), (45, 3), (46, 1), (47, 1), (48, 2), (49, 2), (50, 3), (51, 3), (52, 1), (53, 2), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 2), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 8), (73, 1), (74, 1), (75, 1), (76, 2), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 3), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 5), (90, 1), (91, 2), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1), (98, 3), (99, 1), (100, 1), (101, 3), (102, 1), (103, 1), (104, 1), (105, 4), (106, 2), (107, 1), (108, 1), (109, 1), (110

Now we train an [LDA model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html) with 10 topics on the TF-IDF corpus. Save it to a variable `model_lda`.

In [7]:
#model_lda = gensim.models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=10)
model_lda = gensim.models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=10, passes=20, iterations=400)

model_lda.print_topics(5)

[(0,
  '0.004*"palestinian" + 0.004*"mr" + 0.003*"australia" + 0.003*"government" + 0.003*"australian" + 0.003*"south" + 0.003*"new" + 0.003*"israeli" + 0.003*"arafat" + 0.003*"afghanistan"'),
 (6,
  '0.002*"ses" + 0.002*"japan" + 0.002*"argentina" + 0.002*"hewitt" + 0.002*"car" + 0.002*"club" + 0.002*"roads" + 0.002*"road" + 0.002*"japanese" + 0.001*"crisis"'),
 (2,
  '0.003*"hollingworth" + 0.003*"dr" + 0.003*"governor-general" + 0.003*"space" + 0.002*"abuse" + 0.002*"adventure" + 0.002*"school" + 0.002*"guides" + 0.002*"anglican" + 0.002*"canyoning"'),
 (9,
  '0.007*"qantas" + 0.005*"workers" + 0.004*"industrial" + 0.004*"maintenance" + 0.003*"unions" + 0.003*"dispute" + 0.002*"freeze" + 0.002*"relations" + 0.002*"wage" + 0.002*"airline"'),
 (7,
  '0.003*"firefighters" + 0.002*"zimbabwe" + 0.002*"fires" + 0.002*"service" + 0.002*"pay" + 0.002*"rural" + 0.002*"rates" + 0.002*"storm" + 0.002*"homes" + 0.002*"lording"')]

Let's inspect the first 5 topics of our model.

In [8]:
topics = model_lda.print_topics(num_words=5)
print('LDA topics:')
for topic in topics:
    print(topic)

LDA topics:
(0, '0.004*"palestinian" + 0.004*"mr" + 0.003*"australia" + 0.003*"government" + 0.003*"australian"')
(1, '0.002*"warne" + 0.002*"innings" + 0.002*"wicket" + 0.002*"asic" + 0.002*"kallis"')
(2, '0.003*"hollingworth" + 0.003*"dr" + 0.003*"governor-general" + 0.003*"space" + 0.002*"abuse"')
(3, '0.003*"metres" + 0.002*"karzai" + 0.002*"kandahar" + 0.002*"event" + 0.002*"petrol"')
(4, '0.002*"friedli" + 0.002*"replied" + 0.002*"hih" + 0.002*"projects" + 0.001*"related"')
(5, '0.003*"reid" + 0.002*"cancer" + 0.002*"child" + 0.002*"sergeant" + 0.002*"lung"')
(6, '0.002*"ses" + 0.002*"japan" + 0.002*"argentina" + 0.002*"hewitt" + 0.002*"car"')
(7, '0.003*"firefighters" + 0.002*"zimbabwe" + 0.002*"fires" + 0.002*"service" + 0.002*"pay"')
(8, '0.002*"labor" + 0.002*"gang" + 0.002*"factory" + 0.002*"goshen" + 0.001*"pacific"')
(9, '0.007*"qantas" + 0.005*"workers" + 0.004*"industrial" + 0.004*"maintenance" + 0.003*"unions"')


We see the 5 topics with the highest importance. For each topic, the 10 most important words are shown, together with their coefficient of "alignment" to the topic.

## Document Similarity
We now use our LDA model to compare the similarity of new documents (*queries*) to documents in our collection.

First, create an index of the news articles in our corpus. Use the `MatrixSimilarity` transformation as described in gensim's [similarity queries tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [9]:
index = gensim.similarities.MatrixSimilarity(model_lda[corpus_tfidf])


Now, write a function that takes a query string as input and returns the LDA representation for it. Make sure to apply the same preprocessing as we did to the documents.

In [10]:
def lda_representation(query):
    query = preprocess(query)
    query_bow = dictionary.doc2bow(query)
    query_tfidf = model_tfidf[query_bow]
    query_lda = model_lda[query_tfidf]
    return query_lda
    

Print the top 5 most similar documents, together with their similarities, using your index created above.

In [11]:
query = 'Prime Minister of Australia'
query_lda = lda_representation(query)
sims = index[query_lda]

sims = sorted(enumerate(sims), key=lambda item: -item[1])
for i, sim in sims[:5]:
    print('Document:', i, 'Similarity:', sim)
    print(articles_orig[i])
    print()




Document: 276 Similarity: 0.99407464
Defence Minister Robert Hill has confirmed Australian troops arrived in Afghanistan this morning. Senator Hill says it is an advance party and the rest of the troops will arrive within the next few days. He says Australian forces will operate with US troops in southern Afghanistan to fight the Taliban and Al Qaeda networks. Senator Hill says the operation could take several months. 

Document: 280 Similarity: 0.99380845
The Greens have officially won their second Senate spot in Federal Parliament. The Senate count for New South Wales has been finalised with Kerry Nettle from the Greens taking the final position from long time Democrats Senator Vicki Bourne. Senator Bourne says she is very lucky to have served in the Parliament for 12 years and has nominated serving as an observer at the East Timor independence ballot as the high point of her career. She has wished Kerry Nettle well, saying it is a great honour and a great responsibility to be electe

Run your code again, now training an LDA model with 100 topics. Do you see a qualitative difference in the top-5 most similar documents?