![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Lab3: Latent Semantic Analysis with Gensim

by Andrei Popescu-Belis (HES-SO), based on material by Fabian Märki (FHNW) and Heiho Hahn (FHSG)

## Summary
The aim of this lab is to perform LSA on a small corpus of news.  You will use the LSA word vectors to estimate word similarity, and then to perform ranked retrieval given a query. 

<font color='green'>Please answer the questions in green within this notebook, and submit the completed notebook under the corresponding homework on Moodle.</font>

In [1]:
!pip install gensim
!pip install contractions



In [2]:
import os    
import nltk
import gensim
import pandas as pd
from TextPreprocessor import *
from gensim import models, corpora, similarities
from gensim.models import LsiModel, LdaModel, LdaMulticore

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

The data used in this lab the same set of 300 Australian that you used in Lab 2.  It is a shortened version of the Lee Background Corpus [described here](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF) and it is available with the **gensim** package that you installed.  The following code will load the documents into a Pandas dataframe.

In [3]:
# Code inspired from:
# https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read().splitlines()
data_df = pd.DataFrame({'text': text})
data_df.head(3)

Unnamed: 0,text
0,Hundreds of people have been forced to vacate ...
1,Indian security forces have shot dead eight su...
2,The national road toll for the Christmas-New Y...


## Data preprocessing

You will need first to preprocess the data through the following stages:
1. tokenization
2. stopword removal
2. POS-based filtering (optional)
3. lemmatization or stemming (optional)
4. addition of bigrams to each document (optional)
5. filtering of infrequent words
6. inspection and filtering of frequent words

You can use NLTK as in Lab 1, or our in-house `TextPreprocessor.py` file as in Lab 2.

<font color='green'>Please state here which solution you use and list stages you implement.</font>
I'll use the `TextPreprocessor.py`.

In [4]:
# Please write here the preprocessing instructions if you use TextPreprocessor.py
language = 'english'
stop_words = set(stopwords.words(language))
# Extend the list here:
for sw in ['\"', '\'', '\'\'', '`', '``', '\'s']:
    stop_words.add(sw)


processor = TextPreprocessor(
# Add options here:
    language = language,
    pos_tags = {wordnet.ADJ, wordnet.NOUN},
    stopwords = stop_words,
    lemmantize = True,
    stem = True,
    remove_numbers=True,
)

In [5]:
data_df['processed'] = processor.transform(data_df['text'])
print(data_df['processed'].iloc[120])

union qanta mainten worker industri action compani reject offer disput parti privat talk yesterday industri relat commiss 3,000 mainten worker reject qanta wage freez nation secretari australian manufactur worker union amwu doug cameron union everyth possibl resolv disput qanta prepar accept privat arbitr altern worker industri action escal industri action necessari fair compani crush underfoot


In [6]:
data_df['tokenized'] = data_df['processed'].apply(nltk.word_tokenize)

In [7]:
# Please write here the preprocessing instructions if you use NLTK
# Not used here.

In [8]:
print(data_df['tokenized'].iloc[120])

['union', 'qanta', 'mainten', 'worker', 'industri', 'action', 'compani', 'reject', 'offer', 'disput', 'parti', 'privat', 'talk', 'yesterday', 'industri', 'relat', 'commiss', '3,000', 'mainten', 'worker', 'reject', 'qanta', 'wage', 'freez', 'nation', 'secretari', 'australian', 'manufactur', 'worker', 'union', 'amwu', 'doug', 'cameron', 'union', 'everyth', 'possibl', 'resolv', 'disput', 'qanta', 'prepar', 'accept', 'privat', 'arbitr', 'altern', 'worker', 'industri', 'action', 'escal', 'industri', 'action', 'necessari', 'fair', 'compani', 'crush', 'underfoot']


Please make a list of all words from all articles.  Then, using `nltk.FreqDist`, consider the most frequent and the least frequent words.  If you find uninformative words among the most frequent ones, please remove them from the articles.  Similarly, please remove from articles the words appearing fewer than 2 or 3 times in the corpus.  <font color='green'> Please justify these choices. What is now the size of your vocabulary?</font> 

The freqency distribution shows that there are still numbers in the corpus, altough they should have been removed by the text preprocessor. Thus they will be removed. Before the removement of the infrequent words the vocabulary was $\approx$ 4000 and after $\approx$ 1500.

In [9]:
# Please write here all the necessary instructions.  You may use several cells.

# remove non-alphanumeric words
words = [w for ws in data_df['tokenized'] for w in ws if w.isalpha()]
freq_dist = nltk.FreqDist(words)
# most common
freq_dist.most_common(10)

[('mr', 306),
 ('australian', 178),
 ('new', 171),
 ('palestinian', 168),
 ('australia', 157),
 ('peopl', 153),
 ('govern', 150),
 ('two', 136),
 ('u', 136),
 ('day', 131)]

In [10]:
# least common
freq_dist.most_common()[-10:]

[('cedric', 1),
 ('piolin', 1),
 ('fabric', 1),
 ('santoro', 1),
 ('apprais', 1),
 ('pro', 1),
 ('con', 1),
 ('overcam', 1),
 ('sebastien', 1),
 ('grosjean', 1)]

In [11]:
# remove words that have an occurence of < 3
filtered = dict((word, freq) for word, freq in freq_dist.items() if freq > 3)
freq_dist_filtered = nltk.FreqDist(filtered)
freq_dist_filtered.most_common()[-10:]

[('prudenti', 4),
 ('requir', 4),
 ('goodin', 4),
 ('hornsbi', 4),
 ('hindu', 4),
 ('saxeten', 4),
 ('liverpool', 4),
 ('arthur', 4),
 ('rubber', 4),
 ('cow', 4)]

In [12]:
# check how big the voci is
voci = set(word for word, _ in freq_dist.items())
voci_filtered = set(word for word, _ in freq_dist_filtered.items())
print('Before filtering: %s' % len(voci))
print('After filtering: %s' % len(voci_filtered))

Before filtering: 4278
After filtering: 1521


In [13]:
# filter the data
data_df['filtered'] = [[w for w in ws if w in voci_filtered] for ws in data_df['tokenized']]
print(data_df['filtered'].iloc[10])

['work', 'morn', 'restor', 'power', 'suppli', 'ten', 'thousand', 'home', 'storm', 'struck', 'queensland', 'last', 'night', 'forc', 'wind', 'tree', 'brought', 'power', 'line', 'home', 'car', 'energi', 'everi', 'avail', 'person', 'night', 'restor', 'power', 'locat', 'around', 'brisban', 'west', 'toowoomba', 'north', 'coast', 'brisban', 'protect', 'home', 'sever', 'storm', 'christma', 'four', 'peopl', 'high', 'power', 'line', 'across', 'car', 'insid', 'fierc', 'wind', 'sent', 'larg', 'tree', 'hous', 'one', 'injur']


## LSA with Gensim

In this section, you will write the Gensim commands to compute a term-document matrix from the above documents, then transform it using SVD, and truncate the result.  To learn what the commands are, please follow the [Topics and Tranformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) from Gensim. 

<font color="green">Please gather these commands into a function called `train_lsa`.  They should cover: dictionary creation, corpus mapping, computation of TF-IDF values, and creation of the LSA model.</font> 

In [14]:
def train_lsa(filtered_texts, num_topics = 10):
    # dictionary creation
    dictionary = corpora.Dictionary(filtered_texts)
    # corpus mapping, bow
    corpus = [dictionary.doc2bow(text) for text in filtered_texts]
    # TF-IDF model
    tfidf = models.TfidfModel(corpus, normalize=True)
    corpus_tfidf = tfidf[corpus]
    # LSA model
    lsa = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=num_topics)

    return lsa, dictionary, corpus, corpus_tfidf

<font color="green">Please fix a `number_of_topics`, on the lower side of the range mentioned in the course.  Then, execute the cell that performs `train_lsa`.</font>

In [15]:
number_of_topics = 2

In [16]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], number_of_topics)

<font color="green">Please display several topics found by LSA using the Gensim `print_topics` function.  Please explain in your own words the meaning of what is displayed.  How do you relate it with what was explained in the course on LSA?</font>

The topics diplay the latent dimensions of the LSA transformation. The words 'palestina', 'israel' and 'arafat' are related and contribute the most for the topic.

In [17]:
lsa_model.print_topics(number_of_topics)

[(0,
  '0.336*"palestinian" + 0.222*"isra" + 0.197*"arafat" + 0.126*"mr" + 0.125*"israel" + 0.120*"hama" + 0.111*"attack" + 0.107*"afghanistan" + 0.106*"forc" + 0.103*"gaza"'),
 (1,
  '-0.430*"palestinian" + -0.283*"isra" + -0.253*"arafat" + -0.157*"israel" + -0.155*"hama" + -0.135*"gaza" + 0.133*"afghanistan" + 0.108*"bin" + 0.107*"laden" + -0.106*"sharon"')]

<font color="green">Please define a function that returns the cosine similarity between two words (testing first if they are in the vocabulary). Please exemplify its value on two different word pairs, one of which should be obviously more similar than the other, and comment the values.</font>  You can get inspiration from this [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

One can clearly see that the first similarity is higher in the two dimensions than the second one. The dimension of LSA was reduced to 2-D to better examine the similarity.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

In [19]:
def wordsim(word1, word2, model, dictionary):
    # get bow from words
    w1_bow = dictionary.doc2bow([word1])
    w2_bow = dictionary.doc2bow([word2])
    
    if len(w1_bow) == 0 or len(w2_bow) == 0:
        raise Exception('Words not in dictionary!')
    
    # convert to LSA space
    w1_lsa = model[w1_bow]
    w2_lsa = model[w2_bow]
    
    # compute the similarity
    sim = cosine_similarity(w1_lsa, w2_lsa)
    
    return sim

In [20]:
# print here the cosine similiarities of several pairs and comment the results.
sim_high = wordsim('palestinian', 'israel', lsa_model, dictionary)
sim_low = wordsim('cow', 'gaza', lsa_model, dictionary)

print(sim_high)
print(sim_low)

[[ 1.         -0.15555533]
 [-0.39505025  0.96892901]]
[[ 1.         -0.13398335]
 [ 0.00271741  0.99061584]]


<font color="green">Please use the [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html) to write a function that prints a list of words sorted by decreasing LSA similarity with a given word (giving the score too).  You won't have to use the cosine_similarity function here.  Please choose a "query" word and ten other words, apply your function, and comment the results.</font>

In [21]:
from gensim import similarities

In [22]:
def word_ranking(word0, word_list, model, dictionary):
    word0_bow = dictionary.doc2bow([word0])
    words_bow = [dictionary.doc2bow([w]) for w in word_list]
    
    word0_lsa = model[word0_bow]
    words_lsa = model[words_bow]
    
    index = similarities.MatrixSimilarity(words_lsa)
    
    # perform a similarity query against the corpus
    sims = index[word0_lsa]  
    sims = sorted(enumerate(sims), key=lambda item: -item[1])

    for i, (doc_position, doc_score) in enumerate(sims):
        print('{0}: "{1}"", score: {2:.5f}'.format(i, word_list[doc_position], doc_score))

In [23]:
# call here the function on your choice of words
word_ranking('gaza', ['cow', 'water', 'palestinian', 'israel'], lsa_model, dictionary)

0: "palestinian"", score: 0.99993
1: "israel"", score: 0.99983
2: "cow"", score: -0.15248
3: "water"", score: -0.25935


In [24]:
# Please write here your comments on the rankings

It can be clearly seen that the word has a higher score with the words "palestina" and "israel", since they often appear together. And it has a lower score with the words "cow" and "water" since they do not often appear together.

<font color="green">Please select now a significantly larger number of topics, and train a new LSA model.  Perform the same `word_ranking` task as above and compare the new ranking with the previous one.  Which one seems better?</font>

The first one with the fewer dimensions (2-D) performs much better, because it can find the similarities between the correct word pairs. The one with higher dimensions (300-D) is not able to find these similiarities.

In [25]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], num_topics=300)
word_ranking('gaza', ['cow', 'water', 'palestinian', 'israel'], lsa_model, dictionary)

0: "palestinian"", score: 0.15100
1: "cow"", score: 0.01348
2: "water"", score: -0.01740
3: "israel"", score: -0.07435


## LDA with Gensim
For a simple tutorial on using LDA with Gensim, please see https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21.

Thank you for your work!  Please submit the notebook on Moodle before the next course.