![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Lab 5: Latent Semantic Analysis with Gensim

## Objective
The goal of this lab is to perform LSA on a small corpus of news.  You will use the LSA word vectors to estimate word similarity, and then to perform ranked retrieval given a query. 

<font color='green'>Please answer the questions in green within this notebook, and submit the completed notebook under the corresponding homework on Moodle.</font>

In [64]:
! pip install contractions
import os    
import nltk
import gensim
import pandas as pd
from TextPreprocessor import *
from gensim import models, corpora, similarities
from gensim.models import LsiModel, LdaModel, LdaMulticore
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

The data used in this lab the same set of 300 Australian that you used in Lab 4 on Document Representation.  It is a shortened version of the Lee Background Corpus [described here](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF) and it is available with the **gensim** package that you installed.  The following code will load the documents into a Pandas dataframe.

In [65]:
# Code inspired from:
# https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read().splitlines()
data_df = pd.DataFrame({'text': text})

## Data preprocessing

You will need first to preprocess the data through the following stages:
1. tokenization
2. stopword removal
2. POS-based filtering
3. lemmatization or stemming
4. addition of bigrams to each document
5. filtering of infrequent words
6. inspection and filtering of frequent words

You can use NLTK or our in-house `TextPreprocessor.py` file, as explained in Lab 1.

<font color='green'>Please state here which solution you use and apply (a) POS-based filtering for adjectives and nouns only, (b) lemmatization not stemming. Addition of bigrams is not to be implemented for this lab. </font>

In [66]:
# I'm using the textprocessor

In [67]:
from nltk.tag import pos_tag
# Please write here the preprocessing instructions if you use TextPreprocessor.py
processor = TextPreprocessor(language='english', stopwords=set(stopwords.words('english')), pos_tags=set(['a', 'n']), lemmatize=True, stem=False, n_jobs=-1)

In [68]:
data_df['processed'] = processor.transform(data_df['text'])

In [69]:
data_df['tokenized'] = data_df['processed'].apply(nltk.word_tokenize)

In [70]:
print(data_df['tokenized'].iloc[120])

['union', 'qantas', 'maintenance', 'worker', 'industrial', 'action', 'company', 'reject', 'offer', 'dispute', 'party', 'private', 'talk', 'yesterday', 'industrial', 'relation', 'commission', '3,000', 'maintenance', 'worker', 'reject', 'qantas', 'wage', 'freeze', 'national', 'secretary', 'australian', 'manufacturing', 'worker', 'union', 'amwu', 'doug', 'cameron', 'union', 'everything', 'possible', 'resolve', 'dispute', '``', 'qantas', 'prepared', 'accept', 'private', 'arbitration', 'alternative', 'worker', 'industrial', 'action', 'escalate', 'industrial', 'action', 'necessary', 'fair', 'company', 'crush', 'underfoot', "''"]


Please make a list of all words from all articles.  Then, using `nltk.FreqDist`, consider the most frequent and the least frequent words.  If you find uninformative words among the most frequent ones, please remove them from the articles.  Similarly, please remove from articles the words appearing fewer than 2 or 3 times in the corpus.  <font color='green'> Please justify these choices. What is now the size of your vocabulary?</font> 

In [71]:
# Please write here all the necessary instructions.  You may use several cells.

all_words = [w for ws in data_df['tokenized'] for w in ws if w.isalpha()]

frequency_distribution = nltk.FreqDist(all_words)

print(f'Most frequent words (n=50): \n {frequency_distribution.most_common(50)}')
print(f'Least frequent words (n=50): \n {frequency_distribution.most_common()[-50:]}') 

filtered = dict((word, freq) for word, freq in frequency_distribution.items() if freq > 3)
fdist_filtered = nltk.FreqDist(filtered)
print(f'Least frequent words (n=10): {fdist_filtered.most_common()[-10:]}') 

vocab = set(word for word, _ in frequency_distribution.items())
vocab_filtered = set(word for word, _ in fdist_filtered.items())
print(f'\nBefore filtering: {len(vocab)}')
print(f'After filtering: {len(vocab_filtered)}')

Most frequent words (n=50): 
 [('mr', 306), ('australian', 178), ('new', 171), ('palestinian', 168), ('australia', 157), ('people', 153), ('government', 146), ('two', 136), ('u', 136), ('day', 131), ('south', 130), ('state', 128), ('attack', 127), ('year', 126), ('would', 115), ('one', 114), ('israeli', 112), ('force', 109), ('minister', 106), ('last', 99), ('arafat', 96), ('fire', 95), ('afghanistan', 90), ('united', 89), ('three', 87), ('police', 85), ('world', 84), ('security', 83), ('official', 83), ('could', 82), ('time', 80), ('area', 79), ('today', 77), ('leader', 77), ('told', 75), ('group', 74), ('company', 73), ('union', 71), ('authority', 69), ('laden', 69), ('bin', 68), ('report', 67), ('sydney', 64), ('month', 63), ('man', 63), ('president', 62), ('bank', 61), ('around', 60), ('four', 60), ('test', 58)]
Least frequent words (n=50): 
 [('onto', 1), ('vinyl', 1), ('resentment', 1), ('withdrew', 1), ('limelight', 1), ('plagiarism', 1), ('reformation', 1), ('beset', 1), ('thro

In [72]:
# filter the data
data_df['filtered'] = [[w for w in ws if w in vocab_filtered] for ws in data_df['tokenized']]
print(data_df['filtered'].iloc[50])
data_df['filtered']

['afghan', 'security', 'force', 'arab', 'al', 'qaeda', 'fighter', 'seven', 'others', 'weapon', 'explosive', 'remain', 'hospital', 'southern', 'city', 'kandahar', 'spokesman', 'governor', 'gul', 'agha', 'man', 'left', 'ward', 'one', 'arab', 'custody', 'ward', 'mr', 'seven', 'weapon', 'grenade', 'explosive', 'explosive', 'surrender', 'weapon', 'mr', 'concerned', 'safety', 'arab', 'u', 'bombing', 'kandahar', 'airport', 'hospital', 'taliban', 'militia', 'month', 'taliban', 'weapon', 'grenade', 'explosive', 'arab', 'could', 'protect', 'blow', 'hospital', 'room', 'attempt', 'arrest']


0      [hundred, people, home, southern, highland, ne...
1      [indian, security, force, shot, dead, eight, m...
2      [national, road, toll, year, holiday, period, ...
3      [argentina, political, economic, crisis, inter...
4      [six, midwife, hospital, south, sydney, inappr...
                             ...                        
295    [team, australian, israeli, scientist, success...
296    [today, world, aid, day, late, figure, show, m...
297    [federal, national, party, possible, stage, op...
298    [university, canberra, proposal, republic, one...
299    [australia, france, double, rubber, davis, cup...
Name: filtered, Length: 300, dtype: object

## LSA with Gensim

In this section, you will write the Gensim commands to compute a term-document matrix from the above documents, then transform it using SVD, and truncate the result.  To learn what the commands are, please follow the [Topics and Tranformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) from Gensim. 

<font color="green">Please gather these commands into a function called `train_lsa`.  They should cover: dictionary creation, corpus mapping, computation of TF-IDF values, and creation of the LSA model.</font> 

In [73]:
def train_lsa(filtered_texts, num_topics = 10):

  dictionary = corpora.Dictionary(filtered_texts)

  corpus = [dictionary.doc2bow(text) for text in filtered_texts]

  tfidf = models.TfidfModel(corpus, normalize=True)
  corpus_tfidf = tfidf[corpus]

  lsa = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=num_topics)

  return lsa, dictionary, corpus, corpus_tfidf

<font color="green">Please fix the `number_of_topics` to 10, on the lower side of the range mentioned in the course.  Then, execute the cell that performs `train_lsa`.</font>

In [74]:
number_of_topics = 2

In [75]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], 2)

<font color="green">Please display several topics found by LSA using the Gensim `print_topics` function using `num_topics=4` and `num_words=10`.  Please explain in your own words the meaning of what is displayed.  How do you relate it with what was explained in the course on LSA?</font>

In [76]:
lsa_model.print_topics(number_of_topics)

[(0,
  '0.349*"palestinian" + 0.230*"israeli" + 0.204*"arafat" + 0.129*"israel" + 0.127*"mr" + 0.123*"hamas" + 0.113*"attack" + 0.107*"gaza" + 0.106*"force" + 0.105*"afghanistan"'),
 (1,
  '-0.421*"palestinian" + -0.278*"israeli" + -0.246*"arafat" + -0.153*"israel" + -0.150*"hamas" + 0.135*"afghanistan" + -0.132*"gaza" + 0.110*"bin" + 0.109*"laden" + -0.104*"sharon"')]

<font color="green">Please define a function that returns the cosine similarity between two words (testing first if they are in the vocabulary). Please exemplify its value on the two word pairs "morning" and "night" as well as "morning" and "qantas", and two additional word pairs of your choice and comment the values.</font>  You can get inspiration from this [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [77]:
from sklearn.metrics.pairwise import cosine_similarity

In [78]:
def wordsim(word1, word2, model, dictionary):
  word1_bow = dictionary.doc2bow([word1])
  word2_bow = dictionary.doc2bow([word2])

  if len(word1_bow) == 0 or len(word2_bow) == 0:
        raise Exception('Words not in dictionary!')

  # convert to LSA space
  word1_lsa = model[word1_bow]
  word2_lsa = model[word2_bow]
    
  # compute the similarity
  sim = cosine_similarity(word1_lsa, word2_lsa)

  return sim

In [79]:
# print here the cosine similiarities of several pairs and comment the results.
sim_high = wordsim('arafat', 'israeli', lsa_model, dictionary)
sim_low = wordsim('hindu', 'gaza', lsa_model, dictionary)

print(sim_high)
print(sim_low)

[[ 1.         -0.26790075]
 [-0.23886647  0.99954958]]
[[ 1.         -0.13071564]
 [ 0.00122172  0.99125946]]


<font color="green">Please use the [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html) to write a function that prints a list of words sorted by decreasing LSA similarity with a given word and showing the score too.  You don't have to use the cosine_similarity function here.  Please choose a "query" word and ten other words, apply your function, and comment the results.</font>

In [80]:
from gensim import similarities

In [81]:
def word_ranking(word0, word_list, model, dictionary):
  word0_bow = dictionary.doc2bow([word0])
  words_bow = [dictionary.doc2bow([w]) for w in word_list]
    
  word0_lsa = model[word0_bow]
  words_lsa = model[words_bow]
    
  index = similarities.MatrixSimilarity(words_lsa)
    
  # perform a similarity query against the corpus
  sims = index[word0_lsa]  
  sims = sorted(enumerate(sims), key=lambda item: -item[1])

  for i, (doc_position, doc_score) in enumerate(sims):
    print('{0}: "{1}"", score: {2:.5f}'.format(i, word_list[doc_position], doc_score))

In [82]:
# call here the function on your choice of words
word_ranking('gaza', ['hot', 'water', 'palestinian', 'israel', 'hindu', 'black', 'wind', 'arafat', 'hamas', 'force'], lsa_model, dictionary)

0: "hamas"", score: 0.99999
1: "arafat"", score: 0.99994
2: "palestinian"", score: 0.99994
3: "israel"", score: 0.99984
4: "hot"", score: 0.21829
5: "force"", score: 0.14463
6: "hindu"", score: 0.00942
7: "black"", score: 0.00000
8: "wind"", score: -0.24296
9: "water"", score: -0.25200


 #### Comments

The words hamas and arafat have a very high score with gaza, because they often appear together. Wind and water are not very common as context and thus have very low scores.

<font color="green">Please select now a significantly larger number of topics, and train a new LSA model.  Perform the same `word_ranking` task as above and compare the new ranking with the previous one.  Which one seems better?</font>

In [83]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], num_topics=500)
word_ranking('gaza', ['hot', 'water', 'palestinian', 'israel', 'hindu', 'black', 'wind', 'arafat', 'hamas', 'force'], lsa_model, dictionary)

0: "palestinian"", score: 0.14483
1: "arafat"", score: 0.10987
2: "hamas"", score: 0.04349
3: "force"", score: 0.03042
4: "hindu"", score: 0.00719
5: "black"", score: 0.00000
6: "wind"", score: -0.00702
7: "water"", score: -0.01631
8: "hot"", score: -0.05414
9: "israel"", score: -0.07023


## End of Lab 5
Please make sure all cells have been executed, save this completed notebook, compress it to a *zip* file, and upload it to [Moodle](https://moodle.msengineering.ch/course/view.php?id=1869).