![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Lab3: Latent Semantic Analysis with Gensim

by Andrei Popescu-Belis (HES-SO), based on material by Fabian Märki (FHNW) and Heiho Hahn (FHSG)

## Summary
The aim of this lab is to perform LSA on a small corpus of news.  You will use the LSA word vectors to estimate word similarity, and then to perform ranked retrieval given a query. 

<font color='green'>Please answer the questions in green within this notebook, and submit the completed notebook under the corresponding homework on Moodle.</font>

In [None]:
import os    
import nltk
import gensim
import pandas as pd
from TextPreprocessor import *
from gensim import models, corpora, similarities
from gensim.models import LsiModel, LdaModel, LdaMulticore

The data used in this lab the same set of 300 Australian that you used in Lab 2.  It is a shortened version of the Lee Background Corpus [described here](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF) and it is available with the **gensim** package that you installed.  The following code will load the documents into a Pandas dataframe.

In [None]:
# Code inspired from:
# https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read().splitlines()
data_df = pd.DataFrame({'text': text})

## Data preprocessing

You will need first to preprocess the data through the following stages:
1. tokenization
2. stopword removal
2. POS-based filtering (optional)
3. lemmatization or stemming (optional)
4. addition of bigrams to each document (optional)
5. filtering of infrequent words
6. inspection and filtering of frequent words

You can use NLTK as in Lab 1, or our in-house `TextPreprocessor.py` file as in Lab 2.

<font color='green'>Please state here which solution you use and list stages you implement.</font>

In [None]:
#

In [None]:
# Please write here the preprocessing instructions if you use TextPreprocessor.py


In [None]:
data_df['processed'] = processor.transform(data_df['text'])

In [None]:
data_df['tokenized'] = data_df['processed'].apply(nltk.word_tokenize)

In [None]:
# Please write here the preprocessing instructions if you use NLTK


In [None]:
print(data_df['tokenized'].iloc[120])

Please make a list of all words from all articles.  Then, using `nltk.FreqDist`, consider the most frequent and the least frequent words.  If you find uninformative words among the most frequent ones, please remove them from the articles.  Similarly, please remove from articles the words appearing fewer than 2 or 3 times in the corpus.  <font color='green'> Please justify these choices. What is now the size of your vocabulary?</font> 

In [None]:
# Please write here all the necessary instructions.  You may use several cells.


In [None]:
print(data_df['filtered'].iloc[10])

## LSA with Gensim

In this section, you will write the Gensim commands to compute a term-document matrix from the above documents, then transform it using SVD, and truncate the result.  To learn what the commands are, please follow the [Topics and Tranformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) from Gensim. 

<font color="green">Please gather these commands into a function called `train_lsa`.  They should cover: dictionary creation, corpus mapping, computation of TF-IDF values, and creation of the LSA model.</font> 

In [None]:
def train_lsa(filtered_texts, num_topics = 10):

    return lsa,dictionary,corpus,corpus_tfidf

<font color="green">Please fix a `number_of_topics`, on the lower side of the range mentioned in the course.  Then, execute the cell that performs `train_lsa`.</font>

In [None]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], number_of_topics)

<font color="green">Please display several topics found by LSA using the Gensim `print_topics` function.  Please explain in your own words the meaning of what is displayed.  How do you relate it with what was explained in the course on LSA?</font>

<font color="green">Please define a function that returns the cosine similarity between two words (testing first if they are in the vocabulary). Please exemplify its value on two different word pairs, one of which should be obviously more similar than the other, and comment the values.</font>  You can get inspiration from this [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def wordsim(word1, word2, model, dictionary):


In [None]:
# print here the cosine similiarities of several pairs and comment the results.


<font color="green">Please use the [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html) to write a function that prints a list of words sorted by decreasing LSA similarity with a given word (giving the score too).  You won't have to use the cosine_similarity function here.  Please choose a "query" word and ten other words, apply your function, and comment the results.</font>

In [None]:
from gensim import similarities

In [None]:
def word_ranking(word0, word_list, model, dictionary):


In [None]:
# call here the function on your choice of words


In [None]:
# Please write here your comments on the rankings

<font color="green">Please select now a significantly larger number of topics, and train a new LSA model.  Perform the same `word_ranking` task as above and compare the new ranking with the previous one.  Which one seems better?</font>

## LDA with Gensim
For a simple tutorial on using LDA with Gensim, please see https://towardsdatascience.com/topic-modelling-in-python-with-nltk-and-gensim-4ef03213cd21.

Thank you for your work!  Please submit the notebook on Moodle before the next course.