In this notebook, we will use functions from the [Scikit-learn](https://scikit-learn.org/stable/index.html) library to create a word-document matrix and compute the Singular Value Decomposition to implement Latent Semantic Analysis.

The sentences, already tokenized, are taken from the [Parallel Meaning Bank](https://pmb.let.rug.nl/) version 1.0.0.

In [None]:
!wget https://pmb.let.rug.nl/releases/pmb-1.0.0.zip
!unzip pmb-1.0.0.zip

from glob import glob
corpus = []
for filename in glob("pmb-1.0.0/data/p*/d*/en.tok.off"):
    with open(filename) as f:
        lines = f.readlines()
        tokens = []
        for line in lines:
            tokens.append(" ".join(line.strip().split(" ")[3:]))
        corpus.append(" ".join(tokens))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: pmb-1.0.0/data/p09/d1939/en.met  
 extracting: pmb-1.0.0/data/p09/d1939/en.raw  
  inflating: pmb-1.0.0/data/p09/d1939/en.drs.tpl  
  inflating: pmb-1.0.0/data/p09/d1939/en.drs.xml  
  inflating: pmb-1.0.0/data/p09/d1939/en.tok.off  
  inflating: pmb-1.0.0/data/p09/d1939/en.tok.iob  
  inflating: pmb-1.0.0/data/p09/d1939/en.drs.box  
   creating: pmb-1.0.0/data/p09/d2378/
  inflating: pmb-1.0.0/data/p09/d2378/en.met  
 extracting: pmb-1.0.0/data/p09/d2378/en.raw  
  inflating: pmb-1.0.0/data/p09/d2378/en.drs.tpl  
  inflating: pmb-1.0.0/data/p09/d2378/en.drs.xml  
  inflating: pmb-1.0.0/data/p09/d2378/en.tok.off  
  inflating: pmb-1.0.0/data/p09/d2378/en.tok.iob  
  inflating: pmb-1.0.0/data/p09/d2378/en.drs.box  
   creating: pmb-1.0.0/data/p09/d2883/
  inflating: pmb-1.0.0/data/p09/d2883/en.met  
 extracting: pmb-1.0.0/data/p09/d2883/en.raw  
  inflating: pmb-1.0.0/data/p09/d2883/en.drs.tpl  
  inflating: p

A word-document matrix (A) is just the transpose of the result of Sklearn's count vectorization.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
corpus_vectorized = cv.fit_transform(corpus)
A = corpus_vectorized.T

We also saves the vocabulary computed by the CountVectorizer object as a list. We will use it later to retrieve the vector representation of the words.

In [None]:
vocabulary = list(cv.get_feature_names_out())

Let's define a utility function to compute the cosine similarity of two vectors given two words and a vector space model.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def similarity(model, word1, word2):
    vector1 = model[vocabulary.index(word1)]
    vector2 = model[vocabulary.index(word2)]
    return cosine_similarity(vector1, vector2)[0][0]

The word-document matrix is a very sparse vector model which suffer from the *curse of dimensionality*, as shown by the many zeros returned as similarity scores.

In [None]:
print (similarity(A, 'cat', 'dog'))
print (similarity(A, 'cat', 'man'))
print (similarity(A, 'cat', 'car'))
print (similarity(A, 'cat', 'plane'))

0.0816496580927726
0.0
0.0
0.0


The TruncatedSVD function computes the Singular Value Decomposition, truncates the components to the top N, and returns their product, representing the best approximation of the original matrix as per the Latent Semantic Analysis algorithm.

In [None]:
from sklearn.decomposition import TruncatedSVD
import scipy
svd =  TruncatedSVD(n_components = 50)
A_transformed = scipy.sparse.csc_matrix(svd.fit_transform(A))

The LSA matrix is dense and provides more reasonable similarity scores.

In [None]:
print (similarity(A_transformed, 'cat', 'dog'))
print (similarity(A_transformed, 'cat', 'man'))
print (similarity(A_transformed, 'cat', 'car'))
print (similarity(A_transformed, 'cat', 'plane'))

0.3726249551384404
0.2868626933317482
0.13929558292776395
0.01619782189079399
