<a href="https://colab.research.google.com/github/TasnubaS/Random-Solutions/blob/master/word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embeddings

Outline of this notebook

- Latent semantic analysis (SVD)
- Skip-gram model

## 0. Loading Data

In [None]:
import urllib.request
from os.path import isfile
if not isfile("abstract-filtered.txt"):
    url = "https://yangfengji.net/uva-nlp-course/data/abstract-filtered.txt.zip"
    print("Downloading ...")
    filename, headers = urllib.request.urlretrieve(url, filename="abstract-filtered.txt.zip")

    print("Decompressing the file ...")
    !unzip abstract-filtered.txt.zip

sents = open("abstract-filtered.txt").read().split("\n")
print("Read {} sentences".format(len(sents)))

## 1. Latent Semantic Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse.linalg import svds
from sklearn.decomposition import TruncatedSVD as SVD
from sklearn.metrics.pairwise import cosine_similarity

### 1.1 Construct sent-word matrix

The following code will 

- construct a data matrix with size: words x sentences
- build the vocab (named vocab1), which maps a word to its index
- build the vocab (named ivocab1), which maps an index to its corresponding word

In [None]:
vectorizer = CountVectorizer(lowercase=True, min_df=5, max_df=1.0, ngram_range=(1,1))
mat = vectorizer.fit_transform(sents) # Dim: Sent x Word
vocab1 = vectorizer.vocabulary_
ivocab1 = {val:key for (key, val) in vocab1.items()}
mat = mat.asfptype().T
print("Matrix shape = {}".format(mat.shape)) # Words x Texts

### 1.2 SVD

For a given matrix $\bf{M}\in\mathbb{R}^{v\times m}$, SVD decompose the matrix into three components with a predefined parameter $k$
$$\bf{M} = \bf{U}\cdot\bf{D}\cdot\bf{V}^{t}$$
where

- $\bf{U}\in\mathbb{R}^{v\times k}$
- $\bf{D}\in\mathbb{R}^{k}$: the elements of the diagnoal matrix
- $\bf{V}^{t}\in\mathbb{R}^{k\times m}$

The word embeddings we get from SVD is 

$$\bf{W} = \bf{U}\cdot\bf{D}$$

where each column is a word embedding for the corresponding word in the vocab

In [None]:
k_max = 500
svd = SVD(n_components=k_max)
W1 = svd.fit_transform(mat) # = U*D, Size: Word x k_max
print(W1.shape)

The plot of the singular values

In [None]:
sigma = svd.singular_values_
plt.plot(range(len(sigma)), sigma, '.')
plt.ylim((0, 280))
plt.xlabel("Dimension indices")
plt.ylabel("Singular values")

### 1.3 Word Similarity

In [None]:
def print_sim_words(cossim, vocab, ivocab, word='embeddings'):
    widx = vocab[word] # get word index
    sim_scores = cossim[:,widx] # get similarity scores
    # print(sim_scores)
    sim_indices = np.argsort(sim_scores)[::-1] # rank the similarity score with descreasing order
    sim_words = [ivocab[widx] for widx in sim_indices] # rank the words based on their similarity scores
    print(sim_words[:20]) # print out the first 20 words

Based on the new representations of words $\bf{W}$, for a given word $x$, we can use cosine similarity to find the similar words in the vocab,

$$\cos(x,x') = \frac{\langle\bf{w}_{x},\bf{w}_{x'}\rangle}{\|\bf{w}_{x}\|_2\|\bf{w}_{x'}\|_2}$$

In [None]:
# Compute the cosine similarity based on word embeddings

cossim1 = cosine_similarity(W1,W1)

In [None]:
# Print the top 20 similar words

print_sim_words(cossim1, vocab1, ivocab1, word='embeddings')
# print(ivocab1)

## 2. Skip-gram

### 2.1 The implementation from fastText

In this section, we will first use the implementation from the [fastText](https://pypi.org/project/fasttext/) to do some preliminary study of the skip-gram model. 
This code fully implements the technical details as we discussed in class. 
Please refer to the documentation of fastText for more information of using this code. 

In [None]:
import fasttext

model = fasttext.train_unsupervised('data/arxiv/abstract-filtered.txt', model='skipgram',
                                    ws = 3, # context window size 
                                    dim = 50, # word embedding dimension
                                    epoch = 3, # training epochs
                                    minCount=5) # the minimal count of words in the vocab

After training the model, we can collect the word embedding metrices and vocabulary for evaluation purpose. 

In [None]:
W2 = model.get_output_matrix()
vocab2 = {word:idx for (idx, word) in enumerate(model.get_words())}
ivocab2 = {idx:word for (idx, word) in enumerate(model.get_words())}

Compute the cosine similarity of all the words in the vocab

In [None]:
cossim2 = cosine_similarity(W2,W2)

Now, we can pick any word from the vocab and find out its similar words based on the cosine similarity of word embeddings. 

In [None]:
# Print the top 20 similar words

print_sim_words(cossim2, vocab2, ivocab2, word='embeddings')