# Lab 1: Word Embeddings


This lab we'll use [GloVe word embeddings](https://nlp.stanford.edu/projects/glove/), a semantically meaningful representation for words.

These embeddings are trained with the intuition that words that appear nearby each other in natural language are semantically related. We can quantify this semantic similarity by constructing a **co-occurrence matrix**; that is, given a corpus of text, a vocabulary list and a fixed window size, we count the number of times that each vocab word occurs in the context of all other words. Then if our window size is $k$, the $ij$-th entry of our co-occurrence matrix $X$ is the number of times that vocab word $i$ occurs within $k$ words of vocab word $j$.

GloVe embeddings are trained such that for the embeddings $w_i$ and $w_j$ of words $i$ and $j$,

$$w_i^Tw_j \propto \log(X_{ij})$$

where $X_{ij}$ is the co-occurrence of words $i$ and $j$ in the training corpus. Note that $w_i^Tw_j$ is simply the dot product of $w_i$ and $w_j$, and is therefore proportional to the cosine of the angle between the two vectors. As such, this objective encodes the semantic similarity between the two words as the cosine similarity between their respective embedding vectors. Below are some examples of pre-trained GloVe embeddings projected onto a 2D plane:

![glove.png](https://drive.google.com/uc?id=1acKr4dLPgyt-45U624YF7kUB_SIQKcK4)

(Image taken from [here](https://nlp.stanford.edu/projects/glove/).)

In [1]:
import numpy as np
import pickle

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# === Download data and GloVe word embeddings
# !wget http://nlp.stanford.edu/data/glove.6B.zip

# === Download Preprocessed version
!wget https://docs.google.com/uc?id=1-F_x9wnGENF8E6usLPEP8Ll4vQH5WyKI -O embeddings1.pickle
!wget https://docs.google.com/uc?id=1-dbkzj_KaOCa4MTiqpAjyTb8aFBgxDis -O embeddings2.pickle
!wget https://docs.google.com/uc?id=1-npUmtBI08nKzyx9Wai-vJxW6OhnU2xA -O embeddings3.pickle
!wget https://docs.google.com/uc?id=1-rd3UL9O8mIkn0hJGgFav4YYO7zvCTcF -O vocabulary.pickle

--2024-01-24 23:43:22--  https://docs.google.com/uc?id=1-F_x9wnGENF8E6usLPEP8Ll4vQH5WyKI
Resolving docs.google.com (docs.google.com)... 64.233.183.102, 64.233.183.138, 64.233.183.113, ...
Connecting to docs.google.com (docs.google.com)|64.233.183.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1-F_x9wnGENF8E6usLPEP8Ll4vQH5WyKI [following]
--2024-01-24 23:43:22--  https://drive.usercontent.google.com/download?id=1-F_x9wnGENF8E6usLPEP8Ll4vQH5WyKI
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 142.251.120.132, 2607:f8b0:4001:c2e::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|142.251.120.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39998539 (38M) [application/octet-stream]
Saving to: ‘embeddings1.pickle’


2024-01-24 23:43:25 (81.9 MB/s) - ‘embeddings1.pickle’ saved [39998539/39998539]

--2024-01-24 23:43:25--  htt

In [3]:
with open('embeddings1.pickle', 'rb') as handle:
    emb1 = pickle.load(handle)

with open('embeddings2.pickle', 'rb') as handle:
    emb2 = pickle.load(handle)

with open('embeddings3.pickle', 'rb') as handle:
    emb3 = pickle.load(handle)

with open('vocabulary.pickle', 'rb') as handle:
    vocab = pickle.load(handle)

embeddings = np.concatenate((emb1, emb2, emb3))

if embeddings.shape[0] != len(vocab):
  raise ValueError('Embedding and vocabulary have different length')

In [4]:
print(f'Length of vocab: {len(vocab)}')
print(vocab[0:5])

Length of vocab: 50000
['the', ',', '.', 'of', 'to']


In [5]:
print(f'Shape of embeddings:{str(embeddings.shape)}')
print(embeddings[0])

Shape of embeddings:(50000, 300)
[ 4.6560e-02  2.1318e-01 -7.4364e-03 -4.5854e-01 -3.5639e-02  2.3643e-01
 -2.8836e-01  2.1521e-01 -1.3486e-01 -1.6413e+00 -2.6091e-01  3.2434e-02
  5.6621e-02 -4.3296e-02 -2.1672e-02  2.2476e-01 -7.5129e-02 -6.7018e-02
 -1.4247e-01  3.8825e-02 -1.8951e-01  2.9977e-01  3.9305e-01  1.7887e-01
 -1.7343e-01 -2.1178e-01  2.3617e-01 -6.3681e-02 -4.2318e-01 -1.1661e-01
  9.3754e-02  1.7296e-01 -3.3073e-01  4.9112e-01 -6.8995e-01 -9.2462e-02
  2.4742e-01 -1.7991e-01  9.7908e-02  8.3118e-02  1.5299e-01 -2.7276e-01
 -3.8934e-02  5.4453e-01  5.3737e-01  2.9105e-01 -7.3514e-03  4.7880e-02
 -4.0760e-01 -2.6759e-02  1.7919e-01  1.0977e-02 -1.0963e-01 -2.6395e-01
  7.3990e-02  2.6236e-01 -1.5080e-01  3.4623e-01  2.5758e-01  1.1971e-01
 -3.7135e-02 -7.1593e-02  4.3898e-01 -4.0764e-02  1.6425e-02 -4.4640e-01
  1.7197e-01  4.6246e-02  5.8639e-02  4.1499e-02  5.3948e-01  5.2495e-01
  1.1361e-01 -4.8315e-02 -3.6385e-01  1.8704e-01  9.2761e-02 -1.1129e-01
 -4.2085e-01  1.39

**Questions**:

* What is the length of each word vector?
* What does each number in an embedding vector represent?
* How does an algorithm like GloVe deal with low-frequency words?

##Visualizing Word Embeddings

Visualizing embeddings is important for understanding and improving them. However, as we saw above, embeddings are 300-dimensional vectors.

**Question**: How would you visualize the meaning of a word in an embedded space?

In [6]:
words = [
    'bad', 'dislike', 'worst', # negative polarity
    'fantastic', 'good', 'wonderful', # positive polarity
    'cat', 'cow', 'birds', # animals
    'sing', 'read' # verbs
    ]

In [14]:
word = 'to'
idx = vocab.index(word)
embedding = embeddings[idx]
print(f'word: {word}, index: {idx}, embedding: {embedding}')

word: to, index: 4, embedding: [-2.5756e-01 -5.7132e-02 -6.7190e-01 -3.8082e-01 -3.6421e-01 -8.2155e-02
 -1.0955e-02 -8.2047e-02  4.6056e-01 -1.8477e+00 -1.1258e-01 -1.2955e-01
  2.7254e-01  7.2891e-03  2.6038e-01  1.2096e-01 -2.3193e-01  3.2260e-02
 -2.9472e-01 -6.7594e-01 -3.3844e-01 -2.3297e-01  1.1020e-01  1.8816e-01
 -4.5184e-01 -3.3833e-01  1.1274e-01  4.9490e-01 -4.2132e-02  7.9961e-02
 -1.3146e-02  6.2284e-02  2.0223e-01  3.8279e-02 -1.1154e+00 -1.2140e-01
  8.9846e-02  2.9702e-01 -5.5794e-02 -4.6021e-01 -1.3194e-01  8.7357e-02
 -2.7865e-01  1.4981e-01  2.5536e-01  1.6698e-01 -4.4520e-02  6.7588e-02
 -1.1772e-01 -1.3452e-01  2.8694e-01 -3.9844e-01 -1.2806e-01 -4.7818e-01
  6.7802e-02  2.0353e-01 -3.0677e-01  6.0789e-01 -1.8588e-01  1.1997e-01
 -4.0508e-02 -6.5860e-02  3.0621e-01 -5.5824e-02  3.9448e-02 -4.5570e-01
  2.1081e-01  2.5889e-01  1.4666e-01  3.0950e-01  1.4343e-01  1.0524e-01
  1.5788e-01  1.0300e-01  3.2211e-01 -2.7939e-01 -1.7139e-01  3.2202e-01
  1.0784e-01 -2.8209

In [None]:
filtered_embeddings = []
for word in words:
  idx = vocab.index(word)
  embedding = embeddings[idx]
  filtered_embeddings.append(embedding)

In [None]:
filtered_embeddings = np.array(filtered_embeddings)
print(filtered_embeddings.shape)

In [None]:
X_embedded = TSNE(
    n_components=2, learning_rate='auto',
    init='random', perplexity=2,
    random_state=24
    ).fit_transform(filtered_embeddings)

In [None]:
X_embedded.shape

In [None]:
x = []
y = []

plt.figure()
for i,coords in enumerate(X_embedded):
    plt.scatter(coords[0],coords[1])
    plt.annotate(words[i],
                 xy=(coords[0],coords[1]),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')
plt.title('Two dimensional projection of GloVe embeddings')
plt.show()

**Questions**

*   What clusters can you see in the figure?
*   Can we interpret the X and Y axes in the figure above?



## Nearest neighbors

Another way of visualizing word embeddings is to list the most similar words ordering by **cosine similarity**.
<br>
Remember that the cosine similarity of two vectors **v** and **w** can be computed as:

$$cos(\pmb v, \pmb w) = \frac {\pmb v \cdot \pmb w}{||\pmb v|| \cdot ||\pmb w||}$$

This metric is 1 when **v** and **w** point in the same direction and -1 for vectors pointing in opposite directions.

In [None]:
for word in words:
  idx = vocab.index(word)
  input_word_embedding = embeddings[idx]
  similarity = cosine_similarity([input_word_embedding], embeddings).flatten()
  ind = np.argsort(similarity)[::-1][1:5]
  similar_words = np.array(vocab)[ind]
  similar_words = ', '.join(similar_words)
  print(f'{word}: {similar_words}')

**Question**: What kinds of similarities/differences do you see among the nearest neighbors to each word?


