# GloVe (Global Vectors for word representation)

paper: https://nlp.stanford.edu/pubs/glove.pdf

website: https://nlp.stanford.edu/projects/glove/

- GloVe is an unsupervised learning algorithm

- similar to PCA (find a low-rank matrix approximation)

- ad: **stochastic gradient descent** can be used more efficiently

- perform well in word analogies task

## Motivation

Motivation: **ratios of word-word co-occurrence probabilities** is semantic meaningful

e.g., consider the co-occurrence probabilities for **target words** "ice" and "steam" with various **probe words** (solid, gas, water, fashion) from the vocabulary. 

Note:

- ice is related to solid

- steam is related to gas

- both ice and steam are related to water

- both ice and steam are unrelated to fashion


Here are some actual probabilities from a 6 billion word corpus:

<img src="https://nlp.stanford.edu/projects/glove/images/table.png"/>

As one might expect, **ice** co-occurs more frequently with **solid** than it does with **gas**, 

whereas **steam** co-occurs more frequently with **gas** than it does with **solid**. 

Both words co-occur with their shared property **water** frequently, 

and both co-occur with the **unrelated** word **fashion** infrequently. 

Only in the **ratio of probabilities**, does noise from **non-discriminative** words like **water and fashion** cancel out, 

so that **large values (>> 1)** correlate well with properties specific to **ice**, 

and **small values (<< 1)** correlate well with properties specific of **steam**. 

In this way, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of thermodynamic phase. 热力学

## objective

- Objective: optimize over embedding function $\phi$ to minimize weighted least square error (reconstruction error)

    i.e., use dot product of word embeddings as a low-rank matrix approximation of logarithm of words co-occurrence $\log(C_{ij})$

- a log-bilinear model

$$
\hat \phi = \arg \min_{\phi}L(\phi)=\arg \min_{\phi} \sum_{i=1}^{|V|}\sum_{j=1}^{|V|}f(C_{i, j})\left [ \phi(w_i)^T \phi(w_j)-\log C_{i, j}\right ]^2
$$

where $\phi(w_i) \in \mathbb{R}^d$ is embedding of word $i$


$\phi(w_i)^T \phi(w_j)=\left \langle \phi(w_i), \phi(w_j) \right \rangle$ is inner product of embeddings of word $i$ and word $j$

$|V|$ is size of vocabulary

- $C \in \mathbb{R}^{|V| \times |V|}$ is a **global word-word co-occurrence matrix**, a square matrix

    where $C_{i, j}$ is the ij th entry of the matrix: co-occurrence count, the number of times target word $w_i$ occurs with the context of word $w_j$ together

    which tabulates how frequently words co-occur with one another in a given corpus. 

- $f(x)$ is weight function

    for infrequent word-context pair, weight increase to prevent only learning from extremely frequent word pairs.

$$
f(x)=\min \left(1, \left(\frac{x}{x_{max} }\right)^{\alpha} \right) <= 1
$$

where $x_{max}$ is the max entry of co-occurence matrix $C$

empirically set $\alpha = \frac{3}{4} >0$

if $0<\frac{x}{x_{max}} \ll 1$, e.g. $\frac{x}{x_{max}} = 10^{-4}$, 

it will boost $f(x) = \min\left(1, \left(10^{-4}\right)^{\frac{3}{4}} \right)=10^{-3}$

- positive: high correlation/similarity, "Micky" and "Mouse"
    

- negative: low correlation/similarity, "Micky" and "Egg"


human can easily come up "Mouse" with "Micky", but hard to come up "Egg" with "Micky", thus positive correlation is more meaningful than negative correlation, 
    
**thus we can set negative values in the word-context matrix to be 0**

## relate with Analogies task

this objective associates (logarithm of) ratios of co-occurrence probabilities $\frac{P_{ik}}{P_{jk}}$ with vector differences in the embedding space $\phi(w_i) - \phi(w_j)$. 

which means vector difference encodes same meaning of ratios

$$
[\phi(w_i) - \phi(w_j)]^T \phi(w_k) =  \log \frac{P_{ik}}{P_{jk}} = \log P_{ik} - \log P_{jk} = \log \left(\frac{C_{ik}}{\sum_{k=1}^{|V|}C_{ik}}\right)- \log \left(\frac{C_{jk}}{\sum_{k=1}^{|V|}C_{jk}}\right)
$$

where $w_k \in \mathbb{R}^{d}$ is context word

$P_{ik}=\frac{C_{ik}}{\sum_{k=1}^{|V|}C_{ik}}$ is co-occurrence probability by row normalizing co-occurrnce count

## implementation 

- full gradient descent

    https://www.youtube.com/watch?v=mC7zSvYj60g


- use pytorch to compute gradient + AdapGrad

    https://towardsdatascience.com/a-comprehensive-python-implementation-of-glove-c94257c2813d

### compute co-occurrence matrix

Populating co-occurrence matrix requires a single pass through the entire corpus

For large corpus, this pass can be computationally expensive 

Subsequent training iterations are faster bc the co-occurrence matrix is sparse

#### Method 1 squares of sparse matrix 

- co-occurrence matrix $C \in \mathbb{R}^{3292 \times 3292}$ is squares of sparse matrix $X \in \mathbb{R}^{18111 \times 3292}$
$$
C = X^T X
$$

In [None]:
import pandas as pd
import scipy 
import numpy as np 

In [None]:
leadstring = '/Users/wenxinxu/Desktop/SDS565/data/playlists/'
with open(leadstring+'artists.txt','r') as f:
    artists=f.readlines()
pl = pd.read_csv(leadstring+'playlists.txt',header=None)

# a dict mapping code to artist e.g., 941: 'By The Tree'
codetoartist = {j : artists[j].strip() for j in range(len(artists))}

# a dict mapping artist to code e.g., 'By The Tree': 941
artisttocode = {artists[j].strip() : j for j in range(len(artists))}

# create sparse matrix X (18111, 3292) 
d = pl.to_dict()[0]
inds = [(j,[int(k) for k in d[j].strip().split(' ')]) for j in range(len(d))]
vals = np.ones(len([k for j in inds for k in j[1]])) # (189900,)
i2 = [([j[0]]*len(j[1]),j[1]) for j in inds]

row_ind = [k for j in i2 for k in j[0]]
col_ind = [k for j in i2 for k in j[1]]

X = scipy.sparse.csr_matrix((vals,(row_ind, col_ind))) # (18111, 3292)  Compressed Sparse Row matrix  X[row_ind[k], col_ind[k]] = data[k]
# co-occurence matrix  (3292, 3292)
C = X.T @ X
C.setdiag(0) # for the same artist, set the co-occurrence to zero.

#### Method 2 for-loop

In [None]:
leadstring = '/Users/wenxinxu/Desktop/SDS565/data/playlists/'
with open(leadstring+'artists.txt','r') as f:
    artists=f.readlines()

with open(leadstring+'playlists.txt','r') as f:
    sentences=f.readlines()

# codes of playlist without row index
sentenceslist = [list(map(int, j.strip().split(' '))) for j in sentences] 

# co-occurence matrix (3292, 3292)
C = np.zeros((len(artists), len(artists)))
for sentence in sentenceslist: # j is a sentence (playlist)
    for context in sentence: # a is context word (artsit)
        for target in sentence: # b is target word (artsit)
            C[context, target] += 1 # the number of times target word (artist) occurs with the context of word (artist) together

np.fill_diagonal(C,0)  # for the same artist just set the co-occuurence to zero.