## 4. Competition of **GloVe**s

[Original paper](https://nlp.stanford.edu/pubs/glove.pdf)

[Competition](https://www.kaggle.com/t/b32583e2b6054d6f90f16c9f638ce84d)

GloVe is based on the idea that the learnt embedding should keep the fraction:

\begin{equation}
F\left(w_{i}, w_{j}, \tilde{w}_{k}\right)=\frac{P_{i k}}{P_{j k}}
\end{equation}

where i,j,k are indexes of some words, k - context, w - embedding from a model (same as Word2Vec). Probability P_{i k} is calculated from the corpus as  the probability that word i appear in the context of word k. Thus P_{i k} is determinant from the text and depend on context window lenght, while the embeddings are trainable.


Applying the chain of assumptions about embeddings on the initial formula, the final relationship is:

\begin{equation}
w_{i}^{T} \tilde{w}_{k}+b_{i}+\tilde{b}_{k}=\log \left(X_{i k}\right)
\end{equation}

which immediatly formulates as the least-squares objective:

\begin{equation}
J=\sum_{i, j=1}^{V} f\left(X_{i j}\right)\left(w_{i}^{T} \tilde{w}_{j}+b_{i}+\tilde{b}_{j}-\log X_{i j}\right)^{2}
\end{equation}

where V - vocabulary size, f - weight function of a token that bounds too frequent and too rare words:

\begin{equation}
f(x)=\left\{\begin{array}{cc}
\left(x / x_{\max }\right)^\alpha & \text { if } x<x_{\max } \\
1 & \text { otherwise }
\end{array}\right.
\end{equation}

wher $\alpha$ and $x_{\max}$ are hyperparameters. In the original experiments $\alpha = 0.75$ and $x_{\max} = 100$.

### 4.1 Load dataset

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import itertools
from collections import Counter

nltk.download('brown')
nltk.download('stopwords')
stopwords = stopwords.words('english')

[nltk_data] Downloading package brown to /usr/share/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 4.2 Finish the implementation of GloVe

There are several missed parts of code:

- implementation of weightening function f
- calculation of derivatives for J loss function
- implementation of J
- implementation of AdaGrad

**You're free to use other libraries and implementations (for example, on torch or tf), except using of pretrained models**

A nice-looking pseudo-code of AdaGrad could [be found here](https://pytorch.org/docs/stable/generated/torch.optim.Adagrad.html)

In [2]:
def get_co_occurence_matrix(tokens, processed_sents, token2int, window_size=200):
    '''
    Calculate co-occurence of words, i.e. number of cases when some token appears in context of another

    @tokens - vocabulary 
    @processed_sents - sentences from brown corpus
    @token2int - dictionary that maps words with their indeces
    @window_size - the radius of context, #of words from left and right side of a pivot token
    '''
    
    # Calculate all occurences for a token_ in every sentences
    def get_co_occurences(token_):
        co_occurences = []
        for sent in processed_sents:
            for idx in (np.array(sent)==token_).nonzero()[0]:
                co_occurences.append(sent[max(0, idx-window_size):min(idx+window_size+1, len(sent))])

        co_occurences = list(itertools.chain(*co_occurences))
        co_occurence_idxs = list(map(lambda x: token2int[x], co_occurences))
        co_occurence_dict = Counter(co_occurence_idxs)
        co_occurence_dict = dict(sorted(co_occurence_dict.items()))
        return co_occurence_dict

    co_occurence_matrix = np.zeros(shape=(len(tokens), len(tokens)), dtype='int')
    for token in tokens:
        token_idx = token2int[token]
        co_occurence_dict = get_co_occurences(token)
        co_occurence_matrix[token_idx, list(co_occurence_dict.keys())] = list(co_occurence_dict.values())
        
    # Zeroing the co-occurence of tokens with themselves    
    np.fill_diagonal(co_occurence_matrix, 0)    
    return co_occurence_matrix

[Numba](https://numba.pydata.org) would be critical in this competition because of complexity of algorithm. This library accelarates Python code and does not require installing additional components, compilers or languages.

In [3]:
!pip install numba



In [4]:
from numba import njit, extending, types

@njit
def f(X_wc, X_max=100, alpha=0.75):
    # Weight of a token in the loss function
    # Write your code. Find the formula in the paper
    # (f(x) = (x/X_max)^alpha if x < X_max, else 1, and 0 if x=0)
    if X_wc == 0:
        return 0.0
    elif X_wc < X_max:
        return (X_wc / X_max) ** alpha
    else:
        return 1.0

@njit
def loss_fn(weights, bias, co_occurence_matrix, n_tokens, X_max, alpha):
    total_loss = 0
    for idx_word in range(n_tokens):
        for idx_context in range(n_tokens):
            w_word = weights[idx_word]
            w_context = weights[n_tokens+idx_context]
            b_word = bias[idx_word]
            b_context = bias[n_tokens+idx_context]
            X_wc = co_occurence_matrix[idx_word, idx_context]
            # Write your code. Implement the loss function
            if X_wc > 0:
                total_loss += f(X_wc, X_max, alpha) * (
                    (np.dot(w_word, w_context) + b_word + b_context - np.log(X_wc)) ** 2
                )
    return total_loss

@njit
def gradient(weights, bias, co_occurence_matrix, n_tokens, embedding_size, X_max, alpha):
    dw = np.zeros((2*n_tokens, embedding_size))
    db = np.zeros(2*n_tokens)

    # building word vectors
    for idx_word in range(n_tokens):
        w_word = weights[idx_word]
        b_word = bias[idx_word]

        for idx_context in range(n_tokens):
            w_context = weights[n_tokens+idx_context]
            b_context = bias[n_tokens+idx_context]
            X_wc = co_occurence_matrix[idx_word, idx_context]
            # Derivative over loss function with respect to w_word 
            if X_wc > 0:
                error = (np.dot(w_word, w_context) + b_word + b_context - np.log(X_wc))
                value = 2.0 * f(X_wc, X_max, alpha) * error
                db[idx_word] += value
                dw[idx_word] += value * w_context

    # building context vectors
    for idx_context in range(n_tokens):
        w_context = weights[n_tokens + idx_context]
        b_context = bias[n_tokens + idx_context]

        for idx_word in range(n_tokens):
            w_word = weights[idx_word]
            b_word = bias[idx_word]
            X_wc = co_occurence_matrix[idx_word, idx_context]
            # Derivative over loss function with respect to w_context 
            if X_wc > 0:
                error = (np.dot(w_word, w_context) + b_word + b_context - np.log(X_wc))
                value = 2.0 * f(X_wc, X_max, alpha) * error
                db[n_tokens + idx_context] += value
                dw[n_tokens + idx_context] += value * w_word
    return dw, db
        
@njit
def adagrad(co_occurence_matrix_, n_epochs, lr, alpha):
    # We keep weights of context in the same matrix for simplicity
    # (Assuming n_tokens, embedding_size are known globally or declared above in code)
    n_tokens = co_occurence_matrix_.shape[0]

    weights = np.random.random((2 * n_tokens, embedding_size))
    bias = np.random.random((2 * n_tokens,))

    state_sum_weights = np.zeros(weights.shape)
    state_sum_bias = np.zeros(bias.shape)    
    # Write your code. Choose an appropriate value for maximum co-occurence 
    # so that too-frequent words do not dominate. (Often 100 is standard.)
    X_max = 2
    
    for i in range(n_epochs):
        dw, db = gradient(weights, bias, co_occurence_matrix_, n_tokens, embedding_size, X_max, alpha)

        # Write your code. Finish the implementation of adagrad
        state_sum_weights += dw**2
        state_sum_bias += db**2

        weights -= lr * dw / np.sqrt(state_sum_weights + 1e-8)
        bias   -= lr * db / np.sqrt(state_sum_bias + 1e-8)

        loss = loss_fn(weights, bias, co_occurence_matrix_, n_tokens, X_max, alpha)
        print("Epoch ", i, "| Loss  ", loss)
    return weights, bias

Setting hyperparameters...

In [5]:
# Because of small amount of dataset, we have to use many epochs
n_epochs = 10
# number of sentences to consider. Please don't reduce it, otherwise some words from test set might dissapear
n_sents = 350
# token embedding size
embedding_size = 5
# learning rate
lr = 0.7
# GloVe weights parameter
alpha = 0.75

In [6]:
import nltk
import itertools
from collections import Counter

brown = nltk.corpus.brown
sents = brown.sents()[:n_sents]

print('Processing sentences..')
processed_sents = []
for sent in sents:
    # Convert to lowercase and keep only alphabetic tokens
    cleaned_sent = [word.lower() for word in sent if word.isalpha() and word.lower() not in stopwords]
    processed_sents.append(cleaned_sent)

tokens = list(set(itertools.chain(*processed_sents)))   
n_tokens = len(tokens)
print(f'Number of Sentences: {len(sents)}') 
print(f'Number of Tokens: {n_tokens}')

token2int = dict(zip(tokens, range(len(tokens))))
int2token = {v: k for k, v in token2int.items()}


Processing sentences..
Number of Sentences: 350
Number of Tokens: 1820


### 4.3 Training

In [7]:
print('Building co-occurence matrix..')
co_occurence_matrix = get_co_occurence_matrix(tokens, processed_sents, token2int)
print('Co-occurence matrix shape:', co_occurence_matrix.shape)
assert co_occurence_matrix.shape == (n_tokens, n_tokens)

# co-occurence matrix is similar
assert np.all(co_occurence_matrix.T == co_occurence_matrix)

print('Training word vectors..')

weights, bias = adagrad(co_occurence_matrix, n_epochs, lr, alpha)
# Optional save
# np.save('weights.npy', weights)

Building co-occurence matrix..
Co-occurence matrix shape: (1820, 1820)
Training word vectors..
Epoch  0 | Loss   13129.918563723835
Epoch  1 | Loss   4553.670548879382
Epoch  2 | Loss   2978.4609459361245
Epoch  3 | Loss   2488.667690094923
Epoch  4 | Loss   2278.3930183800903
Epoch  5 | Loss   2158.6890727628465
Epoch  6 | Loss   2075.539493046873
Epoch  7 | Loss   2010.2465488013358
Epoch  8 | Loss   1955.304914085816
Epoch  9 | Loss   1907.0985482966205


Make sure that model returns adequate similar words. If it doesn't, maybe you should increase the corpus or(and) number of epochs

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

def find_similar_words(csim, token):
    token_idx = token2int[token]
    closest_words = list(map(lambda x: int2token[x], np.argsort(csim[token_idx])[::-1][:5]))
    return closest_words

# getting cosine similarities between all combinations of word vectors
csim = cosine_similarity(weights[:n_tokens])

# masking diagonal values since they will be most similar
np.fill_diagonal(csim, 0)

token = 'learn'
closest_words = find_similar_words(csim, token)
print(f'Similar words to {token}:', closest_words)

Similar words to learn: ['charge', 'saba', 'impossible', 'ayes', 'request']


### 4.4 Calculate the  embeddings for given pairs of words

In [9]:
import pandas as pd
import numpy as np
df = pd.read_csv("/kaggle/input/nlp-week-2-glo-ve/test.csv")

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

# Similarity between pairs of synonyms 
cos_sims = {}
for ind, row in df.iterrows():
    word1 = row["word 1"]
    word2 = row["word 2"]

    # Try to process out-of-vocab words
    try:
        token_idx_1 = token2int[word1]
        word_1_weight = np.expand_dims(weights[token_idx_1], axis=0)
    except:
        print(f"Word {word1} is not in the vocabilary")
        word_1_weight = np.zeros((1, embedding_size))

    try:
        token_idx_2 = token2int[word2]
        word_2_weight = np.expand_dims(weights[token_idx_2], axis=0)
    except:
        print(f"Word {word2} is not in the vocabilary")
        word_2_weight = np.zeros((1, embedding_size))

    csim = cosine_similarity(word_1_weight, word_2_weight)
    cos_sims[ind] = csim[0][0]

In [11]:
df_submision = pd.DataFrame(cos_sims.items(), columns=["ID", "sims"])
df_submision.to_csv("submission.csv", index=False)

In [12]:
df_submision['sims'].mean()

0.576218579207498

In [13]:
df_submision['sims'].std()

0.30602015190867704