# GLoVe

In this task we will implement the GLoVe algorithm for generating word embeddings

We will Game of Thrones dialogue from all seasons as our corpus. Each line consists of a dialogue spoken by a character in a scene.

1) Given the corpus, define a function that removes all the punctuations and stop words from the text. (3 points)

In [51]:
import nltk
#nltk.download()
from nltk.corpus import stopwords
from nltk import sent_tokenize
import string
from gensim import utils
import numpy as np
import tqdm

In [37]:
def normalize_text_ex(corpus):
    s_words = stopwords.words('english')
    puncts = string.punctuation

    n_sentences = []

    for sent in corpus:
        n_sent = [w.lower() for w in sent if w not in puncts and w.lower() not in s_words]
        n_sentences.append(n_sent)

    return n_sentences


In [44]:
def normalize_text(corpus):
    s_words = stopwords.words('english')
    puncts = string.punctuation + "â€¦" # special '...' character contained in the csv file

    sents = sent_tokenize(corpus)
    for i in range(len(sents)): # for one line in the corpus, containing multiple sentences
        sents[i] = sents[i].translate(str.maketrans('','',puncts))
        split_sent = [word.lower() for word in sents[i].split() if word not in s_words]
        sents[i] = ' '.join(split_sent)
    return sents

In [74]:
class MyCorpus:
    def __iter__(self):
        for line in open('GOT_dialogues.csv', encoding="utf8"):               
            yield line

In [75]:
corpus = MyCorpus()
n_sentences = []
for line in corpus:
    n_sentences.extend(normalize_text(line))
print(n_sentences)



2) From normalized sentences obtained in the previous step, create word-word frequency matrix with all the unique words. You will also need to create a word2index mapping (3 points)

In [76]:
word2index = {}
unique_tokens = set()
for sent in n_sentences:
    words = sent.split(' ')
    unique_tokens.update(words)
i = 0
for w in unique_tokens:
    word2index[w] = i
    i += 1

def generate_frequency_matrix(corpus: list[str], window_size=3):
    freq_mat = np.zeros((len(word2index), len(word2index)), dtype=np.float32)
    for sent in corpus: # iterate over the normalized sentences
        words_in_sent = sent.split(' ')
        for i in range(len(words_in_sent)): # iterate over each word in the sentence being the centre word
            w_i = word2index[words_in_sent[i]] # row index of centre word in freq_mat
            for j in range(1, window_size+1):
                if i-j >= 0: # to its left
                    w_j = word2index[words_in_sent[i-j]]
                    freq_mat[w_i][w_j]+=1
                if i+j < len(words_in_sent): # to its right
                    w_j = word2index[words_in_sent[i+j]]
                    freq_mat[w_i][w_j]+=1
    return freq_mat

freq_mat = generate_frequency_matrix(n_sentences)

3) Define weighting function used in GLoVe. (4 points)

$f(x) = (\frac{x}{x_{max}})^\alpha$ if $x < x_{max}$, 1 otherwise

In [77]:
def weighting_func(x, x_max=100):
    # write your code snippet here
    alpha = .75 # see https://aclanthology.org/D14-1162.pdf, p.4
    return (x/x_max)**alpha if x < x_max else 1

4) Create the Glove model class using pytorch. (10 points)
   
   Hints: 
   1. The forward pass will compute $W_i^T \hat{W_j} + b_i + \hat{b}_j$
   2. $W, \hat{W}, b_i, \hat{b}_j$ will be the parameters
    

In [78]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

In [102]:
class Glove(nn.Module):
    def __init__(self, v_size, e_size, x_max=100):
        super().__init__()
        self.w = nn.Parameter(torch.rand(v_size, e_size))
        self.w_hat = nn.Parameter(torch.rand(v_size, e_size))

        self.b = nn.Parameter(torch.rand(v_size))
        self.b_hat = nn.Parameter(torch.rand(v_size))

        self.weighting_func = lambda x : weighting_func(x, x_max)
    
    def forward(self, i, j, x):
        out = torch.mul(torch.transpose(self.w[i]), self.w_hat[j])
        out = (out + self.b[i] + self.b_hat[j] - np.log(freq_mat[i][j]))**2
        out = torch.mul(self.weighting_func(x), out)
        return out

In [103]:
model = Glove(len(unique_tokens), e_size=4)

5) Write a function which trains the model on the frequency matrix. Ignore the 0 entries in the matrix. (15 points)

In [88]:
# Since many entries in the matrix would be 0, it makes sense to explicitly keep track of the positive entries and iterate
# over them rather than writing a nested for loop...
# You can wrap this entries in a torch Dataset class
####################### optional ####################
class GOT_data(Dataset):
    def __init__(self):
        pass
    
    def __len__(self):
        pass
    
    def __getitem__(self):
        pass

#####################################################

# Adopt your code to incorporate mini-batch training
def train(model, data, epochs=5, learning_rate=0.001):
    # write your code snippet here
    
    pass

6) Write a function to generate embedding of a given word. Note that the embeddings of a word ($i$) would be $W_i + \hat{W}_i$ (5 points) 

In [104]:
def generate_embedding(model, word):
    return model.w[word2index[word]] + model.w_hat[word2index[word]]

In [105]:
generate_embedding(model, "what")

torch.Size([10088, 4])
torch.Size([10088, 4])


# Intrinsic evaluation of embeddings

(Slide 47, lecture_4)
Word similarity task is often used as an intrinsic evaluation criteria. In the dataset file you will find a list of word pairs with their similarity scores as judged by humans. The task would be to judge how well are the word vectors aligned to human judgement. We will use word2vec embedding vectors trained on the google news corpus. (Ignore the pairs where at least one the words is absent in the corpus)

In [116]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

7) Write a function which takes as input two words and computes the cosine similarity between them. (3 points)

In [117]:
def similarity(word1, word2):
    pass

8) Compute the similarity between all the word pairs in the list and sort them based on the similarity scores. (3 points)

9) Sort the word pairs in the list based on the human judgement scores. (2 points)

10) Compute spearman rank correlation between the two ranked lists obtained in the previous two steps. (2 points)

# Word embedding based classifier

We will design a simple sentiment classifier based on the pre-trained word embeddings (google news).

Each data point is a movie review and the sentiment could be either positive (1) or negative (0)

In [118]:
import pickle

In [119]:
with open('sentiment_test_X.p', 'rb') as fs:
    test_X = pickle.load(fs)

In [120]:
len(test_X)

1821

In [121]:
with open('sentiment_test_y.p', 'rb') as fs:
    test_y = pickle.load(fs)

In [122]:
len(test_y)

1821

In [123]:
test_X[0]

['If',
 'you',
 'sometimes',
 'like',
 'to',
 'go',
 'to',
 'the',
 'movies',
 'to',
 'have',
 'fun',
 ',',
 'Wasabi',
 'is',
 'a',
 'good',
 'place',
 'to',
 'start',
 '.']

In [124]:
test_y[0]

1

In [125]:
with open('sentiment_train_X.p', 'rb') as fs:
    train_X = pickle.load(fs)
with open('sentiment_train_y.p', 'rb') as fs:
    train_y = pickle.load(fs)
with open('sentiment_val_X.p', 'rb') as fs:
    val_X = pickle.load(fs)
with open('sentiment_val_y.p', 'rb') as fs:
    val_y = pickle.load(fs)        

11) Given a review, compute its embedding by averaging over the embedding of its constituent words. Define a function which given a review as a list of words, generates its embeddings by averaging over the constituent word embeddings. (5 points)

In [126]:
def generate_embedding(review):
    # return embedding
    pass

12) Create a feed-forward network class with pytorch. (Hyperparamter choice such as number of layers, hidden size is left to you) (10 points)

In [127]:
class Classifier(nn.Module):
    pass

13) Create a Dataset class for efficiently enumerating over the dataset. (5 points)

In [128]:
class sent_data(Dataset):
    def __init__(self):
        pass
    
    def __len__(self):
        pass
    
    def __getitem__(self):
        pass

14) Write a train function to train model. At the end of each epoch compute the validation accuracy and save the model with the best validation accuracy. (15 points)

In [129]:
# Adopt your code to incorporate mini-batch training
# Use cross-entropy as your loss function
def train(model, train_data, val_data, epochs=5, learning_rate=0.001):
    # write your code snippet here
    
    pass

15) Evaluate the trained model on the test set and report the test accuracy. (5 points)

In [130]:
def evaluate(model, test_data):
    pass