# Task 1: Word Embeddings (10 points)

This notebook will guide you through all steps necessary to train a word2vec model (Detailed description in the PDF).

## Imports

This code block is reserved for your imports. 

You are free to use the following packages: 

(List of packages)

In [183]:
# Imports
import string
import math
import random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchtext import data as dt
from torchtext import datasets

# 1.1 Get the data (0.5 points)

The Hindi portion HASOC corpus from [github.io](https://hasocfire.github.io/hasoc/2019/dataset.html) is already available in the repo, at data/hindi_hatespeech.tsv . Load it into a data structure of your choice. Then, split off a small part of the corpus as a development set (~100 data points).

If you are using Colab the first two lines will let you upload folders or files from your local file system.

In [269]:
# 0 for whole dataset, 1 for development set
dev = 1

text_id = dt.Field()
text = dt.Field()
task_1 = dt.Field()
task_2 = dt.Field()
task_3 = dt.Field()

columns = [
    ('id', text_id),
    ('content', text),
    ('t_1', task_1),
    (None, None),
    (None, None)
]

data, v, t = dt.TabularDataset.splits(
    path = 'data',
    train = 'hindi_hatespeech.tsv',
    validation = 'hindi_hatespeech.tsv',
    test = 'hindi_hatespeech.tsv',
    format = 'tsv',
    fields = columns,
    skip_header = True
)

# Take the first 100 datapoints of the corpus into the development set
dev_set = data[0:100]

## 1.2 Data preparation (0.5 + 0.5 points)

* Prepare the data by removing everything that does not contain information. 
User names (starting with '@') and punctuation symbols clearly do not convey information, but we also want to get rid of so-called [stopwords](https://en.wikipedia.org/wiki/Stop_word), i. e. words that have little to no semantic content (and, but, yes, the...). Hindi stopwords can be found [here](https://github.com/stopwords-iso/stopwords-hi/blob/master/stopwords-hi.txt) Then, standardize the spelling by lowercasing all words.
Do this for the development section of the corpus for now.

* What about hashtags (starting with '#') and emojis? Should they be removed too? Justify your answer in the report, and explain how you accounted for this in your implementation.

+ Decided not to exclude hashtags, since they do definitely contain information that could be able to help deciding if the post contains hatespeech.
+ Didn't exclude emojis as well, since they are perfect to capture emotions, especially anger or disappointment 

In [270]:
# Load the linked stopwords file into a list
with open('data/stopwords-hi.txt') as f:
    stopwords = [line.rstrip() for line in f]
if dev == 0:
    dev_set = data
def prepare(data_set):
    for line in data_set:
        text = vars(line)["content"]
        remove_list = []
        for i in range(len(text)):
            text[i] = text[i].lower()
            word = text[i]
            # find user names
            if (len(word) > 0 and word[0] == '@' or len(word) > 7 and word[0:5] == 'https'):
                remove_list.append(word)
            # find stopwords
            if (word in stopwords):
                remove_list.append(word)
        # finally remove the names and stopwords
        for word in remove_list:
            text.remove(word)
        
    for line in data_set:
        text = vars(line)["content"]
        remove_list = []
        for i in range(len(text)):
            # delete punctuation symbols # TODO somehow not deleting the characters
            for p_symbol in string.punctuation:
                if p_symbol not in ['#', '@']:
                    text[i] = text[i].replace(p_symbol, '')
            if text[i] == '':
                remove_list.append('')
        for word in remove_list:
            text.remove(word)
if dev == 1:
    prepare(dev_set)
if dev == 0:
    prepare(data)
# TODO LOWERCASE EVERYTHING
contents = [vars(line)["content"] for line in dev_set]
if dev == 0:
    contents = [vars(line)["content"] for line in data]
for line in contents:
    if len(line) < 0:
        contents.remove(line)
print(vars(dev_set[6])["content"])
for i in range(50):
    print(vars(dev_set[i])["content"])
c = 0
# remove empty sentences
contents = list(filter(None, contents))
for sentence in contents:
    if len(sentence) < 1:
        c += 1
print(c,type(contents))

['ahmeds', 'dad', 'beta', 'aaj', 'teri', 'mammy', 'kyu', 'nahi', 'baat', 'kr', 'rhi', 'h', 'ahmed']
['बांग्लादेश', 'शानदार', 'वापसी', 'भारत', '314', 'रन', 'रोका', '#indvban', '#cwc19']
['सब', 'रंडी', 'नाच', 'देखने', 'व्यस्त', '#शांतीदूत', 'होगा', 'सब', '#रंडीरोना', 'शुरू', 'देंगे']
['तुम', 'हरामियों', 'बस', 'जूतों', 'कमी', 'शुक्र', 'तुम्हारी', 'लिंचिंग', 'हिंदुओं', 'जागने', 'देर', 'सच', 'होगी', 'तुम', 'हरामी', 'सुवर', 'ड्रामा', 'बनाएं', 'सुवर', 'कहीं', 'मौलाना।', 'तुम', 'हरामियों', 'कुत्ते', 'मौत', 'मारना', 'चाहिए', 'सुवर', 'जैसी', 'शक्ल', 'रंडी', 'औलाद', 'सुवर', 'कहीं', '।।।।']
['बीजेपी', 'mla', 'आकाश', 'विजयवर्गीय', 'जेल', 'रिहा', 'जमानत', 'मिलने', 'खुशी', 'समर्थक', 'इंदौर', 'हर्ष', 'फायरिंग', '#akashvijayvargiya', '…']
['चमकी', 'बुखार', 'विधानसभा', 'परिसर', 'आरजेडी', 'प्रदर्शन', 'तेजस्वी', 'यादव', 'नदारद', '#biharencephalitisdeaths', '…', 'रिपोर्ट']
['मुंबई', 'बारिश', 'लोगों', 'काफी', 'समस्या', 'रही']
['ahmeds', 'dad', 'beta', 'aaj', 'teri', 'mammy', 'kyu', 'nahi', 'baat', 'kr', 'rh

## 1.3 Build the vocabulary (0.5 + 0.5 points)

The input to the first layer of word2vec is an one-hot encoding of the current word. The output of the model is then compared to a numeric class label of the words within the size of the skip-gram window. Now

* Compile a list of all words in the development section of your corpus and save it in a variable ```V```.

In [271]:
V = set()
for line in dev_set:
    for word in vars(line)["content"]:
        V.add(word)
       
V_corpus = set()
for line in data:
    for word in vars(line)["content"]:
        V_corpus.add(word)
        
V = list(V)
V_corpus = list(V_corpus)

* Then, write a function ```word_to_one_hot``` that returns a one-hot encoding of an arbitrary word in the vocabulary. The size of the one-hot encoding should be ```len(v)```.

In [272]:
def word_to_one_hot_dev(word, dev):
    res = []
    if dev == 0:
        for i in range(len(V_corpus)):
            if (V[i] == word):
                res.append(1)
            else:
                res.append(0)
    else:
        for i in range(len(V)):
            if (V[i] == word):
                res.append(1)
            else:
                res.append(0)
    return res
def word_to_one_hot(word):
    return torch.Tensor(word_to_one_hot_dev(word, dev)).to(torch.int64)

## 1.4 Subsampling (0.5 points)

The probability to keep a word in a context is given by:

$P_{keep}(w_i) = \Big(\sqrt{\frac{z(w_i)}{0.001}}+1\Big) \cdot \frac{0.001}{z(w_i)}$

Where $z(w_i)$ is the relative frequency of the word $w_i$ in the corpus. Now,
* Calculate word frequencies
* Define a function ```sampling_prob``` that takes a word (string) as input and returns the probabiliy to **keep** the word in a context.

In [273]:
frequencies = {}
word_count = 0
if dev == 0:
    for i in V_corpus:
        frequencies[i] = 0


    for line in data:
        for word in vars(line)["content"]:
            frequencies[word] += 1
            word_count += 1

    for key in frequencies:
        frequencies[key] = frequencies[key]/word_count
else:
    for i in V:
        frequencies[i] = 0


    for line in dev_set:
        for word in vars(line)["content"]:
            frequencies[word] += 1
            word_count += 1

    for key in frequencies:
        frequencies[key] = frequencies[key]/word_count
    
def sampling_prob(word):
    return (math.sqrt(frequencies[word] * 1000) + 1) * (0.001/frequencies[word])

# 1.5 Skip-Grams (1 point)

Now that you have the vocabulary and one-hot encodings at hand, you can start to do the actual work. The skip gram model requires training data of the shape ```(current_word, context)```, with ```context``` being the words before and/or after ```current_word``` within ```window_size```. 

* Have closer look on the original paper. If you feel to understand how skip-gram works, implement a function ```get_target_context``` that takes a sentence as input and [yield](https://docs.python.org/3.9/reference/simple_stmts.html#the-yield-statement)s a ```(current_word, context)```.

* Use your ```sampling_prob``` function to drop words from contexts as you sample them. 

In [274]:
window_size = 10
def get_target_context(sentence):
	# get a random word (its position in the sentence)
    current_index = random.randint(0, len(sentence) - 1)
    context = []
    min_index = 0 if current_index < window_size else current_index - window_size
    max_index = len(sentence)-1 if current_index + window_size > len(sentence) else current_index + window_size
    context = sentence[min_index:current_index] + sentence[current_index+1:max_index+1]
    current_word = sentence[current_index]
    sampled_out = []
    for word in context:
        p_keep = sampling_prob(word)
        if (random.random() > p_keep):
            sampled_out.append(word)
    for word in sampled_out:
        context.remove(word)
    valid = 1
    if (type(current_word) is not str or type(current_word) is tuple):
        valid = 0
    for w in context:
        if (type(w) is not str or type(w) is tuple):
            valid = 0
    if valid == 0:
        sys.exit(0)
    #print(type(current_word))
    return current_word, context

# 1.6 Hyperparameters (0.5 points)

According to the word2vec paper, what would be a good choice for the following hyperparameters? 

* Embedding dimension
* Window size

Initialize them in a dictionary or as independent variables in the code block below. 

In [275]:
# Set hyperparameters
window_size = 10
embedding_size = 300

# More hyperparameters
learning_rate = 0.01
epochs = 100

# 1.7 Pytorch Module (0.5 + 0.5 + 0.5 points)

Pytorch provides a wrapper for your fancy and super-complex models: [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). The code block below contains a skeleton for such a wrapper. Now,

* Initialize the two weight matrices of word2vec as fields of the class.

* Override the ```forward``` method of this class. It should take a one-hot encoding as input, perform the matrix multiplications, and finally apply a log softmax on the output layer.

* Initialize the model and save its weights in a variable. The Pytorch documentation will tell you how to do that.

In [276]:
# Create model 

class Word2Vec(nn.Module):
  def __init__(self):
    super().__init__()
    self.window_size = window_size
    self.embedding_size = embedding_size
    if dev == 0:
        self.inl = nn.Embedding(len(V_corpus), self.embedding_size)
        self.out = nn.Linear(self.embedding_size, len(V_corpus))
    else:
        self.inl = nn.Embedding(len(V), self.embedding_size)
        self.out = nn.Linear(self.embedding_size, len(V))
    


  def forward(self, one_hot):
    res = self.out(self.inl(one_hot)) 
    #return F.softmax(torch.Tensor(res))
    return torch.Tensor(res)

model = Word2Vec()
path = 'task1.pt'
torch.save(model.state_dict(), path)

# 1.8 Loss function and optimizer (0.5 points)

Initialize variables with [optimizer](https://pytorch.org/docs/stable/optim.html#module-torch.optim) and loss function. You can take what is used in the word2vec paper, but you can use alternative optimizers/loss functions if you explain your choice in the report.

In [277]:
# Define optimizer and loss

# [2] says they use Adagrad
optimizer = torch.optim.Adagrad(model.parameters(), lr=learning_rate)

# [1, 2] say nothing (?) , use negative log likelihood (Stanford video) ? -> weird af
criterion = nn.NLLLoss()
criterion = nn.CrossEntropyLoss()

# 1.9 Training the model (3 points)

As everything is prepared, implement a training loop that performs several passes of the data set through the model. You are free to do this as you please, but your code should:

* Load the weights saved in 1.6 at the start of every execution of the code block
* Print the accumulated loss at least after every epoch (the accumulate loss should be reset after every epoch)
* Define a criterion for the training procedure to terminate if a certain loss value is reached. You can find the threshold by observing the loss for the development set.

You can play around with the number of epochs and the learning rate.

In [278]:
# load initial weights
model.load_state_dict(torch.load(path))

def collate_fn(batch):
    return tuple(zip(*batch))


def sampled(contents):
    res = []
    for sentence in contents:
            res.append(get_target_context(sentence))
    return res

# Define train procedure
def train():
    for epoch in range(epochs):
        train_set = sampled(contents)
        train_loader = torch.utils.data.DataLoader(train_set, batch_size=128, collate_fn=collate_fn)
        accu_loss = 0
        i = 0
        model.train()
        for center, context in train_loader:
            i += 1
            if type(center) is tuple:
                center = center[0]
            if type(context[0]) is tuple:
                context = [tup[0] for tup in context]
            optimizer.zero_grad()
            word = word_to_one_hot(center)
            prediction = model(word)
            expectation = word_to_one_hot(context[0])
            for i in range(1, len(context)):
                expectation = torch.add(expectation, word_to_one_hot(context[i]))
            loss = criterion(prediction, expectation)
            loss.backward()
            optimizer.step()
            accu_loss += loss.item()
        print('Epoch ', epoch, " - loss : ", accu_loss/i)

 
print("Training started")

train()

print("Training finished")

Training started
Epoch  0  - loss :  0.07494012273923316
Epoch  1  - loss :  0.025600890920619773
Epoch  2  - loss :  0.0033748140840819387
Epoch  3  - loss :  0.00136854477000959
Epoch  4  - loss :  0.0009109845215624028
Epoch  5  - loss :  0.0006944758422446973
Epoch  6  - loss :  0.0005664099195990899
Epoch  7  - loss :  0.00048123966112281335
Epoch  8  - loss :  0.00042026798532466696
Epoch  9  - loss :  0.00037435844841629567
Epoch  10  - loss :  0.0003384657655701493
Epoch  11  - loss :  0.00030958336411100447
Epoch  12  - loss :  0.0002858231991830498
Epoch  13  - loss :  0.0002658975244772555
Epoch  14  - loss :  0.0002489388365336139
Epoch  15  - loss :  0.00023429618790896253
Epoch  16  - loss :  0.0002215337948967712
Epoch  17  - loss :  0.000210289532939593
Epoch  18  - loss :  0.00020029860539267762
Epoch  19  - loss :  0.00019136257469654083
Epoch  20  - loss :  0.00018331547728692643
Epoch  21  - loss :  0.0001760170315251206
Epoch  22  - loss :  0.0001693744438164162
Ep

# 1.10 Train on the full dataset (0.5 points)

Now, go back to 1.1 and remove the restriction on the number of sentences in your corpus. Then, reexecute code blocks 1.2, 1.3 and 1.6 (or those relevant if you created additional ones). 

* Then, retrain your model on the complete dataset.

* Now, the input weights of the model contain the desired word embeddings! Save them together with the corresponding vocabulary items (Pytorch provides a nice [functionality](https://pytorch.org/tutorials/beginner/saving_loading_models.html) for this).