# Task 1: Word Embeddings (10 points)

This notebook will guide you through all steps necessary to train a word2vec model (Detailed description in the PDF).

## Imports

This code block is reserved for your imports. 

You are free to use the following packages: 

(List of packages)

In [1]:
# Imports
import re
import string
import json
from datetime import datetime
from collections import defaultdict, Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.nn import Module
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import train_test_split

from nltk.corpus import stopwords

device = 'cuda'

import random

random.seed(26)
np.random.seed(62)
torch.manual_seed(2021)

<torch._C.Generator at 0x7f76ec823f70>

# 1.1 Get the data (0.5 points)

The Hindi portion HASOC corpus from [github.io](https://hasocfire.github.io/hasoc/2019/dataset.html) is already available in the repo, at data/hindi_hatespeech.tsv . Load it into a data structure of your choice. Then, split off a small part of the corpus as a development set (~100 data points).

If you are using Colab the first two lines will let you upload folders or files from your local file system.

In [2]:
data = pd.read_csv('../data/hindi_hatespeech.tsv', sep='\t')

# uncomment the line below to use a small sample data
# data = data.sample(100, replace=False).reset_index(drop=True)

sentences = data['text'].to_numpy()

### Print out data/statistics

In [3]:
print('Number of sentences:', len(sentences))
print('sentence examples:', sentences[:3])

Number of sentences: 4665
sentence examples: ['बांग्लादेश की शानदार वापसी, भारत को 314 रन पर रोका #INDvBAN #CWC19'
 'सब रंडी नाच देखने मे व्यस्त जैसे ही कोई #शांतीदूत के साथ कुछ होगा सब #रंडीरोना शुरू कर देंगे   '
 'तुम जैसे हरामियों के लिए बस जूतों की कमी है शुक्र कर अभी तुम्हारी लिंचिंग हुई नहीं है हिंदुओं के जागने की देर है सच में होगी अभी तो तुम जैसे हरामी सुवर ड्रामा बनाएं हो   सुवर कहीं का मौलाना।   तुम जैसे हरामियों कुत्ते की मौत मारना चाहिए सुवर जैसी शक्ल  रंडी की औलाद सुवर कहीं का ।।।।']


## 1.2 Data preparation (0.5 + 0.5 points)

* Prepare the data by removing everything that does not contain information. 
User names (starting with '@') and punctuation symbols clearly do not convey information, but we also want to get rid of so-called [stopwords](https://en.wikipedia.org/wiki/Stop_word), i. e. words that have little to no semantic content (and, but, yes, the...). Hindi stopwords can be found [here](https://github.com/stopwords-iso/stopwords-hi/blob/master/stopwords-hi.txt) Then, standardize the spelling by lowercasing all words.
Do this for the development section of the corpus for now.

* What about hashtags (starting with '#') and emojis? Should they be removed too? Justify your answer in the report, and explain how you accounted for this in your implementation.

In [4]:
# remove user taggings
user_tag_pattern = re.compile(r'\@\w*')
sentences = [re.sub(user_tag_pattern, ' ', sentence) for sentence in sentences]

In [5]:
# remove punctuations
punctuation = string.punctuation[:2] + string.punctuation[3:] # all punctuation without "#"
translator = str.maketrans(punctuation, ' '*len(punctuation))
def remove_punc(s):
    s = s.translate(translator)
    return s

sentences = [remove_punc(sentence) for sentence in sentences]

In [6]:
# lower case
sentences = [sentence.lower() for sentence in sentences]

In [7]:
# remove stopwords
stopwords = ['अंदर', 'अत', 'अदि', 'अप', 'अपना', 'अपनि', 'अपनी', 'अपने', 'अभि', 'अभी', 'आदि', 
             'आप', 'इंहिं', 'इंहें', 'इंहों', 'इतयादि', 'इत्यादि', 'इन', 'इनका', 'इन्हीं', 'इन्हें', 'इन्हों', 
             'इस', 'इसका', 'इसकि', 'इसकी', 'इसके', 'इसमें', 'इसि', 'इसी', 'इसे', 'उंहिं', 'उंहें', 
             'उंहों', 'उन', 'उनका', 'उनकि', 'उनकी', 'उनके', 'उनको', 'उन्हीं', 'उन्हें', 'उन्हों', 'उस', 
             'उसके', 'उसि', 'उसी', 'उसे', 'एक', 'एवं', 'एस', 'एसे', 'ऐसे', 'ओर', 'और', 'कइ', 
             'कई', 'कर', 'करता', 'करते', 'करना', 'करने', 'करें', 'कहते', 'कहा', 'का', 'काफि', 
             'काफ़ी', 'कि', 'किंहें', 'किंहों', 'कितना', 'किन्हें', 'किन्हों', 'किया', 'किर', 'किस', 
             'किसि', 'किसी', 'किसे', 'की', 'कुछ', 'कुल', 'के', 'को', 'कोइ', 'कोई', 'कोन', 
             'कोनसा', 'कौन', 'कौनसा', 'गया', 'घर', 'जब', 'जहाँ', 'जहां', 'जा', 'जिंहें', 'जिंहों', 
             'जितना', 'जिधर', 'जिन', 'जिन्हें', 'जिन्हों', 'जिस', 'जिसे', 'जीधर', 'जेसा', 'जेसे', 
             'जैसा', 'जैसे', 'जो', 'तक', 'तब', 'तरह', 'तिंहें', 'तिंहों', 'तिन', 'तिन्हें', 'तिन्हों', 
             'तिस', 'तिसे', 'तो', 'था', 'थि', 'थी', 'थे', 'दबारा', 'दवारा', 'दिया', 'दुसरा', 'दुसरे', 
             'दूसरे', 'दो', 'द्वारा', 'न', 'नहिं', 'नहीं', 'ना', 'निचे', 'निहायत', 'नीचे', 'ने', 'पर', 
             'पहले', 'पुरा', 'पूरा', 'पे', 'फिर', 'बनि', 'बनी', 'बहि', 'बही', 'बहुत', 'बाद', 'बाला', 
             'बिलकुल', 'भि', 'भितर', 'भी', 'भीतर', 'मगर', 'मानो', 'मे', 'में', 'यदि', 'यह', 'यहाँ', 
             'यहां', 'यहि', 'यही', 'या', 'यिह', 'ये', 'रखें', 'रवासा', 'रहा', 'रहे', 'ऱ्वासा', 'लिए', 
             'लिये', 'लेकिन', 'व', 'वगेरह', 'वरग', 'वर्ग', 'वह', 'वहाँ', 'वहां', 'वहिं', 'वहीं', 'वाले', 
             'वुह', 'वे', 'वग़ैरह', 'संग', 'सकता', 'सकते', 'सबसे', 'सभि', 'सभी', 'साथ', 'साबुत', 
             'साभ', 'सारा', 'से', 'सो', 'हि', 'ही', 'हुअ', 'हुआ', 'हुइ', 'हुई', 'हुए', 'हे', 'हें', 
             'है', 'हैं', 'हो', 'होता', 'होति', 'होती', 'होते', 'होना', 'होने']

sentences = [[word for word in sentence.split() if word not in stopwords] for sentence in sentences]

### Print out data/statistics

In [8]:
print('pre-processing is finished')
for sentence in sentences[:10]:
    print(sentence)

pre-processing is finished
['बांग्लादेश', 'शानदार', 'वापसी', 'भारत', '314', 'रन', 'रोका', '#indvban', '#cwc19']
['सब', 'रंडी', 'नाच', 'देखने', 'व्यस्त', '#शांतीदूत', 'होगा', 'सब', '#रंडीरोना', 'शुरू', 'देंगे']
['तुम', 'हरामियों', 'बस', 'जूतों', 'कमी', 'शुक्र', 'तुम्हारी', 'लिंचिंग', 'हिंदुओं', 'जागने', 'देर', 'सच', 'होगी', 'तुम', 'हरामी', 'सुवर', 'ड्रामा', 'बनाएं', 'सुवर', 'कहीं', 'मौलाना।', 'तुम', 'हरामियों', 'कुत्ते', 'मौत', 'मारना', 'चाहिए', 'सुवर', 'जैसी', 'शक्ल', 'रंडी', 'औलाद', 'सुवर', 'कहीं', '।।।।']
['बीजेपी', 'mla', 'आकाश', 'विजयवर्गीय', 'जेल', 'रिहा', 'जमानत', 'मिलने', 'खुशी', 'समर्थक', 'इंदौर', 'हर्ष', 'फायरिंग', '#akashvijayvargiya', 'https', 'abpnews', 'abplive', 'in', 'india', 'news', 'celebratory', 'firing', 'outside', 'bjp', 'mla', 'akash', 'vijayvargiya', 'office', 'in', 'indore', '1157241', '…']
['चमकी', 'बुखार', 'विधानसभा', 'परिसर', 'आरजेडी', 'प्रदर्शन', 'तेजस्वी', 'यादव', 'नदारद', '#biharencephalitisdeaths', 'https', 'abpnews', 'abplive', 'in', 'bihar', 'news', 'aes

## 1.3 Build the vocabulary (0.5 + 0.5 points)

The input to the first layer of word2vec is an one-hot encoding of the current word. The output od the model is then compared to a numeric class label of the words within the size of the skip-gram window. Now

* Compile a list of all words in the development section of your corpus and save it in a variable ```V```.

In [9]:
flattened_words = [word for sentence in sentences for word in sentence]
V = list(set(flattened_words))

* Then, write a function ```word_to_one_hot``` that returns a one-hot encoding of an arbitrary word in the vocabulary. The size of the one-hot encoding should be ```len(v)```.

In [10]:
vocab_size = len(V)
word_to_int = {}
int_to_word = {}
for i, word in enumerate(V):
    word_to_int[word] = i
    int_to_word[i] = word

def word_to_one_hot(word):
    word_id = word_to_int[word]
    one_hot_vector = torch.zeros(vocab_size, dtype=torch.long)
    one_hot_vector[word_id] = 1
    return one_hot_vector

### Print out data/statistics

In [11]:
print(f'vocab_size: {vocab_size}')

vocab_size: 20402


## 1.4 Subsampling (0.5 points)

The probability to keep a word in a context is given by:

$P_{keep}(w_i) = \Big(\sqrt{\frac{z(w_i)}{0.001}}+1\Big) \cdot \frac{0.001}{z(w_i)}$

Where $z(w_i)$ is the relative frequency of the word $w_i$ in the corpus. Now,
* Calculate word frequencies
* Define a function ```sampling_prob``` that takes a word (string) as input and returns the probabiliy to **keep** the word in a context.

In [12]:
word_counter = Counter(flattened_words)
def sampling_prob(word):
    z = word_counter[word] / len(flattened_words)
    p_keep = ((z/0.000001)**0.5 + 1) * (0.000001/z)
    return p_keep

# 1.5 Skip-Grams (1 point)

Now that you have the vocabulary and one-hot encodings at hand, you can start to do the actual work. The skip gram model requires training data of the shape ```(current_word, context)```, with ```context``` being the words before and/or after ```current_word``` within ```window_size```. 

* Have closer look on the original paper. If you feel to understand how skip-gram works, implement a function ```get_target_context``` that takes a sentence as input and [yield](https://docs.python.org/3.9/reference/simple_stmts.html#the-yield-statement)s a ```(current_word, context)```.

* Use your ```sampling_prob``` function to drop words from contexts as you sample them. 

In [13]:
def get_target_context(sentence: list(str())):
    for i, word in enumerate(sentence):
        word_one_hot = word_to_one_hot(word)
        for j, context_word in enumerate(sentence[i-window_size:i+window_size+1]):
            if j != i and random.random() < sampling_prob(context_word):
                yield (torch.tensor(word_to_int[word], dtype=torch.long).unsqueeze(0), 
                       torch.tensor(word_to_int[context_word], dtype=torch.long).unsqueeze(0))

# 1.6 Hyperparameters (0.5 points)

According to the word2vec paper, what would be a good choice for the following hyperparameters? 

* Embedding dimension
* Window size

Initialize them in a dictionary or as independent variables in the code block below. 

In [14]:
# Set hyperparameters
window_size = 10 # in the first paper, the authors use 10.
embedding_size = 300 # the first paper doesn't state the best dimension size, they try several options. In the linked code, they use embedding_size = 300.

# More hyperparameters
learning_rate = 0.01
batch_size = 256
epochs = 100

# 1.7 Pytorch Module (0.5 + 0.5 + 0.5 points)

Pytorch provides a wrapper for your fancy and super-complex models: [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). The code block below contains a skeleton for such a wrapper. Now,

* Initialize the two weight matrices of word2vec as fields of the class.

* Override the ```forward``` method of this class. It should take a one-hot encoding as input, perform the matrix multiplications, and finally apply a log softmax on the output layer.

* Initialize the model and save its weights in a variable. The Pytorch documentation will tell you how to do that.

In [15]:
# Create model 
class Word2Vec(Module):
    def __init__(self):
        super(Word2Vec, self).__init__()
        # use the Embedding provided by torch instead of the above word_to_one_hot() function.
        self.embed = nn.Embedding(vocab_size, embedding_size)
        self.fc = nn.Linear(embedding_size, vocab_size)

    def forward(self, word_id):
        out = self.embed(word_id)
        out = self.fc(out)
        return out.squeeze(1)
    
    def to_embed(self, word_id):
        return self.embed(word_id)
    
word2vec = Word2Vec()
save_path = './save/word2vec.pt'
torch.save(word2vec.state_dict(), save_path)

### Print out data/statistics

In [16]:
word2vec.parameters

<bound method Module.parameters of Word2Vec(
  (embed): Embedding(20402, 300)
  (fc): Linear(in_features=300, out_features=20402, bias=True)
)>

# 1.8 Loss function and optimizer (0.5 points)

Initialize variables with [optimizer](https://pytorch.org/docs/stable/optim.html#module-torch.optim) and loss function. You can take what is used in the word2vec paper, but you can use alternative optimizers/loss functions if you explain your choice in the report.

In [17]:
# Define optimizer and loss
optimizer = optim.Adam(word2vec.parameters(), lr=learning_rate) # in paper, AdaGrad is used.
criterion = nn.CrossEntropyLoss(reduction='sum') # i.e. equivalent to NLL on LogSoftmax

# 1.9 Training the model (3 points)

As everything is prepared, implement a training loop that performs several passes of the data set through the model. You are free to do this as you please, but your code should:

* Load the weights saved in 1.6 at the start of every execution of the code block
* Print the accumulated loss at least after every epoch (the accumulate loss should be reset after every epoch)
* Define a criterion for the training procedure to terminate if a certain loss value is reached. You can find the threshold by observing the loss for the development set.

You can play around with the number of epochs and the learning rate.

### Dataset

In [18]:
class W2VDataset(Dataset):
    def __init__(self, sentences):
        self.data = []
        for sentence in sentences:
            for data_point in get_target_context(sentence):
                self.data.append(data_point)
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        return self.data[index]

In [19]:
# load initial weights
word2vec.load_state_dict(torch.load(save_path))
word2vec = word2vec.to(device)

# train model
list_loss = []
for epoch in range(1, epochs+1):
    train_dataset = W2VDataset(sentences)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    
    losses = 0.
    cnt = 0
    word2vec.train()
    for words, context_words in tqdm(train_loader):
        optimizer.zero_grad()
        pred = word2vec(words.to(device))
        loss = criterion(pred, context_words.squeeze(1).to(device))
        loss.backward()
        optimizer.step()
        losses += loss.detach().item()
        cnt += len(words)

    epoch_loss = losses / cnt
    print(f'Epoch {epoch:2}: training loss: {epoch_loss:.4f} over {cnt} training points.')
    
    if epoch % 10 == 0:
        # save embedding
        embedding_weights = word2vec.embed.state_dict()
        embedding_weights['weight']
        torch.save(embedding_weights, f'./save/embedding_checkpoints/embedding_weights_{epoch}_epoch_{embedding_size}_dim_{window_size}_wsize.pt')
        # save full model
        torch.save(word2vec.state_dict(), f'./save/model_checkpoints/word2vec_{epoch}_epoch_{embedding_size}_dim_{window_size}_wsize.pt')
    
    # early-stop when training loss is not decreasing.
    list_loss.append(epoch_loss)
    if len(list_loss) > 5 and min(list_loss[-5:]) > min(list_loss[:-5]):
        print('Training loss is not reducing anymore, terminate.')
        break

print("Training finished")

100%|██████████| 446/446 [00:22<00:00, 19.70it/s]


Epoch  1: training loss: 12.7457 over 113965 training points.


100%|██████████| 445/445 [00:21<00:00, 20.25it/s]


Epoch  2: training loss: 12.0367 over 113758 training points.


100%|██████████| 446/446 [00:22<00:00, 19.58it/s]


Epoch  3: training loss: 10.6651 over 114004 training points.


100%|██████████| 447/447 [00:22<00:00, 20.18it/s]


Epoch  4: training loss: 9.6758 over 114423 training points.


100%|██████████| 445/445 [00:22<00:00, 19.97it/s]


Epoch  5: training loss: 8.9723 over 113814 training points.


100%|██████████| 444/444 [00:22<00:00, 19.68it/s]


Epoch  6: training loss: 8.4702 over 113638 training points.


100%|██████████| 446/446 [00:22<00:00, 19.48it/s]


Epoch  7: training loss: 8.1053 over 113997 training points.


100%|██████████| 446/446 [00:23<00:00, 19.10it/s]


Epoch  8: training loss: 7.8091 over 114011 training points.


100%|██████████| 446/446 [00:22<00:00, 20.06it/s]


Epoch  9: training loss: 7.6485 over 114006 training points.


100%|██████████| 447/447 [00:22<00:00, 19.85it/s]


Epoch 10: training loss: 7.5325 over 114246 training points.


100%|██████████| 447/447 [00:22<00:00, 19.87it/s]


Epoch 11: training loss: 7.4568 over 114274 training points.


100%|██████████| 443/443 [00:22<00:00, 19.88it/s]


Epoch 12: training loss: 7.3913 over 113211 training points.


100%|██████████| 444/444 [00:22<00:00, 19.32it/s]


Epoch 13: training loss: 7.3485 over 113496 training points.


100%|██████████| 445/445 [00:22<00:00, 19.60it/s]


Epoch 14: training loss: 7.3170 over 113817 training points.


100%|██████████| 446/446 [00:23<00:00, 19.23it/s]


Epoch 15: training loss: 7.2866 over 114067 training points.


100%|██████████| 446/446 [00:23<00:00, 19.25it/s]


Epoch 16: training loss: 7.2755 over 113997 training points.


100%|██████████| 445/445 [00:23<00:00, 19.23it/s]


Epoch 17: training loss: 7.2622 over 113802 training points.


100%|██████████| 446/446 [00:21<00:00, 20.39it/s]


Epoch 18: training loss: 7.2387 over 114032 training points.


100%|██████████| 445/445 [00:23<00:00, 18.87it/s]


Epoch 19: training loss: 7.2327 over 113798 training points.


100%|██████████| 446/446 [00:21<00:00, 20.47it/s]


Epoch 20: training loss: 7.2269 over 114035 training points.


100%|██████████| 446/446 [00:23<00:00, 19.01it/s]


Epoch 21: training loss: 7.2080 over 114118 training points.


100%|██████████| 444/444 [00:23<00:00, 18.88it/s]


Epoch 22: training loss: 7.1785 over 113504 training points.


100%|██████████| 445/445 [00:22<00:00, 19.54it/s]


Epoch 23: training loss: 7.1598 over 113692 training points.


100%|██████████| 445/445 [00:22<00:00, 19.98it/s]


Epoch 24: training loss: 7.1359 over 113819 training points.


100%|██████████| 445/445 [00:22<00:00, 20.03it/s]


Epoch 25: training loss: 7.1132 over 113774 training points.


100%|██████████| 445/445 [00:21<00:00, 20.49it/s]


Epoch 26: training loss: 7.0920 over 113847 training points.


100%|██████████| 446/446 [00:21<00:00, 20.48it/s]


Epoch 27: training loss: 7.0859 over 114080 training points.


100%|██████████| 446/446 [00:21<00:00, 20.38it/s]


Epoch 28: training loss: 7.0721 over 114171 training points.


100%|██████████| 445/445 [00:22<00:00, 19.54it/s]


Epoch 29: training loss: 7.0620 over 113710 training points.


100%|██████████| 445/445 [00:22<00:00, 19.53it/s]


Epoch 30: training loss: 7.0382 over 113743 training points.


100%|██████████| 444/444 [00:23<00:00, 19.01it/s]


Epoch 31: training loss: 6.9990 over 113556 training points.


100%|██████████| 444/444 [00:22<00:00, 20.06it/s]


Epoch 32: training loss: 6.9990 over 113453 training points.


100%|██████████| 445/445 [00:21<00:00, 20.35it/s]


Epoch 33: training loss: 6.9706 over 113886 training points.


100%|██████████| 445/445 [00:22<00:00, 19.89it/s]


Epoch 34: training loss: 6.9559 over 113712 training points.


100%|██████████| 444/444 [00:22<00:00, 19.53it/s]


Epoch 35: training loss: 6.9539 over 113614 training points.


100%|██████████| 446/446 [00:23<00:00, 18.85it/s]


Epoch 36: training loss: 6.9340 over 114094 training points.


100%|██████████| 444/444 [00:22<00:00, 19.56it/s]


Epoch 37: training loss: 6.9223 over 113420 training points.


100%|██████████| 445/445 [00:22<00:00, 19.98it/s]


Epoch 38: training loss: 6.9060 over 113914 training points.


100%|██████████| 446/446 [00:24<00:00, 18.10it/s]


Epoch 39: training loss: 6.9014 over 114024 training points.


100%|██████████| 445/445 [00:24<00:00, 18.22it/s]


Epoch 40: training loss: 6.8978 over 113768 training points.


100%|██████████| 445/445 [00:22<00:00, 19.81it/s]


Epoch 41: training loss: 6.8741 over 113686 training points.


100%|██████████| 445/445 [00:21<00:00, 20.25it/s]


Epoch 42: training loss: 6.8541 over 113669 training points.


100%|██████████| 446/446 [00:22<00:00, 20.20it/s]


Epoch 43: training loss: 6.8698 over 113942 training points.


100%|██████████| 444/444 [00:21<00:00, 20.34it/s]


Epoch 44: training loss: 6.8516 over 113476 training points.


100%|██████████| 444/444 [00:22<00:00, 20.18it/s]


Epoch 45: training loss: 6.8389 over 113561 training points.


100%|██████████| 444/444 [00:22<00:00, 19.86it/s]


Epoch 46: training loss: 6.8280 over 113523 training points.


100%|██████████| 445/445 [00:22<00:00, 20.23it/s]


Epoch 47: training loss: 6.8158 over 113731 training points.


100%|██████████| 447/447 [00:22<00:00, 19.91it/s]


Epoch 48: training loss: 6.7931 over 114181 training points.


100%|██████████| 445/445 [00:21<00:00, 20.53it/s]


Epoch 49: training loss: 6.7950 over 113870 training points.


100%|██████████| 446/446 [00:21<00:00, 20.34it/s]


Epoch 50: training loss: 6.7869 over 113929 training points.


100%|██████████| 445/445 [00:22<00:00, 20.13it/s]


Epoch 51: training loss: 6.7724 over 113691 training points.


100%|██████████| 445/445 [00:22<00:00, 20.00it/s]


Epoch 52: training loss: 6.7817 over 113687 training points.


100%|██████████| 445/445 [00:22<00:00, 19.81it/s]


Epoch 53: training loss: 6.7644 over 113886 training points.


100%|██████████| 443/443 [00:21<00:00, 20.58it/s]


Epoch 54: training loss: 6.7640 over 113359 training points.


100%|██████████| 443/443 [00:22<00:00, 19.36it/s]


Epoch 55: training loss: 6.7528 over 113390 training points.


100%|██████████| 445/445 [00:22<00:00, 19.72it/s]


Epoch 56: training loss: 6.7506 over 113904 training points.


100%|██████████| 446/446 [00:22<00:00, 19.44it/s]


Epoch 57: training loss: 6.7427 over 114020 training points.


100%|██████████| 444/444 [00:21<00:00, 20.23it/s]


Epoch 58: training loss: 6.7462 over 113423 training points.


100%|██████████| 446/446 [00:22<00:00, 19.52it/s]


Epoch 59: training loss: 6.7408 over 113934 training points.


100%|██████████| 444/444 [00:21<00:00, 20.21it/s]


Epoch 60: training loss: 6.7324 over 113607 training points.


100%|██████████| 446/446 [00:21<00:00, 20.33it/s]


Epoch 61: training loss: 6.7120 over 113924 training points.


100%|██████████| 445/445 [00:21<00:00, 20.32it/s]


Epoch 62: training loss: 6.7232 over 113815 training points.


100%|██████████| 445/445 [00:22<00:00, 19.88it/s]


Epoch 63: training loss: 6.7105 over 113760 training points.


100%|██████████| 444/444 [00:21<00:00, 20.39it/s]


Epoch 64: training loss: 6.7083 over 113470 training points.


100%|██████████| 445/445 [00:21<00:00, 20.28it/s]


Epoch 65: training loss: 6.7191 over 113717 training points.


100%|██████████| 445/445 [00:22<00:00, 20.06it/s]


Epoch 66: training loss: 6.7022 over 113822 training points.


100%|██████████| 444/444 [00:22<00:00, 20.15it/s]


Epoch 67: training loss: 6.6988 over 113577 training points.


100%|██████████| 447/447 [00:22<00:00, 20.28it/s]


Epoch 68: training loss: 6.6903 over 114266 training points.


100%|██████████| 444/444 [00:22<00:00, 20.14it/s]


Epoch 69: training loss: 6.6928 over 113549 training points.


100%|██████████| 444/444 [00:22<00:00, 20.06it/s]


Epoch 70: training loss: 6.7012 over 113622 training points.


100%|██████████| 444/444 [00:21<00:00, 20.22it/s]


Epoch 71: training loss: 6.6875 over 113527 training points.


100%|██████████| 448/448 [00:22<00:00, 19.74it/s]


Epoch 72: training loss: 6.6833 over 114506 training points.


100%|██████████| 442/442 [00:21<00:00, 20.37it/s]


Epoch 73: training loss: 6.6760 over 112984 training points.


100%|██████████| 444/444 [00:22<00:00, 20.18it/s]


Epoch 74: training loss: 6.6736 over 113560 training points.


100%|██████████| 443/443 [00:21<00:00, 20.30it/s]


Epoch 75: training loss: 6.6801 over 113375 training points.


100%|██████████| 445/445 [00:21<00:00, 20.31it/s]


Epoch 76: training loss: 6.6531 over 113818 training points.


100%|██████████| 444/444 [00:21<00:00, 20.35it/s]


Epoch 77: training loss: 6.6563 over 113490 training points.


100%|██████████| 444/444 [00:21<00:00, 20.35it/s]


Epoch 78: training loss: 6.6592 over 113517 training points.


100%|██████████| 446/446 [00:21<00:00, 20.30it/s]


Epoch 79: training loss: 6.6681 over 114048 training points.


100%|██████████| 445/445 [00:22<00:00, 20.21it/s]


Epoch 80: training loss: 6.6693 over 113737 training points.


100%|██████████| 444/444 [00:22<00:00, 20.09it/s]

Epoch 81: training loss: 6.6702 over 113569 training points.
Training loss is not reducing anymore, terminate.
Training finished





# 1.10 Train on the full dataset (0.5 points)

Now, go back to 1.1 and remove the restriction on the number of sentences in your corpus. Then, reexecute code blocks 1.2, 1.3 and 1.6 (or those relevant if you created additional ones). 

* Then, retrain your model on the complete dataset.

* Now, the input weights of the model contain the desired word embeddings! Save them together with the corresponding vocabulary items (Pytorch provides a nice [functionality](https://pytorch.org/tutorials/beginner/saving_loading_models.html) for this).

In [20]:
# save the word-embedding layer weights
embedding_weights = word2vec.embed.state_dict()
torch.save(embedding_weights, f'save/embedding_weights_.pt')

# save dicts for transformation word <-> int
with open(f'save/word_to_int_dict.json', 'w') as f:
    json.dump(word_to_int, f)
with open(f'save/int_to_word_dict.json', 'w') as f:
    json.dump(int_to_word, f)

# Note
- We keep the hash-tag words.
- We make use of the nn.Embedding layer from torch instead of using word_to_one_hot() function.
- We use Adam instead of AdaGrad as stated in the paper.