### Dislaimer
This notebook requires knowledge in:
* Python
* Neural Networks
* Pytorch Datasets and Modules
* Machine Learning Process Understanding

### Word Embeddings Example
The goal of this notebook is to have a hands-on experience of words embeddings.\
We will do the following:
* Load a set of Arabic text as trigrams
* Build a simple neural network
* Train the network to **predict next word**
* Use the network weights as embeddings

#### Build Dataset

To train an embedding, we scraped a famous Arabic magazine and collected 1000 articles. The articles are\
found in the `alaraby1k.json` file. The task we will use to train the embedding is to predict a third word\
given two previous words. The task is framed as a straightforward classification task. The model should\
predict the id of the third word. Assuming we have approx. 165,000 words. Then we have 165,000 classes in\
the outer layer. Each entry in the dataset contains the ```author``` name the ```issue``` number\
and the article ```text```.

The below code chunk loads the text of all the articles and generate all possible trigrams.\
We encapsulate the dataset with a customer pytorch ```Dataset``` class.

In every transformer training process, we need to have a vocab. The vocab is the\
unique words that are data is composed of.

**During the dataset initialization, we find all unique words and store them in a set** and\
then define two dictionaries that map a given vocabulary to its index and the second\
to map a given index to its vocabulary. We need the dictionaries because we read words, but we feed\
the network indices. We also get numbers, i.e., indices, from the network output, but we need words.\
Thus, those dictionaries.

The network will eventually be trained, and we will run some tests, and it will happen\
that a word in the test set is not even in the vocabulary of the model. We need to map\
this word to an index and a common word for an unknown word.\
**we added to the vocab the word `<UNKOWN>` of index zero to replace words not in the vocab.** Check `__compute_vocab__()`.

In [1]:
from torch.utils.data import Dataset
from collections import defaultdict
from sklearn.model_selection import train_test_split
import json

class MyDataset(Dataset):

    def __init__(self, alaraby_filepath, is_train):
        self.raw_data = [article["text"] for article in json.load(open(alaraby_filepath, "r"))]
        self.train_raw_data, self.test_raw_data = train_test_split(self.raw_data, test_size=0.1, random_state=42)
        self.train_trigrams =  self.__generate_trigrams__(self.train_raw_data)
        self.test_trigrams =  self.__generate_trigrams__(self.test_raw_data)
        self.vocab, self.id_to_word, self.word_to_id = self.__compute_vocab__(self.train_raw_data)
        self.is_train = is_train
    
    def __generate_trigrams__(self, texts):
        trigrams = []
        for text in texts:
            words = text.split()
            article_trigrams = [words[i:i+3] for i in range(len(words)-2)]
            trigrams+= article_trigrams
        return trigrams
    
    def __compute_vocab__(self, texts):
        words = set()
        for text in texts:
            words.update(set(text.split()))
        words_list = ["<UNKNOWN>"] + list(words)
        id_to_word = defaultdict(lambda: "<UNKNOWN>", {idx: value for idx, value in enumerate(words_list)})
        word_to_id = defaultdict(lambda: 0, {value: idx for idx, value in enumerate(words_list)})
        return words, id_to_word, word_to_id

    def __len__(self):
        return len(self.train_trigrams) if self.is_train else len(self.test_trigrams)


    def __getitem__(self, idx):
        trigrams = self.train_trigrams if self.is_train else self.test_trigrams
        trigram = [ self.word_to_id[word] for word in trigrams[idx]]
        return tuple(trigram)
    
    def get_word_from_id(self, idx):
        return self.id_to_word[idx]
    
    def get_word_id(self, word):
        return self.word_to_id[word]
    
    def get_vocab_size(self):
        return len(self.vocab) 

In [None]:
train_dataset = MyDataset("../Dataset/alaraby1k.json", is_train= True)
test_dataset = MyDataset("../Dataset/alaraby1k.json", is_train= False)

print(f"Id of unkown word: {train_dataset.get_word_id('<UNKNOWN>')}")
print(f"ًWord of unkown id: {train_dataset.get_word_from_id(-1)}")
print(f"Vocab size: {train_dataset.get_vocab_size()}")

item = 0
x1, x2, y = train_dataset.__getitem__(item)
print(f"Item {item} in dataset: [ {x2} - {x1}] -> {y}")
print(f"Item {item} in words: [{train_dataset.get_word_from_id(x1)} - {train_dataset.get_word_from_id(x2)}] -> {train_dataset.get_word_from_id(y)}")

item = 1
x1, x2, y = train_dataset.__getitem__(item)
print(f"Item {item} in dataset: [ {x2} - {x1}] -> {y}")
print(f"Item {item} in words: [{train_dataset.get_word_from_id(x1)} - {train_dataset.get_word_from_id(x2)}] -> {train_dataset.get_word_from_id(y)}")

item = 2
x1, x2, y = train_dataset.__getitem__(item)
print(f"Item {item} in dataset: [ {x2} - {x1}] -> {y}")
print(f"Item {item} in words: [{train_dataset.get_word_from_id(x1)} - {train_dataset.get_word_from_id(x2)}] -> {train_dataset.get_word_from_id(y)}")


#### Build Neural Network

**The network task is framed as a straightforward classification task.** The model should predict\
the id of the third word. Assuming we have approx. 165,000 words. Then we have 165,000 classes\
in the outer layer.

We will be using `nn.embedding` layer from pytroch. Internally the layer is a weight matrix of\
the following size `(embedding size, vocab size)`. The `embedding size` it the size of vecotr\
we want to represent the words with it is arbitrary chosen and optimized using hyperparameter\
optimization. The `vocab size` is the size of the vocab or the number of words we have.\
The matrix is updated and optimized using back propagation.

Our vocab size is 163101 we will use an embedding of vocab size 165,000. And we will\
choose the embedding size to be `1024`. Which means each word will be represented using\
an a vector of size `1024`. Given the complex nature of languages in general \
an given the hard nature of the given task, one should use a large embedding size\
this gives the model more capacity to learn more and more about the semantics of the data.

Finally, be informed that `nn.embedding` is pretty similar to `nn.linear` layer.\
With a main difference that embedding takes words indices in the vocabulary\
while the linear layer takes one-hot encoded vectors.

In [5]:
import torch
import torch.nn as nn

class WordPredictor(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(WordPredictor, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim*2, vocab_size)
    
    def forward(self, x1, x2):
        embedded1 = self.embedding(x1)
        embedded2 = self.embedding(x2)
        concatenated = torch.cat((embedded1, embedded2), dim=1)
        output = self.linear(concatenated)
        return output

    def embed(self, x):
        return self.embedding(x)

    def word(self, x):
      distance = torch.norm(self.embedding.weight.data - x, dim=1)
      nearest = torch.argmin(distance)
      return nearest

#### Train Embedding

In this code chunk we simply train the network given a number of hyperparameter.

In [None]:
from torch.utils.data import DataLoader
from torch.optim import SGD
import os

device = torch.device("cuda:0" ) if torch.cuda.is_available() else torch.device("cpu" )

batch_size= 1500
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

vocab_size = 165_000
embbeding_dim = 1024
model = WordPredictor(vocab_size, embbeding_dim)
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=0.01)

chkpnt_path = f"word_predictor_{vocab_size}_{embbeding_dim}.chk"
if os.path.exists(chkpnt_path):
  model.load_state_dict(torch.load(chkpnt_path, map_location=device))


progress_path = "/word_predictor_progress.json"
if os.path.exists(progress_path):
  progress = json.load(open(progress_path, "r"))
else:
  progress = {"chkpnt" : 0, "progress" : []}

# Training loop
epochs = 100
for epoch in (progress["chkpnt"], epochs):
  for i, batch in enumerate(train_dataloader):
    optimizer.zero_grad()
    x1, x2, target = batch
    x1 = x1.to(device)
    x2 = x2.to(device)
    target = target.to(device)
    output = model(x1, x2)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
    if(i % 10==0):
      print(f"Epoch {epoch } Batch {i}, Loss: {loss.item()}")
  progress["progress"].append({"epoch": epoch, "loss": loss.item()})
  progress["chkpnt"] = epoch   
  json.dump(progress, open(progress_path, "w"))
  torch.save(model.state_dict(), chkpnt_path)