# Course AI Homework 5
In Homework 5, we will train our own 'CBOW' Word2Vec embedding from WikiText2 dataset. (small dataset)
- Change Runtime option above to GPU if you could. (max 12 hours for one user)
- Save and submit the outputs of this notebook and model and vocab file you trained.
- Not allowed to have other python file or import pretrained model.

In [51]:
# YOU should run this command if you will train the model in COLAB environment
! pip install datasets transformers



In [52]:
import argparse
import yaml
import os
import torch
import torch.nn as nn
import torchtext

import json
import numpy as np

from functools import partial
from torch.utils.data import DataLoader
from torchtext.data import to_map_style_dataset
from torchtext.data.utils import get_tokenizer

from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import WikiText2 # WikiText103

import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR

from datasets import load_dataset



In [53]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch_seed_numb = 0
if device.type == 'cuda':
    torch.cuda.manual_seed(torch_seed_numb)

In [54]:
device

device(type='cuda')

In [55]:
# If you use Google Colab environment, mount you google drive here to save model and vocab
from google.colab import drive
drive.mount('/content/drive')
root_dir = '/content/drive/MyDrive/course_ai_hw5'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Constant Setting

In [56]:
# You could change parameters if you want.

train_batch_size =  96
val_batch_size = 96
shuffle =  True

optimizer =  'Adam'
learning_rate =  0.001
epochs = 50

result_dir = 'weights/'

# Parameters about CBOW model architecture and Vocab.
CBOW_N_WORDS = 4

MIN_WORD_FREQUENCY = 50
MAX_SEQUENCE_LENGTH = 256

EMBED_DIMENSION = 300
EMBED_MAX_NORM = 1

In [57]:
result_dir = os.path.join(root_dir, result_dir)
if not os.path.exists(result_dir):
    os.mkdir(result_dir)


## Prepare dataset and vocab

In [58]:
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
train_dataset = datasets["train"]
val_dataset = datasets['validation']
test_dataset = datasets['test']
#train_dataset.map(tokenizing_word , batched= True, batch_size = 5000)


In [59]:
# Let's print one example
train_dataset['text'][11]

" Troops are divided into five classes : Scouts , Shocktroopers , Engineers , Lancers and Armored Soldier . Troopers can switch classes by changing their assigned weapon . Changing class does not greatly affect the stats gained while in a previous class . With victory in battle , experience points are awarded to the squad , which are distributed into five different attributes shared by the entire squad , a feature differing from early games ' method of distributing to different unit types . \n"

As you can see, we need to clean up the sentences, lowercase them, tokenize them, and change each word into an index (one-hot vector). Before going through the whole process, we need to create a vocab set using the training dataset.

In [60]:
tokenizer = get_tokenizer("basic_english", language="en")

# TODO 1): make vocabulary
# Hint) use function: build_vocab_from_iterator, use train_dataset set special tokens.. etc
tokens = [tokenizer(text) for text in train_dataset['text']]
vocab = build_vocab_from_iterator(tokens, MIN_WORD_FREQUENCY, specials=['<unk>', '<pad>', '<bos>', '<eos>'])
vocab.set_default_index(vocab['<unk>'])

len(vocab.get_stoi())

4122

We need a collate function to make dataset into CBOW trainning format. The collate function should iterate over (sliding) batch data and make train/test dataset.And each component of data should be composed of CBOW_N_WORD words in each left and right side as input and target output as word in center.  
Make the collate function return CBOW dataset in tensor type.

In [61]:
# Here is a lambda function to tokenize sentence and change words to vocab indexes.
text_pipeline = lambda x: vocab(tokenizer(x))

![cbow](https://user-images.githubusercontent.com/74028313/204695601-51d44a38-4bd3-4a69-8891-2854aa57c034.png)

In [62]:
def roll(x, window_size, step_size=1):
    # unfold dimension to make our rolling window
    return x.unfold(0,window_size,step_size)

def collate(batch, text_pipeline):

    batch_input, batch_output = [], []
    # TODO 2): make collate function
    for text in batch:
        indexed_text = text_pipeline(text)
        indexed_text = torch.tensor(indexed_text, dtype=torch.int, device=device)
        length = indexed_text.shape[0]
        if length < CBOW_N_WORDS * 2 + 1:
            continue
        _batch_input = torch.zeros(length - CBOW_N_WORDS * 2, CBOW_N_WORDS * 2, dtype=torch.int, device=device)
        _batch_input[:, :CBOW_N_WORDS] = roll(indexed_text[:-CBOW_N_WORDS - 1], CBOW_N_WORDS, 1)
        _batch_input[:, CBOW_N_WORDS:] = roll(indexed_text[CBOW_N_WORDS + 1:], CBOW_N_WORDS, 1)
        batch_input.append(_batch_input)
        batch_output.append(indexed_text[CBOW_N_WORDS:-CBOW_N_WORDS])

    batch_input = torch.cat(batch_input, dim=0)
    batch_output = torch.cat(batch_output, dim=0).to(torch.long)
    #batch_output = torch.tensor(batch_output, dtype=torch.int, device=device)
    return batch_input, batch_output

In [63]:
train_dataloader = DataLoader(
    train_dataset['text'],
    batch_size=train_batch_size,
    shuffle=shuffle,
    collate_fn=partial(collate, text_pipeline=text_pipeline),
)

val_dataloader = DataLoader(
    val_dataset['text'],
    batch_size=val_batch_size,
    shuffle=shuffle,
    collate_fn=partial(collate, text_pipeline=text_pipeline),
)

## Make CBOW Model
![image](https://user-images.githubusercontent.com/74028313/204701161-cd9df4bf-78b8-4b4d-b8b7-ed4a3b5c3922.png)

CBOW Models' main concept is to predict center-target word using context words. As you see in above simple architecture, input 2XCBOW_N_WORDS length words are projected to Projection layer. In order to convert each word to embedding, it needs look-up table and we will use torch's Embedding function to convert it. After combining embeddings of context, it use shallow linear neural network to predict target word and compare result with center word's index using cross-entropy loss. Finally, the embedding layer (lookup table) of the trained model itself serves as an embedding representing words.

In [64]:
class CBOW_Model(nn.Module):
    def __init__(self, vocab_size: int, EMBED_DIMENSION, EMBED_MAX_NORM):
        super(CBOW_Model, self).__init__()
        # TODO 3-1): make CBOW model using nn.Embedding and nn.Linear function
        self.embedding_layer = nn.Embedding(num_embeddings=vocab_size, embedding_dim=EMBED_DIMENSION, max_norm=EMBED_MAX_NORM)
        self.linear_layer = nn.Linear(EMBED_DIMENSION, vocab_size)

    def forward(self, _inputs):
        # TODO 3-2): make forward function
        # _inputs: (batch_size, 2 * window_size), each element indicates vocab index
        # projection_layer: (batch_size, embedding_dim)
        # _outputs: (batch_size, vocab_size)

        projection_layer = self.embedding_layer(_inputs)
        projection_layer = torch.sum(projection_layer, dim=1)
        _outputs = self.linear_layer(projection_layer)

        return _outputs

## Train the model

Let's make _train_epoch and _validate_epoch functions to train the CBOW model.  
- model.train() and model.eval() change torch mode in some parts (Dropout, BatchNorm..  etc) of the model to behave differently during inference time.
- There is lr_scheduler option which changes learning rate according to epoch level. Try the option if you are interested in.

In [65]:
vocab_size = len(vocab.get_stoi())

model = CBOW_Model(vocab_size=vocab_size, EMBED_DIMENSION = EMBED_DIMENSION, EMBED_MAX_NORM = EMBED_MAX_NORM)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr = learning_rate)

In [66]:
from tqdm import tqdm

class Train_CBOW:

    def __init__(
        self,
        model,
        epochs,
        train_dataloader,
        val_dataloader,
        loss_function,
        optimizer,
        device,
        model_dir,
        lr_scheduler = None
    ):
        self.model = model
        self.epochs = epochs
        self.train_dataloader = train_dataloader
        self.val_dataloader = val_dataloader
        self.loss_function = loss_function
        self.optimizer = optimizer
        self.lr_scheduler = lr_scheduler
        self.device = device
        self.model_dir = model_dir

        self.loss = {"train": [], "val": []}
        self.model.to(self.device)

    def train(self):
        for epoch in range(self.epochs):
            self._train_epoch()
            self._validate_epoch()
            print(
                "Epoch: {}/{}, Train Loss={:.5f}, Val Loss={:.5f}".format(
                    epoch + 1,
                    self.epochs,
                    self.loss["train"][-1],
                    self.loss["val"][-1],
                )
            )
            if self.lr_scheduler is not None:
                self.lr_scheduler.step()


    def _train_epoch(self):
        self.model.train() # set model as train
        loss_list = []
        # TODO 4-1):
        for b_input, b_output in tqdm(self.train_dataloader):
            pred = self.model(b_input)
            loss = self.loss_function(pred, b_output)
            loss_list.append(loss.item())
            # backward
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

            # for X, y in zip(b_input, b_output):
            #     #print(f"X: {X}")
            #     #print(f"y: {y}")
            #     pred = self.model(X)
            #     #print(f"pred: {pred}")
            #     loss = self.loss_function(pred, y)
            #     #print(f"train loss: {loss}")
            #     loss_list.append(loss)
            #     # backward
            #     self.optimizer.zero_grad()
            #     loss.backward()
            #     self.optimizer.step()
        # end of TODO
        epoch_loss = np.mean(loss_list)
        self.loss["train"].append(epoch_loss)

    def _validate_epoch(self):
        self.model.eval()
        loss_list = []

        with torch.no_grad():
            # TODO 4-2):
            for b_input, b_output in self.val_dataloader:
                pred = self.model(b_input)
                loss = self.loss_function(pred, b_output)
                loss_list.append(loss.item())
            # end of TODO
        epoch_loss = np.mean(loss_list)
        self.loss["val"].append(epoch_loss)


    def save_model(self):
        model_path = os.path.join(self.model_dir, "model.pt")
        torch.save(self.model, model_path)

    def save_loss(self):
        loss_path = os.path.join(self.model_dir, "loss.json")
        with open(loss_path, "w") as fp:
            json.dump(self.loss, fp)

In [67]:
# Option: you could add and change lr_sceduler
scheduler = LambdaLR(optimizer, lr_lambda = lambda epoch: 0.95 ** epoch)

In [69]:
trainer = Train_CBOW(
    model=model,
    epochs=epochs,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    loss_function=loss_function,
    optimizer=optimizer,
    lr_scheduler=None,
    device=device,
    model_dir=result_dir,
)

trainer.train()
print("Training finished.")


100%|██████████| 383/383 [00:10<00:00, 36.30it/s]


Epoch: 1/50, Train Loss=4.64625, Val Loss=4.69610


100%|██████████| 383/383 [00:09<00:00, 41.12it/s]


Epoch: 2/50, Train Loss=4.59170, Val Loss=4.66492


100%|██████████| 383/383 [00:09<00:00, 38.61it/s]


Epoch: 3/50, Train Loss=4.54616, Val Loss=4.64343


100%|██████████| 383/383 [00:10<00:00, 35.77it/s]


Epoch: 4/50, Train Loss=4.50738, Val Loss=4.62399


100%|██████████| 383/383 [00:11<00:00, 33.95it/s]


Epoch: 5/50, Train Loss=4.47179, Val Loss=4.60493


100%|██████████| 383/383 [00:09<00:00, 40.08it/s]


Epoch: 6/50, Train Loss=4.44104, Val Loss=4.58757


100%|██████████| 383/383 [00:09<00:00, 38.58it/s]


Epoch: 7/50, Train Loss=4.41329, Val Loss=4.58004


100%|██████████| 383/383 [00:10<00:00, 37.43it/s]


Epoch: 8/50, Train Loss=4.38715, Val Loss=4.57182


100%|██████████| 383/383 [00:09<00:00, 38.40it/s]


Epoch: 9/50, Train Loss=4.36312, Val Loss=4.56348


100%|██████████| 383/383 [00:09<00:00, 40.95it/s]


Epoch: 10/50, Train Loss=4.34074, Val Loss=4.55489


100%|██████████| 383/383 [00:10<00:00, 37.78it/s]


Epoch: 11/50, Train Loss=4.32033, Val Loss=4.55946


100%|██████████| 383/383 [00:10<00:00, 37.95it/s]


Epoch: 12/50, Train Loss=4.30077, Val Loss=4.54949


100%|██████████| 383/383 [00:09<00:00, 38.87it/s]


Epoch: 13/50, Train Loss=4.28225, Val Loss=4.54711


100%|██████████| 383/383 [00:09<00:00, 40.09it/s]


Epoch: 14/50, Train Loss=4.26567, Val Loss=4.54216


100%|██████████| 383/383 [00:10<00:00, 36.62it/s]


Epoch: 15/50, Train Loss=4.24891, Val Loss=4.53720


100%|██████████| 383/383 [00:10<00:00, 37.72it/s]


Epoch: 16/50, Train Loss=4.23334, Val Loss=4.53500


100%|██████████| 383/383 [00:10<00:00, 35.28it/s]


Epoch: 17/50, Train Loss=4.21876, Val Loss=4.53452


100%|██████████| 383/383 [00:10<00:00, 36.08it/s]


Epoch: 18/50, Train Loss=4.20350, Val Loss=4.51901


100%|██████████| 383/383 [00:11<00:00, 32.18it/s]


Epoch: 19/50, Train Loss=4.19101, Val Loss=4.52959


100%|██████████| 383/383 [00:10<00:00, 38.19it/s]


Epoch: 20/50, Train Loss=4.17801, Val Loss=4.53796


100%|██████████| 383/383 [00:10<00:00, 37.74it/s]


Epoch: 21/50, Train Loss=4.16603, Val Loss=4.52527


100%|██████████| 383/383 [00:11<00:00, 32.37it/s]


Epoch: 22/50, Train Loss=4.15182, Val Loss=4.52549


100%|██████████| 383/383 [00:09<00:00, 38.66it/s]


Epoch: 23/50, Train Loss=4.14190, Val Loss=4.53145


100%|██████████| 383/383 [00:09<00:00, 39.30it/s]


Epoch: 24/50, Train Loss=4.12993, Val Loss=4.52866


100%|██████████| 383/383 [00:10<00:00, 35.67it/s]


Epoch: 25/50, Train Loss=4.11929, Val Loss=4.53191


100%|██████████| 383/383 [00:10<00:00, 35.80it/s]


Epoch: 26/50, Train Loss=4.10853, Val Loss=4.53052


100%|██████████| 383/383 [00:14<00:00, 27.04it/s]


Epoch: 27/50, Train Loss=4.09753, Val Loss=4.52968


100%|██████████| 383/383 [00:10<00:00, 37.84it/s]


Epoch: 28/50, Train Loss=4.08837, Val Loss=4.53052


100%|██████████| 383/383 [00:09<00:00, 39.23it/s]


Epoch: 29/50, Train Loss=4.07831, Val Loss=4.53084


100%|██████████| 383/383 [00:10<00:00, 36.18it/s]


Epoch: 30/50, Train Loss=4.06977, Val Loss=4.52953


100%|██████████| 383/383 [00:10<00:00, 35.68it/s]


Epoch: 31/50, Train Loss=4.06011, Val Loss=4.53133


100%|██████████| 383/383 [00:09<00:00, 39.25it/s]


Epoch: 32/50, Train Loss=4.05223, Val Loss=4.54148


100%|██████████| 383/383 [00:09<00:00, 39.96it/s]


Epoch: 33/50, Train Loss=4.04227, Val Loss=4.54175


100%|██████████| 383/383 [00:10<00:00, 37.90it/s]


Epoch: 34/50, Train Loss=4.03459, Val Loss=4.54366


100%|██████████| 383/383 [00:10<00:00, 37.82it/s]


Epoch: 35/50, Train Loss=4.02590, Val Loss=4.53779


100%|██████████| 383/383 [00:09<00:00, 40.33it/s]


Epoch: 36/50, Train Loss=4.01919, Val Loss=4.54041


100%|██████████| 383/383 [00:10<00:00, 36.81it/s]


Epoch: 37/50, Train Loss=4.01061, Val Loss=4.54820


100%|██████████| 383/383 [00:10<00:00, 35.13it/s]


Epoch: 38/50, Train Loss=4.00313, Val Loss=4.54388


100%|██████████| 383/383 [00:11<00:00, 33.64it/s]


Epoch: 39/50, Train Loss=3.99553, Val Loss=4.54829


100%|██████████| 383/383 [00:12<00:00, 29.91it/s]


Epoch: 40/50, Train Loss=3.98846, Val Loss=4.55051


100%|██████████| 383/383 [00:10<00:00, 35.92it/s]


Epoch: 41/50, Train Loss=3.98123, Val Loss=4.55832


100%|██████████| 383/383 [00:09<00:00, 40.02it/s]


Epoch: 42/50, Train Loss=3.97489, Val Loss=4.56517


100%|██████████| 383/383 [00:10<00:00, 37.40it/s]


Epoch: 43/50, Train Loss=3.96854, Val Loss=4.55796


100%|██████████| 383/383 [00:10<00:00, 38.00it/s]


Epoch: 44/50, Train Loss=3.96111, Val Loss=4.55676


100%|██████████| 383/383 [00:09<00:00, 40.29it/s]


Epoch: 45/50, Train Loss=3.95559, Val Loss=4.56282


100%|██████████| 383/383 [00:09<00:00, 39.55it/s]


Epoch: 46/50, Train Loss=3.94968, Val Loss=4.56204


100%|██████████| 383/383 [00:10<00:00, 37.76it/s]


Epoch: 47/50, Train Loss=3.94344, Val Loss=4.56119


100%|██████████| 383/383 [00:10<00:00, 38.13it/s]


Epoch: 48/50, Train Loss=3.93832, Val Loss=4.57399


100%|██████████| 383/383 [00:15<00:00, 24.56it/s]


Epoch: 49/50, Train Loss=3.93088, Val Loss=4.57215


100%|██████████| 383/383 [00:09<00:00, 40.72it/s]


Epoch: 50/50, Train Loss=3.92646, Val Loss=4.57333
Training finished.


In [70]:
# save model
trainer.save_model()
trainer.save_loss()

vocab_path = os.path.join(result_dir, "vocab.pt")
torch.save(vocab, vocab_path)

### Result
Let's inference trained word embedding and visualize it.

In [71]:
import pandas as pd
import sys

from sklearn.manifold import TSNE
import plotly.graph_objects as go

sys.path.append("../")

In [72]:
result_dir

'/content/drive/MyDrive/course_ai_hw5/weights/'

In [73]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# reload saved model and vocab
model = torch.load(os.path.join(result_dir,"model.pt"), map_location=device)
vocab = torch.load(os.path.join(result_dir,"vocab.pt"))

# embedding is model's first layer
embeddings = list(model.parameters())[0]
embeddings = embeddings.cpu().detach().numpy()

# normalization
norms = (embeddings ** 2).sum(axis=1) ** (1 / 2)
norms = np.reshape(norms, (len(norms), 1))
embeddings_norm = embeddings / norms
embeddings_norm.shape



(4122, 300)

### Make t-SNE graph of trained embedding and color numeric values

In [74]:
embeddings_df = pd.DataFrame(embeddings_norm)
fig = go.Figure()
# TODO 5-1) : make 2-d t-SNE graph of all vocabs and color only for numeric values(others, just color black)
tsne = TSNE(n_components=2)
words_embedded = tsne.fit_transform(embeddings_df)
print(words_embedded[0])

[-38.123863  17.07595 ]


In [75]:
numeric_values = [word for word in vocab.get_stoi() if word.isdigit()]
numeric_indices = [vocab[word] for word in numeric_values]
non_numeric_indices = [i for i in range(len(vocab.get_stoi())) if i not in numeric_indices]
print(len(numeric_indices))
print(len(non_numeric_indices))

216
3906


In [76]:
fig.add_trace(go.Scatter(
    x=words_embedded[numeric_indices, 0],
    y=words_embedded[numeric_indices, 1],
    mode='markers',
    marker=dict(
        color="blue"
    )
))
fig.add_trace(go.Scatter(
    x=words_embedded[non_numeric_indices, 0],
    y=words_embedded[non_numeric_indices, 1],
    mode='markers',
    marker=dict(
        color="black"
    )
))


fig.show()

### Find top N similar words


In [92]:
def find_top_similar(word: str, vocab, embeddings_norm, topN: int = 10):
    # TODO 5-2) : make function returning top n similiar words and similarity scores
    topN_dict = {}
    cos_list = []
    cos = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

    if word not in vocab:
        print("Input word not in vocab")
        return None

    word_embedding = embeddings_norm[vocab[word]]

    for idx, embed in enumerate(embeddings_norm):
        cos_list.append((idx, cos(word_embedding, embed)))
    cos_list.sort(key=lambda x:x[1], reverse=True)

    for idx, similarity in cos_list[1:topN+1]:
        topN_dict[vocab.lookup_token(idx)] = similarity

    return topN_dict


In [93]:
for word, sim in find_top_similar("english", vocab, embeddings_norm).items():
    print("{}: {:.3f}".format(word, sim))


0.10083582
kannada: 0.325
bible: 0.310
celtic: 0.283
irish: 0.270
mole: 0.256
georgian: 0.253
sir: 0.223
scotland: 0.220
1962: 0.215
institution: 0.213


### Result Report

Save the colab result and submit it with your trained model and vocab file. Check one more time your submitted notebook file has result.

You can change the CBOW model parameters Training parameters and details if you want.