# Homework 4: Word Embedding

In this exercise, you will work on the skip-gram neural network architecture for Word2Vec. You will be using Pytorch Lightning to train your model. 

You must complete the following tasks:
1. Read/clean text files
2. Indexing (Assign a number to each word)
3. Create skip-grams (inputs for your model)
4. Create the skip-gram neural network model
5. Visualization
6. Evaluation (Using pre-trained, not using pre-trained)
    (classify topic from 4 categories) 
    


In [1]:
import numpy as np
import pandas as pd
import math
import glob
import re
import random
import collections
import os
import sys

import torch
import torchtext
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

random.seed(42)

# Step 1: Read/clean text files

The given code can be used to processed the pre-tokenzied text file from the wikipedia corpus. In your homework, you must replace those text files with raw text files.  You must use your own tokenizer to process your text files

In [2]:
!wget https://www.dropbox.com/s/eexden7246sgfzf/BEST-TrainingSet.zip
!wget https://www.dropbox.com/s/n87fiy25f2yc3gt/wiki.zip
!unzip -q wiki.zip
!unzip -q BEST-TrainingSet.zip

--2023-02-05 13:46:43--  https://www.dropbox.com/s/eexden7246sgfzf/BEST-TrainingSet.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.64.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.64.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/eexden7246sgfzf/BEST-TrainingSet.zip [following]
--2023-02-05 13:46:48--  https://www.dropbox.com/s/raw/eexden7246sgfzf/BEST-TrainingSet.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc041d5ba25fcc6d49f4e2e70d12.dl.dropboxusercontent.com/cd/0/inline/B15kk--qPHaiRy5YFRRZk6t4pBvehTEQVt6WHjqXa1d-Is7wZqwmmF-Rpjh7So2MKN09FMkLlqcqEBH2DGvXowzKYbi9D3eVpAW0-NiGyh5rwjSa0FvbShqx-waTMAKFahvUIwFh7qTH9hgXQvkA137revy_DTtdAUSSIabibrH3Wg/file# [following]
--2023-02-05 13:46:48--  https://uc041d5ba25fcc6d49f4e2e70d12.dl.dropboxusercontent.com/cd/0/inline/B15kk--qPHaiRy5YFRRZk6t4pBvehTEQVt6WHjqXa1d-Is7wZqwmm

In [3]:
#Step 1: read the wikipedia text file
with open("wiki/thwiki_chk.txt", encoding="utf-8") as f:
    #the delimiter is one or more whitespace characters
    input_text = re.compile(r"\s+").split(f.read()) 
    #exclude an empty string from our input
    input_text = [word for word in input_text if word != ''] 

In [4]:
tokens = input_text
print(tokens[:10])
print("total word count:", len(tokens))

['หน้า', 'หลัก', 'วิกิพีเดีย', 'ดำเนินการ', 'โดย', 'มูลนิธิ', 'วิกิ', 'มีเดีย', 'องค์กร', 'ไม่']
total word count: 36349066


In [5]:
! pip install wandb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.13.9-py2.py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.14.0-py2.py3-none-any.whl (178 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m178.9/178.9 KB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle
  Downloading setproctitle-1.3.2-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (31 kB)
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.30-py3-none-any.whl (184 kB)
[2K     [90m━━━━━━━━━━━━━

# Step 2: Indexing (Assign a number to each word)




## <font color='violet'>TODO 1:</font>
<font color='violet'>“UNK” is often used to represent an unknown word (a word which does not exist in your dictionary/training set). You can also represent a rare word with this token as well. Look up this [link](https://pytorch.org/text/stable/vocab.html#build-vocab-from-iterator) and study how to add special tokens and set a threshold for rare words in torchtext. You will need to add a special 'PAD' token as well (you will need it later).

Then create a torchtext vocabulary and encode (index) the dataset with your own vocabulary.


## Use your knowledge from this section to answer question 1 on MyCourseville

In [7]:
#create vocab
import io
def yield_tokens(file_path):
    with open(file_path, encoding = 'utf-8') as f:
        token = []
        for line in f:
            token.append(line.strip().split())
        return token
filepath = "wiki/thwiki_chk.txt"
vocab = torchtext.vocab.build_vocab_from_iterator(yield_tokens(filepath), specials=['UNK', 'PAD'], min_freq=1)
vocab.set_default_index(1)

In [8]:
#encode the tokens
encoded_dataset = np.array([vocab[token] for token in tokens])

In [9]:
vocab.lookup_tokens(encoded_dataset[:20])

['หน้า',
 'หลัก',
 'วิกิพีเดีย',
 'ดำเนินการ',
 'โดย',
 'มูลนิธิ',
 'วิกิ',
 'มีเดีย',
 'องค์กร',
 'ไม่',
 'แสวง',
 'ผล',
 'กำไร',
 'ผู้',
 'ดำเนินการ',
 'อีก',
 'หลาย',
 'ได้แก่',
 '__NOEDITSECTION__',
 'ดาราศาสตร์']

# Step3: Create skip-grams (inputs for your model)
The cell below contains code for generating a sampling table and generating the input for our skipgram model.






## <font color='violet'>TODO 2 & Question 2+3 On MyCourseville:</font>
<font color='violet'>Look through the code and answer questions 2 and 3 on MyCourseville</font>

In [10]:
# SOURCE : https://github.com/keras-team/keras/blob/v2.11.0/keras/preprocessing/sequence.py

def make_sampling_table(size, sampling_factor=1e-5):
    """Generates a word rank-based probabilistic sampling table.
    Used for generating the `sampling_table` argument for `skipgrams`.
    `sampling_table[i]` is the probability of sampling
    the word i-th most common word in a dataset
    (more common words should be sampled less frequently, for balance).
    The sampling probabilities are generated according
    to the sampling distribution used in word2vec:
    ```
    p(word) = (min(1, sqrt(word_frequency / sampling_factor) /
        (word_frequency / sampling_factor)))
    ```
    We assume that the word frequencies follow Zipf's law (s=1) to derive
    a numerical approximation of frequency(rank):
    `frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))`
    where `gamma` is the Euler-Mascheroni constant.
    Args:
        size: Int, number of possible words to sample.
        sampling_factor: The sampling factor in the word2vec formula.
    Returns:
        A 1D Numpy array of length `size` where the ith entry
        is the probability that a word of rank i should be sampled.
    """
    gamma = 0.577
    rank = np.arange(size)
    rank[0] = 1
    inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1.0 / (12.0 * rank)
    f = sampling_factor * inv_fq

    return np.minimum(1.0, f / np.sqrt(f))

def skipgrams(
    sequence,
    vocabulary_size,
    window_size=4,
    negative_samples=1.0,
    shuffle=True,
    categorical=False,
    sampling_table=None,
    seed=None,
):
    """Generates skipgram word pairs.
    This function transforms a sequence of word indexes (list of integers)
    into tuples of words of the form:
    - (word, word in the same window), with label 1 (positive samples).
    - (word, random word from the vocabulary), with label 0 (negative samples).
    Read more about Skipgram in this gnomic paper by Mikolov et al.:
    [Efficient Estimation of Word Representations in
    Vector Space](http://arxiv.org/pdf/1301.3781v3.pdf)
    Args:
        sequence: A word sequence (sentence), encoded as a list
            of word indices (integers). If using a `sampling_table`,
            word indices are expected to match the rank
            of the words in a reference dataset (e.g. 10 would encode
            the 10-th most frequently occurring token).
            Note that index 0 is expected to be a non-word and will be skipped.
        vocabulary_size: Int, maximum possible word index + 1
        window_size: Int, size of sampling windows (technically half-window).
            The window of a word `w_i` will be
            `[i - window_size, i + window_size+1]`.
        negative_samples: Float >= 0. 0 for no negative (i.e. random) samples.
            1 for same number as positive samples.
        shuffle: Whether to shuffle the word couples before returning them.
        categorical: bool. if False, labels will be
            integers (eg. `[0, 1, 1 .. ]`),
            if `True`, labels will be categorical, e.g.
            `[[1,0],[0,1],[0,1] .. ]`.
        sampling_table: 1D array of size `vocabulary_size` where the entry i
            encodes the probability to sample a word of rank i.
        seed: Random seed.
    Returns:
        couples, labels: where `couples` are int pairs and
            `labels` are either 0 or 1.
    Note:
        By convention, index 0 in the vocabulary is
        a non-word and will be skipped.
    """
    couples = []
    labels = []
    for i, wi in enumerate(sequence):
        if not wi:
            continue
        if sampling_table is not None:
            if sampling_table[wi] < random.random():
                continue

        window_start = max(0, i - window_size)
        window_end = min(len(sequence), i + window_size + 1)
        for j in range(window_start, window_end):
            if j != i:
                wj = sequence[j]
                if not wj:
                    continue
                couples.append([wi, wj])
                if categorical:
                    labels.append([0, 1])
                else:
                    labels.append(1)

    if negative_samples > 0:
      # note that this negative sampling implementation is not very efficient. 
      # optionally, you can use torch.multinomial for negative sampling instead to speed up your training.
        num_negative_samples = int(len(labels) * negative_samples)
        words = [c[0] for c in couples]
        random.shuffle(words)

        couples += [
            [words[i % len(words)], random.randint(1, vocabulary_size - 1)]
            for i in range(num_negative_samples)
        ]
        if categorical:
            labels += [[1, 0]] * num_negative_samples
        else:
            labels += [0] * num_negative_samples

    if shuffle:
        if seed is None:
            seed = random.randint(0, 10e6)
        random.seed(seed)
        random.shuffle(couples)
        random.seed(seed)
        random.shuffle(labels)

    return couples, labels

In [11]:
# Step 3: Create data samples
vocab_size = len(vocab)
skip_window = 1       # How many words to consider left and right.

sample_set= encoded_dataset[:10]
sampling_table = make_sampling_table(vocab_size)
couples, labels = skipgrams(sample_set, vocab_size, window_size=skip_window, sampling_table=sampling_table)
word_target, word_context = zip(*couples)
word_target = np.array(word_target, dtype="int32")
word_context = np.array(word_context, dtype="int32")

print(couples[:8], labels[:8])
for i in range(8):
    print(vocab.lookup_token(couples[i][0]),vocab.lookup_token(couples[i][1]))

[[209, 2454], [3126, 471030], [2454, 209], [3126, 682], [25, 617890], [209, 230], [209, 291705], [3126, 681454]] [1, 0, 1, 1, 0, 1, 0, 0]
หลัก วิกิพีเดีย
มีเดีย ซึ่งปรียานุช
วิกิพีเดีย หลัก
มีเดีย องค์กร
ไม่ อ้างกิง
หลัก หน้า
หลัก โรเมน
มีเดีย โป๊กเกอร์เฟซ


# Step 4: create the skip-gram model


## <font color='violet'>TODO 3:</font>
 <font color='violet'> Complete the code to create a skip-gram model. You will need to:
 1. write the forward pass
 2. choose a proper loss function
 3. calculate loss 
 </font>

In [12]:
!pip install pytorch-lightning
# in case you want to train your own skipgram model, you can use pytorch 2.0 for almost 2x speed up. Don't forget to restart the runtime after installation.
# !pip3 install numpy --pre torch[dynamo] --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117
# !pip3 install torchtext==0.15.0.dev20230102+cpu --force-reinstall --extra-index-url https://download.pytorch.org/whl/nightly/cu117

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytorch-lightning
  Downloading pytorch_lightning-1.9.0-py3-none-any.whl (825 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m825.8/825.8 KB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
Collecting torchmetrics>=0.7.0
  Downloading torchmetrics-0.11.1-py3-none-any.whl (517 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m517.2/517.2 KB[0m [31m41.5 MB/s[0m eta [36m0:00:00[0m
Collecting lightning-utilities>=0.4.2
  Downloading lightning_utilities-0.6.0.post0-py3-none-any.whl (18 kB)
Installing collected packages: lightning-utilities, torchmetrics, pytorch-lightning
Successfully installed lightning-utilities-0.6.0.post0 pytorch-lightning-1.9.0 torchmetrics-0.11.1


In [13]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, IterableDataset
import pytorch_lightning as pl
from pytorch_lightning import Trainer

In [63]:
class SkipGram(pl.LightningModule):
    def __init__(self):
                
        super().__init__()
        self.embedddings_dim = 32 
        self.learning_rate = 0.003
        # self.criterion = nn.NLLLoss()
        self.criterion = nn.BCEWithLogitsLoss(reduction = "mean")

        self.embedding = nn.Embedding(len(vocab), self.embedddings_dim)
        
        nn.init.uniform_(self.embedding.weight)

    def forward(self, input_ids_w, input_ids_c):        
        embed_w = self.embedding(input_ids_w)
        embed_c = self.embedding(input_ids_c)
        output = torch.mul(embed_w, embed_c)
        output = torch.sum(output, dim=1)
        # output = F.sigmoid(output)
        return output

    def training_step(self, batch, batch_idx):
        input_ids_w = batch["input_ids_w"]
        input_ids_c = batch["input_ids_c"]
        output = self.forward(input_ids_w, input_ids_c)
        loss = self.criterion(output, batch["labels"])

        self.log("train_loss", loss)
        return loss

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=self.learning_rate)

In [37]:
class SkipGramDataset(Dataset):
  def __init__(self, dataset):
    # we reshape our dataset into chunks of 2500 words to prevent memory overflow
    # higher number can probably be used to make training run a bit faster
    self.dataset= dataset[:len(dataset) - (len(dataset)%2500)].reshape(-1,2500)

  def __getitem__(self, idx):
    return self.dataset[idx]
  def __len__(self):
    return len(self.dataset)

## TODO3.1
Complete the collate_fn function

The function combines the output of a Dataset object (the output of the __getitem__ function) and compiles them into a certain format that is then passed to the model.

As an illustration
(using the examples from this notebook):

- Say we use a batch size of 2. 
- Two data points will be randomly selected the SkipGramDataset dataset via the getitem method. 
- The data is then passed to the collate_fn method.
- The method outputs a dict with input_ids_w, input_ids_c, labels.
- The dict is finally passed to the training (or possibly testing) step in the model.


In [64]:
class SkipGramDataModule(pl.LightningDataModule):

  def __init__(self, batch_size, num_workers=0):
      super().__init__()
      self.batch_size = batch_size
      self.num_workers = num_workers
 

  def setup(self, stage: str):
    pass

  def collate_fn(self, batch):
    #note that we use batch_size of 1 because it is already a chunk of 2500 words
    #the input batch will look like this: [[1, 2, 3, ..., 2500]]
    #if the batch size is 2 then it will look like this: [[1, 2, 3, ..., 2500], [1, 2, 3, ..., 2500]]
    
    #input_ids_w: a tensor of token ids of center words with shape (num_words)
    #input_ids_c: a tensor of token ids of context words with shape (num_words)
    #labels: a LongTensor of 0/1 with shape (num_words)
    couples, labels = skipgrams(batch[0], len(vocab), window_size=1, sampling_table=sampling_table)
    input_ids_w, input_ids_c = zip(*couples)
    input_ids_w = torch.tensor(input_ids_w, dtype=torch.int32)
    input_ids_c = torch.tensor(input_ids_c, dtype=torch.int32)
    labels = torch.tensor(labels, dtype=torch.float64)
    return {"input_ids_w": input_ids_w, "input_ids_c": input_ids_c, "labels": labels}

  def train_dataloader(self):
      train_dataset = SkipGramDataset(encoded_dataset)
      train_loader = DataLoader(train_dataset, 
                                batch_size = self.batch_size, 
                                num_workers = self.num_workers,
                                collate_fn = self.collate_fn,
                                shuffle=True,
                                persistent_workers = True
                                )
      
      return train_loader
    

In [65]:
skipgram_dm = SkipGramDataModule(batch_size = 1, num_workers=2)

In [66]:
skipgram_model = SkipGram()

In [67]:
from pytorch_lightning import Trainer

In [68]:
if "2.0.0" in torch.__version__: #just for speed up
  model = torch.compile(model)

In [27]:
!wandb login

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [69]:
from pytorch_lightning.loggers import WandbLogger
wandb_logger = WandbLogger(project="hw4_word_rep")

## Question 4 MyCourseville

Paste your wandb run link. Don't forget to make it public.

In [70]:
trainer = Trainer(
    max_epochs=10,
    gpus=1,
    accumulate_grad_batches=40, #we can accumulate gradient from multiple batches to simulate a very large batch size.
    precision=16, #just for speed up
    logger=wandb_logger
)

  rank_zero_deprecation(
INFO:pytorch_lightning.utilities.rank_zero:Using 16bit None Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [71]:
trainer.fit(skipgram_model, skipgram_dm)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name      | Type              | Params
------------------------------------------------
0 | criterion | BCEWithLogitsLoss | 0     
1 | embedding | Embedding         | 22.4 M
------------------------------------------------
22.4 M    Trainable params
0         Non-trainable params
22.4 M    Total params
44.887    Total estimated model params size (MB)


Training: 0it [00:00, ?it/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fc8745c04c0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.8/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7fc8745c04c0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/pytho

In [73]:
torch.save(skipgram_model.state_dict(), "myskipgram32_weights.bin")

# From here on, use the provided weights and vocab to answer the questions in MyCourseville.

In [74]:
# !wget https://www.dropbox.com/s/09xlk9qs03o6txl/skipgram32_weights.bin?dl=0 -O skipgram32_weights.bin
# !wget https://www.dropbox.com/s/bz2f88ugjv6ttxf/vocab.pickle?dl=0 -O vocab.pickle

--2023-02-05 15:39:50--  https://www.dropbox.com/s/09xlk9qs03o6txl/skipgram32_weights.bin?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.7.18, 2620:100:6031:18::a27d:5112
Connecting to www.dropbox.com (www.dropbox.com)|162.125.7.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/09xlk9qs03o6txl/skipgram32_weights.bin [following]
--2023-02-05 15:39:52--  https://www.dropbox.com/s/raw/09xlk9qs03o6txl/skipgram32_weights.bin
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc7863acf887ed7d1ba7c7aa2f03.dl.dropboxusercontent.com/cd/0/inline/B14z1Q3-R-3-N9Ed4La2yWKmCf9plm1AsMau6dQ3GPgHTDFzd5oxW95-ucwWNLVEDLKwIT9nTpcu56OQ8zaYQbJq9ylNLsnSqWSkcx-_T89gc5l4zYCENZ5RGxZxDjVyNeVTrsO88cdq0inCnc3DWLSmNbpeu5ZuLJ4tJXXI_80eDQ/file# [following]
--2023-02-05 15:39:53--  https://uc7863acf887ed7d1ba7c7aa2f03.dl.dropboxusercontent.com/cd/0/inline/B14z1Q3-R-3-N9Ed4La2yWKmCf9plm1AsMau6dQ3GPgH

In [88]:
weights = torch.load("skipgram32_weights.bin")

In [89]:
final_embeddings = weights["embedding.weight"].clone().cpu().numpy()
print(final_embeddings)
print(final_embeddings.shape)

[[ 0.25884312  0.28340966 -0.36874643 ... -0.2384111   0.07282966
   0.17210421]
 [ 5.037373    1.602091    0.13676442 ...  4.7182174  -0.9798973
   1.4447719 ]
 [ 4.219983    1.8910178   0.17435275 ...  4.085619   -0.48468605
   1.9791931 ]
 ...
 [-0.2814809  -0.34960973 -0.6617193  ... -0.1898246  -0.8117112
  -0.4188458 ]
 [-0.6271959  -0.23369864 -0.35343444 ... -0.47042614 -0.4450108
  -0.09961639]
 [ 0.25751215 -0.2845752   0.17050262 ... -0.09321738  0.07451092
   0.15666145]]
(197826, 32)


# Step 5: Intrinsic Evaluation: Word Vector Analogies
## <font color='violet'>Homework Question 4: </font>
<font color='violet'> Read section 2.1 and 2.3 in this [lecture note](http://web.stanford.edu/class/cs224n/readings/cs224n-2019-notes02-wordvecs2.pdf) and complete the code for doing word vector analogy. Use cosine similarity as a measure of similarity (euclidean and dot product might yield different answers). **Note that the answers you get won't make sense since both our model and dataset is very small.**
</font>


In case you are unsure if you implemented it correctly, you can download a proper Thai word2vec (called Thai2fit) and test your method out.


The Thai2fit output for the following words should be:
```
print(wordVecAnalogy('ผู้ชาย','พระราชา','ผู้หญิง'))
พระราชินี

print(wordVecAnalogy('อเมริกา','นิวยอร์ก','ประเทศไทย'))
กรุงเทพมหานคร

print(wordVecAnalogy('ครู','โรงเรียน','หมอ'))
โรงพยาบาล
```



In [104]:
import pickle
with open("vocab.pickle","rb") as f:
  vocab = pickle.load(f)
print(len(vocab))

197826


In [83]:
!wget https://github.com/PyThaiNLP/pythainlp-corpus/releases/download/thai2fit_wv-v0.1/thai2vec.bin -O thai2vec.bin
import gensim
vec = gensim.models.KeyedVectors.load_word2vec_format("thai2vec.bin", binary=True)
itos = {i:k for i,(k,v) in enumerate(vec.vocab.items())}
thai2fit = vec.vectors
stoi = {v:k for k,v in itos.items()}

--2023-02-05 15:46:36--  https://github.com/PyThaiNLP/pythainlp-corpus/releases/download/thai2fit_wv-v0.1/thai2vec.bin
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/138375814/29384800-ae5c-11ea-9622-20f44b66d0cb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230205%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230205T154637Z&X-Amz-Expires=300&X-Amz-Signature=b13546fa494b9225468c570a55cfea2659672fbf542a30f136329d712522feaf&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=138375814&response-content-disposition=attachment%3B%20filename%3Dthai2vec.bin&response-content-type=application%2Foctet-stream [following]
--2023-02-05 15:46:37--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/138375814/29384800-ae5c-11ea-9622-20f44

## Now use the provided word vector (not the thai2fit one) to answer question 5 in MyCourseville.

In [128]:
def wordVecAnalogy(a,b,c):
    # word_a_index = vocab([a])
    # word_b_index = vocab([b])
    # word_c_index = vocab([c])
    # a_vec = final_embeddings[word_a_index]
    # b_vec = final_embeddings[word_b_index]
    # c_vec = final_embeddings[word_c_index]
    # up = np.transpose((b_vec - a_vec + c_vec))
    # down = np.linalg.norm(b_vec - a_vec + c_vec)

    # max_index = 0
    # max_value = -1
    # for i in range(len(final_embeddings)):
    #   # print(final_embeddings[i])
    #   # print(len(up) == len(final_embeddings[i]))
    #   out = np.dot(final_embeddings[i],up)
    #   # out = sum(out)
    #   down = down*(np.linalg.norm(final_embeddings[i]))
    #   out = out/down
    #   if out > max_value:
    #     max_value = out
    #     max_index = i
    # # max_index = np.argmax(np.dot(final_embeddings,up)/down)
    # #complete the code
    # #Note that if the most similar token (d) is either a,b, or c. 
    # #Ignore it and use the next one.

    # return vocab.lookup_token(max_index)

      word_a_index = vocab([a])
      word_b_index = vocab([b])
      word_c_index = vocab([c])

      #complete the code
      #Note that if the most similar token (d) is either a,b, or c. 
      #Ignore it and use the next one.

      a_vec = final_embeddings[word_a_index]
      b_vec = final_embeddings[word_b_index]
      c_vec = final_embeddings[word_c_index]

      up = np.transpose(b_vec - a_vec + c_vec)
      down = np.linalg.norm(b_vec - a_vec + c_vec)

      cosim_list = list()

      for x in final_embeddings:
        cos_sim = np.dot(up.reshape(-1), x)/(down * np.linalg.norm(x))
        cosim_list.append(cos_sim)

      max_value = -1
      max_index = 0
      for i in range(len(cosim_list)):
        if cosim_list[i] > max_value and i not in {word_a_index[0], word_b_index[0], word_c_index[0]}:
          max_index = i
          max_value = cosim_list[i]
      # argmax = max(cosim_list)
      # max_index = cosim_list.index(argmax)

      return vocab.lookup_token(max_index)

print(wordVecAnalogy('ผู้ชาย','พระราชา','ผู้หญิง'))
print(wordVecAnalogy('อเมริกา','นิวยอร์ก','ประเทศไทย'))
print(wordVecAnalogy('ครู','โรงเรียน','หมอ'))

แปรเปลี่ยน
แขวน
ออสเตรเลีย


# Step 6: Extrinsic Evaluation




## <font color='violet'>TODO 5:</font>
<font color='violet'>
Use the word embeddings from the skip-gram model as pre-trained weights in a classification model. Compare the result the with the same classification model that does not use the pre-trained weights. 

The code for data processing steps and model training have been provided for you. All you need to do is **find out how to load the pretrained embedding into the model.**
</font>

In [129]:
all_news_filepath = glob.glob('BEST-TrainingSet/news/*.txt')
all_novel_filepath = glob.glob('BEST-TrainingSet/novel/*.txt')
all_article_filepath = glob.glob('BEST-TrainingSet/article/*.txt')
all_encyclopedia_filepath = glob.glob('BEST-TrainingSet/encyclopedia/*.txt')

In [130]:
#preparing data for the classificaiton model
#In your homework, we will only use the first 2000 words in each text file
#any text file that has less than 2000 words will be padded
#reason:just to make this homework feasible under limited time and resource
max_length = 2000
random.seed(42)
def prep_data():
    input_text = list()
    for textfile_path in [all_news_filepath, all_novel_filepath, all_article_filepath, all_encyclopedia_filepath]:
        for input_file in textfile_path:
            f = open(input_file,"r") #open file with name of "*.txt"
            text = re.sub(r'\|', ' ', f.read()) # replace separation symbol with white space           
            text = re.sub(r'<\W?\w+>', '', text)# remove <NE> </NE> <AB> </AB> tags
            text = text.split() #split() method without an argument splits on whitespace 
            indexed_text = vocab(text[:max_length]) #map raw word string to its index   
            if 'news' in input_file:
                input_text.append([indexed_text,0]) 
            elif 'novel' in input_file:
                input_text.append([indexed_text,1]) 
            elif 'article' in input_file:
                input_text.append([indexed_text,2]) 
            elif 'encyclopedia' in input_file:
                input_text.append([indexed_text,3]) 
            
            f.close()
    random.shuffle(input_text)
    return input_text

input_data = prep_data()
train_data = input_data[:int(len(input_data)*0.6)]
val_data = input_data[int(len(input_data)*0.6):int(len(input_data)*0.8)]
test_data = input_data[int(len(input_data)*0.8):]

train_input = [data[0] for data in train_data]
train_target = [data[1] for data in train_data]

val_input = [data[0] for data in val_data]
val_target = [data[1] for data in val_data]

test_input = [data[0] for data in test_data]
test_target = [data[1] for data in test_data]

In [131]:
class ClassificationModel(pl.LightningModule):
    def __init__(self):
                
        super().__init__()
        self.embedddings_dim = 32 
        self.hidden_size = 64
        self.learning_rate = 0.005
        self.criterion = nn.CrossEntropyLoss() 

        self.embedding = nn.Embedding(len(vocab), self.embedddings_dim)
        self.gru = nn.GRU(input_size = self.embedddings_dim, 
                          hidden_size = self.hidden_size, 
                          num_layers = 2, 
                          batch_first = True)
        self.dropout = nn.Dropout(p=0.5)
        self.dense = nn.Linear(self.hidden_size,4)

    def forward(self, input_ids):
        embedded = self.embedding(input_ids)
        hidden, _ = self.gru(embedded)
        hidden = self.dropout(hidden[:,-1])
        logit = self.dense(hidden)
       
        return logit

    def training_step(self, batch, batch_idx):
         
        labels = batch.pop("labels")
  
        logit = self(**batch)
        loss = self.criterion(logit, labels)

        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):

        labels = batch.pop("labels")
  
        logit = self(**batch)
        loss = self.criterion(logit, labels)
        self.log("val_loss", loss, prog_bar=True)
        return loss
    
    def test_step(self, batch, batch_idx):

        labels = batch.pop("labels")
  
        logit = self(**batch)
        probs = F.softmax(logit, dim=-1)
        preds = torch.argmax(probs, dim=-1)
        acc = (labels == preds).float().mean()
        self.log("test_acc", acc)
        return acc

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=self.learning_rate)

In [132]:
class ClassificationDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __getitem__(self, idx):
        return (self.data[idx], self.labels[idx])
    
    def __len__(self):
        return len(self.data)

In [133]:
from torch.nn.utils.rnn import pad_sequence

In [134]:
class ClassificationDataModule(pl.LightningDataModule):

    def __init__(self, batch_size, num_workers=0):
        super().__init__()
        self.batch_size = batch_size
        self.num_workers = num_workers

    def setup(self, stage: str):
        pass

    def collate_fn(self, batch):
        data = [torch.tensor(b[0]) for b in batch]
        labels = [b[1] for b in batch]
        input_ids = pad_sequence(data, batch_first=True, padding_value=0.0) 
        labels = torch.tensor(labels)

        return {"input_ids": input_ids, "labels": labels}

    def train_dataloader(self):
        train_dataset = ClassificationDataset(train_input, train_target)
        train_loader = DataLoader(train_dataset, 
                                batch_size = self.batch_size, 
                                num_workers = self.num_workers,
                                collate_fn = self.collate_fn,
                                shuffle=True,
                                persistent_workers = True
                                )

        return train_loader
    
    def val_dataloader(self):
        val_dataset = ClassificationDataset(val_input, val_target)
        val_loader = DataLoader(val_dataset, 
                                batch_size = self.batch_size, 
                                num_workers = self.num_workers,
                                collate_fn = self.collate_fn,
                                shuffle=False,
                                persistent_workers = True
                                )

        return val_loader
    
    def test_dataloader(self):
        test_dataset = ClassificationDataset(test_input, test_target)
        test_loader = DataLoader(test_dataset, 
                                batch_size = self.batch_size, 
                                num_workers = self.num_workers,
                                collate_fn = self.collate_fn,
                                shuffle=False,
                                persistent_workers = True
                                )

        return test_loader

In [135]:
cls_dm = ClassificationDataModule(batch_size = 32, num_workers=4)

## Without pretrained embedding

In [136]:
cls_model = ClassificationModel()

In [137]:
trainer = Trainer(
    max_epochs=10,
    accelerator='gpu', 
    devices=1,
    log_every_n_steps = 5,
    callbacks=[pl.callbacks.ModelCheckpoint(filename='best',monitor='val_loss')],

)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [138]:
trainer.fit(cls_model,cls_dm)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | embedding | Embedding        | 6.3 M 
2 | gru       | GRU              | 43.8 K
3 | dropout   | Dropout          | 0     
4 | dense     | Linear           | 260   
-----------------------------------------------
6.4 M     Trainable params
0         Non-trainable params
6.4 M     Total params
25.498    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]



Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


In [139]:
trainer.test(cls_model, cls_dm, ckpt_path='best')

INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/lightning_logs/version_0/checkpoints/best.ckpt
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from checkpoint at /content/lightning_logs/version_0/checkpoints/best.ckpt


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_acc            0.44117647409439087
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_acc': 0.44117647409439087}]

## With pretrained embedding



In [140]:
cls_model = ClassificationModel()

In [141]:
#load the embedding into the 
cls_model.embedding = nn.Embedding.from_pretrained(weights["embedding.weight"])

In [142]:
trainer = Trainer(
    max_epochs=10,
    accelerator='gpu', 
    devices=1,
    log_every_n_steps = 5,
    callbacks=[pl.callbacks.ModelCheckpoint(filename='best',monitor='val_loss')],
)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [143]:
trainer.fit(cls_model,cls_dm)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name      | Type             | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0     
1 | embedding | Embedding        | 6.3 M 
2 | gru       | GRU              | 43.8 K
3 | dropout   | Dropout          | 0     
4 | dense     | Linear           | 260   
-----------------------------------------------
44.0 K    Trainable params
6.3 M     Non-trainable params
6.4 M     Total params
25.498    Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


In [144]:
trainer.test(cls_model, cls_dm, ckpt_path='best')

INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at /content/lightning_logs/version_1/checkpoints/best.ckpt
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from checkpoint at /content/lightning_logs/version_1/checkpoints/best.ckpt


Testing: 0it [00:00, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_acc            0.6274510025978088
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


[{'test_acc': 0.6274510025978088}]