In [None]:
%pip install datasets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import time
from datasets import load_dataset, Dataset

In [3]:
start_time = time.time()

dataset = load_dataset("rotten_tomatoes")
train_dataset = dataset['train']
validation_dataset = dataset['validation']
test_dataset = dataset['test']

end_time = time.time()
print(f"Elapsed time to load dataset: {end_time - start_time:.4f} seconds")

Elapsed time to load dataset: 7.5930 seconds


In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [5]:
train_dataset[:5]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic',
  'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one ."],
 'label': [1, 1, 1, 1, 1]}

# Part 1. Preparing Word Embeddings   
As the first step of building your model, you need to prepare the word embeddings to form the
input layer of your model. You are required to choose only from Word2vec or Glove to initialize
your word embedding matrix. The word embedding matrix stores the pre-trained word vectors
(taken from Word2vec or Glove) where each row corresponds to a vector for a specific word in the
vocabulary formed from your task dataset.


In [6]:
import json
import os
from collections import defaultdict

import numpy as np
import regex as re

UNK_TOKEN = "<UNK>"
EMBEDDING_DIM = 100 # glove embedding are usually 50,100,200,300
SAVE_DIR = './result/'
VOCAB_PATH = os.path.join(SAVE_DIR, 'vocab.json')
EMBEDDING_MATRIX_PATH = os.path.join(SAVE_DIR, 'embedding_matrix.npy')
WORD2IDX_PATH = os.path.join(SAVE_DIR, 'word2idx.json')
IDX2WORD_PATH = os.path.join(SAVE_DIR, 'idx2word.json')

os.makedirs(SAVE_DIR, exist_ok=True)

## Preparing Vocab

In [7]:
train_string = train_dataset[4]["text"]
train_string

"emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one ."

In [8]:
# Pattern is adapted from GPT2: https://github.com/huggingface/transformers/blob/4fb28703adc2b44ed66a44dd04740787010c5e11/src/transformers/models/gpt2/tokenization_gpt2.py#L167
pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d|\p{L}+|\p{N}+|[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
tokens_re = pattern.findall(train_string)
tokens_re = set(tokens_re)
tokens_re

{' ',
 "'s",
 "'t",
 ',',
 '.',
 'an',
 'and',
 'as',
 'doesn',
 'emerges',
 'feel',
 'honest',
 'issue',
 'it',
 'keenly',
 'like',
 'movie',
 'observed',
 'one',
 'rare',
 'so',
 'something',
 'that'}

In [9]:
%pip install nltk



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [10]:
import nltk
nltk.download('punkt_tab')
tokens_nltk = nltk.word_tokenize(train_string)
tokens_nltk = set(tokens_nltk)
tokens_nltk

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/ngtzekean/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


{"'s",
 ',',
 '.',
 'an',
 'and',
 'as',
 'does',
 'emerges',
 'feel',
 'honest',
 'issue',
 'it',
 'keenly',
 'like',
 'movie',
 "n't",
 'observed',
 'one',
 'rare',
 'so',
 'something',
 'that'}

In [11]:
tokens_nltk - tokens_re

{'does', "n't"}

In [12]:
tokens_re - tokens_nltk

{' ', "'t", 'doesn'}

It does seem like NLTK tokenizer performs a better job at tokenizing the overall dataset.
In light of that, we will use the tokenizer that was introduced in the lectures
to obtain an overall more comprehensive tokenization process.

In [13]:
def tokenize(dataset: Dataset) -> set:
    """Tokenize the text in the dataset using NTLK

    :param dataset: The dataset to tokenize
    :type dataset: Dataset
    :return: The set of tokens in the dataset
    :rtype: set
    """
    vocab = set()
    
    for example in dataset:
        tokens = nltk.word_tokenize(example['text'])
        vocab.update(tokens)
    
    print(f"Vocabulary Size: {len(vocab)}")

    with open(VOCAB_PATH, 'w', encoding='utf-8') as f:
        json.dump(list(vocab), f, ensure_ascii=False, indent=4)

    print(f"Vocabulary saved to {VOCAB_PATH}")
    return vocab

In [14]:
vocab = tokenize(train_dataset)

## Initialize Word Embedding Matrix with Glove

In [15]:
def load_glove_vocab():
    """Load Glove vocab

    :return: GloVe vocab
    :rtype: Set
    """
    print("Loading GloVe vocab...")
    glove_vocab:set = set()
    # https://huggingface.co/datasets/SLU-CSCI4750/glove.6B.100d.txt
    dataset = load_dataset("SLU-CSCI4750/glove.6B.100d.txt")
    dataset = dataset['train']
    
    for example in dataset:
        word = example["text"].split()[0]
        glove_vocab.add(word)
    print("GloVe vocab loaded.")
    return glove_vocab

In [16]:
glove_vocab = load_glove_vocab()
print(f"Size of GloVe vocab: {len(glove_vocab)}")

oov_words = vocab - glove_vocab
print(f"Number of OOV Words: {len(oov_words)}")

In [17]:
def load_glove_embeddings() -> dict:
    """Load GloVe embeddings

    :return: GloVe embeddings
    :rtype: Dict
    """
    print("Loading GloVe embeddings...")
    glove_dict = {}
    word_embedding_glove = load_dataset("SLU-CSCI4750/glove.6B.100d.txt")
    word_embedding_glove = word_embedding_glove['train']
    
    for example in word_embedding_glove:
        split_line = example["text"].strip().split()
        word = split_line[0]
        vector = np.array(split_line[1:], dtype='float32')
        glove_dict[word] = vector

    print(f"Total GloVe words loaded: {len(glove_dict)}")
    return glove_dict

In [18]:
glove_dict = load_glove_embeddings()

# Collect words to be removed
missing_words = []
for word in vocab:
    if word not in glove_dict:
        missing_words.append(word)

# Remove missing words from vocab
for word in missing_words:
    vocab.remove(word)
        
print(f"Number of missing words: {len(missing_words)}")
print(f"The missing words are: {missing_words}")

# mapping of words to indices and vice versa
word2idx = {word: idx for idx, word in enumerate(sorted(vocab))}
idx2word = {idx: word for word, idx in word2idx.items()}

print("Building embedding matrix...")
vocab_size = len(word2idx)
print(f"Vocab size: {vocab_size}")
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))

for word, idx in word2idx.items():
    embedding_matrix[idx] = glove_dict[word]

print()
print("Embedding matrix built successfully.")

np.save(EMBEDDING_MATRIX_PATH, embedding_matrix)
print(f"Embedding matrix saved as '{EMBEDDING_MATRIX_PATH}'.")

with open(WORD2IDX_PATH, 'w', encoding='utf-8') as f:
    json.dump(word2idx, f, ensure_ascii=False, indent=4)
print(f"Mapping 'word2idx' saved as '{WORD2IDX_PATH}'.")

with open(IDX2WORD_PATH, 'w', encoding='utf-8') as f:
    json.dump(idx2word, f, ensure_ascii=False, indent=4)
print(f"Mapping 'idx2word' saved as '{IDX2WORD_PATH}'.")


Loading GloVe embeddings...


Repo card metadata block was not found. Setting CardData to empty.


Total GloVe words loaded: 400000
Number of missing words: 1867
The missing words are: ["'artístico", 'kalesniko', 'old-movie', 'sugar-sweet', 'daneses', "'bowling", 'beseechingly', 'musicais', 'stuporously', 'pulpiness', 'gay-niche', "'under", 'clarke-williams', 'gender-bender-baller', 'good-bad', 'últimos', 'sermonize', 'cliche-drenched', 'sogginess', 'not-so-bright', 'cinema-besotted', 'flck', 'sock-you-in-the-eye', 'sugar-coating', 'jirí', 'crime-film', 'out-of-kilter', 'marine/legal', 'smashups', 'pseudo-educational', 'out-bad-act', 'blighter', 'mcklusky', 'glass-shattering', 'beat-charged', 'family-film', 'direção', "'no", 'multi-layers', 'resultan', 'obligada', 'coriat', "'wayne", "'sophisticated", "'some", 'torna-se', 'thekids', 'nebrida', 'everlyn', 'anti-hollywood', 'bug-eye', 'travel-agency', 'navel-gazing', 'over-indulgent', 'mid-film', 'socio-histo-political', 'runteldat', 'again-courage', 'well-meaningness', 'bling-bling', "'rare", "'worse", 'stortelling', 'idoosyncratic',

In [19]:
import numpy as np 

# Load the embedding matrix
embedding_matrix = np.load('result/embedding_matrix.npy')

# Display basic information
print("Shape of the embedding matrix: ", embedding_matrix.shape)
print("Data type of the elements: ", embedding_matrix.dtype)

# Display the first few entries
print("First few rows of the embedding matrix: ")
print(embedding_matrix[0])

Let's also try to define a class that will help us to handle the loading and the
handling of the embedding matrix! This will help with the downstream tasks as well.

We define core APIs that will help us to load the embeddings and also to get the
embeddings for a given word.

In [20]:
import torch


class EmbeddingMatrix():
    def __init__(self) -> None:
        self.d = 0 
        self.v = 0
        self.pad_idx:int
        self.embedding_matrix:np.ndarray
        self.word2idx:dict
    @classmethod
    def load(cls) -> "EmbeddingMatrix":
        # load vectors from file
        embedding_matrix:np.ndarray = np.load(EMBEDDING_MATRIX_PATH)
        # set attributes
        em = cls()
        em.embedding_matrix = embedding_matrix
        
        with open(WORD2IDX_PATH, 'r', encoding='utf-8') as f:
            word2idx:dict = json.load(f)
            em.word2idx = word2idx
        em.v, em.d = embedding_matrix.shape
        return em
    @property
    def to_tensor(self) -> torch.Tensor:
        return torch.tensor(self.embedding_matrix, dtype=torch.float64)
    def add_padding(self) -> None:
        if "<PAD>" in self.word2idx:
            return
        padding = np.zeros((1, self.d), dtype='float32')
        self.embedding_matrix = np.vstack((self.embedding_matrix, padding))
        
        self.v += 1
        self.pad_idx = self.v - 1
        self.word2idx["<PAD>"] = self.pad_idx 
    @property
    def dimension(self) -> int:
        """Dimension of the embedding matrix
        
        :return: The dimension of the embedding matrix
        :rtype: int
        """
        return self.d
    @property
    def vocab_size(self) -> int:
        """Vocabulary size of the embedding matrix

        :return: The vocabulary size of the embedding matrix
        :rtype: int
        """
        return self.v
    @property
    def vocab(self) -> set[str]:
        """Vocabulary of the embedding matrix
        
        Set of words in the embedding matrix

        :return: The vocabulary of the embedding matrix
        :rtype: set[str]
        """
        return set(self.word2idx.keys())
    def __getitem__(self, word:str) -> np.ndarray:
        return self.embedding_matrix[self.word2idx[word]]
    def get_idx(self, word:str) -> int:
        # if word not in vocab, return None
        return self.word2idx.get(word, None)

# Question 1. Word Embedding
(a) What is the size of the vocabulary formed from your training data?   
The size of the vocabulary is `18030`, after removing the OOV words from Glove embeddings, the
final size of the vocabulary is `16163`.

(b) We use OOV (out-of-vocabulary) to refer to those words appeared in the training data but not in the Word2vec (or Glove) dictionary. How many OOV words exist in your training data?    
`1867`
   
(c) The existence of the OOV words is one of the well-known limitations of Word2vec (or Glove).
Without using any transformer-based language models (e.g., BERT, GPT, T5), what do you
think is the best strategy to mitigate such limitation? Implement your solution in your source
code. Show the corresponding code snippet. 

Answer:

(1) Using an <UNK> Token, with its Embeddings randomized. Map any OOV words to the <UNK> Token

We explore the code snippet below

```python
for word in extended_vocab:
    idx = word2idx[word]
    # if word is in glove vocab, use glove vector
    if word in glove_dict:
        embedding_matrix[idx] = glove_dict[word]
    else:
        # use random vector for unknown words
        if word == UNK_TOKEN:
            embedding_matrix[idx] = np.random.normal(scale=0.6, size=(EMBEDDING_DIM,))
        else:
            embedding_matrix[idx] = embedding_matrix[word2idx[UNK_TOKEN]]
```

This is a useful strategy as we can use the embeddings of the <UNK> token to
represent any unknown words. Thus, now for any unknown words, we can use the
<UNK> token to represent them and for the vocabulary words that are not in the
pretrained embeddings, we can use the embeddings of the <UNK> token to represent
it.

(2) There are many kinds of static embeddings. An extension of word2vec, fasttext (Bojanowski et al., 2017), addresses a problem with word2vec as we have presented it so far: it has no good way to deal with unknown words—words that appear in a test corpus but were unseen in the training corpus.

A related problem is word sparsity, such as in languages with rich morphology, where some of the many forms for each noun and verb may only occur rarely. Fasttext deals with these problems by using subword models, representing each word as itself plus a bag of constituent n-grams, with special boundary symbols < and > added to each word.