In [1]:
%pip install datasets

Note: you may need to restart the kernel to use updated packages.


DEPRECATION: Loading egg at c:\users\snorl\downloads\output\pygrads-3.0.b1\dir\lib\site-packages\pygrads-3.0b1-py3.10.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..

[notice] A new release of pip is available: 23.2.1 -> 24.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import time
from datasets import load_dataset, Dataset

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
start_time = time.time()

dataset = load_dataset("rotten_tomatoes")
train_dataset = dataset['train']
validation_dataset = dataset['validation']
test_dataset = dataset['test']

end_time = time.time()
print(f"Elapsed time to load dataset: {end_time - start_time:.4f} seconds")

Elapsed time to load dataset: 9.0530 seconds


In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1066
    })
})

In [5]:
train_dataset[:5]

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic',
  'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .',
  "emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one ."],
 'label': [1, 1, 1, 1, 1]}

# Part 1. Preparing Word Embeddings   
As the first step of building your model, you need to prepare the word embeddings to form the
input layer of your model. You are required to choose only from Word2vec or Glove to initialize
your word embedding matrix. The word embedding matrix stores the pre-trained word vectors
(taken from Word2vec or Glove) where each row corresponds to a vector for a specific word in the
vocabulary formed from your task dataset.


In [6]:
import json
import os
from collections import defaultdict

import numpy as np
import regex as re

UNK_TOKEN = "<UNK>"
EMBEDDING_DIM = 100 # glove embedding are usually 50,100,200,300
SAVE_DIR = './result/'
VOCAB_PATH = os.path.join(SAVE_DIR, 'vocab.json')
EMBEDDING_MATRIX_PATH = os.path.join(SAVE_DIR, 'embedding_matrix.npy')
WORD2IDX_PATH = os.path.join(SAVE_DIR, 'word2idx.json')
IDX2WORD_PATH = os.path.join(SAVE_DIR, 'idx2word.json')

os.makedirs(SAVE_DIR, exist_ok=True)

## Preparing Vocab

In [7]:
train_string = train_dataset[4]["text"]
train_string

"emerges as something rare , an issue movie that's so honest and keenly observed that it doesn't feel like one ."

In [8]:
# Pattern is adapted from GPT2: https://github.com/huggingface/transformers/blob/4fb28703adc2b44ed66a44dd04740787010c5e11/src/transformers/models/gpt2/tokenization_gpt2.py#L167
pattern = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d|\p{L}+|\p{N}+|[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
tokens_re = pattern.findall(train_string)
tokens_re = set(tokens_re)
tokens_re

{' ',
 "'s",
 "'t",
 ',',
 '.',
 'an',
 'and',
 'as',
 'doesn',
 'emerges',
 'feel',
 'honest',
 'issue',
 'it',
 'keenly',
 'like',
 'movie',
 'observed',
 'one',
 'rare',
 'so',
 'something',
 'that'}

In [9]:
%pip install nltk




DEPRECATION: Loading egg at c:\users\snorl\downloads\output\pygrads-3.0.b1\dir\lib\site-packages\pygrads-3.0b1-py3.10.egg is deprecated. pip 23.3 will enforce this behaviour change. A possible replacement is to use pip for package installation..

[notice] A new release of pip is available: 23.2.1 -> 24.3
[notice] To update, run: python.exe -m pip install --upgrade pip





In [10]:
import nltk
nltk.download('punkt_tab')
tokens_nltk = nltk.word_tokenize(train_string)
tokens_nltk = set(tokens_nltk)
tokens_nltk

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\snorl\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


{"'s",
 ',',
 '.',
 'an',
 'and',
 'as',
 'does',
 'emerges',
 'feel',
 'honest',
 'issue',
 'it',
 'keenly',
 'like',
 'movie',
 "n't",
 'observed',
 'one',
 'rare',
 'so',
 'something',
 'that'}

In [11]:
tokens_nltk - tokens_re

{'does', "n't"}

In [12]:
tokens_re - tokens_nltk

{' ', "'t", 'doesn'}

It does seem like NLTK tokenizer performs a better job at tokenizing the overall dataset.
In light of that, we will use the tokenizer that was introduced in the lectures
to obtain an overall more comprehensive tokenization process.

In [13]:
def tokenize(dataset: Dataset) -> set:
    """Tokenize the text in the dataset using NTLK

    :param dataset: The dataset to tokenize
    :type dataset: Dataset
    :return: The set of tokens in the dataset
    :rtype: set
    """
    vocab = set()
    
    for example in dataset:
        tokens = nltk.word_tokenize(example['text'])
        vocab.update(tokens)
    
    print(f"Vocabulary Size: {len(vocab)}")

    with open(VOCAB_PATH, 'w', encoding='utf-8') as f:
        json.dump(list(vocab), f, ensure_ascii=False, indent=4)

    print(f"Vocabulary saved to {VOCAB_PATH}")
    return vocab

In [14]:
vocab = tokenize(train_dataset)

## Initialize Word Embedding Matrix with Glove

In [15]:
def load_glove_vocab():
    """Load Glove vocab

    :return: GloVe vocab
    :rtype: Set
    """
    print("Loading GloVe vocab...")
    glove_vocab:set = set()
    # https://huggingface.co/datasets/SLU-CSCI4750/glove.6B.100d.txt
    dataset = load_dataset("SLU-CSCI4750/glove.6B.100d.txt")
    dataset = dataset['train']
    
    for example in dataset:
        word = example["text"].split()[0]
        glove_vocab.add(word)
    print("GloVe vocab loaded.")
    return glove_vocab

In [16]:
glove_vocab = load_glove_vocab()
print(f"Size of GloVe vocab: {len(glove_vocab)}")

oov_words = vocab - glove_vocab
print(f"Number of OOV Words: {len(oov_words)}")

Loading GloVe vocab...


Repo card metadata block was not found. Setting CardData to empty.


GloVe vocab loaded.
Size of GloVe vocab: 400000
Number of OOV Words: 1867


In [17]:
def load_glove_embeddings() -> dict:
    """Load GloVe embeddings

    :return: GloVe embeddings
    :rtype: Dict
    """
    print("Loading GloVe embeddings...")
    glove_dict = {}
    word_embedding_glove = load_dataset("SLU-CSCI4750/glove.6B.100d.txt")
    word_embedding_glove = word_embedding_glove['train']
    
    for example in word_embedding_glove:
        split_line = example["text"].strip().split()
        word = split_line[0]
        vector = np.array(split_line[1:], dtype='float32')
        glove_dict[word] = vector

    print(f"Total GloVe words loaded: {len(glove_dict)}")
    return glove_dict

In [18]:
# mapping of words to indices and vice versa
word2idx = {word: idx for idx, word in enumerate(sorted(vocab))}
idx2word = {idx: word for word, idx in word2idx.items()}

vocab_size = len(vocab)
embedding_matrix = np.zeros((vocab_size, EMBEDDING_DIM))

print("Building embedding matrix...")

glove_dict = load_glove_embeddings()

for word in vocab:
    idx = word2idx[word]
    # if word is in glove vocab, use glove vector
    if word in glove_dict:
        embedding_matrix[idx] = glove_dict[word]
    else:
    # word is not in glove vocab, we remove it from our vocab
        # embedding_matrix[idx] = np.zeros(EMBEDDING_DIM)
        word2idx.pop(word)
        idx2word.pop(idx)

print("Embedding matrix built successfully.")

np.save(EMBEDDING_MATRIX_PATH, embedding_matrix)
print(f"Embedding matrix saved as '{EMBEDDING_MATRIX_PATH}'.")

with open(WORD2IDX_PATH, 'w', encoding='utf-8') as f:
    json.dump(word2idx, f, ensure_ascii=False, indent=4)
print(f"Mapping 'word2idx' saved as '{WORD2IDX_PATH}'.")

with open(IDX2WORD_PATH, 'w', encoding='utf-8') as f:
    json.dump(idx2word, f, ensure_ascii=False, indent=4)
print(f"Mapping 'idx2word' saved as '{IDX2WORD_PATH}'.")


Building embedding matrix...
Loading GloVe embeddings...


Repo card metadata block was not found. Setting CardData to empty.


Total GloVe words loaded: 400000
Embedding matrix built successfully.
Embedding matrix saved as './result/embedding_matrix.npy'.
Mapping 'word2idx' saved as './result/word2idx.json'.
Mapping 'idx2word' saved as './result/idx2word.json'.


In [19]:
import numpy as np 

# Load the embedding matrix
embedding_matrix = np.load('result/embedding_matrix.npy')

# Display basic information
print("Shape of the embedding matrix: ", embedding_matrix.shape)
print("Data type of the elements: ", embedding_matrix.dtype)

# Display the first few entries
print("First few rows of the embedding matrix: ")
print(embedding_matrix[0])

Shape of the embedding matrix:  (18030, 100)
Data type of the elements:  float64
First few rows of the embedding matrix: 
[ 0.38472     0.49351001  0.49096    -1.54340005 -0.33614001  0.62220001
  0.32264999  0.075331    0.65591002 -0.23517001  1.21140003  0.06193
 -0.62004     0.31371     0.38947999 -0.24381    -0.065643    0.58797002
 -0.86382002  0.63165998  0.68362999  0.39647001 -0.62388003 -0.25094
  0.92830998  1.51520002 -0.43917     0.22249     1.36950004 -0.53097999
  0.39811     0.77113998  0.49043     0.58853     0.2376      0.31619999
 -0.011962   -0.047074    0.34584999 -1.29439998  0.18596999  0.27002001
 -0.70602    -0.20652001 -0.25194001 -0.48679999 -0.71538001 -0.23886999
 -0.041612   -0.55488002 -0.54225999  0.21235999  0.025341    0.96517003
 -0.88182998 -1.86810005  0.32657     1.16890001  1.17589998 -0.17393
 -0.3371      0.87535    -1.01139998 -0.61809999  1.00800002  0.31505999
  0.24417     0.064393    0.33678001  0.33632001  0.45975     0.22813
 -0.37505001 -

# Question 1. Word Embedding
(a) What is the size of the vocabulary formed from your training data?   
`18030`

(b) We use OOV (out-of-vocabulary) to refer to those words appeared in the training data but not in the Word2vec (or Glove) dictionary. How many OOV words exist in your training data?    
`1867`
   
(c) The existence of the OOV words is one of the well-known limitations of Word2vec (or Glove).
Without using any transformer-based language models (e.g., BERT, GPT, T5), what do you
think is the best strategy to mitigate such limitation? Implement your solution in your source
code. Show the corresponding code snippet. 

Answer:

(1) Using an <UNK> Token, with its Embeddings randomized. Map any OOV words to the <UNK> Token

We explore the code snippet below

```python
for word in extended_vocab:
    idx = word2idx[word]
    # if word is in glove vocab, use glove vector
    if word in glove_dict:
        embedding_matrix[idx] = glove_dict[word]
    else:
        # use random vector for unknown words
        if word == UNK_TOKEN:
            embedding_matrix[idx] = np.random.normal(scale=0.6, size=(EMBEDDING_DIM,))
        else:
            embedding_matrix[idx] = embedding_matrix[word2idx[UNK_TOKEN]]
```

This is a useful strategy as we can use the embeddings of the <UNK> token to
represent any unknown words. Thus, now for any unknown words, we can use the
<UNK> token to represent them and for the vocabulary words that are not in the
pretrained embeddings, we can use the embeddings of the <UNK> token to represent
it.

(2) There are many kinds of static embeddings. An extension of word2vec, fasttext (Bojanowski et al., 2017), addresses a problem with word2vec as we have presented it so far: it has no good way to deal with unknown words—words that appear in a test corpus but were unseen in the training corpus.

A related problem is word sparsity, such as in languages with rich morphology, where some of the many forms for each noun and verb may only occur rarely. Fasttext deals with these problems by using subword models, representing each word as itself plus a bag of constituent n-grams, with special boundary symbols < and > added to each word.