# Assignment 4 

This notebook uses Roberta to generate static embeddings for words with 768d averaged contextual embeddings. Contextual embeddings are calculated using the provided `dataset.txt`. Ensure the file `dataset.txt` is placed in the directory as this notebook before running.

## Initialization

Import required libraries.

In [1]:
import random
import psutil
from operator import itemgetter

import torch
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from torch.utils.data import Dataset, DataLoader

from transformers import RobertaModel, RobertaTokenizerFast

Initialize environment compute device (GPU or CPU depending on whats avalible).

In [2]:
# enable tqdm in pandas
tqdm.pandas()

# set to True to use the gpu (if there is one available)
use_gpu = True

# select device
device = torch.device('cuda' if use_gpu and torch.cuda.is_available() else 'cpu')
print(f'device: {device.type}')

# random seed
seed = 1234

# set random seed
if seed is not None:
    print(f'random seed: {seed}')
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)


device: cuda
random seed: 1234


Initalize global constants.

In [3]:
## Data Specifics,

# Location on disk of the dataset to generate contextual embeddings from.
DATA_FILE_PATH = "dataset.txt"

# Location on disk of the glove vocabulary dataset, used when generating word embeddings.
GLOVE_FILE_PATH = "glove.6B.300d-vocabulary.txt"

# Max number of sentences used when generating contextualized embeddings.
MAX_SENTENCES = 250_000

# Max number of tokens per sentence. Extra tokens are truncated. Sentences with less than this many tokens are padded.
MAX_TOKENIZATION_LENGTH = 200

# Size of batches to process on the GPU in parrellel while generating embeddings.
BATCH_SIZE = 350

## Model Specifcs,

# Model name, only accepts valid roberta models
MODEL = "roberta-base"

# The embedding size used by the model (assign based on the model your using!)
EMBEDDING_SIZE = 768

# Set of tokens to ignore
EMBEDDING_TOKEN_IGNORE_SET = {0, 1}

## Data Pre-Processing 

Load in the dataset, sentence by sentence, into the global array `sentences`.

In [4]:
sentences = []

linecount = 0
wordcount = 0 

lengths = []

with open(DATA_FILE_PATH, 'r') as dataset_file:
    while line := dataset_file.readline():
        sentences += [line]
        linecount += 1
        wordcount += len(line.split())
        lengths += [len(line)]

print("Loaded " + str(linecount) + " lines and " + str(wordcount) + " words.")
print("Average length: " + str(np.average(lengths)))
print("Max length: " + str(np.max(lengths)))

sentences = sentences[:MAX_SENTENCES]
sentence_count = len(sentences)

Loaded 4468825 lines and 47820302 words.
Average length: 67.34097665493726
Max length: 3263


## Dataset

We'll be handling tokenization in the Dataset prior to training time to avoid issues with memory leaks related to batch iterations. See this [github](https://github.com/pytorch/pytorch/issues/13246) issue for more information. Datasets are used to take advantage of the DataLoader and its auto batching features.

In [5]:
class RobertaDataset(Dataset):
	def __init__(self, sentences: list, max_length: int):
		sentences_tokenized = []

		for sentence in sentences:
			tokens = tokenizer.encode_plus(sentence, padding = "max_length", max_length = max_length, truncation = True, return_tensors='pt')
			
			ids = torch.LongTensor(tokens['input_ids'][0])
			mask = torch.LongTensor(tokens['attention_mask'][0]) 

			sentences_tokenized += [np.array([ids, mask])]

			print(f"{len(sentences_tokenized) / len(sentences) * 100.0}% complete.\t\t\t", end ='\r')

		self.sentences_tokenized = np.array(sentences_tokenized)

	def __len__(self):
		return len(self.sentences_tokenized)
	
	def __getitem__(self, index):
		return (self.sentences_tokenized[index][0], self.sentences_tokenized[index][1])

Initialize the tokenizer, we'll be using the RobertaTokenzierFast for performance reasons.

In [6]:
tokenizer = RobertaTokenizerFast.from_pretrained("FacebookAI/roberta-base", add_prefix_space = True, clean_up_tokenization_spaces = True)

Initialize the Dataset & DataLoader (will take a couple minutes).

In [7]:
dataset = RobertaDataset(sentences, MAX_TOKENIZATION_LENGTH)
dataloader = DataLoader(dataset, batch_size = BATCH_SIZE, shuffle = True, num_workers = 0)

100.0% complete.					mplete.								

## Embedding Calculations

Setup the model and load it on the compute device.

In [8]:
model = RobertaModel.from_pretrained(MODEL).to(device)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Calculate cumulative embeddings, batch by batch, from the dataset. For every token, the sum of its embeddings is calculated (`token_to_embedding_sums`) and a tally of the number of times the token occures is kept (token_to_embedding_counts). After all batches are processed, the embedding is divided by the number of times it occured to get an average value per token.

In [9]:
token_size = tokenizer.vocab_size

token_to_embedding_sums = np.zeros((token_size, EMBEDDING_SIZE))
token_to_embedding_counts = np.zeros((token_size, 1))

processed_tokens = 0;

def calculate_embeddings(model) -> dict:
	processed_sentances = 0
	
	model.eval()
	
	token_to_avg_embedding_map = {}
	avg_token_embedding = None
	
	with torch.no_grad():
		for batch in dataloader:
			
			ids = batch[0].to(device)
			mask = batch[1].to(device)
			
			output = model(ids, mask)
	
			####################################################################
		
			### shape, [batch, tokens in sentance, embeddings of each token]
			embeddings = output[0].detach().cpu().numpy()
			
			# Update average embeddings, 
			for sentence_embedding_index in range(len(embeddings)):
				sentence_embedding = embeddings[sentence_embedding_index]
	
				for token_index in range(len(sentence_embedding)):
					token = ids[sentence_embedding_index][token_index]
					token_to_embedding_sums[token] += sentence_embedding[token_index]
					token_to_embedding_counts[token] += 1

			pct_virt_ram = psutil.virtual_memory().percent
			processed_sentances += BATCH_SIZE
			pct_complete = processed_sentances / (sentence_count) * 100.0 

			if (pct_virt_ram) > 90.0:
				print("Aborting embedding generation early to avoid running out of RAM!")
				print(f"{pct_complete}% complete. {pct_virt_ram}% RAM utilization. \t\t\t", end ='\r')
				print(f"{psutil.virtual_memory().used / 1e9} GB used.")
				return token_to_avg_embedding_map, avg_token_embedding
			
			print(f"{pct_complete}% complete. {pct_virt_ram}% RAM utilization. \t\t\t", end ='\r')
			
	return token_to_avg_embedding_map, avg_token_embedding

token_to_avg_embedding_map, avg_token_embedding = calculate_embeddings(model)
del model

100.1% complete. 14.1% RAM utilization. 			ization. 					

Finish up embedding calculations. Calculate the average embedding per token (`token_to_embedding_averages`) and the global average token (`average_embedding`).

In [10]:
token_to_embedding_averages = np.zeros((token_size, EMBEDDING_SIZE))
average_embedding = np.sum(token_to_embedding_sums, axis=0) / np.sum(token_to_embedding_counts)

token_to_embedding_counts[token_to_embedding_counts == 0] = 1
token_to_embedding_averages = token_to_embedding_sums / token_to_embedding_counts

# set all un-encountered tokens to 0
# would be worth looking into the effects of using the average token here!
token_to_embedding_averages[np.sum(token_to_embedding_averages) == 0] = 0

## Problem One Complete.
The `token_to_embedding_averages` matrix contains a mapping between sub-word tokens and their average embedding in the dataset. 

e.g. 

```
TOKEN_EMBEDDING = token_to_embedding_averages[TOKEN]
```

## Problem 2
This section implements the `most_similar()` functions from chapter 9 and tests them using the specified examples. 

Generate word to embedding mappings (`word_to_embedding`) for the contents of the glove vocabulary file (`GLOVE_FILE_PATH`).

In [11]:
def get_average_embedding(word):
	tokens = tokenizer.encode_plus(word, padding = "max_length", max_length = MAX_TOKENIZATION_LENGTH, truncation = True)['input_ids']
	tokens = np.array(tokens) 

	embedding = np.zeros(EMBEDDING_SIZE)
	token_count = 0
	for token in tokens:
		if token not in EMBEDDING_TOKEN_IGNORE_SET: 
			embedding += token_to_embedding_averages[token]	
			token_count += 1
	return embedding / token_count


def generate_word_embedding_map(words: list) -> dict:
	word_embedding_map = {}
	processed_words = 0
	for word in words:
		embedding = get_average_embedding(word)
		word_embedding_map[word] = embedding
	
		processed_words += 1
		print(f"{processed_words / len(words) * 100.0}% complete. {len(word_embedding_map)} word embeddings generated.\t\t\t", end ='\r')
	return word_embedding_map


def load_words(from_file: str) -> list:
	words = []
	with open(from_file, 'r') as file:
		while line := file.readline():
			words += [line.strip()]

	return words

In [12]:
words = load_words(GLOVE_FILE_PATH)
word_to_embedding = generate_word_embedding_map(words)

100.0% complete. 400000 word embeddings generated.			enerated.				

Implement the `most_similar()` function.

In [27]:
def get_word_embedding(word):
    if word in word_to_embedding:
        emb = word_to_embedding[word]
    else:
        emb = get_average_embedding(word)
        word_to_embedding[word] = emb
    return emb

def most_similar(word, topn=10):
    emb = get_word_embedding(word)

    # calculate similarities to all words in our vocabulary
    similarities = []
    for word, embedding, in word_to_embedding.items():
        similarity = embedding @ emb

        similarities += [(float(similarity), str(word))]

    similarities.sort(key = itemgetter(0))
    similarities.reverse()
    
    return similarities[:topn]

## 6 Examples

In [28]:
most_similar("cactus")

[(131.35084119944005, 'higher-dimensional'),
 (131.3331053641163, 'one-dimensional'),
 (131.33226782229394, 'best-kept'),
 (131.16419286968562, 'human-animal'),
 (131.1133371650292, 'near-earth-object'),
 (131.0245007210086, 'other-dimensional'),
 (130.9287293293737, 'high-dimensional'),
 (130.7789125005217, 'use-value'),
 (130.72282521956038, 'writer-editor'),
 (130.60795817867975, 'part-owner')]

In [23]:
most_similar("cake")

[(39.20734567809385, 'moslems'),
 (39.20734567809385, 'beholder'),
 (39.20734567809385, '----------------------------------------------'),
 (39.20734567809385, 'ghouls'),
 (39.20734567809385, 'disobedient'),
 (39.20734567809385, 'reimburses'),
 (39.20734567809385, 'orgasms'),
 (39.20734567809385, '------------------------------------------------'),
 (39.20734567809385, 'relearning'),
 (39.20734567809385, '!!!!!')]

In [25]:
most_similar("Angry")

[(39.80380926749184, 'moslems'),
 (39.80380926749184, 'beholder'),
 (39.80380926749184, '----------------------------------------------'),
 (39.80380926749184, 'ghouls'),
 (39.80380926749184, 'disobedient'),
 (39.80380926749184, 'reimburses'),
 (39.80380926749184, 'orgasms'),
 (39.80380926749184, '------------------------------------------------'),
 (39.80380926749184, 'relearning'),
 (39.80380926749184, '!!!!!')]

In [26]:
most_similar("quickly")

[(38.45007287102884, 'moslems'),
 (38.45007287102884, 'beholder'),
 (38.45007287102884, '----------------------------------------------'),
 (38.45007287102884, 'ghouls'),
 (38.45007287102884, 'disobedient'),
 (38.45007287102884, 'reimburses'),
 (38.45007287102884, 'orgasms'),
 (38.45007287102884, '------------------------------------------------'),
 (38.45007287102884, 'relearning'),
 (38.45007287102884, '!!!!!')]

In [18]:
most_similar("between")

[(141.83930279848408, 'higher-dimensional'),
 (141.6022133899604, 'one-dimensional'),
 (141.13505233585803, 'other-dimensional'),
 (140.92190495429995, 'use-value'),
 (140.781117957326, 'part-owner'),
 (140.6535624685956, 'best-kept'),
 (140.65225236056526, 'high-dimensional'),
 (140.62948610969406, 'near-earth-object'),
 (140.2567879639438, 'part-time'),
 (140.25607619236177, 'low-dimensional')]

In [1]:
most_similar("the")

NameError: name 'most_similar' is not defined

## Problem Two Complete.