# Assignment 4 

This notebook uses Roberta to generate a single dictionary which contains a mapping between a token (as a string) and a 756 dimensional averaged embedding over the provided text. The corpus to be used must be placed in the same directory as this notebook and named 'dataset.txt'.

## Initialization

Import required libraries.

In [None]:
# Standard ML libaries
import random
import torch
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from sklearn.metrics import classification_report

# RobertaModel and Tockenizer
from transformers import RobertaTokenizer, RobertaModel, pipeline


Initialize environment with GPU (or CPU as fallback!).

In [2]:
# enable tqdm in pandas
tqdm.pandas()

# set to True to use the gpu (if there is one available)
use_gpu = True

# select device
device = torch.device('cuda' if use_gpu and torch.cuda.is_available() else 'cpu')
print(f'device: {device.type}')

# random seed
seed = 1234

# set random seed
if seed is not None:
    print(f'random seed: {seed}')
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)


device: cuda
random seed: 1234


## Data Pre-Processing 

Load in dataset, sentence by sentence.

In [3]:
sentences = []

linecount = 0
wordcount = 0

with open("dataset.txt", 'r') as dataset_file:
    while line := dataset_file.readline():
        sentences += [line]
        linecount += 1
        wordcount += len(line.split())

print("Loaded " + str(linecount) + " lines and " + str(wordcount) + " words.")

Loaded 4468825 lines and 47820302 words.


Initialize tokenizer.

In [4]:
tokenizer = RobertaTokenizer.from_pretrained("FacebookAI/roberta-base", add_prefix_space = True, clean_up_tokenization_spaces = True)

Quick sanity check to ensure tokenizer is working.

In [5]:
sentence = sentences[1]
tokens = tokenizer(sentence, is_split_into_words = True, return_tensors='pt', \
			padding="max_length", max_length=500, truncation=True)
ids = tokens['input_ids'][0]
mask = tokens['attention_mask'][0] 
print(tokenizer.decode(ids[7]))

 Pictures


## Dataset

We'll be handling tokenization in a Dataset so we can take advantage of the DataLoader for auto batching.

In [7]:
from torch.utils.data import Dataset, DataLoader
class RobertaDataset(Dataset):
	def __init__(self, sentances: list, tokenizer_instance: object, max_length: int):
		self.tokenizer = tokenizer_instance
		self.max_length = max_length
		self.sentences = sentences

	def __len__(self):
		return len(self.sentences)
	
	def __getitem__(self, index):
		sentence = self.sentences[index]
		tokens = self.tokenizer(sentence, is_split_into_words = True, return_tensors='pt', \
					 padding="max_length", max_length=self.max_length, truncation=True)
		ids = tokens['input_ids'][0]
		mask = tokens['attention_mask'][0] 
		return (torch.LongTensor(ids), torch.LongTensor(mask))

In [8]:
dataset = RobertaDataset(sentences, tokenizer, 500)
BATCH_SIZE = 256
dataloader = DataLoader(dataset, batch_size = BATCH_SIZE, shuffle = True, num_workers = 16)

## Embedding Calculations

Calculate a single embedding just to test.

In [9]:
batch = next(iter(dataloader))

ids = batch[0].to(device)
mask = batch[1].to(device)
model = RobertaModel.from_pretrained('roberta-base').to(device)

with torch.no_grad():
	output = model(ids, mask)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Calculate the embeddings of our batches

In [None]:
processed_sentances = 0

model = RobertaModel.from_pretrained('roberta-base').to(device)
model.eval()
token_to_avg_embedding_map = {}

with torch.no_grad():
	for batch in dataloader:

		ids = batch[0].to(device)
		mask = batch[1].to(device)

		output = model(ids, mask)

		####################################################################
	
		### shape, [batch, tokens in sentance, embeddings of each token]
		embeddings = output[0].detach()
		
		# Update average embeddings, 
		for sentence_embedding_index in range(len(embeddings)):
			sentence_embedding = embeddings[sentence_embedding_index]

			for token_index in range(len(sentence_embedding)):
				token = ids[sentence_embedding_index][token_index]
				token_str = tokenizer.decode(token)
				token_embedding = sentence_embedding[token_index]

				if token_str in token_to_avg_embedding_map:
					token_to_avg_embedding_map[token_str] += token_embedding
					token_to_avg_embedding_map[token_str] /= 2
				else:
					token_to_avg_embedding_map[token_str] = token_embedding
		
		processed_sentances += BATCH_SIZE
		print(f"{processed_sentances / len(sentences) * 100.0}% complete. {len(token_to_avg_embedding_map)} tokens processed.\t\t\t", end ='\r')
		

		## Cuda's leaking memeory and this seems to help?
		del ids
		del mask 
		del output
		del embeddings
		torch.cuda.empty_cache()
		import gc
		gc.collect()
		# print("gc:\n")
		# for obj in gc.get_objects():
		# 	try:
		# 		if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
		# 			print(type(obj), obj.size())
		# 	except:
		# 		pass
		# print("")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.005728575184752144% complete. 1894 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.011457150369504288% complete. 3261 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.01718572555425643% complete. 4361 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.022914300739008575% complete. 5483 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.028642875923760722% complete. 6369 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.03437145110851286% complete. 7185 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.040100026293265006% complete. 8024 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.04582860147801715% complete. 8783 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.0515571766627693% complete. 9449 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.057285751847521445% complete. 10052 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.06301432703227358% complete. 10649 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.06874290221702573% complete. 11175 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.07447147740177787% complete. 11748 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.08020005258653001% complete. 12188 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.08592862777128216% complete. 12729 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.0916572029560343% complete. 13236 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.09738577814078644% complete. 13765 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.1031143533255386% complete. 14174 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0.10884292851029073% complete. 14551 tokens processed.			

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


KeyboardInterrupt: 

In [None]:
print("tokens : " + str(len(token_to_avg_embedding_map)))
print("embedding " + str(token_to_avg_embedding_map["red"]))

tokens : 18382
embedding tensor([-6.7960e-02,  1.8660e-01, -1.4361e-01,  7.2157e-02, -9.8492e-02,
         2.4160e-01,  5.6699e-02, -4.1833e-02, -7.1487e-02, -7.0940e-02,
         5.5407e-02, -2.5917e-01, -1.7565e-01,  1.3531e-01, -2.9793e-01,
        -6.4282e-01,  1.6737e-01, -4.7400e-02, -4.4321e-02, -1.3563e-01,
        -2.8828e-01, -2.1811e-02, -2.6930e-01,  1.4016e-01, -6.2870e-02,
        -1.4870e-01, -1.0622e-01, -7.9012e-02,  3.6224e-02,  8.8775e-02,
        -8.6289e-02,  1.8704e-01,  1.7647e-02,  1.4020e-01, -1.1914e-02,
         1.5171e-01,  1.1326e-01, -9.6438e-02, -2.4445e-01,  8.8952e-02,
         2.4561e-02, -9.2861e-01, -1.5765e-01, -1.2588e-01, -5.9391e-03,
         2.3183e-01, -1.8782e-01, -6.5659e-01,  9.0372e-02, -1.4454e-01,
        -7.5906e-02,  3.7736e-01, -4.2175e-02,  3.9214e-01,  4.0338e-02,
        -1.0264e-01,  2.0975e-02, -3.3435e-01, -1.9228e-01, -2.0601e-01,
         2.9846e-01,  1.2260e+00, -1.3871e-01, -4.1784e-01,  2.6139e-01,
        -4.4210e-01, -1.36