# NLP Assignment 5
Created by Prof. [Mohammad M. Ghassemi](https://ghassemi.xyz)

Submitted by: <span style="color:red"> INSERT YOUR NAME HERE </span>

In collaboration with: <span style="color:red"> INSERT YOUR (OPTIONAL) HOMEWORK PARTNER'S NAME HERE </span>

<hr> 

## Assignment Goals
The goal of this assignment is to familiarize yourself with:

1. Data-driven Tokenization
3. Seq2seq with Attention
4. Transformers

The assignment combines tutorial components, with learning exercises that you must complete and submit. The learning exercise sections are clearly demarcated within the assignments.

## Before you start
1. PULL THE LATEST VERSION OF THE `course-materials` REPOSITORY, AND COPY `homework/HW5/` INTO THE CORRESPONDING DIRECTORY OF YOUR SUBMISSION FOLDER
2. CREATE AND ATTACH TO A VIRTUAL ENVIRONMENT, AND INSTALL THE REQUIREMENTS IN `requirements.txt`
3. IMPORT THE COURSE UTILITIES AND RELEVANT LIBRARIES BY RUNNING THE CODE BLOCK BELOW


In [1]:
import importlib
from materials.code import utils
import matplotlib.pyplot as plt
import os
import requests

# IMPORT SOME BASIC TOOLS:
from pprint import pprint
import pyarrow

<hr>

# Part 0: Data 

[Common Crawl](https://commoncrawl.org/) is a non-profit organization that freely provides web crawler data in [WARC format](http://fileformats.archiveteam.org/wiki/WARC); this is useful for budding NLP researchers (and open-source Google competitors) because crawling the web at any kind of meaningful scale is challenging. If you want to add "processed the entire internet - literally" to your list of resume accomplishments without touching any WARC files, you can take a look at the [Oscar Corpus](https://oscar-corpus.com/) which provides a (somewhat) pre-processed version of the internet archives. 

I did consider doing an assignment asking you to process the entire web and whist the thought of your machine's suffering was amusing, my better nature compelled me to use a humble subset of our wonderful world wide web - Wikipedia. More specifically, we'll be using the [WikiText language modeling dataset](https://huggingface.co/nlp/viewer/?dataset=wikitext&config=wikitext-103-raw-v1), a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Let's being by pulling the `wikitext` data from the web, and saving it to disk.

In [2]:
#-------------------------------------------------
# Import the wikitext dataset:
#-------------------------------------------------
from datasets import load_dataset
dataset   = load_dataset('wikitext', 'wikitext-103-raw-v1')

#-------------------------------------------------
# Flatten out the dataset into a list of sentences and outcome, y
#-------------------------------------------------
sentences = dataset['train']['text']  + dataset['validation']['text'] + dataset['test']['text']

#-------------------------------------------------
# Store the Wikipedia Data
#-------------------------------------------------
f = open("materials/data/wikitext.txt", "w")
f.write(''.join(dataset['train']['text']))
f.close()

Reusing dataset wikitext (/Users/ghamut/.cache/huggingface/datasets/wikitext/wikitext-103-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91)


<br><br>The wikipedia data subset that is provided by HuggingFace (found at `materials/data/wikitext.txt`) is a humble half a Gigabyte - rather small by NLP standards, but it will do for our tutorial's purposes. By the way, you can also download the full [daily wikipedia dumps here](https://dumps.wikimedia.org/enwiki/) in case that's something of interest to you.  

<hr>

# Part 1: Data-Driven Tokenization
At the start of this course, we covered the topic of tokenization. As you may recall, the traditional tokenization approaches use hand-crafted rules and human knowledge to separate our text into structured lists of lists which are eventually converted into tensors. In all of the assignments to-date, we've used the traditional approach to tokenization via `nltk` as a foundational pre-processing step. However, as you probably recall from your first homework assignment, there are also data-driven approaches to tokenization; on of the approaches we discussed was `Byte Pair Encoding` (BPE). 

The reason we've ignored BPE in favor of the traditional approaches is because, up until now, the data we've been working with has been rather small and data-driven approaches require lots of data (shocking, I know). My hope is that 500MB of `wikitext` data is sufficient that BPE might now start to yield more sensible tokenization results. 

Below I've provided a pre-processing and tokenization pipeline that we will use to train BPE on the wikipedia corpus:

In [3]:
# Import the Libraries that will allow for the 
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
from tokenizers import BertWordPieceTokenizer, CharBPETokenizer, ByteLevelBPETokenizer, SentencePieceBPETokenizer
from tokenizers.normalizers    import Lowercase, NFD, Sequence, StripAccents
from tokenizers.processors     import BertProcessing
from tokenizers.pre_tokenizers import Punctuation, Digits, ByteLevel
#-------------------------------------------------
# Training data for the Tokenizer
#-------------------------------------------------
training_data = ['materials/data/wikitext.txt']    

#-------------------------------------------------
# Type of Tokenizer
#-------------------------------------------------
tokenizer     = ByteLevelBPETokenizer()

#-------------------------------------------------
# Pre-Tokenizers
#-------------------------------------------------
tokenizer.pre_tokenizer=  [ Punctuation(),         # Split on Punctation
                            Digits(),              # Split on Digits
                           ]

#-------------------------------------------------
# Text normalization approach
#-------------------------------------------------
tokenizer.normalizer = Sequence([ NFD(),           # Fix potential unicode problems
                                  StripAccents(),  # Remove Accents 
                                  Lowercase()      # Cast the text to lowercase
                                ])

#-------------------------------------------------
# Specifiy Special Tokens that are not in the text
#-------------------------------------------------
special_tokens = ["<s>",      # indicates start of text block
                  "<pad>",    # padding for tensors 
                  "</s>",     # indicates end of a text block
                  "<unk>",    # indicates out-of-vocabulary token
                  "<mask>"]   # will be used to artifically "corrupt" data when training.  


#-------------------------------------------------
# Train the tokenizer
#-------------------------------------------------
tokenizer.train(files          = training_data,                            
                vocab_size     = 50000, 
                min_frequency  = 2, 
                special_tokens = special_tokens,
                show_progress  = True)


#-------------------------------------------------
# Save the tokenizer for later use too.
#-------------------------------------------------
tokenizer.save_model("materials/tokenizers", "bpe.tokenizer.50k.json")


#-------------------------------------------------
# Add a post processor 
#-------------------------------------------------
tokenizer._tokenizer.post_processor = BertProcessing(
                                                    ("</s>", tokenizer.token_to_id("</s>")),
                                                    ("<s>", tokenizer.token_to_id("<s>")),
                                                    )
tokenizer.enable_truncation(max_length=512)

<br>This tokenizer is now ready to use. Also, the tokenizer conveniently creates two files for us: 

1. `materials/tokenizers/bpe.tokenizer.50k.json-vocab.json` contains the vocabulary that was generated by BPE; it is ranked by frequency.
2. `materials/tokenizers/bpe.tokenizer.50k.json-merges.txt` contains the merges followed to perform the tokenization

<br>These two files contain everything we need to use the tokenizer without having the train the model again. Let's apply the model to encode, and then decode a sample sentence:

In [4]:
#-------------------------------------------------
# And then encode:
#-------------------------------------------------
encoded = tokenizer.encode("I didn't win a $2,000 trip to New York!! 😭")
decoded = tokenizer.decode(encoded.ids)

print("Encoded string: {}".format(encoded.tokens))
print("Word ids:       {}".format(encoded.ids))
print("Decoded string: {}".format(decoded))

Encoded string: ['<s>', 'I', 'Ġdidn', "'", 't', 'Ġwin', 'Ġa', 'Ġ$', '2', ',', '000', 'Ġtrip', 'Ġto', 'ĠNew', 'ĠYork', '!', '!', 'Ġ', 'ð', 'Ł', 'ĺ', 'Ń', '</s>']
Word ids:       [0, 45, 5271, 11, 88, 1197, 263, 1136, 22, 16, 17371, 6015, 294, 762, 1252, 5, 5, 225, 177, 258, 251, 260, 2]
Decoded string: I didn't win a $2,000 trip to New York!! 😭


<br><br> A couple of things I want to point out about these results:

1. Note that many of the tokens start with the `Ġ` character; this is a representation of whitespace when things are computed at the Byte-level with BPE. 
2. Also note that the data-driven approach here decided to split `didn't` into `didn`, `'`, and `t`. This may not be the best tokenization approach for all problems, but the advantage of techniques like BPE is that they do a great job for tokenization without any prior human knowledge about the problem domain. 
3. Note that even though our training data didn't contain a 😭, the tokenizer still created a representation: `['Ġ', 'ð', 'Ł', 'ĺ', 'Ń']`.

Personally, I prefer to use [Spacy](https://spacy.io/usage/linguistic-features#how-tokenizer-works) or even just simple `nltk` for smaller tokenization projects, and BPE for larger tokenization projects, or projects whrere I'm dealing with a text modality I'm unfamiliar with (e.g. A language I don't know). You can [read more here](https://blog.floydhub.com/tokenization-nlp/#unigram) about various tokenization approaches, as well as their pros/cons.



## Pre-trained tokenizers

Because we saved our vocabulary and merges to disk, we can load them instead of re-training from scratch.

In [5]:
from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors      import BertProcessing


tokenizer = ByteLevelBPETokenizer("materials/tokenizers/bpe.tokenizer.50k.json-vocab.json",
                                  "materials/tokenizers/bpe.tokenizer.50k.json-merges.txt")

tokenizer._tokenizer.post_processor = BertProcessing(("</s>", tokenizer.token_to_id("</s>")),
                                                     ("<s>", tokenizer.token_to_id("<s>")))



encoded = tokenizer.encode("A journey to the center of the earth.")
encoded.tokens


['<s>',
 'A',
 'Ġjourney',
 'Ġto',
 'Ġthe',
 'Ġcenter',
 'Ġof',
 'Ġthe',
 'Ġearth',
 '.',
 '</s>']

<br>We can also download pre-trained tokenizers used by contemporary machine learning models, such as OpenAI's [GPT2](https://openai.com/blog/better-language-models/) by downloading the [gpt2-vocab.json](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json) and [gpt2-merges.txt](https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt) files; I'm bringing this up because new models will (sometimes) make these files available, and if someone trained a tokenizer in a domain that's related to yours, it might save you some time to use it.

<br> I should also note that the tokenizers used by many of the more popular NLP models (e.g. [BERT](https://arxiv.org/pdf/1810.04805.pdf), [XLNet](https://arxiv.org/pdf/1906.08237.pdf)), are freely available from the transformers library:

In [6]:
from transformers import BertTokenizer, XLNetTokenizer

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')    # WordPiece based
xlnet_tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased') # SentencePiece based

tokenized = bert_tokenizer.tokenize("I didn't win a $2,000 trip to New York!! 😭")
print(tokenized)

tokenized = xlnet_tokenizer.tokenize("I didn't win a $2,000 trip to New York!! 😭")
print(tokenized)

['I', 'didn', "'", 't', 'win', 'a', '$', '2', ',', '000', 'trip', 'to', 'New', 'York', '!', '!', '[UNK]']
['▁I', '▁didn', "'", 't', '▁win', '▁a', '▁$2,000', '▁trip', '▁to', '▁New', '▁York', '!!', '▁', '😭']


<hr> 

## Learning Exercise 1: 
#### Worth 1/5 Points
#### A. The Unigram Tokenization Approach
Implement the unigram tokenization approach (from scratch) described [in this paper](https://arxiv.org/pdf/1804.10959.pdf). Demonstrate the tokenization on the wikipedia data. Comment on the advantages of this approach over BPE.

In [7]:
################################################################################
# INSERT YOUR CODE HERE
# DO NOT FORGET TO PRINT YOUR MEANINGFUL RESULTS TO THE SCREEN.
################################################################################

<span style="color:red"> INSERT AN INTERPRETATION OF YOUR RESULTS HERE </span>

<hr>
<h1><span style="color:red"> Self Assessment </span></h1>
Please provide an assessment of how successfully you accomplished the learning exercises in this assignment according to the instruction provided; do not assign yourself points for effort. This self assessment will be used as a starting point when I grade your assignments. Please note that if you over-estimate your grade on a given learning exercise, you will face a 50% penalty on the total points granted for that exercise. If you underestimate your grade, there will be no penalty.

* Learning Exercise: 
    * <span style="color:red">X</span>/1 points