# NanoGPT
In this assignment you will learn to implement a GPT model from scratch. This includes implementing the Transformer architecture (with causal input mask), the GPT model, a byte-level BPE tokenizer, the embedding layer, and the positional encoding layer.
We will train a GPT-2 Model ([Original Paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf))

In [1]:
%pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Imports
from grader import *
import typing
import numpy as np
import torch


In [3]:
# Setup autograder submission
submitter = AutograderSubmitter()

## Part 1: Tokenizer

To transform text into a format that can be used by a neural net, we need to first tokenize it (That is, transform the text corpus into indexes of our dictionary). The GPT-2 model uses a byte-level BPE tokenizer. 

Before we start, please first take a look at [this tutorial video by Huggingface](https://youtu.be/HEikzVL-lZU)

In the tutorial video above, you see that a BPE tokenizer starts with a base dictionary set of characters. Those "base set of characters" are usually represented by unicode characters. Unicode characters are the standard way to represent all possible human languages plus our favorite emojis 😀😙 in byte streams. There are multiple Unicode standards include UTF-8, UTF-16, UTF-32, etc. However, because unicode characters are not very memory efficients, this basically means that we need to start with a **huge** base dictionary size to begin with our tokenizer. And this is why the authors of GPT-2 chooses to instead use a byte-level BPE tokenizer. That is, we split characters into futher smaller fragments (1 byte, or 8-bits) and use those as our base dictionary. This way, we can reduce the base dictionary size from 100,000+ unicode characters to just 256.

> As a side note, the UTF-8 standard is not a strict 8-bit-per-character standard and each character in a UTF-8 stream can take more than 8 bits. 

To implement a byte-level BPE tokenizer, we need to first create a base dictionary of 256 characters. This base dictionary will be constructed and passed into our `Dictionary` class via the `__init__` method. Then we need to be able to expand our vocabulary list, one at a time. We will do this by implementing both the `expand_dictionary` method and the `find_combinations_to_expand` method.

Basically, think about our current dictionary as all possible alphabets and a space, we would first tokenize our text into a list of indexes in our dictionary. Then we will enumerate through the list of indexes and find the most frequent pair of indexes that appear next to each other. We will then combine those two indexes (of the most frequent pair) into a new index and add it to our dictionary. We will repeat this process until we reach our desired vocabulary size.

There are two member attributes of the `Dictionary` class, `dictionary_array` and `combinations_to_index`. The `dictionary_array` attribute simply holds all the vocabularies and `combinations_to_index` is used later in `tokenize()`.

In [4]:
class Dictionary:
    def __init__(self, base_dictionary : typing.List[bytes] = [i.to_bytes(1,'big') for i in range(256)]) -> None:
        
        # dictionary holds all volcabulary items and the index of each item in this array will be the input idx to the model
        self.dictionary_array : typing.List[bytes] = base_dictionary.copy()

        # This is a dictionary that maps a combination of two vocab items to a later vocab item
        self.combinations_to_index : typing.Dict[typing.Tuple[int, int], int] = {}
    
    def __len__(self) -> int:
        return len(self.dictionary_array)
    
    def __getitem__(self, key: int) -> str:
        return self.dictionary_array[key]
    
    def __contains__(self, key: str) -> bool:
        return key in self.dictionary_array
    
    def expand_dictionary(self, combination_vocab : typing.Tuple[int, int]) -> None:
        """
        This function should expand the dictionary with one more vocabulary item, 
        the item should be the concatenation of the two vocab items in combination_vocab
        You need to modify both the dictionary_array and combinations_to_index

        Parameters
        ----------
        combination_vocab : typing.Tuple[int, int]
            The combination of two vocab items to expand the dictionary with
        """
        # YOUR CODE HERE
        raise NotImplementedError()
        # self.dictionary_array.append(self.dictionary_array[combination_vocab[0]] + self.dictionary_array[combination_vocab[1]])
        # self.combinations_to_index[combination_vocab] = len(self.dictionary_array) - 1
    

    def find_combination_to_expand(self, corpus_of_text: typing.List[int]) -> typing.Tuple[int, int]:
        """
        This function should find the combination of two vocab items that occurs the most in the corpus of text and return it
        
        Parameters
        ----------
        corpus_of_text : typing.List[int]
            The corpus of text represented by a list of integers (with each integer representing a vocab in the dictionary) to expand the dictionary with
        """
        # YOUR CODE HERE
        raise NotImplementedError()
        # count_dict = {}
        # for i in range(len(corpus_of_text) - 1):
        #     if (corpus_of_text[i], corpus_of_text[i+1]) in count_dict:
        #         count_dict[(corpus_of_text[i], corpus_of_text[i+1])] += 1
        #     else:
        #         count_dict[(corpus_of_text[i], corpus_of_text[i+1])] = 1
        # return max(count_dict, key=count_dict.get)


In [5]:
# Check your implementation
grade_dictionary_class_expand_dictionary(Dictionary())
grade_dictionary_class_find_combination_to_expand(Dictionary())
submitter.submission_data["tokenizer_comb"] = generate_dictionary_class_find_combination_to_expand_dat(
    Dictionary()
)

Awesome! Now we're able to expand our vocabulary list. But how do we tokenize our text into a list of indexes? We will do this by implementing the `tokenize` function. 

In [6]:
def tokenize(text : str, dictionary : Dictionary) -> typing.List[int]:
    """
    This function should tokenize the text using the dictionary and return the tokenized text as a list of integers

    Parameters
    ----------
    text : str
        The text to tokenize
    
    dictionary : Dictionary
        The dictionary to use for tokenization
    """

    text_bytestream = bytes(text, "utf-8") # convert text to bytestream
    tokenized_text : typing.List[int] = [] # initialize tokenized text
    for i in range(len(text_bytestream)):
        tokenized_text.append(
            dictionary.dictionary_array.index(text_bytestream[i:i+1])
        )
    
    num_tokenized_last_pass = len(tokenized_text)
    # We will sweep through the tokenized text and replace any combination of two vocab items with the later vocab item
    while num_tokenized_last_pass > 0:
        # YOUR CODE HERE
        raise NotImplementedError()
        # num_tokenized_last_pass = 0
        # new_tokenized_text = []
        # for i in range(len(tokenized_text) - 1):
        #     if (tokenized_text[i], tokenized_text[i+1]) in dictionary.combinations_to_index:
        #         new_tokenized_text.append(dictionary.combinations_to_index[(tokenized_text[i], tokenized_text[i+1])])
        #         num_tokenized_last_pass += 1
        #     else:
        #         new_tokenized_text.append(tokenized_text[i])
        #         if i == len(tokenized_text) - 2:
        #             new_tokenized_text.append(tokenized_text[i+1])
        
        # tokenized_text = new_tokenized_text
    
    return tokenized_text


In [7]:
grade_tokenizer(tokenize, Dictionary())
submitter.submission_data["tokenizer"] = generate_tokenizer_submission(tokenize, Dictionary())

## Gather your submissions

The following code will generate a `submission.npz` file. Please submit this file to Gradescope.

In [8]:
submitter.generate_submission_file("submission.npz")

In [None]:
# Internal use
from grader_internal import Autograder
autograder = Autograder()
grade = autograder.grade(submitter.submission_data)
grade_2 = autograder.grade(np.load("submission.npz"))
assert np.all(grade == grade_2)
print("Grade:", np.mean(grade) * 100)