# NanoGPT
In this assignment you will learn to implement a GPT model from scratch. This includes implementing the Transformer architecture (with causal input mask), the GPT model, a byte-level BPE tokenizer, the embedding layer, and the positional encoding layer.
We will train a GPT-2 Model ([Original Paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf))

In [2]:
%pip install numpy

Note: you may need to restart the kernel to use updated packages.


In [1]:
# Imports
from grader import *
import typing
import numpy as np
import torch


## Part 1: Tokenizer

To transform text into a format that can be used by a neural net, we need to first tokenize it (That is, transform the text corpus into indexes of our dictionary). The GPT-2 model uses a byte-level BPE tokenizer. However, we will only try to implement a unicode tokenizer in this section for simplicity.

Before we start, please first take a look at [this tutorial video by Huggingface](https://youtu.be/HEikzVL-lZU)

In the tutorial video above, you see that a BPE tokenizer starts with a base dictionary set of characters. Those "base set of characters" are usually represented by unicode characters. Unicode characters are the standard way to represent all possible human languages plus our favorite emojis 😀😙 in byte streams. There are multiple Unicode standards include UTF-8, UTF-16, UTF-32, etc. However, because unicode characters are not very memory efficients, this basically means that we need to start with a **huge** base dictionary size to begin with our tokenizer. And this is why the authors of GPT-2 chooses to instead use a byte-level BPE tokenizer. That is, we split characters into futher smaller fragments (1 byte, or 8-bits) and use those as our base dictionary. This way, we can reduce the base dictionary size from 100,000+ unicode characters to just 256.

> As a side note, the UTF-8 standard is not a strict 8-bit-per-character standard and each character in a UTF-8 stream can take more than 8 bits. 

In [6]:
class Dictionary:
    def __init__(self, base_dictionary : typing.List[bytes] = [i.to_bytes(1,'big') for i in range(256)]) -> None:
        
        # dictionary holds all volcabulary items and the index of each item in this array will be the input idx to the model
        self.dictionary_array : typing.List[bytes] = base_dictionary

        # This is a dictionary that maps a combination of two vocab items to a later vocab item
        self.combinations_to_index : typing.Dict[typing.Tuple[int, int], int] = {}
    
    def __len__(self) -> int:
        return len(self.dictionary_array)
    
    def __getitem__(self, key: int) -> str:
        return self.dictionary_array[key]
    
    def __contains__(self, key: str) -> bool:
        return key in self.dictionary_array
    
    def expand_dictionary(self, combination_vocab : typing.Tuple[int, int]) -> None:
        """
        This function should expand the dictionary with one more vocabulary item, the item should be the concatenation of the two vocab items in combination_vocab
        Parameters
        ----------
        combination_vocab : typing.Tuple[int, int]
            The combination of two vocab items to expand the dictionary with
        """
        # YOUR CODE HERE
        raise NotImplementedError()
        # self.dictionary_array.append(self.dictionary_array[combination_vocab[0]] + self.dictionary_array[combination_vocab[1]])
        # self.combinations_to_index[combination_vocab] = len(self.dictionary_array) - 1
    

    def find_combination_to_expand(corpus_of_text: typing.List[int]) -> typing.Tuple[int, int]:
        """
        This function should find the combination of two vocab items that occurs the most in the corpus of text and return it
        Parameters
        ----------
        corpus_of_text : typing.List[int]
            The corpus of text represented by a list of integers (with each integer representing a vocab in the dictionary) to expand the dictionary with
        """
        # YOUR CODE HERE
        raise NotImplementedError()
        # count_dict = {}
        # for i in range(len(corpus_of_text) - 1):
        #     if (corpus_of_text[i], corpus_of_text[i+1]) in count_dict:
        #         count_dict[(corpus_of_text[i], corpus_of_text[i+1])] += 1
        #     else:
        #         count_dict[(corpus_of_text[i], corpus_of_text[i+1])] = 1
        # return max(count_dict, key=count_dict.get)


In [None]:
# Check your implementation
grade_dictionary_class_expand_dictionary(Dictionary(), n = 50)