<a href="https://colab.research.google.com/github/Cohegen/gpt2-from-scratch/blob/main/gpt_2_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1 Introduction:Understanding Large Language Models through GPT-2

* Large Language models are a class of deep learning models designed to understand and generate human-like text.
* They're built using the transformer architecture and are widely used in applications like chatbots,translation and text generation


## 2.1 Understanding word embeddings

* Deep neural net models, including LLMs, cannot process raw text directly.
* Since text is categorical, it is not compatible with the mathematical operations used to implement and train neural networks.Therefore we need a way to represent words as continous-valued vectors.
* This concept of converting data into a vector format is referred to embedding.
* In simple terms, an embedding is a mapping from discrete objects, such as words, images, or even entire documents, to points in a convert nonnumeric data into format that neural networks can process.
* Instead of treating each word as a unique symbol, embeddings map words into a continuous vector space where similar words are close together.
* For example, the vectors for "king" and "queen" or "run" and "jog" will be near each other because they share similar meanings. These embeddings are learned from data and capture semantic relationships, making them a fundamental building block in modern NLP models like GPT-2.

## 2.2 Tokenizing text

* LLMs take text as inputs but wait before these words are mapped into embeddings, the go through a stage known as tokenization.
* Tokenization is the process of breaking text into smaller units called tokens- which can be words,subwords, or even characters.
* This step is crucial because language models like GPT-2  don't understand raw text; they work with numbers.
* Tokenization converts texts into sequence of tokens that can be mapped to numerical IDs
* The text we'll tokenizer for LLM training is `The dante's inferno dataset`.
* You can find this dataset from the official project gutenberg website.
* Let's get coding

In [1]:
## downloading dante's inferno dataset from project gutenberg website.
import urllib.request

url = 'https://www.gutenberg.org/cache/epub/41537/pg41537.txt'
file_path = "dante's-inferno.txt"
urllib.request.urlretrieve(url,file_path)

("dante's-inferno.txt", <http.client.HTTPMessage at 0x7c9d7f12ce10>)

* Let's load `dante's-inferno.txt` file using python file handling utilities.

In [2]:
with open("dante's-inferno.txt",'r',encoding='utf-8') as f:
  raw_text = f.read()
print("Total number of characters:",len(raw_text))
print(raw_text[:99])

Total number of characters: 700670
﻿The Project Gutenberg eBook of The Divine Comedy of Dante Alighieri: The Inferno
    
This ebook i


* Our goal is to tokenize this 700,670 character long story into individual words and special characters that we can turn into embeddings for LLM training.
* Which option do we have in splitting the texts, in this short story we can use python's regular expression library `re` for illustration purposes.

* Using some simple text we can use `re.split` command with the following syntax to split a text on whitespaces characters.

In [3]:
import re
text = "Hello, world. I am Antonius, nice to meet you."
result = re.split(r'(\s)',text)
print(result)

['Hello,', ' ', 'world.', ' ', 'I', ' ', 'am', ' ', 'Antonius,', ' ', 'nice', ' ', 'to', ' ', 'meet', ' ', 'you.']


* This simple tokenization scheme mostly works for seperating the use case text into individual words, however some are still connected to punctuation characters that we want to have as seperate entities.
* We have also avoided a step in tokenization in which we make all text inputs lowercase because capitalization helps LLMs distinguish between proper nouns and common nouns, understand sentence structure,and learn to geneate text with proper capitalization.


* Let's modify the regular expression splits on whitespaces `(\s)`, commas, and periods ([,.]).

In [4]:
result = re.split(r'([,.] | \s)',text)
print(result)

['Hello', ', ', 'world', '. ', 'I am Antonius', ', ', 'nice to meet you.']


Tokenization works well here but let's increase the diversity of punctuation which our dummy tokenizer can work on e.g question marks, double-slashes.

In [7]:
text = "Hello, world. Is attention is all you need-- a good research paper?"
result = re.split(r'([,.:;?_!"()\']|\s+|--)', text)
print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'Is', ' ', 'attention', ' ', 'is', ' ', 'all', ' ', 'you', ' ', 'need', '--', '', ' ', 'a', ' ', 'good', ' ', 'research', ' ', 'paper', '?', '']


Now that we have a basic tokenizer working, let's apply it to Dante's Inferno.

In [8]:
preprocessed_text = re.split(r'([,.:;?_!"()\']|\s+|--)',raw_text)
preprocessed_text = [item.strip() for item in preprocessed_text if item.strip()]
print(len(preprocessed_text))

145180


The print statement outputs `145180` tokens in this text(without whitespaces).
Now let's print the first 70 tokens for quick visual check

In [9]:
print(preprocessed_text[:70])

['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Divine', 'Comedy', 'of', 'Dante', 'Alighieri', ':', 'The', 'Inferno', 'This', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'ebook', 'or', 'online', 'at']


## 2.3 Converting tokens into token IDs

* Now, let's convert these tokens from a Python string to an integer representation to produce the token IDs.
* This step is an intermediate step before converting the token IDs into embedding vectors.

* Since we have tokenized Dante's Inferno and assigned it to Python variable called `preprocessed_text`, let's create a list of all unique tokens and sort them alphabetically to determine the vocabulary size.

In [14]:
all_words = sorted(set(preprocessed_text))
vocab_size = len(all_words)
print(vocab_size)

13909


* After finding out the `vocab_size=13909`, we create the vocabulary and print its first 500 entries.

In [18]:
vocab  ={token:integer for integer,token in enumerate(all_words)}
for i,item in enumerate(vocab.items()):
  print(item)
  if i >= 500:
    break

('!', 0)
('"', 1)
('#41537]', 2)
('$1', 3)
('$5', 4)
("'", 5)
('(', 6)
(')', 7)
('***', 8)
(',', 9)
('-', 10)
('--', 11)
('.', 12)
('000', 13)
('1', 14)
('10', 15)
('100', 16)
('101', 17)
('102', 18)
('103', 19)
('1037', 20)
('104', 21)
('105', 22)
('106', 23)
('107', 24)
('108', 25)
('1085', 26)
('109', 27)
('10th', 28)
('11', 29)
('110', 30)
('1106', 31)
('111', 32)
('1115', 33)
('112', 34)
('113', 35)
('114', 36)
('115', 37)
('1152-1190', 38)
('116', 39)
('117', 40)
('118', 41)
('1180', 42)
('1185', 43)
('119', 44)
('1193', 45)
('1198', 46)
('12', 47)
('120', 48)
('121', 49)
('1215', 50)
('122', 51)
('1220', 52)
('123', 53)
('1238', 54)
('1239', 55)
('124', 56)
('1248', 57)
('1249', 58)
('125', 59)
('1250', 60)
('1252', 61)
('1258', 62)
('1259', 63)
('126', 64)
('1260', 65)
('1261', 66)
('1264', 67)
('1265', 68)
('1266', 69)
('1267', 70)
('1268', 71)
('1269', 72)
('127', 73)
('1270', 74)
('1271', 75)
('1273', 76)
('1275', 77)
('1277', 78)
('1278', 79)
('128', 80)
('1280', 81)
('1281

* We see that, the dictionary contains individual tokens associated with unique integer labels.
* Next we will apply this vocabulary to convert new text into token IDs.

* Let's create a tokenizer class in Python with encode method that splits texts into tokens and carries out string-to-integer mapping to produce token IDs via vocabularies.
* We also include `decode` method that carries out the reverse integer-to-string mapping to convert the token IDs back into text.

In [28]:
import re
class SimpleTokenizer1:
  def __init__(self,vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}

  def encode(self,text):
    preprocessed = re.split(r'([,.:;?_!"()\']|\s|--)', text)
    preprocessed = [item.strip() for item in preprocessed if item.strip()]
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self,ids):
    text = " ".join([self.int_to_str[i] for i in ids])
    text = re.sub(r'\s+([,.:;?_!"()\'])', r'\1', text)

    return text

* Let's test this class using a sample paragraph from Dante's Inferno

In [29]:
##instantiating the SimpleTokenizer1
tokenizer = SimpleTokenizer1(vocab)
text = """CANTO XXXIV. The Ninth Circle--the Fourth Ring or Judecca, the deepest point
of the Inferno and the Centre of the Universe--it is the place
of those treacherous to their Lords or Benefactors--Lucifer with
Judas, Brutus, and Cassius hanging from his mouths--passage
through the Centre of the Earth--ascent from the depths to the
light of the stars in the Southern Hemisphere,"""
ids = tokenizer.encode(text)
print(ids)

[763, 3412, 12, 3024, 2221, 942, 11, 12706, 1433, 2637, 10090, 1830, 9, 12706, 6385, 10524, 10012, 12706, 1763, 4624, 12706, 883, 10012, 12706, 3193, 11, 8933, 8924, 12706, 10451, 10012, 12769, 12969, 12857, 12710, 1958, 10090, 658, 11, 1969, 13708, 1829, 9, 739, 9, 4624, 843, 8244, 7856, 8442, 9748, 11, 10256, 12806, 12706, 883, 10012, 12706, 1233, 11, 4781, 7856, 12706, 6484, 12857, 12706, 9222, 10012, 12706, 12239, 8672, 12706, 2889, 1663, 9]


* Let's see if we can turn these token IDs back into text using the `decode` method.

In [30]:
print(tokenizer.decode(ids))

CANTO XXXIV. The Ninth Circle -- the Fourth Ring or Judecca, the deepest point of the Inferno and the Centre of the Universe -- it is the place of those treacherous to their Lords or Benefactors -- Lucifer with Judas, Brutus, and Cassius hanging from his mouths -- passage through the Centre of the Earth -- ascent from the depths to the light of the stars in the Southern Hemisphere,


* So I guess i works now let's apply it to a new text sample not contained in the training set:

In [31]:
text = "Hello, welcome to Bogota"
print(tokenizer.encode(text))

KeyError: 'Hello'

* We can see that after we execute this code we get the above error.
* We're getting error because the word `Hello` was not used in `Dante's Inferno` story.
* Hence, it is not contained in the vocabulary.
* This highlights the need to consider large and diverse training sets to extend the vocabulary when working on LLMs.

* Next we'll try to test the tokenizer further on text that contain unkown words and also special tokens that can be used to provide further context for an LLM during training.

## 2.4 Adding special context tokens

* Our task now is to modify the tokenizer to handle unknown words.
* And also need to address the usage and additional of special context tokens that can enhance a model's understading of context or other information.
* These special tokens iclude markers for unknown words and document boundaries.

* Let'smodify the vocabulary to include these two special tokens , `<unk>` and `<|endoftext|>`, by adding them to our list of all unique words.

In [32]:
all_tokens = sorted(list(set(preprocessed_text)))
all_tokens.append("<|endoftext|>") # Append as individual string
all_tokens.append("<|unk|>") # Append as individual string
vocab = {token:integer for integer,token in enumerate(all_tokens)}

print(len(vocab))

13911


* Now let's print the last 500 enties of the updated vocabulary:

In [33]:
for i, item in enumerate(list(vocab.items())[-500:]):
  print(item)

('visage', 13411)
('visible', 13412)
('vision', 13413)
('visions', 13414)
('visit', 13415)
('visited', 13416)
('visiting', 13417)
('vital', 13418)
('vitally', 13419)
('vitiate', 13420)
('vivid', 13421)
('vividly', 13422)
('vizor', 13423)
('vobis', 13424)
('vocal', 13425)
('vocation', 13426)
('vogue', 13427)
('voice', 13428)
('voices', 13429)
('void', 13430)
('vol', 13431)
('volition', 13432)
('volume', 13433)
('voluntarily', 13434)
('volunteer', 13435)
('volunteers', 13436)
('vomiting', 13437)
('voracity', 13438)
('voted', 13439)
('vouches', 13440)
('vow', 13441)
('voyage', 13442)
('vulgar', 13443)
('wafted', 13444)
('wage', 13445)
('waged', 13446)
('wager', 13447)
('wages', 13448)
('wagged', 13449)
('waging', 13450)
('wags', 13451)
('wail', 13452)
('wailed', 13453)
('wailed[530]', 13454)
('wailing', 13455)
('wailings', 13456)
('wails', 13457)
('waist', 13458)
('wait', 13459)
('waiting', 13460)
('waits', 13461)
('wake', 13462)
('waketh', 13463)
('waking', 13464)
('walk', 13465)
('walks

* Based on the output we can confirm that the special tokens have indeed successfully incorporated into the vocabulary.

* Now let's update our tokenizer.

In [34]:
class SimpleTokenizer2:
  def __init__(self,vocab):
    self.str_to_int = vocab
    self.int_to_str = {i:s for s,i in vocab.items()}

  def encode(self,text):
    preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)',text)
    preprocessed = [
        item.strip() for item in preprocessed if item.strip()
    ]
    preprocessed = [
        item if item in self.str_to_int else "<|unk|>" for item in preprocessed
    ] #replaces unknown words by <|unk|> tokens
    ids = [self.str_to_int[s] for s in preprocessed]
    return ids

  def decode(self,ids):
    text = " ".join([self.int_to_str[i] for i in ids])
    text = re.sub(r"\s+([,.:;?_!\"()'])", r"\1", text) # replaces spaces before the specified punctuation.

    return text

* Now test our new tokenizer.
* For this we'll use text samples that we concatenate from two different from two independent and unrelated sentences.

In [36]:
text1 = "In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits."
text2 = "In deep learning, transformer is an architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table."
text =  "<|endoftext|>".join((text1,text2))
print(text)

In electrical engineering, a transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits.<|endoftext|>In deep learning, transformer is an architecture based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table.


* Now let's tokenize the sample text using `SimpleTokenizer2`on the vocab we previously created in 2.2

In [37]:
tokenizer = SimpleTokenizer2(vocab)
print(tokenizer.encode(text))

[1751, 13910, 7072, 9, 4316, 13910, 8924, 4316, 13910, 13910, 12704, 13910, 13910, 7065, 7856, 10049, 13910, 5630, 12857, 4647, 5630, 9, 10090, 13910, 5631, 12, 13910, 6382, 9141, 9, 13910, 8924, 4612, 13910, 4986, 10046, 12706, 13910, 4861, 13910, 9, 8672, 13623, 12694, 8924, 6053, 12857, 13910, 11197, 5414, 13910, 9, 4624, 6889, 13910, 8924, 6053, 8870, 4316, 13910, 13910, 13910, 7856, 4316, 13755, 13910, 12584, 12]


* We can see that the list of token IDs contain `13909` for the `<|endoftext|>` seperator token as well as  several `13910` tokens, which are used for unknown words.


* Let's detokenize the text for a quick sanity check.

In [38]:
print(tokenizer.decode(tokenizer.encode(text)))

In <|unk|> engineering, a <|unk|> is a <|unk|> <|unk|> that <|unk|> <|unk|> energy from one <|unk|> circuit to another circuit, or <|unk|> circuits. <|unk|> deep learning, <|unk|> is an <|unk|> based on the <|unk|> attention <|unk|>, in which text is converted to <|unk|> representations called <|unk|>, and each <|unk|> is converted into a <|unk|> <|unk|> <|unk|> from a word <|unk|> table.


* Based on comparing this detokenized text with the original input text, we know that the training dataset, Dante's Inferno, does not contain the words `electrical,architecture etc`.

* Depending on the LLM, researchers also consider additional special tokens such as the following:

  * [BOS](beginning of sequence) - This tokens marks the start of a text. It signifies to the LLM where a piece of content begins.
  * [EOS] (end of sequence) - This token is positioned at the end of a text and is especially useful when concatenating multiple unrelated texts similar to `<|endoftext|>`.
  * [PAD] (padding) - When training LLMs with batch sizes larger than one, the batch might contain text of varying lengths.To ensure all text have the same lengths, the shorter texts are extended or `padded`using thr [PAD] token, up to the length of the longest text in the batch.

## 2.5 Byte pair encoding

* Byte Pair Encoding (BPE) tokenizer was used in training LLMs like GPT-2,GPT-3,and the original model used in ChatGPT.

* We first need to download a library that does bpe called `tiktoken` using the code below:

In [39]:
!pip install tiktoken



* Once installed, we instantiate the BPE tokenizer for tiktoken as follows:

In [41]:
import tiktoken

tokenizer = tiktoken.get_encoding('gpt2')