# **Tokenization for Generative Pretrained Transfromers (GPTs)**

This notebook covers the text tokenization techniques used in Generative Pretrained Transformers (GPTs) - specifically GPT-2 and GPT-3 - model training.

---

## **Table of Contents**
1. [Introduction](#introduction)
2. [Loading Raw Text](#loading-raw-text)
3. [Word-Level Tokenization](#word-level-tokenization)
4. [Byte Pair Encoding (BPE)](#byte-pair-encoding-bpe)

**Install all the required libraries or packages.**

In [29]:
!pip install --quiet --upgrade pypdf tiktoken==0.6.0

In [30]:
# import all the required libraries, modules and packages for this project.
import os
from pypdf import PdfReader
import re
import tiktoken

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


<a id="introduction"></a>
## **Introduction**

In simple terms, **Tokenization** is the process of converting raw text (which could be internet or web data, books and any other textual contents or documents) into numerical sequences that machines (in this scenario, we mean computers) can read and understand, and neural networks can process for their training. In this notebook, we explored two tokenization approaches:

- **Word-Level Tokenization**: This tokenization approach splits text into words and punctuation. It is the simplest type of tokenization.
- **Byte Pair Encoding (BPE)**: This is the type of tokenization where texts are split into subword units for better compression and handling of rare or uncommon words that are not present in the training dataset or documents. It is the most efficient, effective and highly accurate tokenization technique, and it is used in most or every of today's leading Large Language Models (LLMs) - OpenAI GPT Family, Meta Llama Family and many others.

---

<a id="loading-raw-text"></a>
## **Loading Raw Text**

First, we load our raw text data from pdf file format and convert it to .txt file. Before any model can be trained, data must be provided during its learning or training stage process. The data here is the raw text data.

In [8]:
pdf_path = "/content/drive/MyDrive/Datasets/Book/"

### **PDF to Text Conversion**

Before we can tokenize text, we need to extract it from PDF files. We've created two utility functions:

- **`pdf_to_text()`**: Extracts text from a single PDF file and optionally saves it as a `.txt` file
- **`batch_pdf_to_text()`**: Processes multiple PDFs in a folder at once, converting each to text

These functions use the `pypdf` library to read PDFs page-by-page and combine the extracted text into a single string.

In [13]:
def pdf_to_text(pdf_path, output_txt_path=None, save_to_file=True):
    """reads pdf and converts to text, optionally saving to .txt file."""

    # check if pdf exists.
    if not os.path.exists(pdf_path):
        print(f"error: '{pdf_path}' not found.")
        return None

    try:
        # open pdf in binary mode.
        with open(pdf_path, 'rb') as file:

            # create pdf reader.
            reader = PdfReader(file)
            num_pages = len(reader.pages)
            print(f"processing {num_pages} pages...")

            # extract all text.
            full_text = []
            for page_num in range(num_pages):
                page = reader.pages[page_num]
                text = page.extract_text()

                if text:
                    full_text.append(text)
                else:
                    print(f"warning: page {page_num + 1} had no extractable text.")

            # combine all pages.
            combined_text = "\n\n".join(full_text)

            # save to file if requested.
            if save_to_file:
                if output_txt_path is None:
                    output_txt_path = pdf_path.replace('.pdf', '.txt')

                with open(output_txt_path, 'w', encoding='utf-8') as txt_file:
                    txt_file.write(combined_text)

                print(f"saved to: {output_txt_path}")

            return combined_text

    except Exception as e:
        print(f"error reading pdf: {e}")
        return None

In [14]:
def batch_pdf_to_text(folder_path, output_folder=None):
    """converts all pdfs in a folder to text files."""

    # check if folder exists.
    if not os.path.exists(folder_path):
        print(f"error: folder '{folder_path}' not found.")
        return

    # create output folder if needed.
    if output_folder and not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # get all pdf files.
    pdf_files = [f for f in os.listdir(folder_path) if f.endswith('.pdf')]

    if not pdf_files:
        print("no pdf files found in folder.")
        return

    print(f"found {len(pdf_files)} pdf files. converting...")

    combined_texts = []

    # process each pdf.
    for pdf_file in pdf_files:
        pdf_path = os.path.join(folder_path, pdf_file)

        if output_folder:
            output_path = os.path.join(output_folder, pdf_file.replace('.pdf', '.txt'))
        else:
            output_path = None

        print(f"\nprocessing: {pdf_file}")

        combined_texts.append(pdf_to_text(pdf_path, output_path))

    print("\nbatch conversion complete!")

    return combined_texts

In [31]:
# batch convert pdfs to text.
raw_text_contents = batch_pdf_to_text(pdf_path)

found 1 pdf files. converting...

processing: Outliers_Malcolm_Gladwell.pdf
processing 249 pages...
saved to: /content/drive/MyDrive/Datasets/Book/Outliers_Malcolm_Gladwell.txt

batch conversion complete!


In [32]:
# get first pdf's text content.
raw_text_content = raw_text_contents[0]

# display character count and preview.
print("Total number of character:", len(raw_text_content), "\n")
print(raw_text_content[:99])

Total number of character: 460987 

OUTLIERS
The Story of Success
MALCOLM GLADWELL
BACK BAY BOOKS
LITTLE, BROWN AND COMPANY
NEW YORK   


### **Text Preprocessing**

We preprocess the raw text by splitting it on punctuation and whitespace using regex patterns. This function tokenizes the text and removes empty strings, giving us a clean list of words and punctuation marks for building our vocabulary.

In [18]:
def preprocess_text(raw_text, max_preview=30):
    """splits text on punctuation and whitespace, removes empty strings."""

    # split on common punctuation and whitespace.
    pattern = r'([,.:;?_!"()\'\[\]{}\/\\|—–-]+|\.\.\.|\s+)'

    tokens = re.split(pattern, raw_text)

    # remove whitespace and filter empty strings.
    preprocessed = [item.strip() for item in tokens if item.strip()]

    # preview first n tokens.
    if max_preview:
        print(f"first {max_preview} tokens:")
        print(preprocessed[:max_preview])
        print(f"\ntotal word level tokens: {len(preprocessed)}\n")

    return preprocessed

In [33]:
# preprocess and tokenize raw text.
preprocessed = preprocess_text(raw_text_content, max_preview=30)

# get unique words and vocabulary size.
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

first 30 tokens:
['OUTLIERS', 'The', 'Story', 'of', 'Success', 'MALCOLM', 'GLADWELL', 'BACK', 'BAY', 'BOOKS', 'LITTLE', ',', 'BROWN', 'AND', 'COMPANY', 'NEW', 'YORK', '•', 'BOSTON', '•', 'LONDON', 'Begin', 'Reading', 'Table', 'of', 'Contents', 'Reading', 'Group', 'Guide', 'Copyright']

total word level tokens: 92758

11324


In [34]:
# get all unique tokens and sort them.
all_tokens = sorted(list(set(preprocessed)))

# add special tokens to vocabulary.
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

# create vocabulary mapping from tokens to integers.
vocab = {token: integer for integer, token in enumerate(all_tokens)}

<a id="word-level-tokenization"></a>
## **Word-Level Tokenization**

Word-level tokenization is a straightforward approach where text is split into individual words and punctuation marks. Each unique word in the vocabulary is assigned a specific numerical ID. This method works well for languages with clear word boundaries and limited vocabulary sizes. However, it struggles with rare or unseen words, which are typically replaced with a special `<|unk|>` (unknown) token. This problem is regarded as **out-of-vocabulary words**. While simple to implement and understand, word-level tokenization can result in large vocabularies and poor handling of morphologically rich languages or out-of-vocabulary words.

### **Building Custom Word Tokenizer**

This Word Tokenizer splits text based on whitespace and punctuation marks.

In [21]:
class WordTokenizer:
    """tokenizes text into ids and decodes ids back to text. it focuses on word-level tokenization."""

    def __init__(self, vocab):
        """initializes tokenizer with vocabulary mappings."""

        self.tok_to_int = vocab
        self.int_to_tok = {integer: token for token, integer in vocab.items()}
        self.pattern = r'([,.:;?_!"()\'\[\]{}\/\\|—–-]+|\.\.\.|\s+)'

    def encode(self, text):
        """converts text to list of token ids."""

        # split on punctuation and whitespace.
        preprocessed = re.split(self.pattern, text)

        # remove empty strings and whitespace.
        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        # replace unknown tokens with <|unk|>.
        preprocessed = [token if token in self.tok_to_int else "<|unk|>" for token in preprocessed]

        # convert tokens to ids.
        ids = [self.tok_to_int[tok] for tok in preprocessed]

        # return ids
        return ids


    def decode(self, ids):
        """converts list of token ids back to text."""

        # map ids to tokens.
        tokens = [self.int_to_tok[id] for id in ids]

        # join tokens with spaces.
        text = " ".join(tokens)

        # remove spaces before punctuation.
        text = re.sub(self.pattern, r'\1', text)

        # remove spaces before punctuation.
        text = re.sub(r'\s+([,.:;?_!"()\'\[\]{}\/\\|—–-])', r'\1', text)

        return text

In [22]:
# initialize the wordtokenizer class.
tokenizer = WordTokenizer(vocab)

# sample text from (and outside) outliers book.
text1 = "Chris Langan's mother was from San Francisco and was estranged from her family."

# sample text with unknown words for testing.
text2 = "do you know about smartphone cryptocurrency?"


# combine texts with special separator token.
text = " <|endoftext|> ".join((text1, text2))

print(f"original text: {text}")

# convert text to token ids.
encoded_text = tokenizer.encode(text)
print(f"\nencoded text: {encoded_text}")

# convert token ids back to text.
decoded_text = tokenizer.decode(encoded_text)
print(f"\ndecoded text: {decoded_text}")

original text: Chris Langan's mother was from San Francisco and was estranged from her family. <|endoftext|> do you know about smartphone cryptocurrency?

encoded text: [915, 1905, 11325, 11325, 7500, 10581, 5993, 2676, 1342, 3561, 10581, 5524, 5993, 6370, 5700, 29, 11325, 11325, 11325, 11325, 11325, 5192, 10835, 6940, 3291, 11325, 11325, 430]

decoded text: Chris Langan <|unk|> <|unk|> mother was from San Francisco and was estranged from her family. <|unk|> <|unk|> <|unk|> <|unk|> <|unk|> do you know about <|unk|> <|unk|>?


<a id="byte-pair-encoding-bpe"></a>
## **Byte Pair Encoding (BPE) Tokenizer.**

The GPT Series uses a more sophisticated encoding preprocessing technique to convert human language (texts or words) into sub-word units for its training processes. It uses the Byte Pair Encoding (BPE) tokenizer. BPE tokenizer is fast, performant and compressed compared to other tokenization methods.

Its implementation is quite complex and time-consuming, so I used the tiktoken open-source library to bypass this process (the logic is almost akin to the WordTokenizer implemented above). Tiktoken is an invertible and lossless tokenization technique. it can work on arbitrary text - even text that is not present in the tokenizer's training data. it compresses the text, whereby making the token sequence shorter than the bytes corresponding to the original text.

It also attempts to let the model see common subwords. For instance, "ing" is a common subword in English, so BPE encodings will often split "encoding" into tokens like "encod" and "ing" (instead of e.g. "enc" and "oding"). Because the model will then see the "ing" token again and again in different contexts, it helps in better contextual understanding of grammar and in generalization of models.

In [23]:
# instantiate bpe tokenizer from tiktoken.
tokenizer_encoder = tiktoken.get_encoding("gpt2")

# convert text to token ids.
encoded_text = tokenizer_encoder.encode(text, allowed_special={"<|endoftext|>"})
print(f"\nbpe tokenizer's encoded text: {encoded_text}")

# convert token ids back to text.
decoded_text = tokenizer_encoder.decode(encoded_text)
print(f"\nbpe tokenizer's decoded text: {decoded_text}")

# display the gpt-2 bpe tovocabulary size.
print(f"\nthe vocabulary size of gpt-2 tokenizer is {tokenizer_encoder.n_vocab}")


bpe tokenizer's encoded text: [15645, 16332, 272, 338, 2802, 373, 422, 2986, 6033, 290, 373, 44585, 422, 607, 1641, 13, 220, 50256, 466, 345, 760, 546, 11745, 20210, 30]

bpe tokenizer's decoded text: Chris Langan's mother was from San Francisco and was estranged from her family. <|endoftext|> do you know about smartphone cryptocurrency?

the vocabulary size of gpt-2 tokenizer is 50257


In [24]:
class BPETokenizer:
    """tokenizes text using byte pair encoding (bpe). uses tiktoken's gpt-2 tokenizer."""

    def __init__(self, model_name="gpt2"):
        """initializes bpe tokenizer with specified model."""

        # load tiktoken encoder for the specified model.
        self.tokenizer = tiktoken.get_encoding(model_name)

        # obtain and store the tokenizer's vocab size.
        self.vocab_size = self.tokenizer.n_vocab

    def encode(self, text, allowed_special=None):
        """converts text to list of token ids."""

        # set default allowed special tokens.
        if allowed_special is None:
            allowed_special = {"<|endoftext|>"}

        # convert text to token ids.
        ids = self.tokenizer.encode(text, allowed_special=allowed_special)

        return ids

    def decode(self, ids):
        """converts list of token ids back to text."""

        # convert token ids back to text.
        text = self.tokenizer.decode(ids)

        return text

    def get_vocab_size(self):
        """returns the vocabulary size of the tokenizer."""

        return self.vocab_size

In [25]:
# initialize tokenizer.
tokenizer = BPETokenizer()

# get total character count.
total_characters = len(raw_text_content)

# get total token count after encoding.
total_tokens = len(tokenizer.encode(raw_text_content))

print("The total characters:", total_characters)
print("The total tokens:", total_tokens)

The total characters: 460987
The total tokens: 115003


## **Next Steps**

In the next notebook (`model_architecture.ipynb`), we'll build the gpt series model architecture that uses these tokens as input.