# Preparing the Tiny Shakespeare Dataset (BPE Tokenization)

Welcome! This notebook guides you through preparing the Tiny Shakespeare dataset for training a language model using Byte Pair Encoding (BPE), specifically with the tokenizer used by GPT-2.

**What is Tokenization?**

Tokenization is a fundamental first step in Natural Language Processing (NLP). It involves breaking down a piece of text into smaller units called "tokens." These tokens are then converted into numerical representations that machine learning models can understand.

**Byte Pair Encoding (BPE)**

BPE is a popular tokenization algorithm that strikes a balance between word-level and character-level tokenization. Here's a high-level idea:
*   It starts with a vocabulary of individual characters.
*   It iteratively merges the most frequent pair of adjacent tokens (bytes or characters) to form new, longer tokens.
*   This process continues for a set number of merges, resulting in a vocabulary that includes common words as single tokens and breaks down rare words into sub-word units.

**BPE vs. Character-Level Tokenization**

Unlike the character-level approach (where every single character is a token, e.g., `H`, `e`, `l`, `l`, `o`), BPE is more efficient:
*   **Manages Vocabulary Size:** It can represent a large corpus of text with a significantly smaller vocabulary than if every unique word was a token, but a more expressive vocabulary than just characters.
*   **Handles Out-of-Vocabulary (OOV) Words:** Rare or new words can often be represented as a sequence of known sub-word tokens, rather than being mapped to a generic "unknown" token.

This notebook will use `tiktoken`, OpenAI's library, to apply the pre-trained GPT-2 BPE tokenizer to the Shakespeare dataset.

## 1. Imports

Let's import the necessary Python libraries:

*   `os`: For interacting with the operating system, like constructing file paths and checking for file existence.
*   `requests`: To make HTTP requests for downloading the dataset from the internet.
*   `tiktoken`: OpenAI's fast BPE tokenizer library. We'll use it to get the GPT-2 tokenizer and encode our text.
*   `numpy`: A library for numerical operations, especially useful for creating and handling arrays of token IDs efficiently.

In [None]:
import os
import requests
import tiktoken
import numpy as np

## 2. Setup Script Directory

We define `script_dir` using `os.getcwd()`. This assumes you are running this notebook from its location within the `data/shakespeare/` directory. All output files (`input.txt`, `train.bin`, `val.bin`) will be saved relative to this directory.

In [None]:
script_dir = os.getcwd()
# For consistency with the original prepare.py, we could also use:
# script_dir = os.path.dirname(__file__) 
# However, __file__ is not defined in interactive notebook environments by default.
# os.getcwd() works well if the notebook is run from its directory.

## 3. Download the Dataset

We'll download the Tiny Shakespeare dataset, which is a single text file containing many of Shakespeare's works.

The code defines `input_file_path` (as `input.txt` in our `script_dir`). It then checks if this file already exists. If not, it downloads the data from a URL hosted on `raw.githubusercontent.com` (from Andrej Karpathy's char-rnn project) and saves it. We specify `encoding='utf-8'` to ensure correct handling of text characters.

In [None]:
input_file_path = os.path.join(script_dir, 'input.txt')
if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    print(f"Downloading dataset from {data_url}...")
    with open(input_file_path, 'w', encoding='utf-8') as f:
        f.write(requests.get(data_url).text)
    print("Dataset downloaded.")
else:
    print("Dataset already exists locally.")

## 4. Load and Split the Data

First, we read the entire content of `input.txt` into a string variable `data`.

Then, we split this data into a training set and a validation set. It's crucial in machine learning to evaluate your model on data it hasn't seen during training to get a true measure of its generalization ability.
*   `train_data`: The first 90% of the dataset, used for training the model.
*   `val_data`: The remaining 10% of the dataset, used for validation.

In [None]:
with open(input_file_path, 'r', encoding='utf-8') as f:
    data = f.read()

n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

print(f"Total characters in dataset: {n:,}")
print(f"Training data characters: {len(train_data):,}")
print(f"Validation data characters: {len(val_data):,}")
print(f"First 100 characters of training data: {train_data[:100]}")

## 5. BPE Tokenization with `tiktoken`

Now, we'll tokenize our text data using the GPT-2 BPE tokenizer provided by the `tiktoken` library.

1.  **Get the Encoder:**
    `enc = tiktoken.get_encoding("gpt2")`
    This line fetches the pre-trained GPT-2 tokenizer. `enc` is an `Encoding` object that contains the vocabulary and the logic to encode text into tokens and decode tokens back into text.
    The GPT-2 tokenizer has a vocabulary size of 50,257. This vocabulary includes tokens for common words, sub-words, and individual characters, as well as special tokens.

2.  **Encode the Data:**
    `train_ids = enc.encode_ordinary(train_data)`
    `val_ids = enc.encode_ordinary(val_data)`
    The `encode_ordinary()` method converts a string into a list of integer token IDs. The term "ordinary" signifies that it processes the text as plain text, without attempting to interpret or process any special tokens that might be part of model-specific prompts (like `<|endoftext|>`). For training on raw text like Shakespeare, this is what we want.

Let's see an example:

In [None]:
enc = tiktoken.get_encoding("gpt2")

# Vocabulary size and max token value for GPT-2
print(f"GPT-2 vocabulary size: {enc.n_vocab}") # Should be 50257
print(f"GPT-2 max token value: {enc.max_token_value}") # Should be 50256

# Example of encoding and decoding
sample_text = "First Citizen:"
encoded_sample = enc.encode_ordinary(sample_text)
decoded_sample = enc.decode(encoded_sample)

print(f"\nOriginal sample: '{sample_text}'")
print(f"Encoded sample (token IDs): {encoded_sample}")
print(f"Decoded sample: '{decoded_sample}'")

# Apply encoding to our actual train and validation data
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)

## 6. Inspect Tokenized Output

After encoding, `train_ids` and `val_ids` are lists of integers. Let's check how many tokens we have in each split.

You'll notice that the number of tokens is significantly smaller than the number of characters. For the Tiny Shakespeare dataset (approx. 1.1 million characters):
*   Character-level tokenization would result in ~1.1 million tokens.
*   BPE tokenization (GPT-2) results in far fewer tokens (around 300k-350k), because common words and sequences of characters are represented by single tokens. This makes the input sequences shorter for the language model, which can be more computationally efficient.

In [None]:
print(f"Number of tokens in train_ids: {len(train_ids):,}")
print(f"Number of tokens in val_ids: {len(val_ids):,}")

## 7. Export to Binary Files

For efficient loading during model training, we save these token ID lists as binary files.

**Steps:**
1.  **Convert to NumPy Arrays:** We convert `train_ids` and `val_ids` (which are Python lists) into NumPy arrays.
2.  **Specify Data Type (`dtype`):** We use `dtype=np.uint16`. This means each token ID will be stored as an unsigned 16-bit integer.
    *   `unsigned` means it can only represent non-negative numbers (which token IDs are).
    *   `16-bit` means it can store values from 0 to 2<sup>16</sup> - 1 (i.e., 0 to 65,535).
    The GPT-2 tokenizer has `enc.max_token_value` of 50,256. Since 50,256 is less than 65,535, `np.uint16` is a perfect fit. It's more memory-efficient than using a larger type like `np.int32` or `np.uint32`.
3.  **Save to File:** The `.tofile()` method of a NumPy array writes the raw array data (just the numbers) to a binary file. This results in `train.bin` and `val.bin`.

These `.bin` files are compact and can be read very quickly by the training script, as it just needs to load a stream of these 16-bit integers.

In [None]:
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)

train_ids.tofile(os.path.join(script_dir, 'train.bin'))
val_ids.tofile(os.path.join(script_dir, 'val.bin'))

print("train.bin and val.bin saved.")

## 8. Conclusion

We have successfully prepared the Tiny Shakespeare dataset for training a language model using GPT-2's Byte Pair Encoding.

The key outputs in your `data/shakespeare/` directory are:
*   `input.txt`: The original raw dataset.
*   `train.bin`: The BPE-tokenized training data (90% of the dataset), stored as a binary sequence of 16-bit integers.
*   `val.bin`: The BPE-tokenized validation data (10% of the dataset), also stored as 16-bit integers.

Unlike the character-level tokenization, we don't need to save a separate `meta.pkl` for the tokenizer itself, because `tiktoken` allows us to reconstruct the exact same GPT-2 tokenizer by just calling `tiktoken.get_encoding("gpt2")` in the training or sampling scripts. The vocabulary and encoding/decoding logic are built into the `tiktoken` library for predefined encodings like "gpt2".

These files are now ready to be used for training a nanoGPT model that expects BPE-tokenized input.