# Preparing the Tiny Shakespeare Dataset (Token-Level) for nanoGPT

This notebook walks through the `prepare.py` script located in `data/shakespeare/`, which processes the Tiny Shakespeare dataset for training nanoGPT models using token-level encoding (GPT-2 BPE). We'll integrate explanations from the provided documentation to offer a comprehensive understanding of each step.

## 1. Introduction and Setup

### Purpose and Scope (from documentation)
The `prepare.py` script for the Shakespeare dataset transforms the raw text of Shakespeare's works into tokenized binary files. This specific version uses the standard GPT-2 BPE tokenizer. These files (`train.bin` and `val.bin`) can be efficiently loaded during model training and are part of nanoGPT's overall data preparation strategy.

### Overview of Data Preparation (from documentation)
Data preparation in nanoGPT, as exemplified by this script, converts raw text into arrays of integer token IDs. These are then stored in binary files. This method allows for efficient memory-mapping during training, meaning the entire dataset doesn't need to fit in RAM. The Shakespeare token-level dataset is one of the smaller datasets provided as an example with nanoGPT.

In [None]:
import os
import requests
import tiktoken
import numpy as np
# script_dir = os.getcwd() # Original line
# In a notebook, __file__ is not defined. We'll define script_dir to be the current working directory,
# which should be 'data/shakespeare/' if you are running this notebook from there as intended.
script_dir = '.' 

## 2. Downloading and Loading the Dataset

### The Tiny Shakespeare Dataset (from documentation & script)
This script uses the "Tiny Shakespeare" dataset, a collection of Shakespeare's works. It's a relatively small dataset, making it good for quick experimentation.

The first step is to download the dataset if it's not already present in the `input.txt` file. The data is fetched from a URL pointing to a plain text file on GitHub.

In [None]:
input_file_path = os.path.join(script_dir, 'input.txt') # Use script_dir defined earlier
if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file_path, 'w', encoding='utf-8') as f:
        f.write(requests.get(data_url).text)
    print(f"Downloaded and saved dataset to {input_file_path}")
else:
    print(f"Dataset {input_file_path} already exists.")

## 3. Data Splitting and Tokenization

### Splitting the Data (from script)
The loaded text data is split into training and validation sets. 90% of the data is used for training, and the remaining 10% is used for validation.

### Tokenization Process (from documentation and script)
The script uses the standard GPT-2 Byte Pair Encoding (BPE) tokenizer from the `tiktoken` library. 
The `enc.encode_ordinary()` function converts the text of the train and validation sets into sequences of token IDs.

In [None]:
with open(input_file_path, 'r', encoding='utf-8') as f:
    data = f.read()
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

## 4. Data Format, Storage, and Exporting to Binary Files

### Data Format and Storage (from documentation)
As with other nanoGPT datasets, the tokenized Shakespeare data is stored as arrays of integer token IDs in binary files:
* Training data: `train.bin`
* Validation data: `val.bin`
* Data type: `np.uint16` (since GPT-2's max token ID is < 2^16).

The documentation table for Shakespeare (word) indicates:
* Training Tokens: ~302K
* Validation Tokens: ~36K
* Format: GPT-2 BPE tokens

These binary files are memory-mapped during training for efficient data access.

### Exporting to Files (from script)
The token ID lists (`train_ids` and `val_ids`) are converted to NumPy arrays with `dtype=np.uint16`. These arrays are then written directly to `.bin` files using the `tofile()` method.

In [None]:
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(script_dir, 'train.bin')) # Use script_dir
val_ids.tofile(os.path.join(script_dir, 'val.bin'))   # Use script_dir
print(f"Finished writing train.bin and val.bin to {script_dir}")

## 5. Final Output and Statistics

The script produces two binary files in the current directory (`data/shakespeare/`):
* `train.bin`: Contains the token IDs for the training data.
* `val.bin`: Contains the token IDs for the validation data.

The script output indicates approximately 301,966 tokens for `train.bin` and 36,059 tokens for `val.bin`, which aligns with the documentation.

These files are ready for use with nanoGPT's training script.