# Preparing the Tiny Shakespeare Dataset (Character-Level) for nanoGPT

This notebook walks through the `prepare.py` script located in `data/shakespeare_char/`. This script processes the Tiny Shakespeare dataset for training nanoGPT models using character-level encoding, where each character is treated as a unique token. We'll integrate explanations from the provided documentation to offer a comprehensive understanding of each step.

## 1. Introduction and Setup

### Purpose and Scope (from documentation)
The `prepare.py` script for the character-level Shakespeare dataset transforms the raw text of Shakespeare's works into tokenized binary files. Unlike other preparation scripts that might use BPE tokenization (like GPT-2's), this one maps each character directly to an integer. These files (`train.bin`, `val.bin`, and `meta.pkl`) are then used for training character-based language models.

### Overview of Character-Level Preparation (from documentation)
Data preparation in nanoGPT, as exemplified by this script, converts raw text into arrays of integer token IDs. For character-level models, the 'tokens' are individual characters. 
Key characteristics of this preparation (from documentation):
*   Vocabulary size: 65 (based on unique characters in the Tiny Shakespeare dataset).
*   Simple integer mapping for each character.
*   Metadata (character-to-ID mapping, vocabulary size) saved in `meta.pkl` for encoding/decoding during generation.
*   A 90/10 train/validation split is used.

In [None]:
import os
import pickle
import requests
import numpy as np
# script_dir = os.getcwd() # Original line
# In a notebook, __file__ is not defined. We'll define script_dir to be the current working directory,
# which should be 'data/shakespeare_char/' if you are running this notebook from there as intended.
script_dir = '.' 

## 2. Downloading the Dataset

### The Tiny Shakespeare Dataset (from script)
The script begins by downloading the Tiny Shakespeare dataset if it's not already present locally as `input.txt`. This is the same raw text file used by the BPE token-level Shakespeare preparation.

In [None]:
input_file_path = os.path.join(script_dir, 'input.txt') # Use script_dir defined earlier
if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file_path, 'w', encoding='utf-8') as f: # Added encoding for consistency with .py
        f.write(requests.get(data_url).text)
    print(f"Downloaded and saved dataset to {input_file_path}")
else:
    print(f"Dataset {input_file_path} already exists.")

with open(input_file_path, 'r', encoding='utf-8') as f: # Added encoding for consistency with .py
    data = f.read()
print(f"Length of dataset in characters: {len(data):,}")

## 3. Character Vocabulary Creation and Encoding

### Building the Vocabulary (from script & documentation)
This is the core of character-level tokenization:
1.  The set of all unique characters in the dataset is extracted.
2.  This set is sorted to ensure consistent mapping.
3.  The vocabulary size is determined by the number of unique characters.
4.  Two dictionaries are created:
    *   `stoi` (string-to-integer): Maps each character to a unique integer ID.
    *   `itos` (integer-to-string): Maps each integer ID back to its character.

The documentation confirms the vocabulary size for this dataset is 65.

In [None]:
# get all the unique characters that occur in this text
chars = sorted(list(set(data)))
vocab_size = len(chars)
print("All the unique characters:", ''.join(chars))
print(f"Vocab size: {vocab_size:,}")

# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
def encode(s):
    return [stoi[c] for c in s] # encoder: take a string, output a list of integers
def decode(l):
    return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Example encoding/decoding
print("Example encoding of 'hello':", encode('hello'))
print("Example decoding of [46, 43, 50, 50, 53]:", decode([46, 43, 50, 50, 53]))

### Data Splitting and Encoding (from script)
The dataset is split into training (90%) and validation (10%) sets. The `encode` function is then used to convert the raw character strings of these splits into lists of integer IDs.

In [None]:
# create the train and test splits
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# encode both to integers
train_ids = encode(train_data)
val_ids = encode(val_data)
print(f"train has {len(train_ids):,} tokens (characters)")
print(f"val has {len(val_ids):,} tokens (characters)")

## 4. Exporting to Binary Files and Saving Metadata

### Data Format and Storage (from documentation)
The integer ID sequences are stored in binary files:
*   Training data: `train.bin`
*   Validation data: `val.bin`
*   Data type: `np.uint16`. This is suitable as the vocabulary size (65) is much smaller than 2^16.

The documentation table for Shakespeare (char) indicates:
*   Training Tokens: ~1M
*   Validation Tokens: ~111K
*   Format: Character-level

### Exporting Binary Files (from script)
The lists of token IDs are converted to NumPy arrays and saved to `.bin` files.

In [None]:
# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(script_dir, 'train.bin')) # Use script_dir
val_ids.tofile(os.path.join(script_dir, 'val.bin'))   # Use script_dir
print(f"Finished writing train.bin and val.bin to {script_dir}")

### Saving Metadata (from script & documentation)
A crucial step for character-level models is saving the vocabulary and encoding/decoding mappings. This is stored in `meta.pkl` using Python's `pickle` module.
The metadata includes:
*   `vocab_size`: The number of unique characters.
*   `itos`: The integer-to-string mapping.
*   `stoi`: The string-to-integer mapping.

This `meta.pkl` file is essential later for generating text from a trained character-level model, as it allows converting the model's output (integer IDs) back into readable characters.

In [None]:
# save the meta information as well, to help us encode/decode later
meta = {
    'vocab_size': vocab_size,
    'itos': itos,
    'stoi': stoi,
}
with open(os.path.join(script_dir, 'meta.pkl'), 'wb') as f: # Use script_dir
    pickle.dump(meta, f)
print(f"Saved metadata to {os.path.join(script_dir, 'meta.pkl')}")

## 5. Final Output and Statistics

The script produces three files in the `data/shakespeare_char/` directory:
* `train.bin`: Character IDs for the training data (approx. 1 million tokens).
* `val.bin`: Character IDs for the validation data (approx. 111,000 tokens).
* `meta.pkl`: Contains the vocabulary size and the character-to-integer mappings.

These files are now ready for training a character-level language model with nanoGPT.