# Preparing the OpenWebText Dataset for nanoGPT

This notebook walks through the `prepare.py` script, which processes the OpenWebText dataset for training nanoGPT models. We'll integrate explanations from the provided documentation to offer a comprehensive understanding of each step.

## 1. Introduction and Setup

### Purpose and Scope (from documentation)
The `prepare.py` script transforms raw text data from the OpenWebText dataset into tokenized binary files. These files can be efficiently loaded during model training. This process is part of nanoGPT's data preparation pipeline, which is crucial for handling large datasets.

### Overview of Data Preparation (from documentation)
Data preparation in nanoGPT, as exemplified by this script, converts raw text into arrays of integer token IDs. These are then stored in binary files (`train.bin` and `val.bin`). This method allows for efficient memory-mapping during training, meaning the entire dataset doesn't need to fit in RAM.

In [None]:
import os
from tqdm import tqdm
import numpy as np
import tiktoken
from datasets import load_dataset
# script_dir = os.getcwd() # original line, but os.path.dirname(__file__) is used in .py
# In a notebook, __file__ is not defined. We'll define script_dir to be the current working directory, 
# which should be 'data/openwebtext/' if you are running this notebook from there as intended.
script_dir = '.' 

## 2. Loading and Splitting the Dataset

### The OpenWebText Dataset (from documentation)
The OpenWebText dataset is a large corpus of text sourced from the web, similar to the dataset used for training the original GPT-2 model. It contains approximately 8 million documents.

The first step in `prepare.py` is to download (if not already cached by HuggingFace `datasets`) and load the OpenWebText dataset. The script then creates a small validation split from the training data, as OpenWebText defaults to a 'train' split only.

In [None]:
# number of workers in .map() call
# good number to use is ~order number of cpu cores // 2
num_proc = 8

# number of workers in load_dataset() call
# best number might be different from num_proc above as it also depends on NW speed.
# it is better than 1 usually though
num_proc_load_dataset = num_proc

# takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)

# owt by default only contains the 'train' split, so create a test split
split_dataset = dataset["train"].train_test_split(test_size=0.0005, seed=2357, shuffle=True)
split_dataset['val'] = split_dataset.pop('test') # rename the test split to val

# this results in:
# >>> split_dataset
# DatasetDict({
#     train: Dataset({
#         features: ['text'],
#         num_rows: 8009762
#     })
#     val: Dataset({
#         features: ['text'],
#         num_rows: 4007
#     })
# })
print(split_dataset)

## 3. Tokenization

### Tokenization Process (from documentation and script)
Once the dataset is loaded and split, the text is tokenized. This script uses the GPT-2 Byte Pair Encoding (BPE) tokenizer provided by the `tiktoken` library.
The `process` function in the script handles tokenization:
1. It encodes the text into token IDs using `enc.encode_ordinary()`, which ignores special tokens.
2. It appends an end-of-text (`eot_token`) to each document. For GPT-2 BPE, this token is `50256`.

The `split_dataset.map()` function applies this tokenization process in parallel to all documents in the train and validation splits.

In [None]:
enc = tiktoken.get_encoding("gpt2")
def process(example):
    ids = enc.encode_ordinary(example['text']) # encode_ordinary ignores any special tokens
    ids.append(enc.eot_token) # add the end of text token, e.g. 50256 for gpt2 bpe
    # note: I think eot should be prepended not appended... hmm. it's called "eot" though...
    out = {'ids': ids, 'len': len(ids)}
    return out

# tokenize the dataset
tokenized = split_dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the splits",
    num_proc=num_proc,
)
print(tokenized)

## 4. Data Format, Storage, and Writing to Binary Files

### Data Format and Storage (from documentation)
All datasets in nanoGPT are prepared and stored as arrays of integer token IDs in binary files:
* Training data is stored in `train.bin`.
* Validation data is stored in `val.bin`.
* The data type is typically `np.uint16` because the maximum token ID in GPT-2's vocabulary (50256) is less than 2^16.

During training, these binary files are memory-mapped for efficient access. This allows the system to train on datasets that might not fit entirely in RAM.

### Writing to Files (from script)
The script iterates through the tokenized train and validation sets. For each set:
1. It calculates the total length of all token sequences (`arr_len`).
2. It creates a memory-mapped NumPy array (`np.memmap`) with the appropriate filename (`train.bin` or `val.bin`), data type (`np.uint16`), and shape.
3. It writes the token IDs into this memory-mapped array in batches for efficiency.
4. Finally, `arr.flush()` ensures all data is written to disk.

In [None]:
for split, dset in tokenized.items():
    arr_len = np.sum(dset['len'], dtype=np.uint64)
    filename = os.path.join(script_dir, f'{split}.bin') # Use script_dir defined earlier
    dtype = np.uint16 # (can do since enc.max_token_value == 50256 is < 2**16)
    arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
    total_batches = 1024

    idx = 0
    for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
        # Batch together samples for faster write
        batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
        arr_batch = np.concatenate(batch['ids'])
        # Write into mmap
        arr[idx : idx + len(arr_batch)] = arr_batch
        idx += len(arr_batch)
    arr.flush()
    print(f"Finished writing {filename}")

## 5. Final Output and Statistics

The script produces two binary files:
* `train.bin`: Approximately 17GB, containing around 9 billion tokens.
* `val.bin`: Approximately 8.5MB, containing around 4 million tokens.

These files are now ready to be used by the `train.py` script in nanoGPT.

### Reading the Binary Files (from script comments)
To read these binary files later, for example with NumPy, you can use:
```python
# m = np.memmap('train.bin', dtype=np.uint16, mode='r')
```

## 6. Custom Dataset Preparation (from documentation)

The documentation also provides general steps for preparing a custom dataset for nanoGPT:
1.  Load or download your text data.
2.  Choose a tokenization method:
    *   GPT-2 BPE tokenization (recommended for most cases).
    *   Character-level tokenization (for smaller datasets or specific applications).
3.  Tokenize the text and convert to integer IDs.
4.  Split into training and validation sets.
5.  Save as binary files using `numpy.tofile()` or memory-mapped arrays.

The binary format allows nanoGPT's training process to efficiently load and process the data, regardless of dataset size.