# Preparing the OpenWebText Dataset for Language Modeling

Welcome! This notebook provides a detailed walkthrough of the process for preparing the OpenWebText dataset for training a large language model like nanoGPT. 

**What is OpenWebText?**

OpenWebText is a large, open-source corpus of high-quality web content. It was created as an open alternative to the private WebText dataset used to train OpenAI's GPT-2 model. It contains text extracted from URLs shared on Reddit, filtered for quality. The dataset is substantial (around 38 GB of text data, as used by GPT-2, though the Hugging Face version might be slightly different in size, e.g., ~54GB in cache for the raw download).

**Handling Large Datasets**

Unlike the smaller Shakespeare dataset, OpenWebText is too large to be comfortably processed entirely in RAM on most machines. This notebook (and the underlying `prepare.py` script) demonstrates techniques for handling such large datasets efficiently:

*   **Hugging Face `datasets` Library:** This library provides powerful tools for downloading, loading, processing, and managing large datasets with features like memory-mapping and multiprocessing.
*   **`numpy.memmap`:** This NumPy feature allows us to treat a file on disk as if it were an in-memory array. This is crucial for creating large binary files of tokenized data without needing to fit the entire array into RAM.

This notebook will guide you through downloading OpenWebText, tokenizing it using the GPT-2 BPE tokenizer, and saving it into `train.bin` and `val.bin` files suitable for training nanoGPT.

## 1. Imports

Let's start by importing the necessary Python libraries:

*   `os`: For operating system interactions, like constructing file paths.
*   `tqdm`: A library for creating smart progress bars. It's very helpful for tracking the progress of long-running operations, especially with large datasets.
*   `numpy` (as `np`): The fundamental package for numerical computation in Python. We'll use it for array manipulations and, critically, for `np.memmap`.
*   `tiktoken`: OpenAI's fast BPE tokenizer library. We'll use it to get the GPT-2 tokenizer.
*   `datasets.load_dataset`: The primary function from the Hugging Face `datasets` library used to download and load datasets from the Hugging Face Hub or local files.

In [None]:
import os
from tqdm import tqdm
import numpy as np
import tiktoken
from datasets import load_dataset

# Determine the script directory (current working directory for a notebook)
script_dir = os.getcwd()
print(f"Files will be saved in: {script_dir}")

## 2. Configuration: Multiprocessing

Processing large datasets can be time-consuming. We can speed things up by using multiple CPU cores in parallel.

*   `num_proc`: This variable sets the number of worker processes to use for parallel operations within the `datasets` library's `.map()` method (which we'll use for tokenization).
*   `num_proc_load_dataset`: This variable sets the number of worker processes for the `load_dataset()` function itself. Downloading and initially processing data can also benefit from parallelism.

**Benefits of Multiprocessing:**
Using multiple processes can significantly reduce the time taken for CPU-bound tasks like tokenizing text or I/O-bound tasks like downloading and decompressing data (up to a point where network bandwidth or disk speed becomes the bottleneck).

**Choosing the Right Number:**
A good rule of thumb is to set `num_proc` to a value around the number of CPU cores available, or perhaps half that if memory is a concern or if other tasks are running. The optimal number can depend on the specific task (CPU-bound vs. I/O-bound), the number of CPU cores, available RAM, and disk/network speed. The script defaults to 8, which is a reasonable starting point for many modern CPUs.

In [None]:
# Number of workers in .map() call
# Good number to use is ~order number of cpu cores // 2
num_proc = 8

# Number of workers in load_dataset() call
# Best number might be different from num_proc above as it also depends on NW speed.
# It is better than 1 usually though
num_proc_load_dataset = num_proc

print(f"Using {num_proc} processes for .map() and {num_proc_load_dataset} for load_dataset().")

## 3. Loading OpenWebText with `datasets.load_dataset`

We use `load_dataset("openwebtext", num_proc=num_proc_load_dataset)` to fetch the OpenWebText dataset.

*   **Fetching from Hugging Face Hub:** This command downloads the dataset from the Hugging Face Hub if it's not already available locally.
*   **Dataset Caching:** The `datasets` library is smart about caching. Downloaded datasets are typically stored in `~/.cache/huggingface/datasets` (on Linux/macOS). If you run this command again, it will load the data from the cache, saving download time. The initial download and preparation can take a while and consume significant disk space (e.g., OpenWebText might take ~54GB in the cache).

The `dataset` object returned is usually a `DatasetDict` if the dataset has predefined splits (like 'train', 'test', 'validation'). For OpenWebText, it primarily contains a 'train' split.

In [None]:
# Takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
print("Loading OpenWebText dataset...")
dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)
print("Dataset loaded.")
print(f"\nDataset structure:\n{dataset}")
print(f"\nNumber of documents in 'train' split: {len(dataset['train']):,}")
print(f"Example document:\n{dataset['train'][0]['text'][:200]}...") # Show first 200 chars of an example

## 4. Creating Train and Validation Splits

The OpenWebText dataset, as loaded from Hugging Face, typically only has a 'train' split. We need to create our own validation split to evaluate the model during training.

*   `split_dataset = dataset["train"].train_test_split(...)`:
    This method from the `datasets` library splits the 'train' dataset into two new datasets: a new (smaller) 'train' and a 'test' dataset.
*   `test_size=0.0005`:
    This specifies that 0.05% of the data should be used for the 'test' set (which we will rename to 'validation'). For very large datasets like OpenWebText (around 8 million documents), even a small percentage results in a sufficiently large validation set. For OpenWebText, 0.05% is about 4,000 documents.
*   `seed=2357`:
    Setting a seed ensures that the random split is reproducible. If you run the script again with the same seed, you'll get the exact same train/validation split.
*   `shuffle=True`:
    This shuffles the dataset before splitting, which is important to ensure that both the training and validation sets are representative of the overall data distribution.
*   `split_dataset['val'] = split_dataset.pop('test')`:
    The `train_test_split` method creates a split named 'test'. We rename it to 'val' (for validation) to match the naming convention often used in training scripts (including nanoGPT's).

The result, `split_dataset`, will be a `DatasetDict` containing 'train' and 'val' splits.

In [None]:
# OWT by default only contains the 'train' split, so create a test split
print("Splitting dataset into train and validation...")
split_dataset = dataset["train"].train_test_split(test_size=0.0005, seed=2357, shuffle=True)
split_dataset['val'] = split_dataset.pop('test') # Rename the test split to val
print("Dataset split.")
print(f"\nSplit dataset structure:\n{split_dataset}")
print(f"Number of documents in new 'train' split: {len(split_dataset['train']):,}")
print(f"Number of documents in 'val' split: {len(split_dataset['val']):,}")

## 5. Tokenization Process

Now we convert the raw text in our datasets into sequences of token IDs that the model can understand. We'll use the GPT-2 Byte Pair Encoding (BPE) tokenizer, accessed via the `tiktoken` library.

**Tokenizer Initialization:**
`enc = tiktoken.get_encoding("gpt2")`
This loads the pre-trained GPT-2 tokenizer. It knows how to map strings to sequences of integers (token IDs) and vice-versa. The GPT-2 tokenizer has a vocabulary of 50,257 tokens.

**The `process(example)` Function:**
This function defines how each individual document (example) in the dataset is tokenized.
```python
def process(example):
    ids = enc.encode_ordinary(example['text']) # encode_ordinary ignores any special tokens
    ids.append(enc.eot_token) # add the end of text token, e.g. 50256 for gpt2 bpe
    out = {'ids': ids, 'len': len(ids)}
    return out
```
*   `ids = enc.encode_ordinary(example['text'])`: The `encode_ordinary()` method takes the raw text from the `example['text']` field and converts it into a list of token IDs. "Ordinary" means it doesn't look for or process special tokens (like prompt tokens) that some models might use; it just tokenizes the plain text.
*   `ids.append(enc.eot_token)`: It's common practice to append an End-Of-Text (EOT) token to each sequence. For GPT-2, `enc.eot_token` is typically 50256. This special token signals to the model that the document has ended. The script's author notes a slight uncertainty about whether it should be prepended or appended, but "eot" (end of text) implies appending.
*   `out = {'ids': ids, 'len': len(ids)}`: The function returns a dictionary. The `datasets.map()` function expects the processing function to return a dictionary where keys correspond to new column names (or existing ones to be overwritten).
    *   `'ids'`: The list of token IDs for the document.
    *   `'len'`: The length of this token list (number of tokens in the document).

**Applying Tokenization with `.map()`:**
`tokenized = split_dataset.map(...)`
The `.map()` method is a powerful feature of the `datasets` library. It applies the `process` function to every example in each split of `split_dataset`.
*   `process`: The function to apply.
*   `remove_columns=['text']`: After tokenization, we no longer need the original raw text column. Removing it saves memory and disk space if the intermediate dataset were to be saved.
*   `desc="tokenizing the splits"`: This provides a description for the `tqdm` progress bar, so you can see which split is currently being tokenized.
*   `num_proc=num_proc`: This enables parallel processing, using the number of workers specified earlier. This significantly speeds up tokenization for large datasets.

The `tokenized` object will be another `DatasetDict`, similar in structure to `split_dataset`, but the 'text' column will be replaced by 'ids' and 'len' columns.

In [None]:
enc = tiktoken.get_encoding("gpt2")

def process(example):
    ids = enc.encode_ordinary(example['text']) # encode_ordinary ignores any special tokens
    ids.append(enc.eot_token) # add the end of text token, e.g. 50256 for gpt2 bpe
    # note: I think eot should be prepended not appended... hmm. it's called "eot" though...
    out = {'ids': ids, 'len': len(ids)}
    return out

# Tokenize the dataset
print("Tokenizing the dataset splits (this can take a while)...")
tokenized = split_dataset.map(
    process,
    remove_columns=['text'],
    desc="tokenizing the splits",
    num_proc=num_proc,
)
print("Tokenization complete.")
print(f"\nTokenized dataset structure:\n{tokenized}")
print(f"\nExample of tokenized 'train' entry features: {tokenized['train'][0].keys()}")
print(f"First 20 token IDs of an example document: {tokenized['train'][0]['ids'][:20]}")
print(f"Length of that example document: {tokenized['train'][0]['len']}")

## 6. Saving to Binary Files with `numpy.memmap`

This is the most critical part for handling large datasets. We want to concatenate all the token IDs from all documents in each split into a single, massive array, and then save this array to a binary file. Doing this entirely in RAM would be infeasible for OpenWebText (which has ~9 billion tokens for the training split).

**The `numpy.memmap` Solution:**
Memory-mapping (`memmap`) allows us to create a NumPy array that is directly mapped to a file on disk. The operating system handles the complexities of reading and writing data to/from the disk as you access or modify the array, but from Python's perspective, it looks like a regular NumPy array. This means we can work with arrays that are much larger than the available RAM.

**The Process:**

The code iterates through each split ('train' and 'val') in the `tokenized` DatasetDict:
`for split, dset in tokenized.items():`

1.  **Calculate Total Length:**
    `arr_len = np.sum(dset['len'], dtype=np.uint64)`
    First, we calculate the total number of tokens in the current split (`dset`). We sum up the 'len' column (which contains the length of each tokenized document). `dtype=np.uint64` is used for the sum to avoid overflow, as the total number of tokens can be very large (billions).

2.  **Define Filename and Data Type:**
    `filename = os.path.join(script_dir, f'{split}.bin')`
    This creates the output filename, e.g., `train.bin` or `val.bin` in the current script directory.
    `dtype = np.uint16`
    We'll store each token ID as an unsigned 16-bit integer. This is suitable because the GPT-2 tokenizer has a maximum token value of 50,256, which fits within the range of `np.uint16` (0 to 65,535). This is memory-efficient.

3.  **Create the Memory-Mapped Array:**
    `arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))`
    This is the core step:
    *   `filename`: The file to be created/used on disk.
    *   `dtype=dtype`: Sets the data type of the array elements (`np.uint16`).
    *   `mode='w+'`: This mode opens the file for both reading and writing. If the file doesn't exist, it's created. If it exists, it's overwritten (be careful!).
    *   `shape=(arr_len,)`: This is crucial. We pre-allocate the entire file on disk to the exact size needed to hold all `arr_len` tokens. The `arr` object is now a NumPy array-like interface to this file.

4.  **Writing Data in Batches:**
    Writing billions of tokens one by one would be extremely slow. Instead, we write them in large batches.
    `total_batches = 1024`
    This is an arbitrary number of batches. The dataset will be divided into this many chunks for processing.
    The script then loops `total_batches` times:
    `for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):`

    *   **Get a Shard of the Dataset:**
        `batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')`
        The `.shard()` method from the `datasets` library divides the dataset `dset` into `total_batches` pieces (shards). `index=batch_idx` selects the current shard. `contiguous=True` can sometimes help with performance. `.with_format('numpy')` converts this shard into a format where columns can be easily accessed as NumPy arrays (though the 'ids' column is a list of lists, which is why `np.concatenate` is used next).

    *   **Concatenate Token IDs in the Batch:**
        `arr_batch = np.concatenate(batch['ids'])`
        The `batch['ids']` is a list of lists of token IDs (one list per document in the shard). `np.concatenate()` flattens this into a single NumPy array of token IDs for the current batch.

    *   **Write Batch to Memmap Array:**
        `arr[idx : idx + len(arr_batch)] = arr_batch`
        This is where the magic happens. We slice the memory-mapped array `arr` from the current position `idx` to `idx + len(arr_batch)` and assign the `arr_batch` (the tokens from the current shard) to this slice. NumPy and the OS handle writing this data to the underlying disk file.
        `idx += len(arr_batch)`: We update `idx` to point to the next position in the `memmap` file for the next batch.

5.  **Flush Changes to Disk:**
    `arr.flush()`
    Although writes to a `memmap` array are generally passed to the OS to be written to disk, `arr.flush()` explicitly ensures that all buffered changes are written to the file. This is good practice to ensure data integrity, especially at the end of the writing process.


In [None]:
# Concatenate all the ids in each dataset into one large file we can use for training
for split, dset in tokenized.items():
    arr_len = np.sum(dset['len'], dtype=np.uint64)
    filename = os.path.join(script_dir, f'{split}.bin')
    dtype = np.uint16 # (can do since enc.max_token_value == 50256 is < 2**16)
    
    print(f"Creating memory-mapped file: {filename} with size: {arr_len:,} tokens ({arr_len * np.dtype(dtype).itemsize / 1024**3:.2f} GB)")
    arr = np.memmap(filename, dtype=dtype, mode='w+', shape=(arr_len,))
    total_batches = 1024 # Process in 1024 batches

    idx = 0
    for batch_idx in tqdm(range(total_batches), desc=f'writing {filename}'):
        # Batch together samples for faster write
        batch = dset.shard(num_shards=total_batches, index=batch_idx, contiguous=True).with_format('numpy')
        arr_batch = np.concatenate(batch['ids'])
        # Write into mmap
        arr[idx : idx + len(arr_batch)] = arr_batch
        idx += len(arr_batch)
    arr.flush()
    print(f"Finished writing {filename}")

## 7. Output Summary

After running the script, you will have two main files:

*   `train.bin`: This file contains all the token IDs for the training split. For OpenWebText, this file is quite large, approximately 17 GB, representing around 9 billion tokens (`9,035,582,198` tokens according to the script comments).
*   `val.bin`: This file contains all the token IDs for the validation split. It's much smaller, around 8.5 MB, representing about 4.4 million tokens (`4,434,897` tokens).

These files are now ready to be used by the training script (`train.py` in nanoGPT), which will also use `numpy.memmap` to read them efficiently during training.

## 8. Reading Back the Data (Example)

To use these `.bin` files later, for example, in a training script, you can again use `numpy.memmap`. This time, you would open the file in read-only mode (`mode='r'`):

```python
# Example of how to read the data back:
# train_data_mmap = np.memmap('train.bin', dtype=np.uint16, mode='r')
# val_data_mmap = np.memmap('val.bin', dtype=np.uint16, mode='r')

# You can then access parts of it like any NumPy array:
# first_100_tokens = train_data_mmap[:100]
# print(first_100_tokens)
```
This approach is memory-efficient because it doesn't load the entire dataset into RAM. Instead, parts of the file are read on demand as you access slices of the memmapped array.

In [None]:
# Example: Peeking into the created files
print("\n--- Verifying file creation and content (first 10 tokens) ---")
for split_name in ['train', 'val']:
    filename = os.path.join(script_dir, f'{split_name}.bin')
    if os.path.exists(filename):
        try:
            m = np.memmap(filename, dtype=np.uint16, mode='r')
            print(f"Successfully opened {filename} with {len(m):,} tokens.")
            print(f"First 10 tokens from {split_name}.bin: {m[:10]}")
            # It's important to delete the memmap object to close the file handle properly
            # especially if you were to open it in 'w+' mode elsewhere or try to delete the file.
            del m 
        except Exception as e:
            print(f"Could not read {filename}: {e}")
    else:
        print(f"{filename} does not exist.")

## 9. Conclusion

Congratulations! You've walked through the process of downloading, tokenizing, and saving the large OpenWebText dataset into a format suitable for training high-performance language models like nanoGPT.

Key techniques covered:
*   Using the Hugging Face `datasets` library to download and manage datasets.
*   Leveraging multiprocessing with `num_proc` in `load_dataset` and `.map()` for faster processing.
*   Applying BPE tokenization using `tiktoken`.
*   Most importantly, using `numpy.memmap` to handle and save arrays much larger than available RAM by directly mapping them to disk files (`train.bin`, `val.bin`).

These methods are essential for scaling up language model training to large, real-world datasets.