# Preparing the Tiny Shakespeare Dataset (Character-Level)

Welcome! This notebook walks you through the process of preparing the Tiny Shakespeare dataset for training a character-level language model with nanoGPT. 

**What is Character-Level Tokenization?**

In Natural Language Processing (NLP), tokenization is the process of breaking down a piece of text into smaller units called tokens. These tokens can be words, sub-words, or, in this case, individual characters.

For example, the sentence "Hello, world!" would be tokenized at the character level as: `['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']`.

**Why Use Character-Level Tokenization?**

Character-level models have a few advantages:
*   **Smaller Vocabulary:** The vocabulary (the set of unique tokens) is much smaller, consisting only of the unique characters in the text (letters, numbers, punctuation, spaces, etc.). This means the model needs fewer embeddings.
*   **Handles Out-of-Vocabulary (OOV) Words:** Word-level models can struggle with words not seen during training. Character models inherently handle any sequence of characters.
*   **Can Model Sub-Word Information:** They can potentially learn patterns within words (like prefixes and suffixes).

However, they also have disadvantages:
*   **Longer Sequences:** Representing text requires much longer sequences of tokens compared to word-level models, which can make it harder for the model to learn long-range dependencies.
*   **Less Semantic Meaning per Token:** Individual characters carry less semantic meaning than whole words.

This notebook will guide you through downloading the data, creating a character vocabulary, encoding the text into numerical format, and saving it in a way that's ready for training.

## 1. Imports

First, let's import the necessary Python libraries:

*   `os`: This module provides a way of using operating system dependent functionality like reading or writing to the file system. We'll use it for path manipulations and checking if files exist.
*   `pickle`: This module is used for serializing and de-serializing Python object structures, also known as "pickling" and "unpickling". We'll use it to save our vocabulary metadata.
*   `requests`: This library allows us to send HTTP requests. We'll use it to download the dataset if it's not already present locally.
*   `numpy`: NumPy is a fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays. We'll use it to efficiently store our tokenized data.

In [None]:
import os
import pickle
import requests
import numpy as np

## 2. Setup Script Directory

We define `script_dir` using `os.getcwd()`. This assumes you are running this notebook from its location within the `data/shakespeare_char/` directory. All output files (`input.txt`, `train.bin`, `val.bin`, `meta.pkl`) will be saved relative to this directory.

In [None]:
script_dir = os.getcwd()
# For consistency with the original prepare.py, we could also use:
# script_dir = os.path.dirname(__file__) 
# However, __file__ is not defined in interactive notebook environments by default.
# os.getcwd() works well if the notebook is run from its directory.

## 3. Download the Dataset

Next, we'll download the Tiny Shakespeare dataset. This dataset consists of a collection of Shakespeare's works concatenated into a single text file.

The code first defines the `input_file_path` where the data will be saved (`input.txt` in the `script_dir`).
It then checks if this file already exists. If it doesn't, the code downloads the data from the specified URL (on `raw.githubusercontent.com`, provided by Andrej Karpathy's char-rnn project) and saves it to `input_file_path`.

In [None]:
input_file_path = os.path.join(script_dir, 'input.txt')
if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    print(f"Downloading dataset from {data_url}...")
    with open(input_file_path, 'w', encoding='utf-8') as f:
        f.write(requests.get(data_url).text)
    print("Dataset downloaded.")
else:
    print("Dataset already exists locally.")

## 4. Load and Inspect the Data

Now that we have the `input.txt` file, we'll read its content into a string variable called `data`.
We then print the total length of the dataset (number of characters) to get an idea of its size.

In [None]:
with open(input_file_path, 'r', encoding='utf-8') as f:
    data = f.read()
print(f"Length of dataset in characters: {len(data):,}")
print(f"First 100 characters: {data[:100]}")

## 5. Create Character Vocabulary

This is a crucial step in preparing data for a character-level language model.

**Why a Vocabulary?**
Language models work with numbers, not raw text. We need a way to convert each character in our dataset into a unique numerical representation (an integer ID). A vocabulary defines this mapping.

**Steps:**
1.  **Extract Unique Characters:** We first find all the unique characters present in the `data`. The `set(data)` operation creates a collection of unique characters, and `sorted(list(...))` converts this set into a sorted list. Sorting ensures that our vocabulary is consistent every time we run this script.
2.  **Vocabulary Size:** The number of unique characters determines our `vocab_size`.
3.  **String-to-Integer (stoi) Mapping:** We create a dictionary called `stoi` where keys are characters and values are their corresponding integer IDs (from 0 to `vocab_size - 1`). For example, if `chars = ['a', 'b', 'c']`, then `stoi` would be `{'a':0, 'b':1, 'c':2}`.
4.  **Integer-to-String (itos) Mapping:** We also create the reverse mapping, `itos`, where keys are integer IDs and values are the corresponding characters. For the example above, `itos` would be `{0:'a', 1:'b', 2:'c'}`. This is useful for decoding the model's output back into readable text.

We then define two helper functions:
*   `encode(s)`: Takes a string `s` and returns a list of integers representing that string according to the `stoi` mapping.
*   `decode(l)`: Takes a list of integers `l` and returns the corresponding string using the `itos` mapping.

In [None]:
# Get all the unique characters that occur in this text
chars = sorted(list(set(data)))
vocab_size = len(chars)
print("All the unique characters:", ''.join(chars))
print(f"Vocabulary size: {vocab_size:,}")

# Create a mapping from characters to integers and vice-versa
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }

# Encoder: take a string, output a list of integers
def encode(s):
    return [stoi[c] for c in s] 

# Decoder: take a list of integers, output a string
def decode(l):
    return ''.join([itos[i] for i in l])

# Example of encoding and decoding
test_string = "hello world"
encoded_string = encode(test_string)
decoded_string = decode(encoded_string)
print(f"Original string: {test_string}")
print(f"Encoded string: {encoded_string}")
print(f"Decoded string: {decoded_string}")

## 6. Create Train and Validation Splits

**Why Split Data?**
It's standard practice in machine learning to split your dataset into at least two parts:
*   **Training Set:** This is the data the model learns from. The model's parameters are adjusted based on its performance on this set.
*   **Validation Set:** This data is held out and not used during the training phase. Instead, it's used to evaluate the model's performance on unseen data, helping to tune hyperparameters and check for overfitting (where the model performs well on training data but poorly on new data).

Here, we split the dataset such that the first 90% of the text forms the training data (`train_data`) and the remaining 10% forms the validation data (`val_data`).

In [None]:
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

print(f"Length of training data: {len(train_data):,} characters")
print(f"Length of validation data: {len(val_data):,} characters")

## 7. Encode Data Splits

Now that we have our training and validation text data, we use the `encode` function (defined earlier) to convert both splits from strings of characters into sequences of integer IDs.

The resulting `train_ids` and `val_ids` will be lists of integers, ready to be processed by the model.

In [None]:
train_ids = encode(train_data)
val_ids = encode(val_data)
print(f"train_ids has {len(train_ids):,} tokens (integers)")
print(f"val_ids has {len(val_ids):,} tokens (integers)")
print(f"First 10 tokens of train_ids: {train_ids[:10]}")

## 8. Export to Binary Files

For efficient loading during training, especially with large datasets, it's common to save the tokenized data in a compact binary format.

**Steps:**
1.  **Convert to NumPy Arrays:** We first convert our lists of token IDs (`train_ids` and `val_ids`) into NumPy arrays.
2.  **Specify Data Type (`dtype`):** We use `dtype=np.uint16`. This means each token ID will be stored as an unsigned 16-bit integer. 
    *   `unsigned` means it can only store non-negative numbers.
    *   `16-bit` means it can store values from 0 to 2<sup>16</sup> - 1 (which is 0 to 65,535).
    Since our `vocab_size` (65 for Shakespeare) is much smaller than 65,535, `np.uint16` is a suitable and memory-efficient choice. If the vocabulary were larger (e.g., > 65,535), we might need `np.uint32`.
3.  **Save to File:** The `.tofile()` method of a NumPy array writes the array's data to a binary file. This results in `train.bin` and `val.bin`.

These `.bin` files will contain a flat sequence of these 16-bit integers, which can be quickly read and loaded into memory by the training script.

In [None]:
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)

train_ids.tofile(os.path.join(script_dir, 'train.bin'))
val_ids.tofile(os.path.join(script_dir, 'val.bin'))

print("train.bin and val.bin saved.")

## 9. Save Metadata

Finally, we need to save the information required to interpret our processed data later. This includes:

*   `vocab_size`: The total number of unique characters.
*   `itos`: The integer-to-string mapping (dictionary).
*   `stoi`: The string-to-integer mapping (dictionary).

This metadata is crucial because:
*   The training script might need `vocab_size` to define the model architecture (e.g., the size of the embedding layer).
*   When we want to generate text from the trained model, it will output sequences of integer IDs. We'll need `itos` to decode these IDs back into human-readable characters.
*   If we want to feed new, unseen text to the model (e.g., for a prompt), we'll need `stoi` to tokenize that text.

We store this information in a Python dictionary called `meta` and then use the `pickle` library to serialize this dictionary and save it to a file named `meta.pkl`.
`pickle.dump(object_to_save, file_handle, protocol)` writes the pickled representation of `object_to_save` to the open `file_handle`. `'wb'` mode means we're writing in binary mode, which is required by pickle.

In [None]:
meta = {
    'vocab_size': vocab_size,
    'itos': itos,
    'stoi': stoi,
}
with open(os.path.join(script_dir, 'meta.pkl'), 'wb') as f:
    pickle.dump(meta, f)

print("meta.pkl saved.")

# To load this metadata later, you would use:
# with open('meta.pkl', 'rb') as f:
#     meta = pickle.load(f)
# print(meta)

## 10. Conclusion

That's it! We have successfully processed the Tiny Shakespeare dataset for character-level language modeling.

As a result of running this notebook, you should now have the following files in your `data/shakespeare_char/` directory (or wherever `script_dir` was pointing):

*   `input.txt`: The raw dataset.
*   `train.bin`: The training data, tokenized and stored as a binary sequence of 16-bit integers.
*   `val.bin`: The validation data, similarly processed.
*   `meta.pkl`: A pickle file containing the vocabulary size and the character-to-integer (`stoi`) and integer-to-character (`itos`) mappings.

These files are now ready to be used for training a character-level nanoGPT model.