# Data Preparation — OD Documentation Dataset

This notebook:
1. Parses all `.md` files from `od_docu/garbage/ingestion/` and subfolders
2. Cleans text: removes links, images, code blocks, HTML, markdown formatting
3. Filters out files with < 100 symbols
4. Applies data cleaning (paragraph length filter, repetition filter, deduplication)
5. Tokenizes with GPT-2 tokenizer (tiktoken) and packages for post-training

In [25]:
import warnings
warnings.filterwarnings("ignore")

In [26]:
import os
import re
import glob
import heapq
import numpy as np
import datasets
import tiktoken

## 1. Parse and Clean Markdown Files

### 1.1 Define the cleaning function

The cleaning pipeline removes:
- Code blocks (fenced with triple backticks)
- HTML comments (`<!-- ... -->`)
- HTML tags (`<p>`, `<img>`, `<br>`, etc.)
- Images (`![alt](url)`)
- Markdown links → keep link text only (`[text](url)` → `text`)
- Bare URLs (http/https)
- Markdown tables
- Markdown formatting (headers `#`, bold `**`, italic `_`, blockquotes `>`, horizontal rules `---`)
- Template/variable syntax (`{{ ... }}`)
- Inline code backticks containing file paths or code (single backtick wrapped code)
- Excessive whitespace

In [27]:
def clean_markdown(text: str) -> str:
    """Clean markdown text by removing links, images, code blocks,
    HTML tags, and markdown formatting. Returns plain text."""

    # 1. Remove fenced code blocks (```...```)
    text = re.sub(r'```[\s\S]*?```', '', text)

    # 2. Remove HTML comments (<!-- ... -->)
    text = re.sub(r'<!--[\s\S]*?-->', '', text)

    # 3. Remove HTML tags (including self-closing and with attributes)
    #    e.g. <p align="center">, <img src="...">, <br>, </p>
    text = re.sub(r'<[^>]+/?>', '', text)

    # 4. Remove images: ![alt text](url)
    text = re.sub(r'!\[[^\]]*\]\([^)]*\)', '', text)

    # 5. Remove template/variable syntax: {{ ... }}
    text = re.sub(r'\{\{[^}]*\}\}', '', text)

    # 6. Convert markdown links to just the link text: [text](url) → text
    text = re.sub(r'\[([^\]]*)\]\([^)]*\)', r'\1', text)

    # 7. Remove bare URLs
    text = re.sub(r'https?://[^\s)>]+', '', text)

    # 8. Remove inline code with file paths / technical identifiers
    #    (backtick-wrapped content that looks like paths, CSS selectors, variable names)
    text = re.sub(r'`[^`]*`', '', text)

    # 9. Remove markdown table separators: |---|---|---|
    text = re.sub(r'^\|?[-|: ]+\|?$', '', text, flags=re.MULTILINE)

    # 10. Remove markdown table rows (lines starting/ending with |)
    text = re.sub(r'^\|.*\|\s*$', '', text, flags=re.MULTILINE)

    # 11. Remove markdown heading markers (#)
    text = re.sub(r'^#{1,6}\s*', '', text, flags=re.MULTILINE)

    # 12. Remove blockquote markers (>)
    text = re.sub(r'^>\s?', '', text, flags=re.MULTILINE)

    # 13. Remove horizontal rules (---, ***, ___)
    text = re.sub(r'^([-*_])\1{2,}\s*$', '', text, flags=re.MULTILINE)

    # 14. Remove bold/italic markers (**, __, *, _) but keep the text
    text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text)  # **bold**
    text = re.sub(r'__([^_]+)__', r'\1', text)        # __bold__
    text = re.sub(r'\*([^*]+)\*', r'\1', text)        # *italic*
    text = re.sub(r'_([^_\s][^_]*)_', r'\1', text)    # _italic_

    # 15. Remove list markers (-, *, numbered)
    text = re.sub(r'^\s*[-*+]\s+', '', text, flags=re.MULTILINE)
    text = re.sub(r'^\s*\d+\.\s+', '', text, flags=re.MULTILINE)

    # 16. Remove escaped characters
    text = re.sub(r'\\([\\`*_{}\[\]()#+\-.!>])', r'\1', text)

    # 17. Clean up excessive blank lines (3+ newlines → 2)
    text = re.sub(r'\n{3,}', '\n\n', text)

    # 18. Strip leading/trailing whitespace from each line
    lines = [line.strip() for line in text.split('\n')]
    text = '\n'.join(lines)

    # 19. Final strip
    text = text.strip()

    return text

### 1.2 Parse all .md files from the ingestion folder

In [28]:
# Path to the ingestion documentation folder
ingestion_dir = "./od_docu/"

# Recursively find all .md files
md_files = glob.glob(os.path.join(ingestion_dir, "**/*.md"), recursive=True)
print(f"Found {len(md_files)} markdown files")

Found 255 markdown files


In [29]:
# Read and clean all files
MIN_SYMBOLS = 100

raw_texts = []
cleaned_texts = []
skipped_short = 0
skipped_empty = 0

for filepath in sorted(md_files):
    with open(filepath, 'r', encoding='utf-8') as f:
        raw_text = f.read()
    
    # Skip files with less than 100 symbols in the raw text
    if len(raw_text) < MIN_SYMBOLS:
        skipped_short += 1
        continue
    
    cleaned = clean_markdown(raw_text)
    
    # Also skip if the cleaned text is too short
    if len(cleaned) < MIN_SYMBOLS:
        skipped_empty += 1
        continue
    
    raw_texts.append(raw_text)
    cleaned_texts.append(cleaned)

print(f"Kept: {len(cleaned_texts)} files")
print(f"Skipped (raw < {MIN_SYMBOLS} symbols): {skipped_short}")
print(f"Skipped (cleaned < {MIN_SYMBOLS} symbols): {skipped_empty}")

Kept: 255 files
Skipped (raw < 100 symbols): 0
Skipped (cleaned < 100 symbols): 0


### 1.3 Inspect cleaning results

Let's look at a few examples to verify the cleaning is working correctly.

In [30]:
# Show first 3 cleaned examples
for i in range(min(3, len(cleaned_texts))):
    print(f"{'='*80}")
    print(f"EXAMPLE {i+1} (length: {len(cleaned_texts[i])} chars)")
    print(f"{'='*80}")
    print(cleaned_texts[i][:500])
    print(f"\n... [truncated, total {len(cleaned_texts[i])} chars]\n")

EXAMPLE 1 (length: 3482 chars)
The Architecture of the Orchestrator

The input for the Orchestrator is an Ingestion Plan. As described above, these can be configured in the .
For one plan, several items can be configured that are executed together once the overall Ingestion Plan is triggered.
For a single item, the following properties need to be defined:
Data Connection Name: The One Data Connection to the Source System on which the ingestion should take place.
Schema or Domain ID: References either the One Data Domain to in

... [truncated, total 3482 chars]

EXAMPLE 2 (length: 3121 chars)
Archiving and Cleaning Mechanisms of Cartography

One Data Cartography has Archiving and Cleaning Mechanisms.
As a result of historization processes, different entries get deleted or overwritten. By listing them in archiving tables, entries that are not primarily necessary anymore are saved and archived in a central place, thus keeping the necessary tables clearly arranged. Additionally, as an incr

### 1.4 Build a Hugging Face Dataset

In [31]:
od_dataset = [{'text': text} for text in cleaned_texts]
dataset = datasets.Dataset.from_list(od_dataset)
print(dataset)
print(f"\nTotal characters: {sum(len(t) for t in cleaned_texts):,}")

Dataset({
    features: ['text'],
    num_rows: 255
})

Total characters: 747,911


## 2. Data Cleaning

Apply the same cleaning steps from the reference notebook:
1. Filter out samples that are too short
2. Remove repetitions within a single text example
3. Remove duplicated documents

In [32]:
print(f"Starting dataset size: {dataset.num_rows} rows")

Starting dataset size: 255 rows


### 2.1 Remove examples that are too short

In [33]:
def paragraph_length_filter(x):
    """Returns False iff a page has too few lines or lines are too short."""
    lines = x['text'].split('\n')
    if (
        len(lines) < 3
        or min(heapq.nlargest(3, [len(line) for line in lines])) < 3
    ):
        return False
    return True

dataset = dataset.filter(
    paragraph_length_filter,
    load_from_cache_file=False
)
print(f"After paragraph length filter: {dataset.num_rows} rows")

Filter: 100%|██████████| 255/255 [00:00<00:00, 65053.68 examples/s]

After paragraph length filter: 255 rows





### 2.2 Remove repeated text within training examples

In [34]:
def find_duplicates(paragraphs):
    """
    Use this function to find the number of repetitions
    in the paragraphs.
    """
    unique_x = set()
    duplicate_chars = 0
    duplicate_elements = 0
    for element in paragraphs:
        if element in unique_x:
            duplicate_chars += len(element)
            duplicate_elements += 1
        else:
            unique_x.add(element)
    return duplicate_elements, duplicate_chars


def paragraph_repetition_filter(x):
    """
    Returns False iff a page has too many repetitions.
    """
    text = x['text']
    paragraphs = re.compile(r"\n{2,}").split(text.strip())
    paragraphs_duplicates, char_duplicates = find_duplicates(paragraphs)
    if len(paragraphs) == 0:
        return False
    if paragraphs_duplicates / len(paragraphs) > 0.3:
        return False
    if len(text) == 0:
        return False
    if char_duplicates / len(text) > 0.2:
        return False
    return True

dataset = dataset.filter(
    paragraph_repetition_filter,
    load_from_cache_file=False
)
print(f"After paragraph repetition filter: {dataset.num_rows} rows")

Filter: 100%|██████████| 255/255 [00:00<00:00, 25280.63 examples/s]

After paragraph repetition filter: 230 rows





### 2.3 Deduplication

In [35]:
def deduplication(ds):
    def dedup_func(x):
        """Use this function to remove duplicate entries"""
        if x['text'] in unique_text:
            return False
        else:
            unique_text.add(x['text'])
            return True

    unique_text = set()
    ds = ds.filter(dedup_func, load_from_cache_file=False, num_proc=1)
    return ds

dataset = deduplication(dataset)
print(f"After deduplication: {dataset.num_rows} rows")

Filter (num_proc=1): 100%|██████████| 230/230 [00:00<00:00, 2345.03 examples/s]

After deduplication: 230 rows





### 2.4 Summary of data cleaning

In [36]:
print(f"Final cleaned dataset: {dataset.num_rows} rows")
print(f"Total characters: {sum(len(t) for t in dataset['text']):,}")

# Show text length distribution
lengths = [len(t) for t in dataset['text']]
print(f"\nText length statistics:")
print(f"  Min:    {min(lengths):,} chars")
print(f"  Max:    {max(lengths):,} chars")
print(f"  Mean:   {np.mean(lengths):,.0f} chars")
print(f"  Median: {np.median(lengths):,.0f} chars")

Final cleaned dataset: 230 rows
Total characters: 634,203

Text length statistics:
  Min:    110 chars
  Max:    26,589 chars
  Mean:   2,757 chars
  Median: 1,811 chars


## 3. Save the preprocessed dataset to disk

In [37]:
os.makedirs("./data", exist_ok=True)

file_path = "post_training_od_docu_dataset.parquet"
dataset.to_parquet(file_path)
print(f"Saved preprocessed dataset to {file_path}")
print(f"File size: {os.path.getsize(file_path) / 1024:.1f} KB")

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 257.07ba/s]

Saved preprocessed dataset to post_training_od_docu_dataset.parquet
File size: 283.0 KB





## 4. Tokenization and Packaging

### 4.1 Load the preprocessed dataset and tokenize

Using the GPT-2 tokenizer (tiktoken) which matches the model in `train_gpt2.py`.

In [38]:
# Reload from parquet to verify persistence
dataset = datasets.load_dataset(
    "parquet",
    data_files="post_training_od_docu_dataset.parquet",
    split="train"
)
print(dataset)

Generating train split: 230 examples [00:00, 23474.06 examples/s]

Dataset({
    features: ['text'],
    num_rows: 230
})





In [39]:
# Initialize the GPT-2 tokenizer (same as used in train_gpt2.py)
enc = tiktoken.get_encoding('gpt2')
eot = enc._special_tokens['<|endoftext|>']  # end of text token

print(f"Vocab size: {enc.n_vocab}")
print(f"EOT token id: {eot}")
print(f"\nExample tokenization:")
print(f"  'Hello world' → {enc.encode('Hello world')}")

Vocab size: 50257
EOT token id: 50256

Example tokenization:
  'Hello world' → [15496, 995]


In [40]:
def tokenization(example):
    """Tokenize text and add BOS/EOS tokens."""
    tokens = enc.encode_ordinary(example["text"])
    # Add <|endoftext|> as both BOS and EOS (GPT-2 convention)
    token_ids = [eot] + tokens + [eot]
    example["input_ids"] = token_ids
    example["num_tokens"] = len(token_ids)
    return example

dataset = dataset.map(tokenization, load_from_cache_file=False)
print(dataset)

Map: 100%|██████████| 230/230 [00:00<00:00, 4742.38 examples/s]

Dataset({
    features: ['text', 'input_ids', 'num_tokens'],
    num_rows: 230
})





In [41]:
# Inspect a sample
sample = dataset[0]
print("Text (first 200 chars):", sample["text"][:200])
print(f"\nInput IDs (first 30): {sample['input_ids'][:30]}")
print(f"Num tokens: {sample['num_tokens']}")

Text (first 200 chars): The Architecture of the Orchestrator

The input for the Orchestrator is an Ingestion Plan. As described above, these can be configured in the .
For one plan, several items can be configured that are e

Input IDs (first 30): [50256, 464, 29778, 286, 262, 30369, 2536, 1352, 198, 198, 464, 5128, 329, 262, 30369, 2536, 1352, 318, 281, 554, 3495, 295, 5224, 13, 1081, 3417, 2029, 11, 777, 460]
Num tokens: 749


In [42]:
total_tokens = np.sum(dataset["num_tokens"])
print(f"Total tokens in dataset: {total_tokens:,}")

Total tokens in dataset: 141,732


### 4.2 Save as .npy shards (format expected by `DataLoaderLite` in `train_gpt2.py`)

`DataLoaderLite` expects flat `.npy` arrays of `uint16` token IDs, with filenames containing `"train"` or `"val"`. It handles windowing into 1024-length sequences internally — **no pre-packing needed**.

This matches the format produced by `fineweb.py`.

In [43]:
# Concatenate all input_ids into a single flat array (same as fineweb.py)
all_tokens = np.concatenate(dataset["input_ids"])

# Convert to uint16 (GPT-2 vocab is 50257 < 2^16 = 65536, so uint16 is fine)
assert (0 <= all_tokens).all() and (all_tokens < 2**16).all(), "token values out of uint16 range"
all_tokens = all_tokens.astype(np.uint16)

print(f"Total tokens: {len(all_tokens):,}")
print(f"Dtype: {all_tokens.dtype}")

# Split into train (90%) and val (10%)
split_idx = int(len(all_tokens) * 0.9)
splits = {
    "val": all_tokens[split_idx:],
    "train": all_tokens[:split_idx],
}

for split_name, tokens in splits.items():
    print(f"  {split_name}: {len(tokens):,} tokens")

Total tokens: 141,732
Dtype: uint16
  val: 14,174 tokens
  train: 127,558 tokens


In [44]:
# Save as .npy shards into data/ directory
# Naming convention matches fineweb.py: {prefix}_{split}_{shard_index:06d}.npy
data_dir = "./data"
os.makedirs(data_dir, exist_ok=True)

for split_name, tokens in splits.items():
    filename = os.path.join(data_dir, f"od_docu_{split_name}_000000")
    np.save(filename, tokens)
    fsize = os.path.getsize(filename + ".npy") / 1024
    print(f"Saved {filename}.npy  ({len(tokens):,} tokens, {fsize:.1f} KB)")

Saved ./data/od_docu_val_000000.npy  (14,174 tokens, 27.8 KB)
Saved ./data/od_docu_train_000000.npy  (127,558 tokens, 249.3 KB)


In [45]:
# Verify the saved shards can be loaded (same way DataLoaderLite does it)
for f in sorted(os.listdir(data_dir)):
    if f.endswith(".npy"):
        loaded = np.load(os.path.join(data_dir, f), mmap_mode='r')
        print(f"{f}: shape={loaded.shape}, dtype={loaded.dtype}, first 10 tokens={loaded[:10]}")

od_docu_train_000000.npy: shape=(127558,), dtype=uint16, first 10 tokens=[50256   464 29778   286   262 30369  2536  1352   198   198]
od_docu_val_000000.npy: shape=(14174,), dtype=uint16, first 10 tokens=[ 2099   198 29665   952    12  3351  3021    14 25628    13]


## 5. Summary

The `.npy` shards in `./data/` are ready for post-training. To use them with `train_gpt2.py`, set:
```python
data_root = "post-training/data"  # or the appropriate relative/absolute path
```

In [46]:
print(f"{'='*60}")
print(f"DATASET SUMMARY")
print(f"{'='*60}")
print(f"Source .md files:   {len(md_files)}")
print(f"After cleaning:     {len(cleaned_texts)} documents")
print(f"After filters:      {dataset.num_rows} documents")
print(f"Total tokens:       {total_tokens:,}")
print(f"  Train tokens:     {len(splits['train']):,} (90%)")
print(f"  Val tokens:       {len(splits['val']):,} (10%)")
print(f"Output directory:   {os.path.abspath(data_dir)}")
print(f"Output files:")
for f in sorted(os.listdir(data_dir)):
    if f.endswith(".npy"):
        print(f"  {f} ({os.path.getsize(os.path.join(data_dir, f)) / 1024:.1f} KB)")
print(f"\nReady for post-training with train_gpt2.py!")

DATASET SUMMARY
Source .md files:   255
After cleaning:     255 documents
After filters:      230 documents
Total tokens:       141,732
  Train tokens:     127,558 (90%)
  Val tokens:       14,174 (10%)
Output directory:   /Users/alexander.hafer/Desktop/LLM/Karpathy/reproduce_gpt-2/post-training/data
Output files:
  od_docu_train_000000.npy (249.3 KB)
  od_docu_val_000000.npy (27.8 KB)

Ready for post-training with train_gpt2.py!


In [47]:
# Cleanup: remove the intermediate parquet file (no longer needed)
parquet_file = "post_training_od_docu_dataset.parquet"
if os.path.exists(parquet_file):
    os.remove(parquet_file)
    print(f"Removed intermediate file: {parquet_file}")

Removed intermediate file: post_training_od_docu_dataset.parquet
