# ðŸ“Š Data Exploration: Pre-1986 Training Streams

This notebook explores the training data for our from-scratch Small Language Model (SLM). We're building a 300M parameter model trained **exclusively on pre-1986 knowledge** for counterfactual safety reasoning.

**Goal:** Understand what we're working with before we start training.

---
## 1. Setup & Imports

Just the basics - nothing fancy here.

In [None]:
import os
from pathlib import Path
from collections import Counter

# We'll use matplotlib for some simple visualizations
import matplotlib.pyplot as plt

# Where our training data lives
DATA_DIR = Path("../data/pre1986_training_streams_v1_FINAL")

# Quick sanity check
if DATA_DIR.exists():
    print(f"âœ“ Data directory found: {DATA_DIR.resolve()}")
    print(f"  Files: {list(DATA_DIR.glob('*.txt'))}")
else:
    print("âœ— Data directory not found - check your path!")

---
## 2. The Training Streams

Our dataset consists of **4 text streams**, each serving a different purpose:

| File | Purpose | Training Phase |
|------|---------|----------------|
| `base_stream.txt` | General knowledge (books, textbooks, essays) | Phase A: Pretraining |
| `finetune_control.txt` | Control systems theory | Phase C: Fine-tuning |
| `finetune_nuclear.txt` | Nuclear engineering concepts | Phase C: Fine-tuning |
| `finetune_reliability.txt` | System reliability analysis | Phase C: Fine-tuning |

Let's see what we're dealing with.

In [None]:
def get_file_stats(filepath):
    """Get basic stats about a text file."""
    with open(filepath, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Count some basic things
    lines = content.split('\n')
    words = content.split()
    chars = len(content)
    
    return {
        'size_mb': os.path.getsize(filepath) / (1024 * 1024),
        'lines': len(lines),
        'words': len(words),
        'chars': chars,
        'sample': content[:500]  # First 500 chars as a preview
    }

# Gather stats for all our files
files = ['base_stream.txt', 'finetune_control.txt', 
         'finetune_nuclear.txt', 'finetune_reliability.txt']

stats = {}
for fname in files:
    fpath = DATA_DIR / fname
    if fpath.exists():
        stats[fname] = get_file_stats(fpath)
        print(f"{fname}:")
        print(f"  Size: {stats[fname]['size_mb']:.2f} MB")
        print(f"  Words: {stats[fname]['words']:,}")
        print()

---
## 3. Visualizing the Data Distribution

Let's see how the data is split between base pretraining and fine-tuning.

In [None]:
# Simple bar chart of file sizes
names = list(stats.keys())
sizes = [stats[n]['size_mb'] for n in names]

# Make it look decent
plt.figure(figsize=(10, 5))
colors = ['#2ecc71', '#3498db', '#e74c3c', '#9b59b6']  # Nice colors
bars = plt.bar(names, sizes, color=colors)

plt.ylabel('Size (MB)')
plt.title('Training Data Size by Stream')
plt.xticks(rotation=15)

# Add size labels on top of bars
for bar, size in zip(bars, sizes):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
             f'{size:.1f} MB', ha='center', fontsize=9)

plt.tight_layout()
plt.show()

# The base stream is MUCH larger - that's intentional
base_pct = stats['base_stream.txt']['size_mb'] / sum(sizes) * 100
print(f"\nBase pretraining data is {base_pct:.1f}% of total - this is by design.")
print("Fine-tuning is meant to *shape* reasoning, not add bulk knowledge.")

---
## 4. Peeking at the Content

Let's look at actual samples from each stream to understand what the model will learn from.

In [None]:
def show_sample(filename, start_pos=0, length=1000):
    """Show a sample from a file starting at a given position."""
    with open(DATA_DIR / filename, 'r', encoding='utf-8') as f:
        f.seek(start_pos)
        sample = f.read(length)
    
    print(f"=== {filename} (starting at byte {start_pos}) ===")
    print(sample)
    print("\n" + "="*50 + "\n")

# Show samples from each file
# Starting from different positions to see variety
show_sample('base_stream.txt', start_pos=1000)

In [None]:
# Fine-tuning data - control systems
show_sample('finetune_control.txt', start_pos=500)

In [None]:
# Fine-tuning data - nuclear engineering
show_sample('finetune_nuclear.txt', start_pos=1000)

---
## 5. Document Structure: EOS Markers

Our data uses `<EOS>` (End Of Sequence) markers to separate documents. This is important because:
- The model learns that `<EOS>` means "this thought is complete"
- We don't want the model to blend unrelated documents together

Let's count them.

In [None]:
def count_eos_markers(filename):
    """Count how many documents are in a file."""
    with open(DATA_DIR / filename, 'r', encoding='utf-8') as f:
        content = f.read()
    return content.count('<EOS>')

print("Document counts (based on <EOS> markers):")
print("-" * 40)
for fname in files:
    count = count_eos_markers(fname)
    print(f"{fname}: {count:,} documents")

---
## 6. Character & Word Distribution

Let's look at what characters appear in our data. This helps us understand:
- Is there any weird encoding issues?
- What special characters do we need to handle?

In [None]:
# Analyze base_stream since it's the largest
with open(DATA_DIR / 'base_stream.txt', 'r', encoding='utf-8') as f:
    base_content = f.read()

# Count characters
char_counts = Counter(base_content)

# Show top 30 most common characters
print("Top 30 most common characters in base_stream.txt:")
print("-" * 50)
for char, count in char_counts.most_common(30):
    # Make whitespace visible
    display_char = repr(char) if char in '\n\t\r ' else char
    print(f"  {display_char}: {count:,}")

In [None]:
# Check for any unusual unicode characters
unusual = {char: count for char, count in char_counts.items() 
           if ord(char) > 127}  # Non-ASCII

if unusual:
    print("Non-ASCII characters found:")
    for char, count in sorted(unusual.items(), key=lambda x: -x[1])[:20]:
        print(f"  U+{ord(char):04X} '{char}': {count:,}")
else:
    print("âœ“ All characters are ASCII - clean data!")

---
## 7. Key Takeaways

Before moving on to tokenization, here's what we learned:

1. **Data Scale:** Base pretraining stream is ~50MB, fine-tuning streams are much smaller (~2.5MB combined)
2. **Structure:** Documents are separated by `<EOS>` markers
3. **Content Quality:** Pre-1986 scientific and engineering text - exactly what we need
4. **Encoding:** UTF-8 encoded, mostly ASCII with some special characters for equations

**Next:** In notebook 02, we'll train a BPE tokenizer on this data.

In [None]:
# Summary stats to remember
print("=" * 50)
print("SUMMARY")
print("=" * 50)
total_mb = sum(s['size_mb'] for s in stats.values())
total_words = sum(s['words'] for s in stats.values())
print(f"Total data size: {total_mb:.2f} MB")
print(f"Total word count: {total_words:,}")
print(f"Files: {len(stats)}")