# 1.11a: Gatsby Corpus Preparation

**Goal:** Download and clean The Great Gatsby text for training experiments.

## Processing Steps

1. Download from Project Gutenberg
2. Strip Gutenberg front/back matter
3. Unwrap hard-wrapped paragraphs (single line breaks → spaces, preserve paragraph breaks)
4. Export clean corpus for reuse

## Output

- **Clean corpus:** `../data/gatsby_clean.txt`
- One paragraph per line

## Parameters

In [51]:
# Source URL (Project Gutenberg)
GATSBY_URL = "https://www.gutenberg.org/files/64317/64317-0.txt"

# Output paths
RAW_PATH = "../data/gatsby_raw.txt"
CLEAN_PATH = "../data/gatsby_clean.txt"

## Imports

In [52]:
import urllib.request
import re
from pathlib import Path

## Create Output Directory

In [53]:
Path("../data").mkdir(parents=True, exist_ok=True)
print("✓ Created ../data directory")

✓ Created ../data directory


## Download Raw Corpus

In [54]:
raw_path = Path(RAW_PATH)

if raw_path.exists():
    print(f"✓ Raw corpus already exists at {RAW_PATH}")
else:
    print(f"Downloading from Project Gutenberg...")
    urllib.request.urlretrieve(GATSBY_URL, RAW_PATH)
    print(f"✓ Downloaded to {RAW_PATH}")

# Read raw text
with open(RAW_PATH, 'r', encoding='utf-8') as f:
    raw_text = f.read()

print(f"\nRaw corpus stats:")
print(f"  Characters: {len(raw_text):,}")
print(f"  Lines: {len(raw_text.splitlines()):,}")

✓ Raw corpus already exists at ../data/gatsby_raw.txt

Raw corpus stats:
  Characters: 270,822
  Lines: 6,407


## Strip Gutenberg Front/Back Matter

In [55]:
# Find the Gutenberg delimiters
start_pattern = r'\*\*\* START OF (?:THIS|THE) PROJECT GUTENBERG EBOOK .+ \*\*\*'
start_match = re.search(start_pattern, raw_text, re.IGNORECASE)

if not start_match:
    raise ValueError("Could not find Gutenberg START marker")

end_pattern = r'\*\*\* END OF (?:THIS|THE) PROJECT GUTENBERG EBOOK .+ \*\*\*'
end_match = re.search(end_pattern, raw_text, re.IGNORECASE)

if not end_match:
    raise ValueError("Could not find Gutenberg END marker")

# Extract novel text
novel_text = raw_text[start_match.end():end_match.start()].strip()

print(f"✓ Stripped Gutenberg front/back matter")
print(f"  Novel length: {len(novel_text):,} characters")

✓ Stripped Gutenberg front/back matter
  Novel length: 270,690 characters


## Unwrap Paragraphs

In [56]:
# Normalize line endings first
text = novel_text.replace('\r\n', '\n').replace('\r', '\n')

# Split into paragraphs (separated by blank lines)
paragraphs = re.split(r'\n\n+', text)

# Unwrap each paragraph (join lines with spaces)
unwrapped = []
for para in paragraphs:
    # Join all lines in the paragraph with spaces
    lines = para.split('\n')
    unwrapped_para = ' '.join(line.strip() for line in lines if line.strip())
    if unwrapped_para:
        unwrapped.append(unwrapped_para)

# Join paragraphs with single newlines
clean_text = '\n'.join(unwrapped)

print(f"✓ Unwrapped paragraphs")
print(f"  Before: {len(novel_text):,} characters")
print(f"  After:  {len(clean_text):,} characters")
print(f"  Paragraphs: {len(unwrapped):,}")

✓ Unwrapped paragraphs
  Before: 270,690 characters
  After:  268,403 characters
  Paragraphs: 1,650


## Save Clean Corpus

In [57]:
# Write clean corpus
with open(CLEAN_PATH, 'w', encoding='utf-8') as f:
    f.write(clean_text)

print(f"✓ Saved clean corpus to {CLEAN_PATH}")
print(f"  Size: {len(clean_text):,} characters")
print(f"  Lines: {len(clean_text.splitlines()):,}")

✓ Saved clean corpus to ../data/gatsby_clean.txt
  Size: 268,403 characters
  Lines: 1,650


## Preview

In [58]:
# Show first 10 paragraphs
lines = clean_text.split('\n')
print("\nFirst 10 paragraphs:")
print("=" * 80)
for i, line in enumerate(lines[:10]):
    print(f"{i+1}: {line[:100]}{'...' if len(line) > 100 else ''}")
print("=" * 80)
print()

# Show last 5 paragraphs  
print("Last 5 paragraphs:")
print("=" * 80)
for i, line in enumerate(lines[-5:]):
    print(f"{len(lines)-5+i+1}: {line[:100]}{'...' if len(line) > 100 else ''}")
print("=" * 80)


First 10 paragraphs:
1: The Great Gatsby by F. Scott Fitzgerald
2: Table of Contents
3: I II III IV V VI VII VIII IX
4: Once again to Zelda
5: Then wear the gold hat, if that will move her; If you can bounce high, bounce for her too, Till she ...
6: Thomas Parke d’Invilliers
7: I
8: In my younger and more vulnerable years my father gave me some advice that I’ve been turning over in...
9: “Whenever you feel like criticizing anyone,” he told me, “just remember that all the people in this ...
10: He didn’t say any more, but we’ve always been unusually communicative in a reserved way, and I under...

Last 5 paragraphs:
1646: On the last night, with my trunk packed and my car sold to the grocer, I went over and looked at tha...
1647: Most of the big shore places were closed now and there were hardly any lights except the shadowy, mo...
1648: And as I sat there brooding on the old, unknown world, I thought of Gatsby’s wonder when he first pi...
1649: Gatsby believed in the green light, the 

## Summary

In [59]:
print("\n" + "=" * 80)
print("CORPUS PREPARATION COMPLETE")
print("=" * 80)
print()
print(f"Processing:")
print(f"  ✓ Stripped Gutenberg front/back matter")
print(f"  ✓ Unwrapped hard-wrapped paragraphs")
print(f"  ✓ One paragraph per line")
print()
print(f"Output:")
print(f"  Clean corpus: {CLEAN_PATH}")
print(f"  Final size: {len(clean_text):,} characters")
print(f"  Paragraphs: {len(unwrapped):,}")
print()
print(f"Next steps:")
print(f"  → Eyeball check gatsby_clean.txt")
print(f"  → 1.11b: Create ASCII tokenizer based on this corpus")
print(f"  → 1.12a: Train Lil Gatsby model")
print()
print("=" * 80)


CORPUS PREPARATION COMPLETE

Processing:
  ✓ Stripped Gutenberg front/back matter
  ✓ Unwrapped hard-wrapped paragraphs
  ✓ One paragraph per line

Output:
  Clean corpus: ../data/gatsby_clean.txt
  Final size: 268,403 characters
  Paragraphs: 1,650

Next steps:
  → Eyeball check gatsby_clean.txt
  → 1.11b: Create ASCII tokenizer based on this corpus
  → 1.12a: Train Lil Gatsby model

