# 1.11a: Gatsby Corpus Preparation

**Goal:** Download The Great Gatsby from Project Gutenberg and prepare a clean ASCII corpus.

## Processing Steps

1. Download UTF-8 plain text from Project Gutenberg
2. Strip Gutenberg front/back matter
3. Unwrap hard-wrapped paragraphs (preserve paragraph breaks)
4. Convert to ASCII using Unidecode (proper transliteration)
5. Export clean ASCII corpus

## Output

- **Clean corpus:** `../data/gatsby_clean.txt`
- Pure 7-bit ASCII encoding
- One paragraph per line

## Parameters

In [60]:
# Source URL (Project Gutenberg - UTF-8 plain text)
GATSBY_URL = "https://www.gutenberg.org/files/64317/64317-0.txt"

# Output paths
RAW_PATH = "../data/gatsby_raw.txt"
CLEAN_PATH = "../data/gatsby_clean.txt"

## Imports

In [61]:
import urllib.request
import re
from pathlib import Path
from unidecode import unidecode

print("✓ Imports complete")

✓ Imports complete


## Create Output Directory

In [62]:
Path("../data").mkdir(parents=True, exist_ok=True)
print("✓ Created ../data directory")

✓ Created ../data directory


## Download Raw Corpus

In [63]:
raw_path = Path(RAW_PATH)

if raw_path.exists():
    print(f"✓ Raw corpus already exists at {RAW_PATH}")
else:
    print(f"Downloading from Project Gutenberg...")
    urllib.request.urlretrieve(GATSBY_URL, RAW_PATH)
    print(f"✓ Downloaded to {RAW_PATH}")

# Read raw text (UTF-8)
with open(RAW_PATH, 'r', encoding='utf-8') as f:
    raw_text = f.read()

print(f"\nRaw corpus stats:")
print(f"  Characters: {len(raw_text):,}")
print(f"  Lines: {len(raw_text.splitlines()):,}")
print(f"  Encoding: UTF-8")

✓ Raw corpus already exists at ../data/gatsby_raw.txt

Raw corpus stats:
  Characters: 270,822
  Lines: 6,407
  Encoding: UTF-8


## Strip Gutenberg Front/Back Matter

In [64]:
# Find the Gutenberg delimiters
start_pattern = r'\*\*\* START OF (?:THIS|THE) PROJECT GUTENBERG EBOOK .+ \*\*\*'
start_match = re.search(start_pattern, raw_text, re.IGNORECASE)

if not start_match:
    raise ValueError("Could not find Gutenberg START marker")

end_pattern = r'\*\*\* END OF (?:THIS|THE) PROJECT GUTENBERG EBOOK .+ \*\*\*'
end_match = re.search(end_pattern, raw_text, re.IGNORECASE)

if not end_match:
    raise ValueError("Could not find Gutenberg END marker")

# Extract novel text
novel_text = raw_text[start_match.end():end_match.start()].strip()

print(f"✓ Stripped Gutenberg front/back matter")
print(f"  Novel length: {len(novel_text):,} characters")

✓ Stripped Gutenberg front/back matter
  Novel length: 270,690 characters


## Unwrap Paragraphs

In [65]:
# Normalize line endings first
text = novel_text.replace('\r\n', '\n').replace('\r', '\n')

# Split into paragraphs (separated by blank lines)
paragraphs = re.split(r'\n\n+', text)

# Unwrap each paragraph (join lines with spaces)
unwrapped = []
for para in paragraphs:
    # Join all lines in the paragraph with spaces
    lines = para.split('\n')
    unwrapped_para = ' '.join(line.strip() for line in lines if line.strip())
    if unwrapped_para:
        unwrapped.append(unwrapped_para)

# Join paragraphs with single newlines
clean_text = '\n'.join(unwrapped)

print(f"✓ Unwrapped paragraphs")
print(f"  Before: {len(novel_text):,} characters")
print(f"  After:  {len(clean_text):,} characters")
print(f"  Paragraphs: {len(unwrapped):,}")

✓ Unwrapped paragraphs
  Before: 270,690 characters
  After:  268,403 characters
  Paragraphs: 1,650


## Convert to ASCII (Unidecode)

In [66]:
# Count non-ASCII characters before conversion
non_ascii_before = sum(1 for c in clean_text if ord(c) >= 128)

# Convert UTF-8 to ASCII using Unidecode
# This properly transliterates: smart quotes → straight quotes, em-dash → --, etc.
ascii_text = unidecode(clean_text)

# Verify it's pure ASCII
non_ascii_after = sum(1 for c in ascii_text if ord(c) >= 128)

print(f"✓ Converted to ASCII using Unidecode")
print(f"  Non-ASCII characters before: {non_ascii_before:,}")
print(f"  Non-ASCII characters after:  {non_ascii_after:,}")
print(f"  Character count: {len(clean_text):,} → {len(ascii_text):,}")

if non_ascii_after > 0:
    raise ValueError(f"Conversion failed: {non_ascii_after} non-ASCII characters remain!")

print(f"  ✓ Verified: 100% ASCII (all characters < 128)")

✓ Converted to ASCII using Unidecode
  Non-ASCII characters before: 4,785
  Non-ASCII characters after:  0
  Character count: 268,403 → 268,928
  ✓ Verified: 100% ASCII (all characters < 128)


## Save Clean Corpus

In [67]:
# Write clean corpus (ASCII encoding)
with open(CLEAN_PATH, 'w', encoding='ascii') as f:
    f.write(ascii_text)

print(f"✓ Saved clean corpus to {CLEAN_PATH}")
print(f"  Size: {len(ascii_text):,} characters")
print(f"  Lines: {len(ascii_text.splitlines()):,}")
print(f"  Encoding: ASCII (pure 7-bit)")

✓ Saved clean corpus to ../data/gatsby_clean.txt
  Size: 268,928 characters
  Lines: 1,650
  Encoding: ASCII (pure 7-bit)


## Preview

In [68]:
# Show first 10 paragraphs
lines = ascii_text.split('\n')
print("\nFirst 10 paragraphs:")
print("=" * 80)
for i, line in enumerate(lines[:10]):
    print(f"{i+1}: {line[:100]}{'...' if len(line) > 100 else ''}")
print("=" * 80)
print()

# Show last 5 paragraphs  
print("Last 5 paragraphs:")
print("=" * 80)
for i, line in enumerate(lines[-5:]):
    print(f"{len(lines)-5+i+1}: {line[:100]}{'...' if len(line) > 100 else ''}")
print("=" * 80)


First 10 paragraphs:
1: The Great Gatsby by F. Scott Fitzgerald
2: Table of Contents
3: I II III IV V VI VII VIII IX
4: Once again to Zelda
5: Then wear the gold hat, if that will move her; If you can bounce high, bounce for her too, Till she ...
6: Thomas Parke d'Invilliers
7: I
8: In my younger and more vulnerable years my father gave me some advice that I've been turning over in...
9: "Whenever you feel like criticizing anyone," he told me, "just remember that all the people in this ...
10: He didn't say any more, but we've always been unusually communicative in a reserved way, and I under...

Last 5 paragraphs:
1646: On the last night, with my trunk packed and my car sold to the grocer, I went over and looked at tha...
1647: Most of the big shore places were closed now and there were hardly any lights except the shadowy, mo...
1648: And as I sat there brooding on the old, unknown world, I thought of Gatsby's wonder when he first pi...
1649: Gatsby believed in the green light, the 

## Summary

In [70]:
print("\n" + "=" * 80)
print("CORPUS PREPARATION COMPLETE")
print("=" * 80)
print()
print(f"Processing:")
print(f"  ✓ Downloaded UTF-8 text from Project Gutenberg")
print(f"  ✓ Stripped Gutenberg front/back matter")
print(f"  ✓ Unwrapped hard-wrapped paragraphs")
print(f"  ✓ Converted to ASCII using Unidecode library")
print(f"  ✓ Verified 100% ASCII (all bytes < 128)")
print()
print(f"Output:")
print(f"  File: {CLEAN_PATH}")
print(f"  Size: {len(ascii_text):,} characters")
print(f"  Paragraphs: {len(unwrapped):,}")
print(f"  Encoding: ASCII (7-bit)")
print()

# Analyze character coverage
ascii_bytes = ascii_text.encode('ascii')
unique_bytes = set(ascii_bytes)
present_tokens = sorted(unique_bytes)
missing_tokens = sorted(set(range(128)) - unique_bytes)

print(f"Character coverage (ASCII 0-127):")
print(f"  Present: {len(present_tokens)} / 128 ({100*len(present_tokens)/128:.1f}%)")
print(f"  Missing: {len(missing_tokens)} / 128 ({100*len(missing_tokens)/128:.1f}%)")
print()

if len(missing_tokens) <= 20:
    # Show missing chars if not too many
    missing_chars = []
    for b in missing_tokens:
        if 32 <= b < 127:  # Printable
            missing_chars.append(f"{chr(b)} ({b})")
        else:  # Control char
            missing_chars.append(f"0x{b:02x} ({b})")
    print(f"  Missing characters: {', '.join(missing_chars)}")
    print()

print(f"Implication for training:")
print(f"  {len(present_tokens)} tokens will be trained (appear in corpus)")
print(f"  {len(missing_tokens)} tokens will be untrained (never seen)")
print()

print(f"Next steps:")
print(f"  → Use this corpus for ASCII-based training experiments")
print(f"  → All characters guaranteed to be in range [0, 127]")
print(f"  → Expect {len(missing_tokens)} untrained tokens to form structure")
print()
print("=" * 80)


CORPUS PREPARATION COMPLETE

Processing:
  ✓ Downloaded UTF-8 text from Project Gutenberg
  ✓ Stripped Gutenberg front/back matter
  ✓ Unwrapped hard-wrapped paragraphs
  ✓ Converted to ASCII using Unidecode library
  ✓ Verified 100% ASCII (all bytes < 128)

Output:
  File: ../data/gatsby_clean.txt
  Size: 268,928 characters
  Paragraphs: 1,650
  Encoding: ASCII (7-bit)

Character coverage (ASCII 0-127):
  Present: 79 / 128 (61.7%)
  Missing: 49 / 128 (38.3%)

Implication for training:
  79 tokens will be trained (appear in corpus)
  49 tokens will be untrained (never seen)

Next steps:
  → Use this corpus for ASCII-based training experiments
  → All characters guaranteed to be in range [0, 127]
  → Expect 49 untrained tokens to form structure

