# Notebook 01: Deterministic Ingestion & Cleaning

## Goal

Download public-domain text from Project Gutenberg and clean it for downstream processing.

We'll remove front/back matter, normalize whitespace, and preserve structural markers (chapters, paragraphs).


## Why Project Gutenberg?

Project Gutenberg provides stable, public-domain texts in plain UTF-8 format. For this project, we'll use:

- **The Iliad** (Homer, translated)
- **The Picture of Dorian Gray** (Oscar Wilde)

These texts are well-structured and suitable for demonstrating RAG with verbatim citations.


## Text Variability & Cleaning Challenges

Project Gutenberg texts include:

1. **Front matter**: Title page, copyright notices, table of contents
2. **Back matter**: End-of-book notices, legal text
3. **Encoding issues**: Sometimes mixed line endings, special characters
4. **Structural markers**: Chapter headings, line numbers (in poetry), paragraph breaks

Our cleaning strategy:

- Strip known separators (e.g., "*** START OF THIS PROJECT GUTENBERG EBOOK ***")
- Normalize whitespace (collapse multiple spaces, preserve paragraph breaks)
- Keep chapter markers for citation purposes
- Ensure UTF-8 encoding throughout


## Risks & Considerations

- **Different editions**: Project Gutenberg may host multiple editions; choose a stable URL
- **Re-run behavior**: Skip download if file already exists (idempotent)
- **Reproducibility**: Document the exact URL and date of download
- **File size**: Full texts can be large; ensure we're working with cleaned, manageable chunks downstream


## Step 1: Load Configuration

We'll read `configs/app.yaml` to get the book choice and output directories.


In [1]:
# === TODO (you code this) ===
# Load configs/app.yaml; expose book choice and output dirs.
# Acceptance: dict with keys: book, chunk_size, chunk_overlap, embedding_model

def load_config(path="configs/app.yaml"):
    raise NotImplementedError


## Step 2: Download Book

Download the requested book from Project Gutenberg into `data/raw/`.


In [2]:
# === TODO (you code this) ===
# Download book into data/raw/ and return path.
# Acceptance: file exists; non-empty; UTF-8.

from src.ingest import download_book

# Implement download_book in src first.
# Then call it here with the book name from config.


ModuleNotFoundError: No module named 'src'

## Step 3: Clean Raw Text

Remove front/back matter, normalize whitespace, and save cleaned text to `data/interim/`.


In [None]:
# === TODO (you code this) ===
# Clean raw text and preview first/last 20 lines; save cleaned to data/interim.
# Acceptance: cleaned length >> 0; obvious headers stripped.

from src.clean import clean_text

# Call clean_text with the downloaded file path.
# Preview the result, then save to data/interim/{book}_cleaned.txt


## Quick Visual Check

Display a few cleaned paragraphs to verify the cleaning worked correctly. Document any rules you used (e.g., "removed lines matching 'Project Gutenberg', preserved CHAPTER markers").


In [None]:
# Preview cleaned text: show first 500 characters and a sample paragraph
# This helps verify that front matter was removed and structure is preserved.


## Summary

At this point, you should have:

- ✅ Raw text downloaded to `data/raw/{book}.txt`
- ✅ Cleaned text saved to `data/interim/{book}_cleaned.txt`
- ✅ Visual confirmation that cleaning removed headers/footers

**Next notebook**: We'll chunk the cleaned text and generate embeddings.
