# Notebook 01: Deterministic Ingestion & Cleaning

## Goal

Download public-domain text from Project Gutenberg and clean it for downstream processing.

We'll remove front/back matter, normalize whitespace, and preserve structural markers (chapters, paragraphs).


## Why Project Gutenberg?

Project Gutenberg provides stable, public-domain texts in plain UTF-8 format. For this project, we'll use:

- **The Iliad** (Homer, translated)
- **The Picture of Dorian Gray** (Oscar Wilde)

These texts are well-structured and suitable for demonstrating RAG with verbatim citations.


## Text Variability & Cleaning Challenges

Project Gutenberg texts include:

1. **Front matter**: Title page, copyright notices, table of contents
2. **Back matter**: End-of-book notices, legal text
3. **Encoding issues**: Sometimes mixed line endings, special characters
4. **Structural markers**: Chapter headings, line numbers (in poetry), paragraph breaks

Our cleaning strategy:

- Strip known separators (e.g., "*** START OF THIS PROJECT GUTENBERG EBOOK ***")
- Normalize whitespace (collapse multiple spaces, preserve paragraph breaks)
- Keep chapter markers for citation purposes
- Ensure UTF-8 encoding throughout


## Risks & Considerations

- **Different editions**: Project Gutenberg may host multiple editions; choose a stable URL
- **Re-run behavior**: Skip download if file already exists (idempotent)
- **Reproducibility**: Document the exact URL and date of download
- **File size**: Full texts can be large; ensure we're working with cleaned, manageable chunks downstream


## Step 1: Load Configuration

We'll read `configs/app.yaml` to get the book choice and output directories.


In [37]:
# === TODO (you code this) ===
# Load configs/app.yaml; expose book choice and output dirs.
# Acceptance: dict with keys: book, chunk_size, chunk_overlap, embedding_model
import yaml

def load_config(path="../configs/app.yaml"):
    with open(path, 'r') as f:
        config = yaml.safe_load(f)
    return config


## Step 2: Download Book

Download the requested book from Project Gutenberg into `data/raw/`.


In [38]:
# === TODO (you code this) ===
# Download book into data/raw/ and return path.
# Acceptance: file exists; non-empty; UTF-8.

import sys
from pathlib import Path
import importlib

sys.path.append(str(Path("..").resolve()))
from src import ingest
importlib.reload(ingest)  # Reload to get latest changes
from src.ingest import download_book
import os


# Implement download_book in src first.
# Then call it here with the book name from config.
config = load_config()

# Use the book-specific link from config instead of just the book name
book_name = config['book']
if book_name == 'iliad':
    book_url = config.get('iliad_link')
elif book_name == 'dorian':
    book_url = config.get('dorian_gray_link')
else:
    raise ValueError(f"Unknown book: {book_name}")

# Download the book (pass URL from config if available)
raw_file_path = download_book(config['book'], '../data/raw', url=book_url)
print(f"Downloaded book to: {raw_file_path}")
# Change YAML to download either iliad or dorian


File already exists: /Users/franciscoteixeirabarbosa/Dropbox/books_rag_project/classics-rag-qa/data/raw/dorian.txt
Downloaded book to: /Users/franciscoteixeirabarbosa/Dropbox/books_rag_project/classics-rag-qa/data/raw/dorian.txt


## Step 3: Clean Raw Text

Remove front/back matter, normalize whitespace, and save cleaned text to `data/interim/`.


In [39]:
# === TODO (you code this) ===
# Clean raw text and preview first/last 20 lines; save cleaned to data/interim.
# Acceptance: cleaned length >> 0; obvious headers stripped.

import sys
from pathlib import Path
import importlib

sys.path.append(str(Path("..").resolve()))
from src import clean
importlib.reload(clean)  # Reload to get latest changes
from src.clean import clean_text

# Call clean_text with the downloaded file path.
cleaned_text = clean_text(raw_file_path)
print(f"Cleaned text length: {len(cleaned_text)} characters")

# Preview first/last 20 lines (split by newlines)
cleaned_lines = cleaned_text.split('\n')
print(f"\nTotal lines: {len(cleaned_lines)}")
print("\nFirst 20 lines of cleaned text:")
print('\n'.join(cleaned_lines[:20]))
print("\n" + "="*60)
print("Last 20 lines of cleaned text:")
print('\n'.join(cleaned_lines[-20:]))

# Save cleaned text to data/interim/{book}_cleaned.txt
config = load_config()
book_name = config['book']
output_path = Path('../data/interim') / f"{book_name}_cleaned.txt"
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.write_text(cleaned_text, encoding='utf-8')
print(f"\n✅ Saved cleaned text to: {output_path}")



Cleaned text length: 429167 characters

Total lines: 8432

First 20 lines of cleaned text:
The Picture of Dorian Gray

by Oscar Wilde

Contents

THE PREFACE
CHAPTER I.
CHAPTER II.
CHAPTER III.
CHAPTER IV.
CHAPTER V.
CHAPTER VI.
CHAPTER VII.
CHAPTER VIII.
CHAPTER IX.
CHAPTER X.
CHAPTER XI.
CHAPTER XII.
CHAPTER XIII.

Last 20 lines of cleaned text:
them was Sir Henry Ashton’s uncle.

Inside, in the servants’ part of the house, the half-clad domestics
were talking in low whispers to each other. Old Mrs. Leaf was crying
and wringing her hands. Francis was as pale as death.

After about a quarter of an hour, he got the coachman and one of the
footmen and crept upstairs. They knocked, but there was no reply. They
called out. Everything was still. Finally, after vainly trying to force
the door, they got on the roof and dropped down on to the balcony. The
windows yielded easily—their bolts were old.

When they entered, they found hanging upon the wall a splendid portrait
of their master as the

## Quick Visual Check

Display a few cleaned paragraphs to verify the cleaning worked correctly. Document any rules you used (e.g., "removed lines matching 'Project Gutenberg', preserved CHAPTER markers").


In [40]:
# Preview cleaned text: show first 500 characters and a sample paragraph
# This helps verify that front matter was removed and structure is preserved.

first_500 = cleaned_text[:500]
print("First 500 characters of cleaned text:")
print(first_500)

# Sample paragraph
paragraph_sample = cleaned_text[1000:1500]
print("\nSample paragraph:")
print(paragraph_sample)

First 500 characters of cleaned text:
The Picture of Dorian Gray

by Oscar Wilde

Contents

THE PREFACE
CHAPTER I.
CHAPTER II.
CHAPTER III.
CHAPTER IV.
CHAPTER V.
CHAPTER VI.
CHAPTER VII.
CHAPTER VIII.
CHAPTER IX.
CHAPTER X.
CHAPTER XI.
CHAPTER XII.
CHAPTER XIII.
CHAPTER XIV.
CHAPTER XV.
CHAPTER XVI.
CHAPTER XVII.
CHAPTER XVIII.
CHAPTER XIX.
CHAPTER XX.

THE PREFACE

The artist is the creator of beautiful things. To reveal art and
conceal the artist is art’s aim. The critic is he who can translate
into another manner or a new materi

Sample paragraph:
h century dislike of realism is the rage of Caliban seeing
his own face in a glass.

The nineteenth century dislike of romanticism is the rage of Caliban
not seeing his own face in a glass. The moral life of man forms part of
the subject-matter of the artist, but the morality of art consists in
the perfect use of an imperfect medium. No artist desires to prove
anything. Even things that are true can be proved. No artist has
ethical sympat

## Summary

At this point, you should have:

- ✅ Raw text downloaded to `data/raw/{book}.txt`
- ✅ Cleaned text saved to `data/interim/{book}_cleaned.txt`
- ✅ Visual confirmation that cleaning removed headers/footers

**Next notebook**: We'll chunk the cleaned text and generate embeddings.
