# Lab: Brochure Generator Playground

A sandbox to try the brochure pipeline (scraper → cleaner → optional Gemini) without touching the main app. Follow the sections below to set up, scrape, clean, and optionally generate a brochure.

Outline covered here:
- Setup project and kernel
- Load and inspect sample data
- Define helper functions (using existing services)
- Run the main flow with toggles
- Add quick tests
- Validate outputs


## 1. Setup Project and Install Dependencies

Use the project venv, install `requirements.txt`, then select that kernel. This cell wires sys.path to the repo root and loads `.env` so the notebook sees the same settings as the app.


In [None]:
# Kernel & env setup
import os
import sys
from pathlib import Path

from dotenv import load_dotenv

# Point PYTHONPATH at repo root for imports
PROJECT_ROOT = Path(__file__).resolve().parents[1]
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

load_dotenv(PROJECT_ROOT / ".env")

from config.settings import settings  # type: ignore

# Toggle the target and whether to hit the LLM step
target_url = "https://example.com"
RUN_LLM = False  # set True only if APP_GOOGLE_API_KEY is configured

print(f"PROJECT_ROOT: {PROJECT_ROOT}")
print(f"App: {settings.app_name} | Env: {settings.app_env}")
print(f"Google API key set: {bool(settings.google_api_key)}")
print(f"LLM enabled in this run: {RUN_LLM}")


## 2. Load and Inspect Dataset

We will use a tiny in-memory HTML sample as a fallback dataset for offline runs, while the live pipeline will pull from `target_url` when reachable.


In [None]:
# In-memory fallback HTML (used if live scraping fails)
fallback_main_html = """
<html>
  <head><title>Example Co</title></head>
  <body>
    <header><h1>Welcome to Example Co</h1></header>
    <section>
      <h2>About</h2>
      <p>Example Co builds reliable web solutions for small businesses.</p>
    </section>
    <section>
      <h2>Services</h2>
      <ul>
        <li>Web development</li>
        <li>Content strategy</li>
        <li>Marketing automation</li>
      </ul>
    </section>
    <footer>Contact: hello@example.com</footer>
  </body>
</html>
""".strip()

fallback_related_pages = [
    {
        "url": "https://example.com/contact",
        "html": """
        <html><body><h1>Contact Us</h1><p>Email: hello@example.com</p><p>Phone: 555-123-4567</p></body></html>
        """.strip(),
    },
]

print(f"Fallback main HTML length: {len(fallback_main_html)}")
print(f"Fallback related pages: {len(fallback_related_pages)}")


## 3. Define Core Functions and Classes

We reuse existing services (scraper, cleaner, summarizer) and wrap them with small helpers to make the notebook resilient (graceful fallbacks, optional LLM).


In [None]:
import logging
from typing import Optional

from app.services.brochure_generator.scraper import (
    filter_related_links,
    scrape_main_page,
    scrape_related_pages,
)
from app.services.brochure_generator.content_cleaner import combine_and_clean
from app.services.brochure_generator.llm_summarizer import generate_brochure

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("lab")


def scrape_all(url: str) -> tuple[str, list[str], list[dict[str, str]]]:
    """Scrape main page + filtered related pages. Falls back to empty on errors."""
    try:
        main_html, links = scrape_main_page(url)
        kept_links = filter_related_links(url, links)
        related = scrape_related_pages(kept_links)
        return main_html, kept_links, related
    except Exception as exc:  # noqa: BLE001
        logger.warning("Scrape failed (%s); using fallback data.", exc)
        return fallback_main_html, [], fallback_related_pages


def clean_content(main_html: str, related_pages: list[dict[str, str]]) -> str:
    """Run the cleaner to produce a single text blob."""
    return combine_and_clean(main_html, related_pages)


def maybe_generate_brochure(cleaned_text: str, *, run_llm: bool) -> Optional[str]:
    """Call the LLM only if requested and API key exists."""
    has_key = bool(settings.google_api_key)
    if not run_llm or not has_key:
        logger.info("LLM step skipped (run_llm=%s, key=%s)", run_llm, has_key)
        return None
    try:
        return generate_brochure(cleaned_text)
    except Exception as exc:  # noqa: BLE001
        logger.warning("LLM generation failed: %s", exc)
        return None


## 4. Implement Main Execution Flow

This ties everything together: scrape → filter → scrape related → clean → optional LLM.


In [None]:
from textwrap import shorten

# Run the pipeline (scrape → clean → optional LLM)
main_html, kept_links, related_pages = scrape_all(target_url)
print(f"Main HTML length: {len(main_html)}")
print(f"Related links kept: {len(kept_links)} -> {kept_links[:5]}")
print(f"Related pages scraped: {len(related_pages)}")

cleaned = clean_content(main_html, related_pages)
print(f"Cleaned text length: {len(cleaned)}")
print("Preview:\n", shorten(cleaned.replace("\n", " "), width=400, placeholder=" ..."))

brochure_md = maybe_generate_brochure(cleaned, run_llm=RUN_LLM)
if brochure_md:
    print("\n---\nBrochure preview (first 800 chars):\n")
    print(brochure_md[:800])
else:
    print("\nLLM step skipped or failed (see logs above).")


## 5. Add Unit Tests

Quick pytest-style checks to ensure the cleaner works on fallback HTML and that the brochure generator can be invoked conditionally.


In [None]:
def test_cleaner_handles_basic_html():
    cleaned = clean_content(fallback_main_html, fallback_related_pages)
    assert "Example Co" in cleaned
    assert "Contact" in cleaned


def test_llm_guard_flag():
    brochure = maybe_generate_brochure("short text", run_llm=False)
    assert brochure is None


# Run the quick tests in-notebook
if __name__ == "__main__":
    test_cleaner_handles_basic_html()
    test_llm_guard_flag()
    print("Tests passed.")


## 6. Run and Validate Outputs

Execute the main flow and inspect the previews. Switch `RUN_LLM` to `True` and set your API key in `.env` to exercise Gemini.


In [None]:
# Simple validation snapshot
print(f"Cleaned text chars: {len(cleaned)}")
print("Cleaned excerpt:\n", cleaned[:600])

if brochure_md:
    print("\nBrochure excerpt:\n", brochure_md[:600])
else:
    print("Brochure not generated in this run.")
