A personal watchlist digest. Checks a handful of websites each morning, compares current state to a saved baseline, and emails you only what is new.
This repository is the companion codebase for Chapter 14 of Code We Understand — a complete walk-through of building Tidepool from scratch using the book's seven practices. Each commit on main corresponds to a step in the chapter; reading the commit log top-to-bottom is the chapter's narrative in condensed form.
Every morning at a time you schedule, Tidepool:
- Loads a list of sites from
watchlist.yaml. - Fetches each one (respecting
robots.txt, with a 10-second timeout and no retry). - Parses the configured CSS selectors to extract candidate items.
- Diffs against the SQLite baseline at
data/tidepool.dbto find items not seen before. - Emails you the new items. If there is nothing new and no site broke, no email is sent.
When a site breaks — selectors stop matching, the host stops responding, robots disallows — Tidepool flags it explicitly in the digest rather than silently dropping it.
- Python 3.11+ (uses
str | Nonesyntax and timezone-awaredatetime). - uv for environment management.
- An SMTP account for sending digests (Gmail works, using an app password).
git clone <this-repo> tidepool
cd tidepool
uv sync
cp .env.example .env
cp watchlist.example.yaml watchlist.yamlEdit .env with your SMTP credentials and both FROM_EMAIL / TO_EMAIL (often the same address for personal use):
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=you@gmail.com
SMTP_PASS=your-app-password
FROM_EMAIL=you@gmail.com
TO_EMAIL=you@gmail.com
SUBJECT_PREFIX=
Edit watchlist.yaml with the sites you want to monitor (see Watchlist schema below).
uv run python -m tidepool --dry-runFetches and compares as normal, but prints the composed EmailMessage to stdout instead of sending, and does not update the baseline. Use this to validate selectors before letting a real run "mark" items as seen.
uv run python -m tidepoolSends the digest via SMTP and updates the baseline. On the first real run for a new site, baseline items are recorded silently — you do not get an "everything is new" blast.
| Flag | Default | Purpose |
|---|---|---|
--watchlist PATH |
watchlist.yaml |
Use an alternate watchlist file. |
--dry-run |
off | Print instead of send; skip baseline updates. |
Tidepool is a flat Python package. Each subsystem is a single module, not a subpackage — the whole tool is under 400 lines, so folders-with-__init__.py would be ceremony.
tidepool/
watch.py # Load and validate watchlist.yaml -> list[Site]
fetch.py # Retrieve a URL over HTTP -> FetchResult
compare.py # Parse HTML, diff against baseline -> list[NewItem] | SiteBroken
notify.py # Compose and send the digest email
data.py # SQLite baseline read/write
__main__.py # Orchestrator
Data flow for one morning's run:
__main__.main()
-> watch.load_watchlist(path) list[Site]
-> for each Site:
fetch.fetch_site(site.url) FetchResult
compare.parse_items(site, fr) list[ParsedItem] | SiteBroken
data.load_baseline(site.name) Baseline | None
compare.compare_to_baseline(...) list[NewItem]
data.save_baseline(...) (unless --dry-run)
-> notify.compose_digest(...) EmailMessage | None
-> notify.send_digest(msg, dry_run) None
fetch_site(url) -> FetchResult— the transport layer. Errors become data onFetchResult.error, not exceptions. HTTP 4xx/5xx populatestatus; their semantic meaning (transient or permanent "broken") iscompare's call, notfetch's.parse_items(site, fetch_result) -> list[ParsedItem] | SiteBroken— applies CSS selectors via BeautifulSoup. ReturnsSiteBrokenon fetch error, status ≥ 400, selector mismatch, or universal title/link extraction failure.compare_to_baseline(site_name, baseline, parsed, fetched_at) -> list[NewItem]— returns[]whenbaseline is None(first-run silent establishment).compose_digest(new_items, broken_sites, sites, run_at) -> EmailMessage | None— returnsNonewhen both lists are empty, which is how the orchestrator knows not to send.send_digest(msg, dry_run=False)— STARTTLS on configured port.dry_run=Truejust printsmsg.as_string().
The orchestrator wraps each site's loop body in try/except Exception. Expected failures (fetch errors, selector misses) are already converted to SiteBroken upstream. The catch-all exists so that one unexpected bug on one site does not kill the morning run — the broken site shows up in the digest with internal error: <ExceptionType>: <msg>, and the other sites keep processing.
| Variable | Required for… | Notes |
|---|---|---|
SMTP_HOST |
real send | e.g. smtp.gmail.com |
SMTP_PORT |
real send | STARTTLS-friendly, typically 587 |
SMTP_USER |
real send | SMTP auth username |
SMTP_PASS |
real send | SMTP auth password (app password for Gmail) |
FROM_EMAIL |
always | Sent as the From: header; can differ from SMTP_USER for Gmail send-as aliases |
TO_EMAIL |
always | Recipient |
SUBJECT_PREFIX |
optional | Inbox tag, e.g. [tidepool] |
For a dry run, only FROM_EMAIL and TO_EMAIL need to be set — SMTP credentials are not read.
sites:
- name: Hacker News
url: https://news.ycombinator.com/
item_selector: "tr.athing"
title_selector: ".titleline > a"
link_selector: ".titleline > a"
category: tech # optional
- name: Simon Willison
url: https://simonwillison.net/
item_selector: "div.entry"
title_selector: "h3 a"
link_selector: "h3 a"
category: blogsRules:
namemust be unique across the file — it is the identity key for baselines. Renaming a site is equivalent to starting over.- The five fields
name,url,item_selector,title_selector,link_selectorare all required.load_watchlistraisesValueErrorif any are missing. categoryis optional. If any site has acategory, the digest groups items by category; otherwise it lists them flat bysite_name.- Selectors are standard CSS selectors, parsed by BeautifulSoup's
html.parser.
tidepool/
├── CLAUDE.md # Project context for the agent (the book's seed document)
├── NOTES.md # Session-by-session working notes with reasoning
├── README.md # This file
├── pyproject.toml # uv project, 4 runtime deps, Python 3.11+
├── uv.lock # Resolved dependency tree
├── .env.example # Template — copy to .env
├── .gitignore
├── watchlist.example.yaml # Template — copy to watchlist.yaml
├── data/ # Runtime SQLite database (gitignored)
└── tidepool/ # The package
├── __init__.py
├── __main__.py
├── watch.py
├── fetch.py
├── compare.py
├── notify.py
└── data.py
data/, .env, and watchlist.yaml are gitignored — they hold runtime state and secrets. .env.example and watchlist.example.yaml are committed as templates.
Lifted from CLAUDE.md and enforced throughout the code:
- Fail loud, not silent. Broken sites surface in the digest with a reason. Transport errors become
FetchResult.errordata. Unexpected exceptions are caught per-site and rendered asSiteBroken. Selector mismatches are a reported failure, not an ignored row. - Email only when there is something to report. If
new_itemsandbroken_sitesare both empty,compose_digestlogs to stderr and returnsNone— no empty-inbox noise. - Keep the dependency list short. Four runtime deps (
httpx,beautifulsoup4,pyyaml,python-dotenv). Everything else is stdlib:smtplib,sqlite3,urllib.robotparser,argparse,email,dataclasses. - Local-only operation. No cloud services beyond the outbound SMTP connection. SQLite on local disk. Designed to run as a scheduled local job.
Tidepool itself has no scheduler — that's an OS concern. Two common options:
Create ~/Library/LaunchAgents/local.tidepool.plist with a StartCalendarInterval for your chosen hour, pointing at uv run python -m tidepool inside the repo. Load with launchctl load.
0 7 * * * cd /path/to/tidepool && /usr/local/bin/uv run python -m tidepool 2>> ~/.tidepool.logOn a normal real run, Tidepool is silent on stdout and only writes to stderr for warnings. Redirect stderr if you want to keep the warning history.
Each commit on main is a moment in the chapter. Reading the log is the fastest way to see how Tidepool was built.
initial scaffold: project structure, empty subsystems, pyproject
fetch: implement fetch_site with FetchResult dataclass, robots.txt warning, no retry
compare: implement compare_to_baseline with first-run baseline establishment and site-broken detection
notify: implement send_digest with dry-run flag, smart subject line, empty-case logging
main: implement orchestrator with per-site isolation and dry-run flag
The chapter applies the book's seven practices in sequence:
- Load context deliberately — the
CLAUDE.mdin the repo root. Every session starts by reading it. - Plan before code — each subsystem was planned with ambiguities surfaced before implementation. See the "Decisions" sections in
NOTES.md. - Probe real data — fetch was first exercised against live sites (
example.com, an httpbin redirect, a bogus host) before anything depended on its output. - Adjudicate with reasoning — every nontrivial decision in
NOTES.mdhas a Why: line, not just the chosen option. - Verify end-to-end — after each subsystem landed, it was run against real input and the output shown.
- Commit in small increments — one subsystem per commit.
subsystem: what changedformat. - Stop when the work is done — sessions end at a clean resting point.
NOTES.md is the per-session trail: what was done, what changed, what was decided and why, what was deferred, and what comes next. Treat it as the companion read while flipping through Chapter 14.
Writing good selectors is the only craft you add. The loop:
- Open the site in a browser, view source (or DevTools).
- Find the repeating element that wraps each post or story — that's your
item_selector. - Inside that element, find the selector for the title text and the
<a>that holds the URL. Often the same selector serves bothtitle_selectorandlink_selector. - Add the entry to
watchlist.yaml. uv run python -m tidepool --dry-runand read the composed output.
Diagnosing a broken entry:
"selector matched no items: '…'"→ youritem_selectoris wrong."every item missing title or link (…)"→item_selectormatches, buttitle_selector/link_selectorinside those elements do not resolve to both text and href."HTTP 403"or"blocked by robots.txt"→ the site does not want to be fetched. Drop it from the watchlist or keep it for the "still blocked" daily signal.
- The first digest was empty. First run per site establishes the baseline silently. You will see items the next time that site updates.
- Gmail auth fails. Use an app password, not your normal password. Enable 2-Step Verification first, then generate an app password at Google Account → Security → App Passwords.
- A site shows up broken every day. Run
--dry-runand read thereason. Common causes: selector drift after a redesign, login-wall, Cloudflare challenge, or an intentionalrobots.txtdisallow (which we respect). Could not fetch robots.txt for …on stderr, twice. When a host is unreachable, the robots check and the main fetch each fail on the same DNS error. Cosmetic, not a bug.decode failed: <encoding>. The site's declared encoding doesn't match its bytes. Rare; file an issue with the watchlist entry and we can look.
The shape is designed to make common extensions surgical:
- Another output channel (Slack, SMS, iMessage, Discord webhook): add a parallel function in
notify.pywith the samecompose_*/send_*split. The orchestrator stays unchanged. - Per-site rate-limiting:
fetch_siteis intentionally stateless. A caller-side semaphore ortime.sleepin the orchestrator is the right place. - Per-site parsing hooks:
parse_itemsis the natural extension point. Switch onsite.name, or add aparserfield toSiteand dispatch. - JSON feed support:
fetchstays the same; add aparse_itemsvariant that handlesapplication/jsoncontent.
Keep the dependency list short. Every new package requires justification.
The book is the narrative; the repo is the artifact. They answer different questions:
- Why does the code look like this? — the book.
- What exactly did the final code end up being? — the repo.
- What was the reasoning at each decision? —
NOTES.mdand the commit messages.
Reading them in parallel — one screen with the book, another with git log -p and NOTES.md — is the intended experience.