Skip to content

CodeWeUnderstand/tidepool

Repository files navigation

Tidepool

A personal watchlist digest. Checks a handful of websites each morning, compares current state to a saved baseline, and emails you only what is new.

This repository is the companion codebase for Chapter 14 of Code We Understand — a complete walk-through of building Tidepool from scratch using the book's seven practices. Each commit on main corresponds to a step in the chapter; reading the commit log top-to-bottom is the chapter's narrative in condensed form.


What it does

Every morning at a time you schedule, Tidepool:

  1. Loads a list of sites from watchlist.yaml.
  2. Fetches each one (respecting robots.txt, with a 10-second timeout and no retry).
  3. Parses the configured CSS selectors to extract candidate items.
  4. Diffs against the SQLite baseline at data/tidepool.db to find items not seen before.
  5. Emails you the new items. If there is nothing new and no site broke, no email is sent.

When a site breaks — selectors stop matching, the host stops responding, robots disallows — Tidepool flags it explicitly in the digest rather than silently dropping it.

Requirements

  • Python 3.11+ (uses str | None syntax and timezone-aware datetime).
  • uv for environment management.
  • An SMTP account for sending digests (Gmail works, using an app password).

Setup

git clone <this-repo> tidepool
cd tidepool
uv sync
cp .env.example .env
cp watchlist.example.yaml watchlist.yaml

Edit .env with your SMTP credentials and both FROM_EMAIL / TO_EMAIL (often the same address for personal use):

SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=you@gmail.com
SMTP_PASS=your-app-password
FROM_EMAIL=you@gmail.com
TO_EMAIL=you@gmail.com
SUBJECT_PREFIX=

Edit watchlist.yaml with the sites you want to monitor (see Watchlist schema below).

Usage

Dry run (no email sent, baseline not updated)

uv run python -m tidepool --dry-run

Fetches and compares as normal, but prints the composed EmailMessage to stdout instead of sending, and does not update the baseline. Use this to validate selectors before letting a real run "mark" items as seen.

Real run

uv run python -m tidepool

Sends the digest via SMTP and updates the baseline. On the first real run for a new site, baseline items are recorded silently — you do not get an "everything is new" blast.

Flags

Flag Default Purpose
--watchlist PATH watchlist.yaml Use an alternate watchlist file.
--dry-run off Print instead of send; skip baseline updates.

Architecture

Tidepool is a flat Python package. Each subsystem is a single module, not a subpackage — the whole tool is under 400 lines, so folders-with-__init__.py would be ceremony.

tidepool/
  watch.py     # Load and validate watchlist.yaml -> list[Site]
  fetch.py     # Retrieve a URL over HTTP -> FetchResult
  compare.py   # Parse HTML, diff against baseline -> list[NewItem] | SiteBroken
  notify.py    # Compose and send the digest email
  data.py      # SQLite baseline read/write
  __main__.py  # Orchestrator

Data flow for one morning's run:

__main__.main()
  -> watch.load_watchlist(path)               list[Site]
  -> for each Site:
       fetch.fetch_site(site.url)             FetchResult
       compare.parse_items(site, fr)          list[ParsedItem] | SiteBroken
       data.load_baseline(site.name)          Baseline | None
       compare.compare_to_baseline(...)       list[NewItem]
       data.save_baseline(...)                (unless --dry-run)
  -> notify.compose_digest(...)               EmailMessage | None
  -> notify.send_digest(msg, dry_run)         None

Key interfaces

  • fetch_site(url) -> FetchResult — the transport layer. Errors become data on FetchResult.error, not exceptions. HTTP 4xx/5xx populate status; their semantic meaning (transient or permanent "broken") is compare's call, not fetch's.
  • parse_items(site, fetch_result) -> list[ParsedItem] | SiteBroken — applies CSS selectors via BeautifulSoup. Returns SiteBroken on fetch error, status ≥ 400, selector mismatch, or universal title/link extraction failure.
  • compare_to_baseline(site_name, baseline, parsed, fetched_at) -> list[NewItem] — returns [] when baseline is None (first-run silent establishment).
  • compose_digest(new_items, broken_sites, sites, run_at) -> EmailMessage | None — returns None when both lists are empty, which is how the orchestrator knows not to send.
  • send_digest(msg, dry_run=False) — STARTTLS on configured port. dry_run=True just prints msg.as_string().

Per-site failure isolation

The orchestrator wraps each site's loop body in try/except Exception. Expected failures (fetch errors, selector misses) are already converted to SiteBroken upstream. The catch-all exists so that one unexpected bug on one site does not kill the morning run — the broken site shows up in the digest with internal error: <ExceptionType>: <msg>, and the other sites keep processing.

Configuration reference

.env

Variable Required for… Notes
SMTP_HOST real send e.g. smtp.gmail.com
SMTP_PORT real send STARTTLS-friendly, typically 587
SMTP_USER real send SMTP auth username
SMTP_PASS real send SMTP auth password (app password for Gmail)
FROM_EMAIL always Sent as the From: header; can differ from SMTP_USER for Gmail send-as aliases
TO_EMAIL always Recipient
SUBJECT_PREFIX optional Inbox tag, e.g. [tidepool]

For a dry run, only FROM_EMAIL and TO_EMAIL need to be set — SMTP credentials are not read.

Watchlist schema

sites:
  - name: Hacker News
    url: https://news.ycombinator.com/
    item_selector: "tr.athing"
    title_selector: ".titleline > a"
    link_selector: ".titleline > a"
    category: tech        # optional

  - name: Simon Willison
    url: https://simonwillison.net/
    item_selector: "div.entry"
    title_selector: "h3 a"
    link_selector: "h3 a"
    category: blogs

Rules:

  • name must be unique across the file — it is the identity key for baselines. Renaming a site is equivalent to starting over.
  • The five fields name, url, item_selector, title_selector, link_selector are all required. load_watchlist raises ValueError if any are missing.
  • category is optional. If any site has a category, the digest groups items by category; otherwise it lists them flat by site_name.
  • Selectors are standard CSS selectors, parsed by BeautifulSoup's html.parser.

Project layout

tidepool/
├── CLAUDE.md                # Project context for the agent (the book's seed document)
├── NOTES.md                 # Session-by-session working notes with reasoning
├── README.md                # This file
├── pyproject.toml           # uv project, 4 runtime deps, Python 3.11+
├── uv.lock                  # Resolved dependency tree
├── .env.example             # Template — copy to .env
├── .gitignore
├── watchlist.example.yaml   # Template — copy to watchlist.yaml
├── data/                    # Runtime SQLite database (gitignored)
└── tidepool/                # The package
    ├── __init__.py
    ├── __main__.py
    ├── watch.py
    ├── fetch.py
    ├── compare.py
    ├── notify.py
    └── data.py

data/, .env, and watchlist.yaml are gitignored — they hold runtime state and secrets. .env.example and watchlist.example.yaml are committed as templates.

Operating principles

Lifted from CLAUDE.md and enforced throughout the code:

  1. Fail loud, not silent. Broken sites surface in the digest with a reason. Transport errors become FetchResult.error data. Unexpected exceptions are caught per-site and rendered as SiteBroken. Selector mismatches are a reported failure, not an ignored row.
  2. Email only when there is something to report. If new_items and broken_sites are both empty, compose_digest logs to stderr and returns None — no empty-inbox noise.
  3. Keep the dependency list short. Four runtime deps (httpx, beautifulsoup4, pyyaml, python-dotenv). Everything else is stdlib: smtplib, sqlite3, urllib.robotparser, argparse, email, dataclasses.
  4. Local-only operation. No cloud services beyond the outbound SMTP connection. SQLite on local disk. Designed to run as a scheduled local job.

Scheduling a daily run

Tidepool itself has no scheduler — that's an OS concern. Two common options:

macOS (launchd)

Create ~/Library/LaunchAgents/local.tidepool.plist with a StartCalendarInterval for your chosen hour, pointing at uv run python -m tidepool inside the repo. Load with launchctl load.

Linux (cron)

0 7 * * * cd /path/to/tidepool && /usr/local/bin/uv run python -m tidepool 2>> ~/.tidepool.log

On a normal real run, Tidepool is silent on stdout and only writes to stderr for warnings. Redirect stderr if you want to keep the warning history.

The commit arc

Each commit on main is a moment in the chapter. Reading the log is the fastest way to see how Tidepool was built.

initial scaffold: project structure, empty subsystems, pyproject
fetch:    implement fetch_site with FetchResult dataclass, robots.txt warning, no retry
compare:  implement compare_to_baseline with first-run baseline establishment and site-broken detection
notify:   implement send_digest with dry-run flag, smart subject line, empty-case logging
main:     implement orchestrator with per-site isolation and dry-run flag

The chapter applies the book's seven practices in sequence:

  1. Load context deliberately — the CLAUDE.md in the repo root. Every session starts by reading it.
  2. Plan before code — each subsystem was planned with ambiguities surfaced before implementation. See the "Decisions" sections in NOTES.md.
  3. Probe real data — fetch was first exercised against live sites (example.com, an httpbin redirect, a bogus host) before anything depended on its output.
  4. Adjudicate with reasoning — every nontrivial decision in NOTES.md has a Why: line, not just the chosen option.
  5. Verify end-to-end — after each subsystem landed, it was run against real input and the output shown.
  6. Commit in small increments — one subsystem per commit. subsystem: what changed format.
  7. Stop when the work is done — sessions end at a clean resting point.

NOTES.md is the per-session trail: what was done, what changed, what was decided and why, what was deferred, and what comes next. Treat it as the companion read while flipping through Chapter 14.

Customizing your watchlist

Writing good selectors is the only craft you add. The loop:

  1. Open the site in a browser, view source (or DevTools).
  2. Find the repeating element that wraps each post or story — that's your item_selector.
  3. Inside that element, find the selector for the title text and the <a> that holds the URL. Often the same selector serves both title_selector and link_selector.
  4. Add the entry to watchlist.yaml.
  5. uv run python -m tidepool --dry-run and read the composed output.

Diagnosing a broken entry:

  • "selector matched no items: '…'" → your item_selector is wrong.
  • "every item missing title or link (…)"item_selector matches, but title_selector / link_selector inside those elements do not resolve to both text and href.
  • "HTTP 403" or "blocked by robots.txt" → the site does not want to be fetched. Drop it from the watchlist or keep it for the "still blocked" daily signal.

Troubleshooting

  • The first digest was empty. First run per site establishes the baseline silently. You will see items the next time that site updates.
  • Gmail auth fails. Use an app password, not your normal password. Enable 2-Step Verification first, then generate an app password at Google Account → Security → App Passwords.
  • A site shows up broken every day. Run --dry-run and read the reason. Common causes: selector drift after a redesign, login-wall, Cloudflare challenge, or an intentional robots.txt disallow (which we respect).
  • Could not fetch robots.txt for … on stderr, twice. When a host is unreachable, the robots check and the main fetch each fail on the same DNS error. Cosmetic, not a bug.
  • decode failed: <encoding>. The site's declared encoding doesn't match its bytes. Rare; file an issue with the watchlist entry and we can look.

Extending

The shape is designed to make common extensions surgical:

  • Another output channel (Slack, SMS, iMessage, Discord webhook): add a parallel function in notify.py with the same compose_* / send_* split. The orchestrator stays unchanged.
  • Per-site rate-limiting: fetch_site is intentionally stateless. A caller-side semaphore or time.sleep in the orchestrator is the right place.
  • Per-site parsing hooks: parse_items is the natural extension point. Switch on site.name, or add a parser field to Site and dispatch.
  • JSON feed support: fetch stays the same; add a parse_items variant that handles application/json content.

Keep the dependency list short. Every new package requires justification.

Why read this repo alongside the book

The book is the narrative; the repo is the artifact. They answer different questions:

  • Why does the code look like this? — the book.
  • What exactly did the final code end up being? — the repo.
  • What was the reasoning at each decision?NOTES.md and the commit messages.

Reading them in parallel — one screen with the book, another with git log -p and NOTES.md — is the intended experience.

About

Small, local website monitor that emails you a morning digest of new items. Built as the example project for Code We Understand.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages