Skip to content

OpenDataLand/podcast_summarize

Repository files navigation

podcast_summarize

Tools for monitoring open-source geospatial podcasts, generating structured episode notes, and publishing them to subscribers by email. The project has two cooperating pieces:

  1. Summarizer pipeline – pulls new podcast episodes, formats transcripts, writes Google Docs (optional), and emails the weekly digest.
  2. FastAPI mini-app – exposes subscribe/unsubscribe endpoints backed by a Mailgun mailing list so readers can manage their subscription links securely.

Features

  • Chunked transcript formatting and summary synthesis using the configured OpenAI model.
  • Opinionated prompt tuning for open-source geospatial software developers.
  • Google Drive export hooks for sharing transcripts/summaries (optional).
  • Mailgun List API integration with per-subscriber unsubscribe tokens and one-click headers.
  • Double opt-in flow with confirmation links and expiring tokens.
  • FastAPI server (summary-api) for subscription management, confirmation, and health checks.

Getting Started

Prerequisites

  • Python 3.10+
  • ffmpeg installed and reachable by the path in FFMPEG_BIN.
  • Mailgun domain with an API key and mailing list (e.g., geospatial_podcasts@mg.opendata.land).
  • OpenAI API key with access to the summarization model you choose.

Installation

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# or: pip install -e .[gdrive]   # if you need Google Drive exports

Configuration

  1. Copy the example environment file and edit the values:
    cp .env.example .env
    $EDITOR .env
  2. At minimum set:
    • OPENAI_API_KEY
    • MAILGUN_DOMAIN, MAILGUN_API_KEY, MAILGUN_LIST_ADDRESS, MAIL_FROM
    • APP_BASE_URL (public URL where the FastAPI app is hosted)
    • APP_SIGNING_SECRET (random string for unsubscribe token signing)
    • Optionally SUBSCRIBE_FORM_TOKEN to require a shared secret for subscription posts
  3. Optionally configure Google Drive (GDRIVE_*).

The application writes working files to WORK_DIR and state to STATE_DIR; both paths are ignored by git.

Running the summarizer (cron/CLI)

Use the existing CLI entry point (summarize) to process feeds. Example dry-run:

source .venv/bin/activate
summarize --dry-run

When MAILGUN_LIST_ADDRESS is configured the pipeline sends each digest to that list alias. If MAIL_TO is populated it is treated as a one-time seed list (imported into Mailgun); otherwise you can leave it blank and rely solely on the subscribe/unsubscribe flow. Subscribers must confirm their email before they are added to the list.

Running the FastAPI server

Launch the API with the provided script entry:

source .venv/bin/activate
summary-api  # wraps uvicorn podcast_summarize.server:app

Deploy this behind HTTPS at APP_BASE_URL. Key endpoints:

  • GET /healthz – simple health check.
  • POST /subscribe – accepts JSON { "email": "...", "name": "optional" } and sends a confirmation email. If SUBSCRIBE_FORM_TOKEN is set you must also provide "token": "<value>" or the X-Subscribe-Token header. A minimal HTML form is available at GET /subscribe for browser-based signups.
  • GET|POST /confirm/{token} – handles double opt-in confirmation links. Tokens expire after SUBSCRIBE_CONFIRM_TTL_HOURS hours.
  • POST /unsubscribe – accepts { "email": "..." } to remove a subscriber.
  • GET|POST /unsubscribe/{token} – one-click link used in outgoing emails.
  • GET /unsubscribe – renders a simple email-only form for manual removals.

Docker Compose wrapper

If you want the API in a self-contained container, a minimal compose file is included:

cp .env.example .env  # ensure the API has its env vars
docker compose -f docker-compose.api.yaml up --build

This binds the app to http://localhost:8002, mounts ./state and ./work inside the container, and restarts the service automatically if it exits. Adjust the published port in docker-compose.api.yaml or set API_PORT in .env if you need a different binding.

Environment variables reference

Variable Description
OPENAI_API_KEY API key used for summarization requests
MODEL_SUMMARY OpenAI model name (default gpt-4o-mini)
MAILGUN_DOMAIN Mailgun domain (e.g., mg.opendata.land)
MAILGUN_API_KEY Mailgun REST API key
MAILGUN_LIST_ADDRESS Mailing list alias for fan-out
MAIL_FROM From header used for outbound mail
MAIL_TO Optional seed addresses imported once into the list (leave empty otherwise)
APP_BASE_URL Public HTTPS base URL where the FastAPI app is served
APP_SIGNING_SECRET Secret used to sign unsubscribe tokens
SUBSCRIBE_FORM_TOKEN Optional shared secret required for subscription requests
APP_BRAND_NAME Label used in emails and forms (defaults to Geospatial Podcast Summaries)
SUBSCRIBE_CONFIRM_TTL_HOURS Hours before confirmation links expire (default 48)
WORK_DIR, STATE_DIR Paths for generated files and state cache
FFMPEG_BIN Path to ffmpeg executable
GDRIVE_* Optional Google Drive integration switches

Repository Layout

src/podcast_summarize/
├── EpisodeProcessor.py    # pipeline for fetching, summarizing, emailing episodes
├── summarize.py           # prompt configuration and transcript formatting helpers
├── emailer.py             # Mailgun delivery logic (list-aware)
├── subscriptions.py       # Subscriber store + Mailgun list adapters
├── server.py              # FastAPI application
└── config.py              # Environment configuration helpers

Development

  • The project uses pyproject.toml for dependencies and exposes script entry points.
  • Run python3 -m compileall src/podcast_summarize before committing larger changes to catch syntax errors.
  • Keep .env and other secret material out of version control; .env.example documents all required settings.

License

This repository is maintained by OpenDataLand. Choose and add a license file (LICENSE) if you plan to distribute the project publicly.

About

Summarize Podcasts!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages