Script that:
- Reads a list of sources from a file (each line:
type url— type isrssorhtml) - Parses the latest N items: RSS/Atom with feedparser, HTML blog listing pages with BeautifulSoup + readability
- Calls an LLM to generate 10 post ideas from the parsed content (structured: source links, source insight, post idea, description, format, how to use)
Uses OpenAI for LLM calls and LangFuse for tracing (via langfuse.openai).
# Use uv (recommended)
uv sync
# Or pip
pip install -e .Copy .env.example to .env and set:
OPENAI_API_KEY— requiredLANGFUSE_PUBLIC_KEY/LANGFUSE_SECRET_KEY— optional; if set, traces are sent to LangFuse
- Put sources in
urls.txt. Format:type url(tab or space), one per line.rss— RSS/Atom feed URL (e.g.rss https://hnrss.org/frontpage).html— blog listing page URL; the script will find article links on the page and fetch each article (e.g.html https://www.forrester.com/blogs/).
- Run:
python run.pyOptions (env vars):
URLS_FILE— path to URLs file (default:urls.txt)TOP_N— max number of latest entries to use (default:10)OPENAI_MODEL— model name (default:gpt-4o)
Output: 10 post ideas (with source links, insights, format) printed to stdout; LLM calls are logged to LangFuse when configured.
When adding new URLs (especially html sources), follow docs/ADDING_SOURCES.md so that:
- You verify how the site is parsed and which links are collected.
- You add exclusions in
parsers/html.pyfor category/section/landing URLs if the parser picks them up by mistake.
content-engine/
├── run.py # Entry point (loads .env, calls main)
├── main.py # Pipeline: load sources → fetch → LLM → output
├── config.py # Constants (DEFAULT_*, USER_AGENT, SOURCE_TYPES)
├── models.py # FeedEntry, Source dataclasses
├── sources.py # load_sources() from urls file
├── fetcher.py # fetch_entries() — RSS + HTML, merge by date
├── parsers/
│ ├── __init__.py
│ ├── rss.py # fetch_entries_rss()
│ └── html.py # fetch_entries_html()
├── prompt_loader.py # load_prompt(name, **variables)
├── llm.py # generate_post_ideas()
├── prompts/ # Prompt templates ({{placeholder}})
│ ├── post_ideas_system.txt
│ └── post_ideas_user.txt # {{count}}, {{content}}, {{sources}}
├── urls.txt
└── ...
Run from the project root so that imports resolve (python run.py).