A multi-city event aggregator. Scrapes multiple sources — club listings, Eventbrite categories, venue calendars, regional aggregators, and meetup pages — deduplicates, tags, scores by personal interest, calculates transit times, and generates a self-contained HTML viewer per city.
Ships with two cities: NYC (at /nyc) and Seattle (at /seattle), both served from the same Lambda + CloudFront. Adding a third city is a config change; see Adding a City.
Built for someone who wants one page that answers "what's happening this week?" across nightlife, food, lectures, dating events, comedy, and more.
- How It Works
- Sources
- Setup
- Usage
- Scraping Patterns
- Caching
- Interest Scoring
- Adding a Source
- Infrastructure
- Local Development
- Project Structure
- Adding a City
cities/<id>.json Per-city config (metadata, RA area, region filter, etc.)
cities/<id>.sources.json Per-city source definitions (URLs, scrape method, TTL, type)
|
fetch_events.py Fetches in parallel, caches per-source, deduplicates
|
tags + scoring Pattern-matches genres, cuisines, vibes -> weighted interest score
|
Google Maps API Transit time from home to each venue (optional, cached 7 days)
|
events-viewer.html Injects JSON into template -> output/index.html
The HTML output is fully self-contained (no external JS dependencies) with filtering by date, tag, source, and transit time. It works offline once generated.
The viewer encodes the current filter state in the URL so any view can be shared as a link. Hit the Share button to copy the current URL to the clipboard. Supported query params:
| Param | Example | Meaning |
|---|---|---|
date |
2026-05-08 |
Restrict to a single date |
transit |
30 |
Max transit minutes (10/20/30/45) |
type |
edm |
Event type (edm, dating, food-drink, lecture, social, concert, party, comedy, art) |
status |
avail |
avail hides sold-out, sold shows only sold-out |
q |
tresor |
Free-text search over name/venue/tags |
tags |
EDM,Techno |
Include events tagged with ANY of these (comma-separated) |
xtags |
Dating |
Exclude events with ANY of these tags |
sources |
Partiful,Venue |
Include only events from these sources |
xsources |
RA |
Exclude events from these sources (default: RA) |
Source filters (Partiful, RA, etc.) and tag filters (EDM, Dating, etc.) are ANDed across groups and ORed within each, so ?sources=Partiful&tags=Dating returns Partiful events that are tagged Dating — not the union.
Example — EDM events in NYC on Friday, May 8:
https://local-events.wolff.sh/nyc/?date=2026-05-08&tags=EDM
Sources are listed per-city in cities/<id>.sources.json. Current inventory:
Shared across cities (NYC + Seattle):
| Source | Method | What It Covers |
|---|---|---|
| Resident Advisor | GraphQL API | EDM / club events |
| Dice.fm | __NEXT_DATA__ extraction (per-city URL + venue-city-id filter) |
EDM / live music |
| Eventbrite | ld+json schema.org parsing |
Nightlife, EDM, food, comedy (+ NYC-only: lectures, dating categories) |
| Meetup | shot-scraper | Local in-person meetups |
NYC-only:
| Source | Method | What It Covers |
|---|---|---|
| Elsewhere | __NEXT_DATA__ extraction |
Elsewhere Brooklyn (all rooms) |
| Partiful | __NEXT_DATA__ extraction |
Social events, parties, pop-ups |
| Brooklyn Storehouse | shot-scraper | Warehouse club nights |
| Basement NY | shot-scraper (two-hop) | Underground club |
| SILO Brooklyn | shot-scraper | Club nights |
| Mission NYC | shot-scraper | Manhattan club |
| House of Yes | shot-scraper | Parties / performance |
| Film Forum | custom HTML parse | Arthouse cinema Q&As and series |
| Metrograph | custom HTML parse | Arthouse cinema (filtered to "special" screenings) |
| Nitehawk Williamsburg | custom HTML parse | Arthouse cinema |
Seattle-only:
| Source | Method | What It Covers |
|---|---|---|
| 19hz | regional listing scrape (filtered by nineteenhz_locations / nineteenhz_venues) |
Seattle-area EDM aggregator — covers Kremwerk, Timbre Room, Neumos, etc. without per-venue scrapers |
- Python 3.10+
- Chromium (installed automatically by Playwright via
shot-scraper) - A Google Maps API key (optional, for transit times)
git clone https://github.com/MatthewWolff/local-events.git
cd local-events
bash setup.shThe interactive setup walks you through Python dependencies, API keys, location config, a test run, and optionally AWS infrastructure.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
shot-scraper install # downloads Chromium
cp .env.example .env # then edit with your valuesCopy the example env file and fill in your values:
cp .env.example .envEdit .env:
GOOGLE_MAPS_API_KEY=your-key-here
HOME_ADDRESS_NYC=123 Your St, Brooklyn, NY 11211
WORK_ADDRESS_NYC=456 Office Ave, Manhattan, NY 10001
# HOME_ADDRESS_SEATTLE=... # optional — enables the transit column for Seattle
Addresses are per-city and keyed by the uppercased city id: HOME_ADDRESS_<CITY> / WORK_ADDRESS_<CITY>. Bare HOME_ADDRESS / WORK_ADDRESS are silently ignored (no fallback). A city without a configured home address renders with no transit UI at all.
WORK_ADDRESS_<CITY> flags events that are closer to work on weekday evenings. Both are optional.
GOOGLE_MAPS_API_KEY requires the Distance Matrix API enabled in your Google Cloud project. Without it, everything works except transit times.
These go in .env or are passed as CLI flags:
| Variable | Default | Description |
|---|---|---|
DATE_RANGE_DAYS |
14 |
How many days ahead to look |
MAX_TRANSIT_MINUTES |
45 |
Hide events farther than this |
CACHE_DIR |
./cache |
Root for cache files. Writes land in cache/<city>/ per-city. |
OUTPUT_DIR |
./output |
Root for generated HTML. Writes land in output/<city>/ per-city. |
LOG_DIR |
./logs |
Structured JSON logs |
SHOT_SCRAPER_TIMEOUT |
30 |
Timeout (seconds) for headless browser scrapes |
GEMINI_API_KEY |
(none) | Gemini API key for LLM tag enrichment (optional) |
By default commands run against NYC. Pass --city=<id> to target a different city.
# Full run for NYC (default) — fetches, generates HTML, opens in browser
python fetch_events.py
# Same, for Seattle
python fetch_events.py --city=seattle
# No browser, verbose logging to console
python fetch_events.py --no-open --verbose
# Bypass all caches (re-scrape everything)
python fetch_events.py --force
# Bypass cache for one source only
python fetch_events.py --force-source=ra
# Smoke test all sources
python fetch_events.py --health-check
# Refresh sold-out status only (no re-scraping)
python fetch_events.py --status-only --no-open
# Custom date range
python fetch_events.py --date-from 2026-05-01 --date-to 2026-05-14
# Also dump raw JSON
python fetch_events.py --json events.json| Method | How it works | Sources |
|---|---|---|
graphql |
POST to ra.co/graphql, area configured per-city |
RA |
next_data |
Extract __NEXT_DATA__ JSON from server-rendered pages |
Dice, Elsewhere, Partiful |
eventbrite_listing |
Scrape listing page, extract event URLs, parse schema.org JSON from each | Eventbrite categories |
shot_scraper |
Headless Chromium renders the page, parse document.body.innerText |
Storehouse, SILO, Mission, House of Yes, Meetup |
shot_scraper_twohop |
Headless browser gets event URLs first, then fetches each page | Basement NY |
ldjson_direct |
Fetch a single event-detail page and parse schema.org ld+json |
(generic fallback) |
19hz |
Fetch a regional 19hz listing and filter by city-scoped allowlists | Seattle 19hz |
filmforum / metrograph / nitehawk |
Venue-specific HTML parsers for arthouse cinema calendars | Film Forum, Metrograph, Nitehawk |
Each source has a TTL defined in cities/<id>.sources.json (typically 2-6 hours). Cached responses are stored as JSON under cache/<id>/ with a version field — bumping CACHE_VERSION in fetch_events.py invalidates all caches. Results below min_expected_events are not cached to avoid persisting render failures.
Each city runs on its own schedule (typically twice a day, ~12h apart in local time) — all TTLs expire between runs, so every scheduled run fetches fresh data. The S3 cache/<city>/ prefix persists cache across Lambda cold starts and keeps cities isolated (deleting one city's cache doesn't affect the other).
Transit times are cached separately for 7 days since venue locations don't change often.
Events are tagged and scored based on keyword matching against event names and venues. The INTEREST_SCORES dict at the top of fetch_events.py defines weights:
- Genre tags (Techno, House, DnB): matched by artist/venue keywords in
GENRE_RULES - Culture tags (Japanese, Italian, etc.): matched by food/drink keywords in
CULTURE_RULES - Activity tags (Science, Comedy, Dating, etc.): matched by event type
- Vibe tags (Date Night, Social, Solo-Friendly): inferred from keywords and event type
Events with only generic tags (Social, Party) are sent to Gemini 2.5 Flash-Lite for richer classification. This is a fallback -- keyword rules handle the majority of events, and the system works without a Gemini key.
Edit these dicts to match your own interests. Higher scores float to the top within each date.
A small local web app at tools/score_tuner.py lets you drag tags to reorder them — top of the list becomes the highest score, scores auto-distribute linearly from 10 down to 1, and clicking Save rewrites INTEREST_SCORES in fetch_events.py.
python tools/score_tuner.py
# opens http://127.0.0.1:5001The UI shuffles tags on load so you rank from scratch rather than anchoring on the current order. Great for forking this repo and dialing in your own preferences without reading any code.
Add an entry to the city's cities/<id>.sources.json:
{
"id": "my-venue",
"name": "My Venue",
"method": "shot_scraper",
"url": "https://myvenue.com/events",
"venue_address": "123 Street, Brooklyn, NY 11211",
"ttl_seconds": 21600,
"type": "edm",
"min_expected_events": 3,
"enabled": true
}For shot_scraper sources, you also need a parser function in fetch_events.py and an entry in the SHOT_PARSERS dict. The parser receives the raw document.body.innerText output and extracts events from it.
Supported methods:
graphql-- POST to a GraphQL endpoint (currently RA-specific)next_data-- extract__NEXT_DATA__JSON from server-rendered pageseventbrite_listing-- scrape Eventbrite listing pages, then parseld+jsonfrom each eventshot_scraper-- headless browser renders the page, then text is parsedshot_scraper_twohop-- headless browser gets event URLs first, then fetches each page
All AWS infrastructure is managed by Terraform in infra/. The scraper runs on Lambda with Chromium for full headless browser support.
Full scrape (7am + 7pm ET)
→ Lambda scrapes all sources, tags, Gemini enrichment, transit
→ output uploaded to S3 → CloudFront invalidation
Status refresh (every 2 hours, 9am-9pm ET)
→ Lambda loads cached events, refreshes sold-out status
→ reapplies tags + transit from cache (no re-scraping)
→ ~5 seconds vs ~55 seconds for full scrape
| Resource | Detail |
|---|---|
| S3 bucket | local-events-wolff-sh (private, OAC) |
| CloudFront | Distribution with ACM cert for local-events.wolff.sh |
| Lambda | local-events-scraper (container image, 3GB RAM, 15min timeout) |
| ECR | local-events-scraper repository |
| EventBridge | Full scrape 7am + 7pm ET, status refresh every 2h 9am-9pm ET |
| SSM | Secrets under /local-events/ prefix |
cd infra
terraform init
terraform applyACM certificate requires a DNS validation CNAME in name.com (output after first apply).
Set these GitHub repository secrets (Settings > Secrets > Actions):
| Secret | Description |
|---|---|
AWS_ACCESS_KEY_ID |
IAM user access key with admin permissions |
AWS_SECRET_ACCESS_KEY |
IAM user secret key |
GOOGLE_MAPS_API_KEY |
Google Distance Matrix API key |
HOME_ADDRESS |
Home address for transit calculations |
WORK_ADDRESS |
Work address |
GEMINI_API_KEY |
Gemini API key for LLM tag enrichment |
ALARM_EMAIL |
Email for CloudWatch alarm notifications |
Pushing to mainline auto-deploys via GitHub Actions:
terraform apply— deploys any infrastructure changes- Docker build + push to ECR
- Update Lambda function code
- Invoke Lambda to regenerate the site
Triggers on changes to: fetch_events.py, lambda_handler.py, Dockerfile, requirements.txt, cities/**, templates/**, infra/**.
For manual deploys: bash deploy/build-and-push.sh
aws lambda invoke --function-name local-events-scraper /tmp/response.json && cat /tmp/response.json| Resource | Detail |
|---|---|
| Status page | local-events.wolff.sh/status.html |
| CloudWatch dashboard | LocalEvents in AWS console |
| Metrics namespace | LocalEvents -- EventCount, SourceError, TotalEvents, RunDuration, SourcesOk, SourcesFailed. Every metric carries a City dimension (and per-source metrics also carry Source). |
| Alarms | run-failed, low-events (<50), high-failures (4+), no-invocation (24h), per-source chronic failure (RA, Dice, Partiful, Lectures on Tap) -- all notify via SNS email |
# CloudWatch logs
aws logs tail /aws/lambda/local-events-scraper --follow
# Last run manifest
aws s3 cp s3://local-events-wolff-sh/last_run.json - | python -m json.toolpython fetch_events.py --no-open --verbose # all sources
python fetch_events.py --force # bypass cache
python fetch_events.py --health-check # smoke testBoth Python (pytest) and JavaScript (node --test) suites run through a single entry point:
./test.sh # unit tests (py + js, parallel)
./test.sh py # Python unit tests only
./test.sh js # JavaScript unit tests only
./test.sh coverage # pytest with coverage (add `html` for drilldown)
./test.sh integration # slow E2E: real browser + full pipeline runUnit tests run on every commit via the pre-commit hook. Integration tests run in CI before deploy — they're slower (~5s for 15 tests) because they launch a real Chromium via shot-scraper and exercise run() / run_status_only() end-to-end.
Coverage on the tested logic surface is 97.8% lines / 92.4% branches. The overall file coverage is ~52% because network-facing fetchers (RA GraphQL, Dice/Elsewhere __NEXT_DATA__, Eventbrite listings, Partiful, the shot-scraper venues) and CLI orchestration are deliberately not unit-tested — mocking the whole HTTP stack would test the mocks, not the code.
What IS covered:
- Dedup pipeline —
normalize_name,normalize_venue,names_overlap,detail_score,merge_events,deduplicate - Tagging —
_source_tag,apply_tagsacross all genre/culture/vibe rules - Partiful classification —
_classify_partiful_typeand section fallbacks - Event validation —
_check_event_date,validate_events - Cache —
read_cache(TTL, version, date-range),write_cache(atomic rename) - Parsers — SILO, Mission, House of Yes, Meetup, Basement, Storehouse text parsers
- Gemini enrichment — response parsing, vocab enforcement, caching, error paths
- HTML generation — placeholder substitution, filter-logic inlining, field remapping
- Viewer filter + URL state — source/tag AND semantics, encode/decode roundtrip
A pre-commit hook (.githooks/pre-commit) runs ./test.sh before every commit. setup.sh enables it via git config core.hooksPath .githooks. Bypass in emergencies with git commit --no-verify.
docker build -t local-events-scraper .
docker run --rm -e HOME=/tmp -e PLAYWRIGHT_BROWSERS_PATH=/ms-playwright \
--entrypoint bash local-events-scraper -c \
'shot-scraper javascript https://example.com "document.title"'.
├── fetch_events.py # Single script — fetchers, parsers, scoring, HTML generation
├── lambda_handler.py # Lambda entry point — SSM secrets, S3 cache, CF invalidation
├── Dockerfile # Container image for Lambda (Playwright + Chromium)
├── cities/
│ ├── nyc.json # Per-city config (metadata, RA area, etc.)
│ └── nyc.sources.json # Per-city source definitions
├── requirements.txt # Python dependencies
├── og-image.jpg # OpenGraph social share image
├── CLAUDE.md # Development gotchas for AI-assisted coding
├── setup.sh # Interactive setup script
├── .env.example # Environment variable template
├── templates/
│ ├── events-viewer.html # HTML template with embedded JS viewer
│ ├── filter_logic.js # Filter + URL-state module, inlined into the template at generation
│ └── status.html # Source health dashboard (reads last_run.json)
├── tests/ # pytest + node --test, run via ./test.sh
│ ├── conftest.py
│ ├── filter_logic.test.js # viewer filter + URL state (JS)
│ ├── test_cache.py
│ ├── test_dedup.py
│ ├── test_gemini.py
│ ├── test_html_generation.py
│ ├── test_parsers.py
│ ├── test_partiful_classifier.py
│ ├── test_tags.py
│ ├── test_validation.py
│ └── integration/ # opt-in via ./test.sh integration
│ ├── conftest.py
│ ├── test_viewer_browser.py # shot-scraper drives real HTML
│ ├── test_status_only.py # cache → status-only → HTML + manifest
│ └── test_full_pipeline.py # fetch_source mocked, full run() E2E
├── .githooks/
│ └── pre-commit # runs ./test.sh before every commit
├── test.sh # single entry point for all tests
├── infra/ # Terraform infrastructure (S3 backend)
│ ├── main.tf # Provider, backend
│ ├── s3.tf # S3 bucket + policy
│ ├── cloudfront.tf # CloudFront distribution + OAC
│ ├── lambda.tf # Lambda function + ECR
│ ├── eventbridge.tf # EventBridge schedule
│ ├── iam.tf # Lambda execution role
│ ├── ssm.tf # SSM parameters for secrets
│ ├── acm.tf # ACM certificate
│ ├── monitoring.tf # CloudWatch dashboard, alarms, SNS
│ ├── variables.tf # Input variables
│ └── outputs.tf # Output values
├── deploy/
│ └── build-and-push.sh # Build container, push to ECR, update Lambda
├── docs/
│ └── index.html # GitHub Pages redirect to local-events.wolff.sh
└── .github/workflows/
└── deploy.yml # CI/CD: Terraform + Docker + Lambda deploy
The pipeline is multi-city — NYC and Seattle are both supported out of the box. Adding a third city is a pure configuration change; you should not need to edit fetch_events.py unless you're wiring up a new venue-specific parser.
Create cities/<id>.json. See cities/nyc.json / cities/seattle.json for the full schema. Key fields:
| Field | Example | Purpose |
|---|---|---|
id |
"la" |
Must match the filename stem. Lowercase. |
name |
"LA" |
Short label shown in the nav bar. |
header_location |
"Los Angeles" |
Subhead under the <h1>. |
site_title / og_title / meta_description / og_description |
`"LA Events | What's On"` |
gtag_page_title |
"la-events" |
GA4 page_title — must be distinct per city. |
timezone |
"America/Los_Angeles" |
Used by Dice date conversion. |
ra_area_id |
23 |
Find by inspecting RA's GraphQL areas.eq request in devtools. |
ra_referer |
"https://ra.co/events/us/losangeles" |
Matches the RA area. |
eventbrite_allowed_regions |
["CA"] |
Events outside these states are dropped. |
dice_url |
"https://dice.fm/browse?location=los-angeles" |
Per-city Dice listing. |
partiful_url |
null |
Set to a real URL if Partiful has a city page, else null. |
venue_address_overrides |
{"TBA - ...": "real address"} |
Address fixes for TBA-named venues. |
dedup_stop_words_extra |
["la", "los", "angeles"] |
City-scoped tokens to strip during name dedup. |
favorite_venues |
[] |
Venues you want auto-tagged "Venue" even when sourced from RA/Dice. |
venue_sources |
[...] |
Source names that scrape a single venue (get the "Venue" source tag). |
film_sources |
[] |
Arthouse cinema sources — get the "Film" tag automatically. |
Mirror the schema in cities/nyc.sources.json. Each entry has id, name, method, url, ttl_seconds, type, min_expected_events, and enabled. The method must be a key in FETCHER_MAP (see fetch_events.py).
If you want the transit column/filter/legend to render for the new city, set its home address in .env:
HOME_ADDRESS_LA=123 Main St, Los Angeles, CA
WORK_ADDRESS_LA=...
Variable names must use the suffix _<CITY_UPPER>. Bare HOME_ADDRESS / WORK_ADDRESS are silently ignored. Without a home address, the transit UI is hidden entirely for that city.
These are global across cities (they describe personal taste, not city features). Customize to match your interests:
INTEREST_SCORES— weight each tag (higher = more prominent)GENRE_RULES— keyword patterns for music genresCULTURE_RULES— keyword patterns for cuisines/culturesVIBE_KEYWORDS_DATE/VIBE_KEYWORDS_FRIENDS— vibe detection keywords
- Append your city id to
var.citiesdefault ininfra/variables.tf. terraform apply— this adds a per-city EventBridge schedule, rebuilds the CloudFront function to route/<id>, and adds per-city alarms keyed on theCitydimension.
If the new city uses a venue site that doesn't match any existing fetcher, add a parser function in fetch_events.py and register it in SHOT_PARSERS (keyed by the source id). See parse_silo_text / parse_storehouse_text for examples.
Update TZ=America/New_York to your city's timezone. This affects how venue websites render event dates in the headless browser:
ENV TZ=America/Los_Angeles # e.g. for LAAlso update the EventBridge schedule timezone in infra/eventbridge.tf:
schedule_expression_timezone = "America/Los_Angeles"- Update the title, description, and OG metadata in
templates/events-viewer.html - Update the header location text ("Williamsburg, BK")
MIT