NYC Events Aggregator

A multi-city event aggregator. Scrapes multiple sources — club listings, Eventbrite categories, venue calendars, regional aggregators, and meetup pages — deduplicates, tags, scores by personal interest, calculates transit times, and generates a self-contained HTML viewer per city.

Ships with two cities: NYC (at /nyc) and Seattle (at /seattle), both served from the same Lambda + CloudFront. Adding a third city is a config change; see Adding a City.

Built for someone who wants one page that answers "what's happening this week?" across nightlife, food, lectures, dating events, comedy, and more.

How It Works

cities/<id>.json      Per-city config (metadata, RA area, region filter, etc.)
cities/<id>.sources.json  Per-city source definitions (URLs, scrape method, TTL, type)
       |
  fetch_events.py     Fetches in parallel, caches per-source, deduplicates
       |
  tags + scoring      Pattern-matches genres, cuisines, vibes -> weighted interest score
       |
  Google Maps API     Transit time from home to each venue (optional, cached 7 days)
       |
  events-viewer.html  Injects JSON into template -> output/index.html

The HTML output is fully self-contained (no external JS dependencies) with filtering by date, tag, source, and transit time. It works offline once generated.

Shareable filter URLs

The viewer encodes the current filter state in the URL so any view can be shared as a link. Hit the Share button to copy the current URL to the clipboard. Supported query params:

Param	Example	Meaning
`date`	`2026-05-08`	Restrict to a single date
`transit`	`30`	Max transit minutes (10/20/30/45)
`type`	`edm`	Event type (edm, dating, food-drink, lecture, social, concert, party, comedy, art)
`status`	`avail`	`avail` hides sold-out, `sold` shows only sold-out
`q`	`tresor`	Free-text search over name/venue/tags
`tags`	`EDM,Techno`	Include events tagged with ANY of these (comma-separated)
`xtags`	`Dating`	Exclude events with ANY of these tags
`sources`	`Partiful,Venue`	Include only events from these sources
`xsources`	`RA`	Exclude events from these sources (default: `RA`)

Source filters (Partiful, RA, etc.) and tag filters (EDM, Dating, etc.) are ANDed across groups and ORed within each, so ?sources=Partiful&tags=Dating returns Partiful events that are tagged Dating — not the union.

Example — EDM events in NYC on Friday, May 8:

https://local-events.wolff.sh/nyc/?date=2026-05-08&tags=EDM

Sources

Sources are listed per-city in cities/<id>.sources.json. Current inventory:

Shared across cities (NYC + Seattle):

Source	Method	What It Covers
Resident Advisor	GraphQL API	EDM / club events
Dice.fm	`__NEXT_DATA__` extraction (per-city URL + venue-city-id filter)	EDM / live music
Eventbrite	`ld+json` schema.org parsing	Nightlife, EDM, food, comedy (+ NYC-only: lectures, dating categories)
Meetup	shot-scraper	Local in-person meetups

NYC-only:

Source	Method	What It Covers
Elsewhere	`__NEXT_DATA__` extraction	Elsewhere Brooklyn (all rooms)
Partiful	`__NEXT_DATA__` extraction	Social events, parties, pop-ups
Brooklyn Storehouse	shot-scraper	Warehouse club nights
Basement NY	shot-scraper (two-hop)	Underground club
SILO Brooklyn	shot-scraper	Club nights
Mission NYC	shot-scraper	Manhattan club
House of Yes	shot-scraper	Parties / performance
Film Forum	custom HTML parse	Arthouse cinema Q&As and series
Metrograph	custom HTML parse	Arthouse cinema (filtered to "special" screenings)
Nitehawk Williamsburg	custom HTML parse	Arthouse cinema

Seattle-only:

Source	Method	What It Covers
19hz	regional listing scrape (filtered by `nineteenhz_locations` / `nineteenhz_venues`)	Seattle-area EDM aggregator — covers Kremwerk, Timbre Room, Neumos, etc. without per-venue scrapers

Setup

Requirements

Python 3.10+
Chromium (installed automatically by Playwright via shot-scraper)
A Google Maps API key (optional, for transit times)

Quick start

git clone https://github.com/MatthewWolff/local-events.git
cd local-events
bash setup.sh

The interactive setup walks you through Python dependencies, API keys, location config, a test run, and optionally AWS infrastructure.

Manual install

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
shot-scraper install  # downloads Chromium
cp .env.example .env  # then edit with your values

Configure

Copy the example env file and fill in your values:

cp .env.example .env

Edit .env:

GOOGLE_MAPS_API_KEY=your-key-here
HOME_ADDRESS_NYC=123 Your St, Brooklyn, NY 11211
WORK_ADDRESS_NYC=456 Office Ave, Manhattan, NY 10001
# HOME_ADDRESS_SEATTLE=...     # optional — enables the transit column for Seattle

Addresses are per-city and keyed by the uppercased city id: HOME_ADDRESS_<CITY> / WORK_ADDRESS_<CITY>. Bare HOME_ADDRESS / WORK_ADDRESS are silently ignored (no fallback). A city without a configured home address renders with no transit UI at all.

WORK_ADDRESS_<CITY> flags events that are closer to work on weekday evenings. Both are optional.

GOOGLE_MAPS_API_KEY requires the Distance Matrix API enabled in your Google Cloud project. Without it, everything works except transit times.

Optional settings

These go in .env or are passed as CLI flags:

Variable	Default	Description
`DATE_RANGE_DAYS`	`14`	How many days ahead to look
`MAX_TRANSIT_MINUTES`	`45`	Hide events farther than this
`CACHE_DIR`	`./cache`	Root for cache files. Writes land in `cache/<city>/` per-city.
`OUTPUT_DIR`	`./output`	Root for generated HTML. Writes land in `output/<city>/` per-city.
`LOG_DIR`	`./logs`	Structured JSON logs
`SHOT_SCRAPER_TIMEOUT`	`30`	Timeout (seconds) for headless browser scrapes
`GEMINI_API_KEY`	(none)	Gemini API key for LLM tag enrichment (optional)

Usage

By default commands run against NYC. Pass --city=<id> to target a different city.

# Full run for NYC (default) — fetches, generates HTML, opens in browser
python fetch_events.py

# Same, for Seattle
python fetch_events.py --city=seattle

# No browser, verbose logging to console
python fetch_events.py --no-open --verbose

# Bypass all caches (re-scrape everything)
python fetch_events.py --force

# Bypass cache for one source only
python fetch_events.py --force-source=ra

# Smoke test all sources
python fetch_events.py --health-check

# Refresh sold-out status only (no re-scraping)
python fetch_events.py --status-only --no-open

# Custom date range
python fetch_events.py --date-from 2026-05-01 --date-to 2026-05-14

# Also dump raw JSON
python fetch_events.py --json events.json

Scraping Patterns

Method	How it works	Sources
`graphql`	POST to `ra.co/graphql`, area configured per-city	RA
`next_data`	Extract `__NEXT_DATA__` JSON from server-rendered pages	Dice, Elsewhere, Partiful
`eventbrite_listing`	Scrape listing page, extract event URLs, parse schema.org JSON from each	Eventbrite categories
`shot_scraper`	Headless Chromium renders the page, parse `document.body.innerText`	Storehouse, SILO, Mission, House of Yes, Meetup
`shot_scraper_twohop`	Headless browser gets event URLs first, then fetches each page	Basement NY
`ldjson_direct`	Fetch a single event-detail page and parse schema.org `ld+json`	(generic fallback)
`19hz`	Fetch a regional 19hz listing and filter by city-scoped allowlists	Seattle 19hz
`filmforum` / `metrograph` / `nitehawk`	Venue-specific HTML parsers for arthouse cinema calendars	Film Forum, Metrograph, Nitehawk

Caching

Each source has a TTL defined in cities/<id>.sources.json (typically 2-6 hours). Cached responses are stored as JSON under cache/<id>/ with a version field — bumping CACHE_VERSION in fetch_events.py invalidates all caches. Results below min_expected_events are not cached to avoid persisting render failures.

Each city runs on its own schedule (typically twice a day, ~12h apart in local time) — all TTLs expire between runs, so every scheduled run fetches fresh data. The S3 cache/<city>/ prefix persists cache across Lambda cold starts and keeps cities isolated (deleting one city's cache doesn't affect the other).

Transit times are cached separately for 7 days since venue locations don't change often.

Interest Scoring

Events are tagged and scored based on keyword matching against event names and venues. The INTEREST_SCORES dict at the top of fetch_events.py defines weights:

Genre tags (Techno, House, DnB): matched by artist/venue keywords in GENRE_RULES
Culture tags (Japanese, Italian, etc.): matched by food/drink keywords in CULTURE_RULES
Activity tags (Science, Comedy, Dating, etc.): matched by event type
Vibe tags (Date Night, Social, Solo-Friendly): inferred from keywords and event type

Events with only generic tags (Social, Party) are sent to Gemini 2.5 Flash-Lite for richer classification. This is a fallback -- keyword rules handle the majority of events, and the system works without a Gemini key.

Edit these dicts to match your own interests. Higher scores float to the top within each date.

Tuning scores interactively

A small local web app at tools/score_tuner.py lets you drag tags to reorder them — top of the list becomes the highest score, scores auto-distribute linearly from 10 down to 1, and clicking Save rewrites INTEREST_SCORES in fetch_events.py.

python tools/score_tuner.py
# opens http://127.0.0.1:5001

The UI shuffles tags on load so you rank from scratch rather than anchoring on the current order. Great for forking this repo and dialing in your own preferences without reading any code.

Adding a Source

Add an entry to the city's cities/<id>.sources.json:

{
  "id": "my-venue",
  "name": "My Venue",
  "method": "shot_scraper",
  "url": "https://myvenue.com/events",
  "venue_address": "123 Street, Brooklyn, NY 11211",
  "ttl_seconds": 21600,
  "type": "edm",
  "min_expected_events": 3,
  "enabled": true
}

For shot_scraper sources, you also need a parser function in fetch_events.py and an entry in the SHOT_PARSERS dict. The parser receives the raw document.body.innerText output and extracts events from it.

Supported methods:

graphql -- POST to a GraphQL endpoint (currently RA-specific)
next_data -- extract __NEXT_DATA__ JSON from server-rendered pages
eventbrite_listing -- scrape Eventbrite listing pages, then parse ld+json from each event
shot_scraper -- headless browser renders the page, then text is parsed
shot_scraper_twohop -- headless browser gets event URLs first, then fetches each page

Infrastructure

All AWS infrastructure is managed by Terraform in infra/. The scraper runs on Lambda with Chromium for full headless browser support.

Full scrape (7am + 7pm ET)
  → Lambda scrapes all sources, tags, Gemini enrichment, transit
  → output uploaded to S3 → CloudFront invalidation

Status refresh (every 2 hours, 9am-9pm ET)
  → Lambda loads cached events, refreshes sold-out status
  → reapplies tags + transit from cache (no re-scraping)
  → ~5 seconds vs ~55 seconds for full scrape

AWS resources

Resource	Detail
S3 bucket	`local-events-wolff-sh` (private, OAC)
CloudFront	Distribution with ACM cert for `local-events.wolff.sh`
Lambda	`local-events-scraper` (container image, 3GB RAM, 15min timeout)
ECR	`local-events-scraper` repository
EventBridge	Full scrape 7am + 7pm ET, status refresh every 2h 9am-9pm ET
SSM	Secrets under `/local-events/` prefix

Deploying infrastructure

cd infra
terraform init
terraform apply

ACM certificate requires a DNS validation CNAME in name.com (output after first apply).

CI/CD

Set these GitHub repository secrets (Settings > Secrets > Actions):

Secret	Description
`AWS_ACCESS_KEY_ID`	IAM user access key with admin permissions
`AWS_SECRET_ACCESS_KEY`	IAM user secret key
`GOOGLE_MAPS_API_KEY`	Google Distance Matrix API key
`HOME_ADDRESS`	Home address for transit calculations
`WORK_ADDRESS`	Work address
`GEMINI_API_KEY`	Gemini API key for LLM tag enrichment
`ALARM_EMAIL`	Email for CloudWatch alarm notifications

Pushing to mainline auto-deploys via GitHub Actions:

terraform apply — deploys any infrastructure changes
Docker build + push to ECR
Update Lambda function code
Invoke Lambda to regenerate the site

Triggers on changes to: fetch_events.py, lambda_handler.py, Dockerfile, requirements.txt, cities/**, templates/**, infra/**.

For manual deploys: bash deploy/build-and-push.sh

Manual Lambda invoke

aws lambda invoke --function-name local-events-scraper /tmp/response.json && cat /tmp/response.json

Monitoring

Resource	Detail
Status page	local-events.wolff.sh/status.html
CloudWatch dashboard	`LocalEvents` in AWS console
Metrics namespace	`LocalEvents` -- EventCount, SourceError, TotalEvents, RunDuration, SourcesOk, SourcesFailed. Every metric carries a `City` dimension (and per-source metrics also carry `Source`).
Alarms	run-failed, low-events (<50), high-failures (4+), no-invocation (24h), per-source chronic failure (RA, Dice, Partiful, Lectures on Tap) -- all notify via SNS email

# CloudWatch logs
aws logs tail /aws/lambda/local-events-scraper --follow

# Last run manifest
aws s3 cp s3://local-events-wolff-sh/last_run.json - | python -m json.tool

Local Development

Running locally

python fetch_events.py --no-open --verbose   # all sources
python fetch_events.py --force               # bypass cache
python fetch_events.py --health-check        # smoke test

Tests

Both Python (pytest) and JavaScript (node --test) suites run through a single entry point:

./test.sh            # unit tests (py + js, parallel)
./test.sh py         # Python unit tests only
./test.sh js         # JavaScript unit tests only
./test.sh coverage   # pytest with coverage (add `html` for drilldown)
./test.sh integration  # slow E2E: real browser + full pipeline run

Unit tests run on every commit via the pre-commit hook. Integration tests run in CI before deploy — they're slower (~5s for 15 tests) because they launch a real Chromium via shot-scraper and exercise run() / run_status_only() end-to-end.

Coverage on the tested logic surface is 97.8% lines / 92.4% branches. The overall file coverage is ~52% because network-facing fetchers (RA GraphQL, Dice/Elsewhere __NEXT_DATA__, Eventbrite listings, Partiful, the shot-scraper venues) and CLI orchestration are deliberately not unit-tested — mocking the whole HTTP stack would test the mocks, not the code.

What IS covered:

Dedup pipeline — normalize_name, normalize_venue, names_overlap, detail_score, merge_events, deduplicate
Tagging — _source_tag, apply_tags across all genre/culture/vibe rules
Partiful classification — _classify_partiful_type and section fallbacks
Event validation — _check_event_date, validate_events
Cache — read_cache (TTL, version, date-range), write_cache (atomic rename)
Parsers — SILO, Mission, House of Yes, Meetup, Basement, Storehouse text parsers
Gemini enrichment — response parsing, vocab enforcement, caching, error paths
HTML generation — placeholder substitution, filter-logic inlining, field remapping
Viewer filter + URL state — source/tag AND semantics, encode/decode roundtrip

A pre-commit hook (.githooks/pre-commit) runs ./test.sh before every commit. setup.sh enables it via git config core.hooksPath .githooks. Bypass in emergencies with git commit --no-verify.

Running with Docker (matches Lambda environment)

docker build -t local-events-scraper .
docker run --rm -e HOME=/tmp -e PLAYWRIGHT_BROWSERS_PATH=/ms-playwright \
  --entrypoint bash local-events-scraper -c \
  'shot-scraper javascript https://example.com "document.title"'

Project Structure

.
├── fetch_events.py              # Single script — fetchers, parsers, scoring, HTML generation
├── lambda_handler.py            # Lambda entry point — SSM secrets, S3 cache, CF invalidation
├── Dockerfile                   # Container image for Lambda (Playwright + Chromium)
├── cities/
│   ├── nyc.json                 # Per-city config (metadata, RA area, etc.)
│   └── nyc.sources.json         # Per-city source definitions
├── requirements.txt             # Python dependencies
├── og-image.jpg                 # OpenGraph social share image
├── CLAUDE.md                    # Development gotchas for AI-assisted coding
├── setup.sh                     # Interactive setup script
├── .env.example                 # Environment variable template
├── templates/
│   ├── events-viewer.html       # HTML template with embedded JS viewer
│   ├── filter_logic.js          # Filter + URL-state module, inlined into the template at generation
│   └── status.html              # Source health dashboard (reads last_run.json)
├── tests/                        # pytest + node --test, run via ./test.sh
│   ├── conftest.py
│   ├── filter_logic.test.js      # viewer filter + URL state (JS)
│   ├── test_cache.py
│   ├── test_dedup.py
│   ├── test_gemini.py
│   ├── test_html_generation.py
│   ├── test_parsers.py
│   ├── test_partiful_classifier.py
│   ├── test_tags.py
│   ├── test_validation.py
│   └── integration/              # opt-in via ./test.sh integration
│       ├── conftest.py
│       ├── test_viewer_browser.py    # shot-scraper drives real HTML
│       ├── test_status_only.py       # cache → status-only → HTML + manifest
│       └── test_full_pipeline.py     # fetch_source mocked, full run() E2E
├── .githooks/
│   └── pre-commit                # runs ./test.sh before every commit
├── test.sh                       # single entry point for all tests
├── infra/                       # Terraform infrastructure (S3 backend)
│   ├── main.tf                  # Provider, backend
│   ├── s3.tf                    # S3 bucket + policy
│   ├── cloudfront.tf            # CloudFront distribution + OAC
│   ├── lambda.tf                # Lambda function + ECR
│   ├── eventbridge.tf           # EventBridge schedule
│   ├── iam.tf                   # Lambda execution role
│   ├── ssm.tf                   # SSM parameters for secrets
│   ├── acm.tf                   # ACM certificate
│   ├── monitoring.tf            # CloudWatch dashboard, alarms, SNS
│   ├── variables.tf             # Input variables
│   └── outputs.tf               # Output values
├── deploy/
│   └── build-and-push.sh        # Build container, push to ECR, update Lambda
├── docs/
│   └── index.html               # GitHub Pages redirect to local-events.wolff.sh
└── .github/workflows/
    └── deploy.yml               # CI/CD: Terraform + Docker + Lambda deploy

Adding a City

The pipeline is multi-city — NYC and Seattle are both supported out of the box. Adding a third city is a pure configuration change; you should not need to edit fetch_events.py unless you're wiring up a new venue-specific parser.

1. City config (`cities/<id>.json`)

Create cities/<id>.json. See cities/nyc.json / cities/seattle.json for the full schema. Key fields:

Field	Example	Purpose
`id`	`"la"`	Must match the filename stem. Lowercase.
`name`	`"LA"`	Short label shown in the nav bar.
`header_location`	`"Los Angeles"`	Subhead under the `<h1>`.
`site_title` / `og_title` / `meta_description` / `og_description`	`"LA Events	What's On"`
`gtag_page_title`	`"la-events"`	GA4 page_title — must be distinct per city.
`timezone`	`"America/Los_Angeles"`	Used by Dice date conversion.
`ra_area_id`	`23`	Find by inspecting RA's GraphQL `areas.eq` request in devtools.
`ra_referer`	`"https://ra.co/events/us/losangeles"`	Matches the RA area.
`eventbrite_allowed_regions`	`["CA"]`	Events outside these states are dropped.
`dice_url`	`"https://dice.fm/browse?location=los-angeles"`	Per-city Dice listing.
`partiful_url`	`null`	Set to a real URL if Partiful has a city page, else `null`.
`venue_address_overrides`	`{"TBA - ...": "real address"}`	Address fixes for TBA-named venues.
`dedup_stop_words_extra`	`["la", "los", "angeles"]`	City-scoped tokens to strip during name dedup.
`favorite_venues`	`[]`	Venues you want auto-tagged `"Venue"` even when sourced from RA/Dice.
`venue_sources`	`[...]`	Source names that scrape a single venue (get the `"Venue"` source tag).
`film_sources`	`[]`	Arthouse cinema sources — get the `"Film"` tag automatically.

2. Source list (`cities/<id>.sources.json`)

Mirror the schema in cities/nyc.sources.json. Each entry has id, name, method, url, ttl_seconds, type, min_expected_events, and enabled. The method must be a key in FETCHER_MAP (see fetch_events.py).

3. Location config (`.env`)

If you want the transit column/filter/legend to render for the new city, set its home address in .env:

HOME_ADDRESS_LA=123 Main St, Los Angeles, CA
WORK_ADDRESS_LA=...

Variable names must use the suffix _<CITY_UPPER>. Bare HOME_ADDRESS / WORK_ADDRESS are silently ignored. Without a home address, the transit UI is hidden entirely for that city.

4. Interest scoring (`fetch_events.py`)

These are global across cities (they describe personal taste, not city features). Customize to match your interests:

INTEREST_SCORES — weight each tag (higher = more prominent)
GENRE_RULES — keyword patterns for music genres
CULTURE_RULES — keyword patterns for cuisines/cultures
VIBE_KEYWORDS_DATE / VIBE_KEYWORDS_FRIENDS — vibe detection keywords

5. Infrastructure (`infra/`)

Append your city id to var.cities default in infra/variables.tf.
terraform apply — this adds a per-city EventBridge schedule, rebuilds the CloudFront function to route /<id>, and adds per-city alarms keyed on the City dimension.

6. Venue parsers (only if adding a new venue source)

If the new city uses a venue site that doesn't match any existing fetcher, add a parser function in fetch_events.py and register it in SHOT_PARSERS (keyed by the source id). See parse_silo_text / parse_storehouse_text for examples.

7. Container timezone (`Dockerfile`)

Update TZ=America/New_York to your city's timezone. This affects how venue websites render event dates in the headless browser:

ENV TZ=America/Los_Angeles  # e.g. for LA

Also update the EventBridge schedule timezone in infra/eventbridge.tf:

schedule_expression_timezone = "America/Los_Angeles"

8. HTML template

Update the title, description, and OG metadata in templates/events-viewer.html
Update the header location text ("Williamsburg, BK")

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
.githooks		.githooks
.github/workflows		.github/workflows
cities		cities
cloudfront		cloudfront
deploy		deploy
docs		docs
infra		infra
specs/multi-city		specs/multi-city
templates		templates
tests		tests
tools		tools
.coveragerc		.coveragerc
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
README.md		README.md
fetch_events.py		fetch_events.py
lambda_handler.py		lambda_handler.py
og-image.jpg		og-image.jpg
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.sh		setup.sh
test.sh		test.sh

Folders and files

Latest commit

History

Repository files navigation

NYC Events Aggregator

Table of Contents

How It Works

Shareable filter URLs

Sources

Setup

Requirements

Quick start

Manual install

Configure

Optional settings

Usage

Scraping Patterns

Caching

Interest Scoring

Tuning scores interactively

Adding a Source

Infrastructure

AWS resources

Deploying infrastructure

CI/CD

Manual Lambda invoke

Monitoring

Local Development

Running locally

Tests

Running with Docker (matches Lambda environment)

Project Structure

Adding a City

1. City config (cities/<id>.json)

2. Source list (cities/<id>.sources.json)

3. Location config (.env)

4. Interest scoring (fetch_events.py)

5. Infrastructure (infra/)

6. Venue parsers (only if adding a new venue source)

7. Container timezone (Dockerfile)

8. HTML template

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. City config (`cities/<id>.json`)

2. Source list (`cities/<id>.sources.json`)

3. Location config (`.env`)

4. Interest scoring (`fetch_events.py`)

5. Infrastructure (`infra/`)

7. Container timezone (`Dockerfile`)

Packages