Skip to content

MatthewWolff/local-events

Repository files navigation

NYC Events Aggregator

A multi-city event aggregator. Scrapes multiple sources — club listings, Eventbrite categories, venue calendars, regional aggregators, and meetup pages — deduplicates, tags, scores by personal interest, calculates transit times, and generates a self-contained HTML viewer per city.

Ships with two cities: NYC (at /nyc) and Seattle (at /seattle), both served from the same Lambda + CloudFront. Adding a third city is a config change; see Adding a City.

Built for someone who wants one page that answers "what's happening this week?" across nightlife, food, lectures, dating events, comedy, and more.

Table of Contents

How It Works

cities/<id>.json      Per-city config (metadata, RA area, region filter, etc.)
cities/<id>.sources.json  Per-city source definitions (URLs, scrape method, TTL, type)
       |
  fetch_events.py     Fetches in parallel, caches per-source, deduplicates
       |
  tags + scoring      Pattern-matches genres, cuisines, vibes -> weighted interest score
       |
  Google Maps API     Transit time from home to each venue (optional, cached 7 days)
       |
  events-viewer.html  Injects JSON into template -> output/index.html

The HTML output is fully self-contained (no external JS dependencies) with filtering by date, tag, source, and transit time. It works offline once generated.

Shareable filter URLs

The viewer encodes the current filter state in the URL so any view can be shared as a link. Hit the Share button to copy the current URL to the clipboard. Supported query params:

Param Example Meaning
date 2026-05-08 Restrict to a single date
transit 30 Max transit minutes (10/20/30/45)
type edm Event type (edm, dating, food-drink, lecture, social, concert, party, comedy, art)
status avail avail hides sold-out, sold shows only sold-out
q tresor Free-text search over name/venue/tags
tags EDM,Techno Include events tagged with ANY of these (comma-separated)
xtags Dating Exclude events with ANY of these tags
sources Partiful,Venue Include only events from these sources
xsources RA Exclude events from these sources (default: RA)

Source filters (Partiful, RA, etc.) and tag filters (EDM, Dating, etc.) are ANDed across groups and ORed within each, so ?sources=Partiful&tags=Dating returns Partiful events that are tagged Dating — not the union.

Example — EDM events in NYC on Friday, May 8:

https://local-events.wolff.sh/nyc/?date=2026-05-08&tags=EDM

Sources

Sources are listed per-city in cities/<id>.sources.json. Current inventory:

Shared across cities (NYC + Seattle):

Source Method What It Covers
Resident Advisor GraphQL API EDM / club events
Dice.fm __NEXT_DATA__ extraction (per-city URL + venue-city-id filter) EDM / live music
Eventbrite ld+json schema.org parsing Nightlife, EDM, food, comedy (+ NYC-only: lectures, dating categories)
Meetup shot-scraper Local in-person meetups

NYC-only:

Source Method What It Covers
Elsewhere __NEXT_DATA__ extraction Elsewhere Brooklyn (all rooms)
Partiful __NEXT_DATA__ extraction Social events, parties, pop-ups
Brooklyn Storehouse shot-scraper Warehouse club nights
Basement NY shot-scraper (two-hop) Underground club
SILO Brooklyn shot-scraper Club nights
Mission NYC shot-scraper Manhattan club
House of Yes shot-scraper Parties / performance
Film Forum custom HTML parse Arthouse cinema Q&As and series
Metrograph custom HTML parse Arthouse cinema (filtered to "special" screenings)
Nitehawk Williamsburg custom HTML parse Arthouse cinema

Seattle-only:

Source Method What It Covers
19hz regional listing scrape (filtered by nineteenhz_locations / nineteenhz_venues) Seattle-area EDM aggregator — covers Kremwerk, Timbre Room, Neumos, etc. without per-venue scrapers

Setup

Requirements

  • Python 3.10+
  • Chromium (installed automatically by Playwright via shot-scraper)
  • A Google Maps API key (optional, for transit times)

Quick start

git clone https://github.com/MatthewWolff/local-events.git
cd local-events
bash setup.sh

The interactive setup walks you through Python dependencies, API keys, location config, a test run, and optionally AWS infrastructure.

Manual install

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
shot-scraper install  # downloads Chromium
cp .env.example .env  # then edit with your values

Configure

Copy the example env file and fill in your values:

cp .env.example .env

Edit .env:

GOOGLE_MAPS_API_KEY=your-key-here
HOME_ADDRESS_NYC=123 Your St, Brooklyn, NY 11211
WORK_ADDRESS_NYC=456 Office Ave, Manhattan, NY 10001
# HOME_ADDRESS_SEATTLE=...     # optional — enables the transit column for Seattle

Addresses are per-city and keyed by the uppercased city id: HOME_ADDRESS_<CITY> / WORK_ADDRESS_<CITY>. Bare HOME_ADDRESS / WORK_ADDRESS are silently ignored (no fallback). A city without a configured home address renders with no transit UI at all.

WORK_ADDRESS_<CITY> flags events that are closer to work on weekday evenings. Both are optional.

GOOGLE_MAPS_API_KEY requires the Distance Matrix API enabled in your Google Cloud project. Without it, everything works except transit times.

Optional settings

These go in .env or are passed as CLI flags:

Variable Default Description
DATE_RANGE_DAYS 14 How many days ahead to look
MAX_TRANSIT_MINUTES 45 Hide events farther than this
CACHE_DIR ./cache Root for cache files. Writes land in cache/<city>/ per-city.
OUTPUT_DIR ./output Root for generated HTML. Writes land in output/<city>/ per-city.
LOG_DIR ./logs Structured JSON logs
SHOT_SCRAPER_TIMEOUT 30 Timeout (seconds) for headless browser scrapes
GEMINI_API_KEY (none) Gemini API key for LLM tag enrichment (optional)

Usage

By default commands run against NYC. Pass --city=<id> to target a different city.

# Full run for NYC (default) — fetches, generates HTML, opens in browser
python fetch_events.py

# Same, for Seattle
python fetch_events.py --city=seattle

# No browser, verbose logging to console
python fetch_events.py --no-open --verbose

# Bypass all caches (re-scrape everything)
python fetch_events.py --force

# Bypass cache for one source only
python fetch_events.py --force-source=ra

# Smoke test all sources
python fetch_events.py --health-check

# Refresh sold-out status only (no re-scraping)
python fetch_events.py --status-only --no-open

# Custom date range
python fetch_events.py --date-from 2026-05-01 --date-to 2026-05-14

# Also dump raw JSON
python fetch_events.py --json events.json

Scraping Patterns

Method How it works Sources
graphql POST to ra.co/graphql, area configured per-city RA
next_data Extract __NEXT_DATA__ JSON from server-rendered pages Dice, Elsewhere, Partiful
eventbrite_listing Scrape listing page, extract event URLs, parse schema.org JSON from each Eventbrite categories
shot_scraper Headless Chromium renders the page, parse document.body.innerText Storehouse, SILO, Mission, House of Yes, Meetup
shot_scraper_twohop Headless browser gets event URLs first, then fetches each page Basement NY
ldjson_direct Fetch a single event-detail page and parse schema.org ld+json (generic fallback)
19hz Fetch a regional 19hz listing and filter by city-scoped allowlists Seattle 19hz
filmforum / metrograph / nitehawk Venue-specific HTML parsers for arthouse cinema calendars Film Forum, Metrograph, Nitehawk

Caching

Each source has a TTL defined in cities/<id>.sources.json (typically 2-6 hours). Cached responses are stored as JSON under cache/<id>/ with a version field — bumping CACHE_VERSION in fetch_events.py invalidates all caches. Results below min_expected_events are not cached to avoid persisting render failures.

Each city runs on its own schedule (typically twice a day, ~12h apart in local time) — all TTLs expire between runs, so every scheduled run fetches fresh data. The S3 cache/<city>/ prefix persists cache across Lambda cold starts and keeps cities isolated (deleting one city's cache doesn't affect the other).

Transit times are cached separately for 7 days since venue locations don't change often.

Interest Scoring

Events are tagged and scored based on keyword matching against event names and venues. The INTEREST_SCORES dict at the top of fetch_events.py defines weights:

  • Genre tags (Techno, House, DnB): matched by artist/venue keywords in GENRE_RULES
  • Culture tags (Japanese, Italian, etc.): matched by food/drink keywords in CULTURE_RULES
  • Activity tags (Science, Comedy, Dating, etc.): matched by event type
  • Vibe tags (Date Night, Social, Solo-Friendly): inferred from keywords and event type

Events with only generic tags (Social, Party) are sent to Gemini 2.5 Flash-Lite for richer classification. This is a fallback -- keyword rules handle the majority of events, and the system works without a Gemini key.

Edit these dicts to match your own interests. Higher scores float to the top within each date.

Tuning scores interactively

A small local web app at tools/score_tuner.py lets you drag tags to reorder them — top of the list becomes the highest score, scores auto-distribute linearly from 10 down to 1, and clicking Save rewrites INTEREST_SCORES in fetch_events.py.

python tools/score_tuner.py
# opens http://127.0.0.1:5001

The UI shuffles tags on load so you rank from scratch rather than anchoring on the current order. Great for forking this repo and dialing in your own preferences without reading any code.

Adding a Source

Add an entry to the city's cities/<id>.sources.json:

{
  "id": "my-venue",
  "name": "My Venue",
  "method": "shot_scraper",
  "url": "https://myvenue.com/events",
  "venue_address": "123 Street, Brooklyn, NY 11211",
  "ttl_seconds": 21600,
  "type": "edm",
  "min_expected_events": 3,
  "enabled": true
}

For shot_scraper sources, you also need a parser function in fetch_events.py and an entry in the SHOT_PARSERS dict. The parser receives the raw document.body.innerText output and extracts events from it.

Supported methods:

  • graphql -- POST to a GraphQL endpoint (currently RA-specific)
  • next_data -- extract __NEXT_DATA__ JSON from server-rendered pages
  • eventbrite_listing -- scrape Eventbrite listing pages, then parse ld+json from each event
  • shot_scraper -- headless browser renders the page, then text is parsed
  • shot_scraper_twohop -- headless browser gets event URLs first, then fetches each page

Infrastructure

All AWS infrastructure is managed by Terraform in infra/. The scraper runs on Lambda with Chromium for full headless browser support.

Full scrape (7am + 7pm ET)
  → Lambda scrapes all sources, tags, Gemini enrichment, transit
  → output uploaded to S3 → CloudFront invalidation

Status refresh (every 2 hours, 9am-9pm ET)
  → Lambda loads cached events, refreshes sold-out status
  → reapplies tags + transit from cache (no re-scraping)
  → ~5 seconds vs ~55 seconds for full scrape

AWS resources

Resource Detail
S3 bucket local-events-wolff-sh (private, OAC)
CloudFront Distribution with ACM cert for local-events.wolff.sh
Lambda local-events-scraper (container image, 3GB RAM, 15min timeout)
ECR local-events-scraper repository
EventBridge Full scrape 7am + 7pm ET, status refresh every 2h 9am-9pm ET
SSM Secrets under /local-events/ prefix

Deploying infrastructure

cd infra
terraform init
terraform apply

ACM certificate requires a DNS validation CNAME in name.com (output after first apply).

CI/CD

Set these GitHub repository secrets (Settings > Secrets > Actions):

Secret Description
AWS_ACCESS_KEY_ID IAM user access key with admin permissions
AWS_SECRET_ACCESS_KEY IAM user secret key
GOOGLE_MAPS_API_KEY Google Distance Matrix API key
HOME_ADDRESS Home address for transit calculations
WORK_ADDRESS Work address
GEMINI_API_KEY Gemini API key for LLM tag enrichment
ALARM_EMAIL Email for CloudWatch alarm notifications

Pushing to mainline auto-deploys via GitHub Actions:

  1. terraform apply — deploys any infrastructure changes
  2. Docker build + push to ECR
  3. Update Lambda function code
  4. Invoke Lambda to regenerate the site

Triggers on changes to: fetch_events.py, lambda_handler.py, Dockerfile, requirements.txt, cities/**, templates/**, infra/**.

For manual deploys: bash deploy/build-and-push.sh

Manual Lambda invoke

aws lambda invoke --function-name local-events-scraper /tmp/response.json && cat /tmp/response.json

Monitoring

Resource Detail
Status page local-events.wolff.sh/status.html
CloudWatch dashboard LocalEvents in AWS console
Metrics namespace LocalEvents -- EventCount, SourceError, TotalEvents, RunDuration, SourcesOk, SourcesFailed. Every metric carries a City dimension (and per-source metrics also carry Source).
Alarms run-failed, low-events (<50), high-failures (4+), no-invocation (24h), per-source chronic failure (RA, Dice, Partiful, Lectures on Tap) -- all notify via SNS email
# CloudWatch logs
aws logs tail /aws/lambda/local-events-scraper --follow

# Last run manifest
aws s3 cp s3://local-events-wolff-sh/last_run.json - | python -m json.tool

Local Development

Running locally

python fetch_events.py --no-open --verbose   # all sources
python fetch_events.py --force               # bypass cache
python fetch_events.py --health-check        # smoke test

Tests

Both Python (pytest) and JavaScript (node --test) suites run through a single entry point:

./test.sh            # unit tests (py + js, parallel)
./test.sh py         # Python unit tests only
./test.sh js         # JavaScript unit tests only
./test.sh coverage   # pytest with coverage (add `html` for drilldown)
./test.sh integration  # slow E2E: real browser + full pipeline run

Unit tests run on every commit via the pre-commit hook. Integration tests run in CI before deploy — they're slower (~5s for 15 tests) because they launch a real Chromium via shot-scraper and exercise run() / run_status_only() end-to-end.

Coverage on the tested logic surface is 97.8% lines / 92.4% branches. The overall file coverage is ~52% because network-facing fetchers (RA GraphQL, Dice/Elsewhere __NEXT_DATA__, Eventbrite listings, Partiful, the shot-scraper venues) and CLI orchestration are deliberately not unit-tested — mocking the whole HTTP stack would test the mocks, not the code.

What IS covered:

  • Dedup pipelinenormalize_name, normalize_venue, names_overlap, detail_score, merge_events, deduplicate
  • Tagging_source_tag, apply_tags across all genre/culture/vibe rules
  • Partiful classification_classify_partiful_type and section fallbacks
  • Event validation_check_event_date, validate_events
  • Cacheread_cache (TTL, version, date-range), write_cache (atomic rename)
  • Parsers — SILO, Mission, House of Yes, Meetup, Basement, Storehouse text parsers
  • Gemini enrichment — response parsing, vocab enforcement, caching, error paths
  • HTML generation — placeholder substitution, filter-logic inlining, field remapping
  • Viewer filter + URL state — source/tag AND semantics, encode/decode roundtrip

A pre-commit hook (.githooks/pre-commit) runs ./test.sh before every commit. setup.sh enables it via git config core.hooksPath .githooks. Bypass in emergencies with git commit --no-verify.

Running with Docker (matches Lambda environment)

docker build -t local-events-scraper .
docker run --rm -e HOME=/tmp -e PLAYWRIGHT_BROWSERS_PATH=/ms-playwright \
  --entrypoint bash local-events-scraper -c \
  'shot-scraper javascript https://example.com "document.title"'

Project Structure

.
├── fetch_events.py              # Single script — fetchers, parsers, scoring, HTML generation
├── lambda_handler.py            # Lambda entry point — SSM secrets, S3 cache, CF invalidation
├── Dockerfile                   # Container image for Lambda (Playwright + Chromium)
├── cities/
│   ├── nyc.json                 # Per-city config (metadata, RA area, etc.)
│   └── nyc.sources.json         # Per-city source definitions
├── requirements.txt             # Python dependencies
├── og-image.jpg                 # OpenGraph social share image
├── CLAUDE.md                    # Development gotchas for AI-assisted coding
├── setup.sh                     # Interactive setup script
├── .env.example                 # Environment variable template
├── templates/
│   ├── events-viewer.html       # HTML template with embedded JS viewer
│   ├── filter_logic.js          # Filter + URL-state module, inlined into the template at generation
│   └── status.html              # Source health dashboard (reads last_run.json)
├── tests/                        # pytest + node --test, run via ./test.sh
│   ├── conftest.py
│   ├── filter_logic.test.js      # viewer filter + URL state (JS)
│   ├── test_cache.py
│   ├── test_dedup.py
│   ├── test_gemini.py
│   ├── test_html_generation.py
│   ├── test_parsers.py
│   ├── test_partiful_classifier.py
│   ├── test_tags.py
│   ├── test_validation.py
│   └── integration/              # opt-in via ./test.sh integration
│       ├── conftest.py
│       ├── test_viewer_browser.py    # shot-scraper drives real HTML
│       ├── test_status_only.py       # cache → status-only → HTML + manifest
│       └── test_full_pipeline.py     # fetch_source mocked, full run() E2E
├── .githooks/
│   └── pre-commit                # runs ./test.sh before every commit
├── test.sh                       # single entry point for all tests
├── infra/                       # Terraform infrastructure (S3 backend)
│   ├── main.tf                  # Provider, backend
│   ├── s3.tf                    # S3 bucket + policy
│   ├── cloudfront.tf            # CloudFront distribution + OAC
│   ├── lambda.tf                # Lambda function + ECR
│   ├── eventbridge.tf           # EventBridge schedule
│   ├── iam.tf                   # Lambda execution role
│   ├── ssm.tf                   # SSM parameters for secrets
│   ├── acm.tf                   # ACM certificate
│   ├── monitoring.tf            # CloudWatch dashboard, alarms, SNS
│   ├── variables.tf             # Input variables
│   └── outputs.tf               # Output values
├── deploy/
│   └── build-and-push.sh        # Build container, push to ECR, update Lambda
├── docs/
│   └── index.html               # GitHub Pages redirect to local-events.wolff.sh
└── .github/workflows/
    └── deploy.yml               # CI/CD: Terraform + Docker + Lambda deploy

Adding a City

The pipeline is multi-city — NYC and Seattle are both supported out of the box. Adding a third city is a pure configuration change; you should not need to edit fetch_events.py unless you're wiring up a new venue-specific parser.

1. City config (cities/<id>.json)

Create cities/<id>.json. See cities/nyc.json / cities/seattle.json for the full schema. Key fields:

Field Example Purpose
id "la" Must match the filename stem. Lowercase.
name "LA" Short label shown in the nav bar.
header_location "Los Angeles" Subhead under the <h1>.
site_title / og_title / meta_description / og_description `"LA Events What's On"`
gtag_page_title "la-events" GA4 page_title — must be distinct per city.
timezone "America/Los_Angeles" Used by Dice date conversion.
ra_area_id 23 Find by inspecting RA's GraphQL areas.eq request in devtools.
ra_referer "https://ra.co/events/us/losangeles" Matches the RA area.
eventbrite_allowed_regions ["CA"] Events outside these states are dropped.
dice_url "https://dice.fm/browse?location=los-angeles" Per-city Dice listing.
partiful_url null Set to a real URL if Partiful has a city page, else null.
venue_address_overrides {"TBA - ...": "real address"} Address fixes for TBA-named venues.
dedup_stop_words_extra ["la", "los", "angeles"] City-scoped tokens to strip during name dedup.
favorite_venues [] Venues you want auto-tagged "Venue" even when sourced from RA/Dice.
venue_sources [...] Source names that scrape a single venue (get the "Venue" source tag).
film_sources [] Arthouse cinema sources — get the "Film" tag automatically.

2. Source list (cities/<id>.sources.json)

Mirror the schema in cities/nyc.sources.json. Each entry has id, name, method, url, ttl_seconds, type, min_expected_events, and enabled. The method must be a key in FETCHER_MAP (see fetch_events.py).

3. Location config (.env)

If you want the transit column/filter/legend to render for the new city, set its home address in .env:

HOME_ADDRESS_LA=123 Main St, Los Angeles, CA
WORK_ADDRESS_LA=...

Variable names must use the suffix _<CITY_UPPER>. Bare HOME_ADDRESS / WORK_ADDRESS are silently ignored. Without a home address, the transit UI is hidden entirely for that city.

4. Interest scoring (fetch_events.py)

These are global across cities (they describe personal taste, not city features). Customize to match your interests:

  • INTEREST_SCORES — weight each tag (higher = more prominent)
  • GENRE_RULES — keyword patterns for music genres
  • CULTURE_RULES — keyword patterns for cuisines/cultures
  • VIBE_KEYWORDS_DATE / VIBE_KEYWORDS_FRIENDS — vibe detection keywords

5. Infrastructure (infra/)

  • Append your city id to var.cities default in infra/variables.tf.
  • terraform apply — this adds a per-city EventBridge schedule, rebuilds the CloudFront function to route /<id>, and adds per-city alarms keyed on the City dimension.

6. Venue parsers (only if adding a new venue source)

If the new city uses a venue site that doesn't match any existing fetcher, add a parser function in fetch_events.py and register it in SHOT_PARSERS (keyed by the source id). See parse_silo_text / parse_storehouse_text for examples.

7. Container timezone (Dockerfile)

Update TZ=America/New_York to your city's timezone. This affects how venue websites render event dates in the headless browser:

ENV TZ=America/Los_Angeles  # e.g. for LA

Also update the EventBridge schedule timezone in infra/eventbridge.tf:

schedule_expression_timezone = "America/Los_Angeles"

8. HTML template

  • Update the title, description, and OG metadata in templates/events-viewer.html
  • Update the header location text ("Williamsburg, BK")

License

MIT

About

NYC events aggregator — scrapes multiple sources, ranks by interest, shows transit times

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors