A 5-minute, game-like walkthrough that places every fan into one of 12 LA28 fan tribes, hands them a tribe-specific LA28 watchlist with Gemini-grounded reasons, drops them into an anonymous tribe forum, and shows them the other fans in their tribe — including the surprising ones who landed on different sports.
Live demo: https://tribe28-mr2rqybiwq-uc.a.run.app
Built for the Team USA x Google Cloud Hackathon. Addresses Challenge 5 — Choose Your Own, with strong overlap on Challenge 1 (Olympic/Paralympic parity) and Challenge 3 (LA28 momentum).
For a fresh local setup, start here:
cp .env.example .env
bash scripts/bootstrap.shIf this is a brand-new clone with no existing BigQuery tables or offline ML artifacts, follow the full step-by-step build sequence here:
| Layer | Technology |
|---|---|
| Frontend | React 18, TypeScript, Vite, react-router-dom, inline SVG |
| Backend | FastAPI, Pydantic v2, Python 3.12, uv |
| LLM | Gemini 2.5 Pro (composer) + Gemini 2.5 Flash (critic) via Vertex AI |
| Data | BigQuery (US region), Cloud Storage (fact-pack cache) |
| Hosting | Cloud Run (us-central1), Cloud Build, Artifact Registry |
| ML | scikit-learn (KMeans, silhouette), pandas, pyarrow, BigQuery ML |
| Source | What | USA rows | Years |
|---|---|---|---|
| Kaggle — 120 Years of Olympic History | Olympic athlete-event records | 18,853 | 1896–2016 |
| Kaggle — Paralympic Dataset 1960–2018 | Paralympic medal-athlete records (downloaded as .xlsx, converted to medal_athlete.csv) |
3,059 | 1960–2018 |
| la28.org/en/games-plan/olympics.html + /paralympics.html | LA28 programme (hand-curated JSON) | 75 sports | — |
Not used: finish times, athlete names, photos, hometowns, weather data, biometrics.
Placement stored as bands only: 1 (Gold) / 2 (Silver) / 3 (Bronze) / qualified.
- 7-chapter walkthrough. A narrative quiz projects you onto six axes (precision, solo, endurance, ground, combat, equipment). Each answer animates a live dot; Chapter 4 reveals a mini-cohort of three similarly-answering anonymous fans; Chapter 6 shows an "era affinity" callout grounded in seven decades of Team USA data.
- Tribe assignment. Your 6-axis position is matched to one of 12 KMeans tribe centroids (silhouette
0.6587) trained on US-only Team USA aggregate records. - Grounded watchlist. A hybrid scorer ranks all 75 LA28 sports by affinity × archetype × momentum × tribemate signal × Paralympic-parity guard. For each top pick, Gemini 2.5 Pro writes a fact-pack-grounded reason; Gemini 2.5 Flash verifies grounding before the text ships. Deterministic template prose is the fallback if the critic rejects.
- Community view. k-anonymous BigQuery aggregates (k ≥ 10) surface tribe share %, top US states (tile-cartogram), bridge tribes (closest by axis distance with shared sports), and tribemate cards with a "surprise pick" sport.
- Tribe forum. Per-tribe anonymous threads seeded from
forum_seed.json. Anon IDs only (T01-001), NIL banlist applied to every post. - Share card. A 6-axis spider chart of your axis profile, no athletes, no event imagery.
flowchart TD
subgraph Browser["Browser (React + Vite)"]
UI["/walkthrough · /reveal · /community\n/forum · /share · /methodology"]
end
subgraph CloudRun["Cloud Run — FastAPI backend (us-central1)"]
direction TB
API["/tribe/assign\n/sport · /la28\n/walkthrough · /forum"]
SCORER["score_sports()\nhybrid: affinity × archetype\n× momentum × CF × parity"]
AGENT["explain_sport()\ncomposer → critic loop"]
COMPOSER["Gemini 2.5 Pro\ncomposer.py\nthinking_budget=256\nresponse_mime_type=json"]
CRITIC["Gemini 2.5 Flash\ncritic.py\nthinking_budget=128\nbinary grounding check"]
FALLBACK["deterministic template\nper-tribe prose\n(fires when critic rejects)"]
NIL["compliance.py\n153k-name NIL banlist\n(applied to every output)"]
API --> SCORER
API --> AGENT
SCORER --> API
AGENT --> COMPOSER
COMPOSER --> CRITIC
CRITIC -->|pass| API
CRITIC -->|fail| FALLBACK
FALLBACK --> API
COMPOSER --> NIL
CRITIC --> NIL
end
subgraph GCP["Google Cloud Platform"]
BQ["BigQuery (US)\npropeller_team_usa\n─ athletes_aggregated (21,912 rows)\n─ cluster_assignments\n─ classification_reference\n─ fan_sessions (live + 5k synthetic seed)\n─ v_sport_factpack (view)\n─ v_tribe_definitions (view)\n─ v_tribe_community_stats (view)"]
GCS["Cloud Storage\nfact-pack cache\n(GCS_CACHE_BUCKET)"]
VERTEX["Vertex AI\nGemini 2.5 Pro · Flash\nGOOGLE_GENAI_USE_VERTEXAI=true"]
end
subgraph OfflineML["Offline ML pipeline (one-shot, not in Cloud Run)"]
direction LR
RAW["Raw CSVs\nKaggle Olympic (271k rows)\nParalympic medal-athlete (29k rows)"]
PREMIG["premigration/data/load/build.py\n→ athletes_aggregated\n→ athlete_names_banlist"]
KMEANS6["premigration/ml/archetypes/train_kmeans.py\nBigQuery ML KMEANS k=6\n→ cluster_assignments (6 archetypes)"]
AXIS["ml_offline/build_sport_axis.py\n101-entry hand-authored axis table\n→ sport_axis.json"]
DERIVE["ml_offline/derive_athlete_axes.py\njoins athletes_aggregated\n+ cluster_assignments\n→ athlete_axes.parquet (80k rows)"]
TRIBES["ml_offline/fit_tribes.py\nsklearn KMeans k=12\nsilhouette = 0.6587\n→ tribe_centroids.json"]
RAW --> PREMIG --> KMEANS6
KMEANS6 --> DERIVE
AXIS --> DERIVE
DERIVE --> TRIBES
end
Browser <-->|"JSON / HTTPS"| CloudRun
CloudRun <--> BQ
CloudRun <--> GCS
CloudRun <--> VERTEX
PREMIG --> BQ
KMEANS6 --> BQ
TRIBES -->|"artifacts baked\ninto Docker image"| CloudRun
Everything below assumes you are starting from a fresh clone with no pre-existing BigQuery tables. If the live GCP project already has the tables (as in the deployed version), skip to Step 4.
| Tool | Minimum version | Install |
|---|---|---|
| Python | 3.12 | python.org or pyenv install 3.12 |
uv |
0.4+ | curl -LsSf https://astral.sh/uv/install.sh | sh |
| Node.js | 20 LTS | nodejs.org |
| Google Cloud CLI | latest | gcloud components update |
| Kaggle CLI | latest | pip install kaggle |
You need a GCP project with the following APIs enabled:
bigquery.googleapis.com
bigquerystorage.googleapis.com
aiplatform.googleapis.com
run.googleapis.com
cloudbuild.googleapis.com
artifactregistry.googleapis.com
storage.googleapis.com
Enable them all at once:
gcloud services enable bigquery.googleapis.com bigquerystorage.googleapis.com \
aiplatform.googleapis.com run.googleapis.com cloudbuild.googleapis.com \
artifactregistry.googleapis.com storage.googleapis.com- Create a Kaggle account at https://www.kaggle.com and accept the dataset licence.
- Go to your Kaggle account → Settings → API → Create new token. This downloads
kaggle.json. - Place
kaggle.jsonat~/.kaggle/kaggle.jsonandchmod 600 ~/.kaggle/kaggle.json. - Download the dataset:
kaggle datasets download heesoo37/120-years-of-olympic-history-athletes-and-results \
--unzip \
--path premigration/data/raw/kaggle/
# This puts athlete_events.csv (271,116 rows) into premigration/data/raw/kaggle/Verify:
wc -l premigration/data/raw/kaggle/athlete_events.csv
# Expected: 271117 (header + 271116 rows)Do not commit this file. It contains athlete names and is in
.gitignore.
This dataset contains two .xlsx files and must be downloaded via kagglehub and converted to CSV.
Install kagglehub and openpyxl if needed:
pip install kagglehub openpyxl pandasRun this script from the project root to download, convert, and place both files:
python3 - <<'PY'
import kagglehub, shutil, os, pandas as pd
print("Downloading Paralympic dataset from Kaggle...")
cache_path = kagglehub.dataset_download("arpitsinghaiml/paralympic-dataset-1960-2018")
print(f"Cache: {cache_path}\n")
dest_dir = "premigration/data/raw/paralympic_org"
os.makedirs(dest_dir, exist_ok=True)
for item in os.listdir(cache_path):
src = os.path.join(cache_path, item)
if os.path.isfile(src) and item.endswith(".xlsx"):
csv_name = item.replace(".xlsx", ".csv")
dest = os.path.join(dest_dir, csv_name)
df = pd.read_excel(src)
df.to_csv(dest, index=False)
print(f" {item} -> {dest} ({len(df)} rows)")
elif os.path.isfile(src):
shutil.copy2(src, os.path.join(dest_dir, item))
print(f" copied {item}")
print("\nDone.")
PYThis produces two CSV files:
| File | Rows | Used by pipeline |
|---|---|---|
medal_athlete.csv |
29,170 | Yes — loaded as athletes_raw_paralympic, filtered to npc/npc_new = 'USA' (3,059 rows) |
paramedals.csv |
2,346 | No — country-level medal aggregates, not used in athlete aggregation |
Verify:
wc -l premigration/data/raw/paralympic_org/medal_athlete.csv
# Expected: 29171 (header + 29170 rows)Do not commit these files. They contain athlete names and are covered by
.gitignore.
Set your project:
export PROJECT_ID="your-gcp-project-id"
export BIGQUERY_LOCATION="US"Install the premigration dependencies (uses the project's own pyproject.toml):
uv sync
source .venv/bin/activateRun the loader. This is idempotent — safe to re-run:
python premigration/data/load/build.pyWhat it creates:
| Dataset | Table | Rows | Notes |
|---|---|---|---|
propeller_team_usa_internal |
athletes_raw_olympic |
271,116 | Raw Olympic rows with names — internal only |
propeller_team_usa_internal |
athletes_raw_paralympic |
29,170 | Raw Paralympic rows with names — internal only |
propeller_team_usa_internal |
athlete_names_banlist |
~153,590 | Distinct names from both sources — used by NIL check |
propeller_team_usa |
classification_reference |
120 | Para class-code descriptions |
propeller_team_usa |
athletes_aggregated |
21,912 | USA-only rows, names removed, SHA-256 IDs, placement bands |
Verify the key output table:
bq query --nouse_legacy_sql \
"SELECT COUNT(*) as n, COUNTIF(program='OLYMPIC') as olympic, COUNTIF(program='PARALYMPIC') as para
FROM \`${PROJECT_ID}.propeller_team_usa.athletes_aggregated\`"
# Expected: n=21912, olympic=18853, para=3059python premigration/ml/archetypes/train_kmeans.pyWhat it creates:
| Dataset | Table | Notes |
|---|---|---|
propeller_team_usa |
feature_matrix |
17 ordinal + one-hot features, 21,912 rows |
propeller_team_usa |
archetype_kmeans |
BigQuery ML KMEANS model, k=6 |
propeller_team_usa |
cluster_assignments |
21,912 rows + cluster_id (1–6) |
The 6 archetypes (aerobic_engine, wheel_velocity, forge_hammer, aquatic_spectrum, combat_grace, precision_rotational) are human-labelled names stored in premigration/ml/archetypes/archetype_definitions.json.
Verify:
bq query --nouse_legacy_sql \
"SELECT cluster_id, COUNT(*) as n
FROM \`${PROJECT_ID}.propeller_team_usa.cluster_assignments\`
GROUP BY 1 ORDER BY 1"
# Expected: 6 rows, total 21912# Copy and fill in your environment variables
cp .env.example .envEdit .env:
PROJECT_ID=your-gcp-project-id # GCP project with the BQ tables above
BIGQUERY_LOCATION=US
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_GENAI_USE_VERTEXAI=true # routes Gemini calls through Vertex AI
GEMINI_PRO_MODEL=gemini-2.5-pro
GEMINI_CRITIC_MODEL=gemini-2.5-flash
GCS_CACHE_BUCKET=your-gcs-bucket-name # bucket for Gemini fact-pack cacheMake sure Application Default Credentials are set:
gcloud auth application-default loginRun the bootstrap script (creates venv, installs deps, verifies artifacts, installs frontend deps, runs tests):
bash scripts/bootstrap.shIf this is the first time on a fresh clone (no ML artifacts yet), the bootstrap will fail listing missing files. Continue to Step 5 first.
Skip this step if
ml_offline/artifacts/already containssport_axis.json,tribe_centroids.json,tribe_definitions.json, andforum_seed.json(they are committed to the repo).
source .venv/bin/activate
# 1. Build sport axis table (101-entry hand-authored vectors, 75 LA28 sports)
uv run python -m ml_offline.build_sport_axis
# → writes ml_offline/artifacts/sport_axis.json
# 2. Derive per-athlete 6-axis positions (requires BigQuery access)
uv run python -m ml_offline.derive_athlete_axes
# → fetches athletes_aggregated + cluster_assignments from BQ
# → writes ml_offline/artifacts/athlete_axes.parquet (80k rows)
# → uploads to GCS_CACHE_BUCKET/tribe28/athlete_axes.parquet
# 3. Fit 12 fan tribes via KMeans on the 6-axis positions
uv run python -m ml_offline.fit_tribes
# → writes ml_offline/artifacts/tribe_centroids.json
# → prints silhouette score (expected ≈ 0.6587)
# 4. Propose tribe names with Gemini (outputs need human review)
uv run python -m ml_offline.name_tribes
# → writes ml_offline/artifacts/tribe_name_proposals.json
# Review this file, then manually edit tribe_definitions.json
# (tribe_definitions.json is already committed with reviewed names)
# 5. Seed the forum
uv run python -m ml_offline.seed_forum
# → writes ml_offline/artifacts/forum_seed.json
# 6. Seed synthetic fan sessions in BigQuery (5,000 rows, random_state=42)
uv run python -m ml_offline.seed_fan_sessions
# → creates / truncates propeller_team_usa.fan_sessions
# → inserts 5,000 synthetic sessions as a community cold-startThe three SQL views must exist before the backend can serve requests:
source .venv/bin/activate
for sql in sql/010_v_sport_factpack.sql sql/020_v_tribe_definitions.sql sql/030_v_tribe_community_stats.sql; do
echo "Applying $sql ..."
sed "s/eternal-tempest-495921-m5/${PROJECT_ID}/g" "$sql" | \
bq query --nouse_legacy_sql --project_id="${PROJECT_ID}"
doneVerify all three views exist:
bq ls --filter labels.type:view ${PROJECT_ID}:propeller_team_usa 2>/dev/null || \
bq query --nouse_legacy_sql \
"SELECT table_name FROM \`${PROJECT_ID}.propeller_team_usa.INFORMATION_SCHEMA.VIEWS\`"source .venv/bin/activate
uvicorn backend.main:app --reload --port 8000Smoke-test:
curl http://localhost:8000/healthz
# {"status":"ok"}
curl -s http://localhost:8000/la28 | python3 -c "import json,sys; d=json.load(sys.stdin); print(len(d['sports']), 'sports')"
# 75 sportscd frontend
npm run dev
# → http://localhost:5173 (proxies /api/* to localhost:8000)uv run pytest -qExpected: 73 passed. Tests cover scoring, projection, tribes, composer/critic orchestration, compliance (NIL, finish-time guard, Paralympic parity), forum store, and HTTP smoke tests against every endpoint.
gcloud builds submit \
--config cloudbuild.yaml \
--substitutions=SHORT_SHA=$(git rev-parse --short HEAD)cloudbuild.yaml runs a multi-stage Docker build (Vite → Python), pushes to Artifact Registry, and rolls out a new Cloud Run revision in us-central1. The ML artifacts (sport_axis.json, tribe_centroids.json, tribe_definitions.json, forum_seed.json) are baked into the image at build time via COPY ml_offline/artifacts/ ./ml_offline/artifacts/.
After deploy, verify:
SERVICE_URL=$(gcloud run services describe tribe28 --region=us-central1 --format='value(status.url)')
curl "${SERVICE_URL}/healthz"backend/
agent/ composer (Gemini 2.5 Pro) + critic (Flash) + orchestrator + fallback templates
api/ FastAPI routers: /tribe, /sport, /la28, /walkthrough, /forum
community/ BigQuery aggregate helpers (k-anonymous)
forum/ in-process anonymous forum store + moderation + NIL glyph
ml/ projection (6 axes), scoring (hybrid), tribes (assignment)
compliance.py NIL banlist loader + check function
schemas.py Pydantic v2 API contracts
frontend/
src/
pages/ Entry, Walkthrough, TribeLoading, Reveal, Community,
Watchlist, Share, ForumIndex, TribeForum, ThreadPage,
Profile, Methodology
components/ TribeGlyph, MiniConstellation, USStateMap, Thread, ComposeBox, …
hooks/ useSession, useTribes, useWatchlist (custom-event live sync)
api/ typed fetch clients per backend router
ml_offline/
artifacts/ la28_program.json, sport_axis.json, tribe_centroids.json,
tribe_definitions.json, forum_seed.json, athlete_axes.parquet
build_sport_axis.py hand-authored 6-axis vectors for 101 sports
derive_athlete_axes.py BQ join → parquet
fit_tribes.py KMeans k=12 → centroids
name_tribes.py Gemini name proposals
seed_fan_sessions.py synthetic community cold-start
seed_forum.py Gemini-seeded forum threads
premigration/
data/load/build.py CSV → BigQuery loader (idempotent)
ml/archetypes/train_kmeans.py BigQuery ML KMEANS k=6
ml/archetypes/tsne_layout.py optional t-SNE layout
sql/ BigQuery view DDL (010 factpack, 020 definitions, 030 community stats)
scripts/bootstrap.sh one-command local setup
tests/ 73-test pytest suite
Every constraint is enforced in code, not just documented:
- No athlete NIL. 153k-name banlist checked on every model output; violations force the deterministic fallback. Forum posts use
T01-001-style anon IDs. - No IOC/IPC/sponsor marks. No third-party logos — all visuals are abstract SVG glyphs.
- No finish times or specific scoring results. Enforced in
SportFactPackschema, the critic's grounding rules, andtests/test_compliance.py. - Paralympic parity. Scorer guarantees at least one Paralympic sport in every top-pick list; class codes appear as labels only, never interpreted.
- US-scope only.
NOC = 'USA'(Olympic) andnpc/npc_new = 'USA'(Paralympic) filters hard-coded inpremigration/data/load/build.py. - k-anonymity. All community aggregates gated by
HAVING count(*) >= 10at the BigQuery view level. - Public data only. Kaggle Olympic history (heesoo37), Kaggle Paralympic dataset 1960–2018 (arpitsinghaiml, converted from xlsx), and the official LA28 games programme at la28.org.
Apache 2.0 — see LICENSE.