Tribe28

A 5-minute, game-like walkthrough that places every fan into one of 12 LA28 fan tribes, hands them a tribe-specific LA28 watchlist with Gemini-grounded reasons, drops them into an anonymous tribe forum, and shows them the other fans in their tribe — including the surprising ones who landed on different sports.

Live demo: https://tribe28-mr2rqybiwq-uc.a.run.app

Built for the Team USA x Google Cloud Hackathon. Addresses Challenge 5 — Choose Your Own, with strong overlap on Challenge 1 (Olympic/Paralympic parity) and Challenge 3 (LA28 momentum).

Build instructions

For a fresh local setup, start here:

cp .env.example .env
bash scripts/bootstrap.sh

If this is a brand-new clone with no existing BigQuery tables or offline ML artifacts, follow the full step-by-step build sequence here:

Full build instructions

Tech stack

Layer	Technology
Frontend	React 18, TypeScript, Vite, react-router-dom, inline SVG
Backend	FastAPI, Pydantic v2, Python 3.12, `uv`
LLM	Gemini 2.5 Pro (composer) + Gemini 2.5 Flash (critic) via Vertex AI
Data	BigQuery (US region), Cloud Storage (fact-pack cache)
Hosting	Cloud Run (us-central1), Cloud Build, Artifact Registry
ML	scikit-learn (KMeans, silhouette), pandas, pyarrow, BigQuery ML

Data sources

Source	What	USA rows	Years
Kaggle — 120 Years of Olympic History	Olympic athlete-event records	18,853	1896–2016
Kaggle — Paralympic Dataset 1960–2018	Paralympic medal-athlete records (downloaded as `.xlsx`, converted to `medal_athlete.csv`)	3,059	1960–2018
la28.org/en/games-plan/olympics.html + /paralympics.html	LA28 programme (hand-curated JSON)	75 sports	—

Not used: finish times, athlete names, photos, hometowns, weather data, biometrics. Placement stored as bands only: 1 (Gold) / 2 (Silver) / 3 (Bronze) / qualified.

What it does

7-chapter walkthrough. A narrative quiz projects you onto six axes (precision, solo, endurance, ground, combat, equipment). Each answer animates a live dot; Chapter 4 reveals a mini-cohort of three similarly-answering anonymous fans; Chapter 6 shows an "era affinity" callout grounded in seven decades of Team USA data.
Tribe assignment. Your 6-axis position is matched to one of 12 KMeans tribe centroids (silhouette 0.6587) trained on US-only Team USA aggregate records.
Grounded watchlist. A hybrid scorer ranks all 75 LA28 sports by affinity × archetype × momentum × tribemate signal × Paralympic-parity guard. For each top pick, Gemini 2.5 Pro writes a fact-pack-grounded reason; Gemini 2.5 Flash verifies grounding before the text ships. Deterministic template prose is the fallback if the critic rejects.
Community view. k-anonymous BigQuery aggregates (k ≥ 10) surface tribe share %, top US states (tile-cartogram), bridge tribes (closest by axis distance with shared sports), and tribemate cards with a "surprise pick" sport.
Tribe forum. Per-tribe anonymous threads seeded from forum_seed.json. Anon IDs only (T01-001), NIL banlist applied to every post.
Share card. A 6-axis spider chart of your axis profile, no athletes, no event imagery.

Architecture

flowchart TD
    subgraph Browser["Browser (React + Vite)"]
        UI["/walkthrough · /reveal · /community\n/forum · /share · /methodology"]
    end

    subgraph CloudRun["Cloud Run — FastAPI backend (us-central1)"]
        direction TB
        API["/tribe/assign\n/sport · /la28\n/walkthrough · /forum"]
        SCORER["score_sports()\nhybrid: affinity × archetype\n× momentum × CF × parity"]
        AGENT["explain_sport()\ncomposer → critic loop"]
        COMPOSER["Gemini 2.5 Pro\ncomposer.py\nthinking_budget=256\nresponse_mime_type=json"]
        CRITIC["Gemini 2.5 Flash\ncritic.py\nthinking_budget=128\nbinary grounding check"]
        FALLBACK["deterministic template\nper-tribe prose\n(fires when critic rejects)"]
        NIL["compliance.py\n153k-name NIL banlist\n(applied to every output)"]
        API --> SCORER
        API --> AGENT
        SCORER --> API
        AGENT --> COMPOSER
        COMPOSER --> CRITIC
        CRITIC -->|pass| API
        CRITIC -->|fail| FALLBACK
        FALLBACK --> API
        COMPOSER --> NIL
        CRITIC --> NIL
    end

    subgraph GCP["Google Cloud Platform"]
        BQ["BigQuery (US)\npropeller_team_usa\n─ athletes_aggregated (21,912 rows)\n─ cluster_assignments\n─ classification_reference\n─ fan_sessions (live + 5k synthetic seed)\n─ v_sport_factpack (view)\n─ v_tribe_definitions (view)\n─ v_tribe_community_stats (view)"]
        GCS["Cloud Storage\nfact-pack cache\n(GCS_CACHE_BUCKET)"]
        VERTEX["Vertex AI\nGemini 2.5 Pro · Flash\nGOOGLE_GENAI_USE_VERTEXAI=true"]
    end

    subgraph OfflineML["Offline ML pipeline (one-shot, not in Cloud Run)"]
        direction LR
        RAW["Raw CSVs\nKaggle Olympic (271k rows)\nParalympic medal-athlete (29k rows)"]
        PREMIG["premigration/data/load/build.py\n→ athletes_aggregated\n→ athlete_names_banlist"]
        KMEANS6["premigration/ml/archetypes/train_kmeans.py\nBigQuery ML KMEANS k=6\n→ cluster_assignments (6 archetypes)"]
        AXIS["ml_offline/build_sport_axis.py\n101-entry hand-authored axis table\n→ sport_axis.json"]
        DERIVE["ml_offline/derive_athlete_axes.py\njoins athletes_aggregated\n+ cluster_assignments\n→ athlete_axes.parquet (80k rows)"]
        TRIBES["ml_offline/fit_tribes.py\nsklearn KMeans k=12\nsilhouette = 0.6587\n→ tribe_centroids.json"]
        RAW --> PREMIG --> KMEANS6
        KMEANS6 --> DERIVE
        AXIS --> DERIVE
        DERIVE --> TRIBES
    end

    Browser <-->|"JSON / HTTPS"| CloudRun
    CloudRun <--> BQ
    CloudRun <--> GCS
    CloudRun <--> VERTEX
    PREMIG --> BQ
    KMEANS6 --> BQ
    TRIBES -->|"artifacts baked\ninto Docker image"| CloudRun

Detailed build instructions

Everything below assumes you are starting from a fresh clone with no pre-existing BigQuery tables. If the live GCP project already has the tables (as in the deployed version), skip to Step 4.

Prerequisites

Tool	Minimum version	Install
Python	3.12	python.org or `pyenv install 3.12`
`uv`	0.4+	`curl -LsSf https://astral.sh/uv/install.sh \| sh`
Node.js	20 LTS	nodejs.org
Google Cloud CLI	latest	`gcloud components update`
Kaggle CLI	latest	`pip install kaggle`

You need a GCP project with the following APIs enabled:

bigquery.googleapis.com
bigquerystorage.googleapis.com
aiplatform.googleapis.com
run.googleapis.com
cloudbuild.googleapis.com
artifactregistry.googleapis.com
storage.googleapis.com

Enable them all at once:

gcloud services enable bigquery.googleapis.com bigquerystorage.googleapis.com \
  aiplatform.googleapis.com run.googleapis.com cloudbuild.googleapis.com \
  artifactregistry.googleapis.com storage.googleapis.com

Step 1 — Download raw source data

1a. Kaggle Olympic dataset

Create a Kaggle account at https://www.kaggle.com and accept the dataset licence.
Go to your Kaggle account → Settings → API → Create new token. This downloads kaggle.json.
Place kaggle.json at ~/.kaggle/kaggle.json and chmod 600 ~/.kaggle/kaggle.json.
Download the dataset:

kaggle datasets download heesoo37/120-years-of-olympic-history-athletes-and-results \
  --unzip \
  --path premigration/data/raw/kaggle/
# This puts athlete_events.csv (271,116 rows) into premigration/data/raw/kaggle/

Verify:

wc -l premigration/data/raw/kaggle/athlete_events.csv
# Expected: 271117 (header + 271116 rows)

Do not commit this file. It contains athlete names and is in .gitignore.

1b. Kaggle Paralympic dataset

This dataset contains two .xlsx files and must be downloaded via kagglehub and converted to CSV.

Install kagglehub and openpyxl if needed:

pip install kagglehub openpyxl pandas

Run this script from the project root to download, convert, and place both files:

python3 - <<'PY'
import kagglehub, shutil, os, pandas as pd

print("Downloading Paralympic dataset from Kaggle...")
cache_path = kagglehub.dataset_download("arpitsinghaiml/paralympic-dataset-1960-2018")
print(f"Cache: {cache_path}\n")

dest_dir = "premigration/data/raw/paralympic_org"
os.makedirs(dest_dir, exist_ok=True)

for item in os.listdir(cache_path):
    src = os.path.join(cache_path, item)
    if os.path.isfile(src) and item.endswith(".xlsx"):
        csv_name = item.replace(".xlsx", ".csv")
        dest = os.path.join(dest_dir, csv_name)
        df = pd.read_excel(src)
        df.to_csv(dest, index=False)
        print(f"  {item} -> {dest}  ({len(df)} rows)")
    elif os.path.isfile(src):
        shutil.copy2(src, os.path.join(dest_dir, item))
        print(f"  copied {item}")

print("\nDone.")
PY

This produces two CSV files:

File	Rows	Used by pipeline
`medal_athlete.csv`	29,170	Yes — loaded as `athletes_raw_paralympic`, filtered to `npc/npc_new = 'USA'` (3,059 rows)
`paramedals.csv`	2,346	No — country-level medal aggregates, not used in athlete aggregation

Verify:

wc -l premigration/data/raw/paralympic_org/medal_athlete.csv
# Expected: 29171 (header + 29170 rows)

Do not commit these files. They contain athlete names and are covered by .gitignore.

Step 2 — Load raw data into BigQuery (premigration)

Set your project:

export PROJECT_ID="your-gcp-project-id"
export BIGQUERY_LOCATION="US"

Install the premigration dependencies (uses the project's own pyproject.toml):

uv sync
source .venv/bin/activate

Run the loader. This is idempotent — safe to re-run:

python premigration/data/load/build.py

What it creates:

Dataset	Table	Rows	Notes
`propeller_team_usa_internal`	`athletes_raw_olympic`	271,116	Raw Olympic rows with names — internal only
`propeller_team_usa_internal`	`athletes_raw_paralympic`	29,170	Raw Paralympic rows with names — internal only
`propeller_team_usa_internal`	`athlete_names_banlist`	~153,590	Distinct names from both sources — used by NIL check
`propeller_team_usa`	`classification_reference`	120	Para class-code descriptions
`propeller_team_usa`	`athletes_aggregated`	21,912	USA-only rows, names removed, SHA-256 IDs, placement bands

Verify the key output table:

bq query --nouse_legacy_sql \
  "SELECT COUNT(*) as n, COUNTIF(program='OLYMPIC') as olympic, COUNTIF(program='PARALYMPIC') as para
   FROM \`${PROJECT_ID}.propeller_team_usa.athletes_aggregated\`"
# Expected: n=21912, olympic=18853, para=3059

Step 3 — Train the archetype clustering (BigQuery ML, k=6)

python premigration/ml/archetypes/train_kmeans.py

What it creates:

Dataset	Table	Notes
`propeller_team_usa`	`feature_matrix`	17 ordinal + one-hot features, 21,912 rows
`propeller_team_usa`	`archetype_kmeans`	BigQuery ML KMEANS model, k=6
`propeller_team_usa`	`cluster_assignments`	21,912 rows + `cluster_id` (1–6)

The 6 archetypes (aerobic_engine, wheel_velocity, forge_hammer, aquatic_spectrum, combat_grace, precision_rotational) are human-labelled names stored in premigration/ml/archetypes/archetype_definitions.json.

Verify:

bq query --nouse_legacy_sql \
  "SELECT cluster_id, COUNT(*) as n
   FROM \`${PROJECT_ID}.propeller_team_usa.cluster_assignments\`
   GROUP BY 1 ORDER BY 1"
# Expected: 6 rows, total 21912

Step 4 — Local dev environment

# Copy and fill in your environment variables
cp .env.example .env

Edit .env:

PROJECT_ID=your-gcp-project-id          # GCP project with the BQ tables above
BIGQUERY_LOCATION=US
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_GENAI_USE_VERTEXAI=true          # routes Gemini calls through Vertex AI
GEMINI_PRO_MODEL=gemini-2.5-pro
GEMINI_CRITIC_MODEL=gemini-2.5-flash
GCS_CACHE_BUCKET=your-gcs-bucket-name   # bucket for Gemini fact-pack cache

Make sure Application Default Credentials are set:

gcloud auth application-default login

Run the bootstrap script (creates venv, installs deps, verifies artifacts, installs frontend deps, runs tests):

bash scripts/bootstrap.sh

If this is the first time on a fresh clone (no ML artifacts yet), the bootstrap will fail listing missing files. Continue to Step 5 first.

Step 5 — Run the offline ML pipeline

Skip this step if ml_offline/artifacts/ already contains sport_axis.json, tribe_centroids.json, tribe_definitions.json, and forum_seed.json (they are committed to the repo).

source .venv/bin/activate

# 1. Build sport axis table (101-entry hand-authored vectors, 75 LA28 sports)
uv run python -m ml_offline.build_sport_axis
# → writes ml_offline/artifacts/sport_axis.json

# 2. Derive per-athlete 6-axis positions (requires BigQuery access)
uv run python -m ml_offline.derive_athlete_axes
# → fetches athletes_aggregated + cluster_assignments from BQ
# → writes ml_offline/artifacts/athlete_axes.parquet  (80k rows)
# → uploads to GCS_CACHE_BUCKET/tribe28/athlete_axes.parquet

# 3. Fit 12 fan tribes via KMeans on the 6-axis positions
uv run python -m ml_offline.fit_tribes
# → writes ml_offline/artifacts/tribe_centroids.json
# → prints silhouette score (expected ≈ 0.6587)

# 4. Propose tribe names with Gemini (outputs need human review)
uv run python -m ml_offline.name_tribes
# → writes ml_offline/artifacts/tribe_name_proposals.json
# Review this file, then manually edit tribe_definitions.json
# (tribe_definitions.json is already committed with reviewed names)

# 5. Seed the forum
uv run python -m ml_offline.seed_forum
# → writes ml_offline/artifacts/forum_seed.json

# 6. Seed synthetic fan sessions in BigQuery (5,000 rows, random_state=42)
uv run python -m ml_offline.seed_fan_sessions
# → creates / truncates propeller_team_usa.fan_sessions
# → inserts 5,000 synthetic sessions as a community cold-start

Step 6 — Apply BigQuery views

The three SQL views must exist before the backend can serve requests:

source .venv/bin/activate

for sql in sql/010_v_sport_factpack.sql sql/020_v_tribe_definitions.sql sql/030_v_tribe_community_stats.sql; do
  echo "Applying $sql ..."
  sed "s/eternal-tempest-495921-m5/${PROJECT_ID}/g" "$sql" | \
    bq query --nouse_legacy_sql --project_id="${PROJECT_ID}"
done

Verify all three views exist:

bq ls --filter labels.type:view ${PROJECT_ID}:propeller_team_usa 2>/dev/null || \
bq query --nouse_legacy_sql \
  "SELECT table_name FROM \`${PROJECT_ID}.propeller_team_usa.INFORMATION_SCHEMA.VIEWS\`"

Step 7 — Start the backend

source .venv/bin/activate
uvicorn backend.main:app --reload --port 8000

Smoke-test:

curl http://localhost:8000/healthz
# {"status":"ok"}

curl -s http://localhost:8000/la28 | python3 -c "import json,sys; d=json.load(sys.stdin); print(len(d['sports']), 'sports')"
# 75 sports

Step 8 — Start the frontend dev server

cd frontend
npm run dev
# → http://localhost:5173  (proxies /api/* to localhost:8000)

Step 9 — Run the test suite

uv run pytest -q

Expected: 73 passed. Tests cover scoring, projection, tribes, composer/critic orchestration, compliance (NIL, finish-time guard, Paralympic parity), forum store, and HTTP smoke tests against every endpoint.

Step 10 — Deploy to Cloud Run

gcloud builds submit \
  --config cloudbuild.yaml \
  --substitutions=SHORT_SHA=$(git rev-parse --short HEAD)

cloudbuild.yaml runs a multi-stage Docker build (Vite → Python), pushes to Artifact Registry, and rolls out a new Cloud Run revision in us-central1. The ML artifacts (sport_axis.json, tribe_centroids.json, tribe_definitions.json, forum_seed.json) are baked into the image at build time via COPY ml_offline/artifacts/ ./ml_offline/artifacts/.

After deploy, verify:

SERVICE_URL=$(gcloud run services describe tribe28 --region=us-central1 --format='value(status.url)')
curl "${SERVICE_URL}/healthz"

Project structure

backend/
  agent/            composer (Gemini 2.5 Pro) + critic (Flash) + orchestrator + fallback templates
  api/              FastAPI routers: /tribe, /sport, /la28, /walkthrough, /forum
  community/        BigQuery aggregate helpers (k-anonymous)
  forum/            in-process anonymous forum store + moderation + NIL glyph
  ml/               projection (6 axes), scoring (hybrid), tribes (assignment)
  compliance.py     NIL banlist loader + check function
  schemas.py        Pydantic v2 API contracts
frontend/
  src/
    pages/          Entry, Walkthrough, TribeLoading, Reveal, Community,
                    Watchlist, Share, ForumIndex, TribeForum, ThreadPage,
                    Profile, Methodology
    components/     TribeGlyph, MiniConstellation, USStateMap, Thread, ComposeBox, …
    hooks/          useSession, useTribes, useWatchlist (custom-event live sync)
    api/            typed fetch clients per backend router
ml_offline/
  artifacts/        la28_program.json, sport_axis.json, tribe_centroids.json,
                    tribe_definitions.json, forum_seed.json, athlete_axes.parquet
  build_sport_axis.py   hand-authored 6-axis vectors for 101 sports
  derive_athlete_axes.py   BQ join → parquet
  fit_tribes.py     KMeans k=12 → centroids
  name_tribes.py    Gemini name proposals
  seed_fan_sessions.py  synthetic community cold-start
  seed_forum.py     Gemini-seeded forum threads
premigration/
  data/load/build.py          CSV → BigQuery loader (idempotent)
  ml/archetypes/train_kmeans.py  BigQuery ML KMEANS k=6
  ml/archetypes/tsne_layout.py   optional t-SNE layout
sql/                BigQuery view DDL (010 factpack, 020 definitions, 030 community stats)
scripts/bootstrap.sh  one-command local setup
tests/              73-test pytest suite

Compliance

Every constraint is enforced in code, not just documented:

No athlete NIL. 153k-name banlist checked on every model output; violations force the deterministic fallback. Forum posts use T01-001-style anon IDs.
No IOC/IPC/sponsor marks. No third-party logos — all visuals are abstract SVG glyphs.
No finish times or specific scoring results. Enforced in SportFactPack schema, the critic's grounding rules, and tests/test_compliance.py.
Paralympic parity. Scorer guarantees at least one Paralympic sport in every top-pick list; class codes appear as labels only, never interpreted.
US-scope only. NOC = 'USA' (Olympic) and npc/npc_new = 'USA' (Paralympic) filters hard-coded in premigration/data/load/build.py.
k-anonymity. All community aggregates gated by HAVING count(*) >= 10 at the BigQuery view level.
Public data only. Kaggle Olympic history (heesoo37), Kaggle Paralympic dataset 1960–2018 (arpitsinghaiml, converted from xlsx), and the official LA28 games programme at la28.org.

License

Apache 2.0 — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tribe28

Build instructions

Tech stack

Data sources

What it does

Architecture

Detailed build instructions

Prerequisites

Step 1 — Download raw source data

1a. Kaggle Olympic dataset

1b. Kaggle Paralympic dataset

Step 2 — Load raw data into BigQuery (premigration)

Step 3 — Train the archetype clustering (BigQuery ML, k=6)

Step 4 — Local dev environment

Step 5 — Run the offline ML pipeline

Step 6 — Apply BigQuery views

Step 7 — Start the backend

Step 8 — Start the frontend dev server

Step 9 — Run the test suite

Step 10 — Deploy to Cloud Run

Project structure

Compliance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
backend		backend
frontend		frontend
ml_offline		ml_offline
premigration		premigration
scripts		scripts
sql		sql
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
cloudbuild.yaml		cloudbuild.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Tribe28

Build instructions

Tech stack

Data sources

What it does

Architecture

Detailed build instructions

Prerequisites

Step 1 — Download raw source data

1a. Kaggle Olympic dataset

1b. Kaggle Paralympic dataset

Step 2 — Load raw data into BigQuery (premigration)

Step 3 — Train the archetype clustering (BigQuery ML, k=6)

Step 4 — Local dev environment

Step 5 — Run the offline ML pipeline

Step 6 — Apply BigQuery views

Step 7 — Start the backend

Step 8 — Start the frontend dev server

Step 9 — Run the test suite

Step 10 — Deploy to Cloud Run

Project structure

Compliance

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages