Skip to content

S-Sigdel/tribe28

Repository files navigation

Tribe28

A 5-minute, game-like walkthrough that places every fan into one of 12 LA28 fan tribes, hands them a tribe-specific LA28 watchlist with Gemini-grounded reasons, drops them into an anonymous tribe forum, and shows them the other fans in their tribe — including the surprising ones who landed on different sports.

Live demo: https://tribe28-mr2rqybiwq-uc.a.run.app

Built for the Team USA x Google Cloud Hackathon. Addresses Challenge 5 — Choose Your Own, with strong overlap on Challenge 1 (Olympic/Paralympic parity) and Challenge 3 (LA28 momentum).


Build instructions

For a fresh local setup, start here:

cp .env.example .env
bash scripts/bootstrap.sh

If this is a brand-new clone with no existing BigQuery tables or offline ML artifacts, follow the full step-by-step build sequence here:


Tech stack

Layer Technology
Frontend React 18, TypeScript, Vite, react-router-dom, inline SVG
Backend FastAPI, Pydantic v2, Python 3.12, uv
LLM Gemini 2.5 Pro (composer) + Gemini 2.5 Flash (critic) via Vertex AI
Data BigQuery (US region), Cloud Storage (fact-pack cache)
Hosting Cloud Run (us-central1), Cloud Build, Artifact Registry
ML scikit-learn (KMeans, silhouette), pandas, pyarrow, BigQuery ML

Data sources

Source What USA rows Years
Kaggle — 120 Years of Olympic History Olympic athlete-event records 18,853 1896–2016
Kaggle — Paralympic Dataset 1960–2018 Paralympic medal-athlete records (downloaded as .xlsx, converted to medal_athlete.csv) 3,059 1960–2018
la28.org/en/games-plan/olympics.html + /paralympics.html LA28 programme (hand-curated JSON) 75 sports

Not used: finish times, athlete names, photos, hometowns, weather data, biometrics. Placement stored as bands only: 1 (Gold) / 2 (Silver) / 3 (Bronze) / qualified.


What it does

  1. 7-chapter walkthrough. A narrative quiz projects you onto six axes (precision, solo, endurance, ground, combat, equipment). Each answer animates a live dot; Chapter 4 reveals a mini-cohort of three similarly-answering anonymous fans; Chapter 6 shows an "era affinity" callout grounded in seven decades of Team USA data.
  2. Tribe assignment. Your 6-axis position is matched to one of 12 KMeans tribe centroids (silhouette 0.6587) trained on US-only Team USA aggregate records.
  3. Grounded watchlist. A hybrid scorer ranks all 75 LA28 sports by affinity × archetype × momentum × tribemate signal × Paralympic-parity guard. For each top pick, Gemini 2.5 Pro writes a fact-pack-grounded reason; Gemini 2.5 Flash verifies grounding before the text ships. Deterministic template prose is the fallback if the critic rejects.
  4. Community view. k-anonymous BigQuery aggregates (k ≥ 10) surface tribe share %, top US states (tile-cartogram), bridge tribes (closest by axis distance with shared sports), and tribemate cards with a "surprise pick" sport.
  5. Tribe forum. Per-tribe anonymous threads seeded from forum_seed.json. Anon IDs only (T01-001), NIL banlist applied to every post.
  6. Share card. A 6-axis spider chart of your axis profile, no athletes, no event imagery.

Architecture

flowchart TD
    subgraph Browser["Browser (React + Vite)"]
        UI["/walkthrough · /reveal · /community\n/forum · /share · /methodology"]
    end

    subgraph CloudRun["Cloud Run — FastAPI backend (us-central1)"]
        direction TB
        API["/tribe/assign\n/sport · /la28\n/walkthrough · /forum"]
        SCORER["score_sports()\nhybrid: affinity × archetype\n× momentum × CF × parity"]
        AGENT["explain_sport()\ncomposer → critic loop"]
        COMPOSER["Gemini 2.5 Pro\ncomposer.py\nthinking_budget=256\nresponse_mime_type=json"]
        CRITIC["Gemini 2.5 Flash\ncritic.py\nthinking_budget=128\nbinary grounding check"]
        FALLBACK["deterministic template\nper-tribe prose\n(fires when critic rejects)"]
        NIL["compliance.py\n153k-name NIL banlist\n(applied to every output)"]
        API --> SCORER
        API --> AGENT
        SCORER --> API
        AGENT --> COMPOSER
        COMPOSER --> CRITIC
        CRITIC -->|pass| API
        CRITIC -->|fail| FALLBACK
        FALLBACK --> API
        COMPOSER --> NIL
        CRITIC --> NIL
    end

    subgraph GCP["Google Cloud Platform"]
        BQ["BigQuery (US)\npropeller_team_usa\n─ athletes_aggregated (21,912 rows)\n─ cluster_assignments\n─ classification_reference\n─ fan_sessions (live + 5k synthetic seed)\n─ v_sport_factpack (view)\n─ v_tribe_definitions (view)\n─ v_tribe_community_stats (view)"]
        GCS["Cloud Storage\nfact-pack cache\n(GCS_CACHE_BUCKET)"]
        VERTEX["Vertex AI\nGemini 2.5 Pro · Flash\nGOOGLE_GENAI_USE_VERTEXAI=true"]
    end

    subgraph OfflineML["Offline ML pipeline (one-shot, not in Cloud Run)"]
        direction LR
        RAW["Raw CSVs\nKaggle Olympic (271k rows)\nParalympic medal-athlete (29k rows)"]
        PREMIG["premigration/data/load/build.py\n→ athletes_aggregated\n→ athlete_names_banlist"]
        KMEANS6["premigration/ml/archetypes/train_kmeans.py\nBigQuery ML KMEANS k=6\n→ cluster_assignments (6 archetypes)"]
        AXIS["ml_offline/build_sport_axis.py\n101-entry hand-authored axis table\n→ sport_axis.json"]
        DERIVE["ml_offline/derive_athlete_axes.py\njoins athletes_aggregated\n+ cluster_assignments\n→ athlete_axes.parquet (80k rows)"]
        TRIBES["ml_offline/fit_tribes.py\nsklearn KMeans k=12\nsilhouette = 0.6587\n→ tribe_centroids.json"]
        RAW --> PREMIG --> KMEANS6
        KMEANS6 --> DERIVE
        AXIS --> DERIVE
        DERIVE --> TRIBES
    end

    Browser <-->|"JSON / HTTPS"| CloudRun
    CloudRun <--> BQ
    CloudRun <--> GCS
    CloudRun <--> VERTEX
    PREMIG --> BQ
    KMEANS6 --> BQ
    TRIBES -->|"artifacts baked\ninto Docker image"| CloudRun
Loading

Detailed build instructions

Everything below assumes you are starting from a fresh clone with no pre-existing BigQuery tables. If the live GCP project already has the tables (as in the deployed version), skip to Step 4.

Prerequisites

Tool Minimum version Install
Python 3.12 python.org or pyenv install 3.12
uv 0.4+ curl -LsSf https://astral.sh/uv/install.sh | sh
Node.js 20 LTS nodejs.org
Google Cloud CLI latest gcloud components update
Kaggle CLI latest pip install kaggle

You need a GCP project with the following APIs enabled:

bigquery.googleapis.com
bigquerystorage.googleapis.com
aiplatform.googleapis.com
run.googleapis.com
cloudbuild.googleapis.com
artifactregistry.googleapis.com
storage.googleapis.com

Enable them all at once:

gcloud services enable bigquery.googleapis.com bigquerystorage.googleapis.com \
  aiplatform.googleapis.com run.googleapis.com cloudbuild.googleapis.com \
  artifactregistry.googleapis.com storage.googleapis.com

Step 1 — Download raw source data

1a. Kaggle Olympic dataset

  1. Create a Kaggle account at https://www.kaggle.com and accept the dataset licence.
  2. Go to your Kaggle account → SettingsAPICreate new token. This downloads kaggle.json.
  3. Place kaggle.json at ~/.kaggle/kaggle.json and chmod 600 ~/.kaggle/kaggle.json.
  4. Download the dataset:
kaggle datasets download heesoo37/120-years-of-olympic-history-athletes-and-results \
  --unzip \
  --path premigration/data/raw/kaggle/
# This puts athlete_events.csv (271,116 rows) into premigration/data/raw/kaggle/

Verify:

wc -l premigration/data/raw/kaggle/athlete_events.csv
# Expected: 271117 (header + 271116 rows)

Do not commit this file. It contains athlete names and is in .gitignore.

1b. Kaggle Paralympic dataset

This dataset contains two .xlsx files and must be downloaded via kagglehub and converted to CSV.

Install kagglehub and openpyxl if needed:

pip install kagglehub openpyxl pandas

Run this script from the project root to download, convert, and place both files:

python3 - <<'PY'
import kagglehub, shutil, os, pandas as pd

print("Downloading Paralympic dataset from Kaggle...")
cache_path = kagglehub.dataset_download("arpitsinghaiml/paralympic-dataset-1960-2018")
print(f"Cache: {cache_path}\n")

dest_dir = "premigration/data/raw/paralympic_org"
os.makedirs(dest_dir, exist_ok=True)

for item in os.listdir(cache_path):
    src = os.path.join(cache_path, item)
    if os.path.isfile(src) and item.endswith(".xlsx"):
        csv_name = item.replace(".xlsx", ".csv")
        dest = os.path.join(dest_dir, csv_name)
        df = pd.read_excel(src)
        df.to_csv(dest, index=False)
        print(f"  {item} -> {dest}  ({len(df)} rows)")
    elif os.path.isfile(src):
        shutil.copy2(src, os.path.join(dest_dir, item))
        print(f"  copied {item}")

print("\nDone.")
PY

This produces two CSV files:

File Rows Used by pipeline
medal_athlete.csv 29,170 Yes — loaded as athletes_raw_paralympic, filtered to npc/npc_new = 'USA' (3,059 rows)
paramedals.csv 2,346 No — country-level medal aggregates, not used in athlete aggregation

Verify:

wc -l premigration/data/raw/paralympic_org/medal_athlete.csv
# Expected: 29171 (header + 29170 rows)

Do not commit these files. They contain athlete names and are covered by .gitignore.


Step 2 — Load raw data into BigQuery (premigration)

Set your project:

export PROJECT_ID="your-gcp-project-id"
export BIGQUERY_LOCATION="US"

Install the premigration dependencies (uses the project's own pyproject.toml):

uv sync
source .venv/bin/activate

Run the loader. This is idempotent — safe to re-run:

python premigration/data/load/build.py

What it creates:

Dataset Table Rows Notes
propeller_team_usa_internal athletes_raw_olympic 271,116 Raw Olympic rows with names — internal only
propeller_team_usa_internal athletes_raw_paralympic 29,170 Raw Paralympic rows with names — internal only
propeller_team_usa_internal athlete_names_banlist ~153,590 Distinct names from both sources — used by NIL check
propeller_team_usa classification_reference 120 Para class-code descriptions
propeller_team_usa athletes_aggregated 21,912 USA-only rows, names removed, SHA-256 IDs, placement bands

Verify the key output table:

bq query --nouse_legacy_sql \
  "SELECT COUNT(*) as n, COUNTIF(program='OLYMPIC') as olympic, COUNTIF(program='PARALYMPIC') as para
   FROM \`${PROJECT_ID}.propeller_team_usa.athletes_aggregated\`"
# Expected: n=21912, olympic=18853, para=3059

Step 3 — Train the archetype clustering (BigQuery ML, k=6)

python premigration/ml/archetypes/train_kmeans.py

What it creates:

Dataset Table Notes
propeller_team_usa feature_matrix 17 ordinal + one-hot features, 21,912 rows
propeller_team_usa archetype_kmeans BigQuery ML KMEANS model, k=6
propeller_team_usa cluster_assignments 21,912 rows + cluster_id (1–6)

The 6 archetypes (aerobic_engine, wheel_velocity, forge_hammer, aquatic_spectrum, combat_grace, precision_rotational) are human-labelled names stored in premigration/ml/archetypes/archetype_definitions.json.

Verify:

bq query --nouse_legacy_sql \
  "SELECT cluster_id, COUNT(*) as n
   FROM \`${PROJECT_ID}.propeller_team_usa.cluster_assignments\`
   GROUP BY 1 ORDER BY 1"
# Expected: 6 rows, total 21912

Step 4 — Local dev environment

# Copy and fill in your environment variables
cp .env.example .env

Edit .env:

PROJECT_ID=your-gcp-project-id          # GCP project with the BQ tables above
BIGQUERY_LOCATION=US
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_GENAI_USE_VERTEXAI=true          # routes Gemini calls through Vertex AI
GEMINI_PRO_MODEL=gemini-2.5-pro
GEMINI_CRITIC_MODEL=gemini-2.5-flash
GCS_CACHE_BUCKET=your-gcs-bucket-name   # bucket for Gemini fact-pack cache

Make sure Application Default Credentials are set:

gcloud auth application-default login

Run the bootstrap script (creates venv, installs deps, verifies artifacts, installs frontend deps, runs tests):

bash scripts/bootstrap.sh

If this is the first time on a fresh clone (no ML artifacts yet), the bootstrap will fail listing missing files. Continue to Step 5 first.


Step 5 — Run the offline ML pipeline

Skip this step if ml_offline/artifacts/ already contains sport_axis.json, tribe_centroids.json, tribe_definitions.json, and forum_seed.json (they are committed to the repo).

source .venv/bin/activate

# 1. Build sport axis table (101-entry hand-authored vectors, 75 LA28 sports)
uv run python -m ml_offline.build_sport_axis
# → writes ml_offline/artifacts/sport_axis.json

# 2. Derive per-athlete 6-axis positions (requires BigQuery access)
uv run python -m ml_offline.derive_athlete_axes
# → fetches athletes_aggregated + cluster_assignments from BQ
# → writes ml_offline/artifacts/athlete_axes.parquet  (80k rows)
# → uploads to GCS_CACHE_BUCKET/tribe28/athlete_axes.parquet

# 3. Fit 12 fan tribes via KMeans on the 6-axis positions
uv run python -m ml_offline.fit_tribes
# → writes ml_offline/artifacts/tribe_centroids.json
# → prints silhouette score (expected ≈ 0.6587)

# 4. Propose tribe names with Gemini (outputs need human review)
uv run python -m ml_offline.name_tribes
# → writes ml_offline/artifacts/tribe_name_proposals.json
# Review this file, then manually edit tribe_definitions.json
# (tribe_definitions.json is already committed with reviewed names)

# 5. Seed the forum
uv run python -m ml_offline.seed_forum
# → writes ml_offline/artifacts/forum_seed.json

# 6. Seed synthetic fan sessions in BigQuery (5,000 rows, random_state=42)
uv run python -m ml_offline.seed_fan_sessions
# → creates / truncates propeller_team_usa.fan_sessions
# → inserts 5,000 synthetic sessions as a community cold-start

Step 6 — Apply BigQuery views

The three SQL views must exist before the backend can serve requests:

source .venv/bin/activate

for sql in sql/010_v_sport_factpack.sql sql/020_v_tribe_definitions.sql sql/030_v_tribe_community_stats.sql; do
  echo "Applying $sql ..."
  sed "s/eternal-tempest-495921-m5/${PROJECT_ID}/g" "$sql" | \
    bq query --nouse_legacy_sql --project_id="${PROJECT_ID}"
done

Verify all three views exist:

bq ls --filter labels.type:view ${PROJECT_ID}:propeller_team_usa 2>/dev/null || \
bq query --nouse_legacy_sql \
  "SELECT table_name FROM \`${PROJECT_ID}.propeller_team_usa.INFORMATION_SCHEMA.VIEWS\`"

Step 7 — Start the backend

source .venv/bin/activate
uvicorn backend.main:app --reload --port 8000

Smoke-test:

curl http://localhost:8000/healthz
# {"status":"ok"}

curl -s http://localhost:8000/la28 | python3 -c "import json,sys; d=json.load(sys.stdin); print(len(d['sports']), 'sports')"
# 75 sports

Step 8 — Start the frontend dev server

cd frontend
npm run dev
# → http://localhost:5173  (proxies /api/* to localhost:8000)

Step 9 — Run the test suite

uv run pytest -q

Expected: 73 passed. Tests cover scoring, projection, tribes, composer/critic orchestration, compliance (NIL, finish-time guard, Paralympic parity), forum store, and HTTP smoke tests against every endpoint.


Step 10 — Deploy to Cloud Run

gcloud builds submit \
  --config cloudbuild.yaml \
  --substitutions=SHORT_SHA=$(git rev-parse --short HEAD)

cloudbuild.yaml runs a multi-stage Docker build (Vite → Python), pushes to Artifact Registry, and rolls out a new Cloud Run revision in us-central1. The ML artifacts (sport_axis.json, tribe_centroids.json, tribe_definitions.json, forum_seed.json) are baked into the image at build time via COPY ml_offline/artifacts/ ./ml_offline/artifacts/.

After deploy, verify:

SERVICE_URL=$(gcloud run services describe tribe28 --region=us-central1 --format='value(status.url)')
curl "${SERVICE_URL}/healthz"

Project structure

backend/
  agent/            composer (Gemini 2.5 Pro) + critic (Flash) + orchestrator + fallback templates
  api/              FastAPI routers: /tribe, /sport, /la28, /walkthrough, /forum
  community/        BigQuery aggregate helpers (k-anonymous)
  forum/            in-process anonymous forum store + moderation + NIL glyph
  ml/               projection (6 axes), scoring (hybrid), tribes (assignment)
  compliance.py     NIL banlist loader + check function
  schemas.py        Pydantic v2 API contracts
frontend/
  src/
    pages/          Entry, Walkthrough, TribeLoading, Reveal, Community,
                    Watchlist, Share, ForumIndex, TribeForum, ThreadPage,
                    Profile, Methodology
    components/     TribeGlyph, MiniConstellation, USStateMap, Thread, ComposeBox, …
    hooks/          useSession, useTribes, useWatchlist (custom-event live sync)
    api/            typed fetch clients per backend router
ml_offline/
  artifacts/        la28_program.json, sport_axis.json, tribe_centroids.json,
                    tribe_definitions.json, forum_seed.json, athlete_axes.parquet
  build_sport_axis.py   hand-authored 6-axis vectors for 101 sports
  derive_athlete_axes.py   BQ join → parquet
  fit_tribes.py     KMeans k=12 → centroids
  name_tribes.py    Gemini name proposals
  seed_fan_sessions.py  synthetic community cold-start
  seed_forum.py     Gemini-seeded forum threads
premigration/
  data/load/build.py          CSV → BigQuery loader (idempotent)
  ml/archetypes/train_kmeans.py  BigQuery ML KMEANS k=6
  ml/archetypes/tsne_layout.py   optional t-SNE layout
sql/                BigQuery view DDL (010 factpack, 020 definitions, 030 community stats)
scripts/bootstrap.sh  one-command local setup
tests/              73-test pytest suite

Compliance

Every constraint is enforced in code, not just documented:

  • No athlete NIL. 153k-name banlist checked on every model output; violations force the deterministic fallback. Forum posts use T01-001-style anon IDs.
  • No IOC/IPC/sponsor marks. No third-party logos — all visuals are abstract SVG glyphs.
  • No finish times or specific scoring results. Enforced in SportFactPack schema, the critic's grounding rules, and tests/test_compliance.py.
  • Paralympic parity. Scorer guarantees at least one Paralympic sport in every top-pick list; class codes appear as labels only, never interpreted.
  • US-scope only. NOC = 'USA' (Olympic) and npc/npc_new = 'USA' (Paralympic) filters hard-coded in premigration/data/load/build.py.
  • k-anonymity. All community aggregates gated by HAVING count(*) >= 10 at the BigQuery view level.
  • Public data only. Kaggle Olympic history (heesoo37), Kaggle Paralympic dataset 1960–2018 (arpitsinghaiml, converted from xlsx), and the official LA28 games programme at la28.org.

License

Apache 2.0 — see LICENSE.

About

Built for the Team USA x Google Cloud Hackathon; Licensed Under Apache 2.0

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors