From 22200c3e7b5ea2ab2d91eca08af9de107a3f9344 Mon Sep 17 00:00:00 2001 From: Ross Gardler Date: Wed, 3 Dec 2025 20:46:19 -0800 Subject: [PATCH 1/3] Checkpoint from Copilot CLI for coding agent session --- .pm/tracker.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.pm/tracker.md b/.pm/tracker.md index 1f0d07af..5c49f019 100644 --- a/.pm/tracker.md +++ b/.pm/tracker.md @@ -148,7 +148,7 @@ **Recommended Next Tasks:** -1. **9.4.1 - AI Tournaments & Balance Tooling** (Priority: MEDIUM, Effort: High) +1. **9.4.1 - AI Tournaments & Balance Tooling** (Priority: MEDIUM, Effort: High) - Issue [#49](https://github.com/TheWizardsCode/GEngine/issues/49) - Why: Final Phase 9 task; all AI infrastructure complete (observer, actor, hybrid strategy) - Owner needed: Gamedev agent - Impact: Balance validation and AI testing at scale From 850db887ee3a340931cbad34c7b0377715413e2b Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 4 Dec 2025 05:08:15 +0000 Subject: [PATCH 2/3] Implement AI Tournaments & Balance Tooling (Issue #49) Co-authored-by: SorraTheOrc <250240+SorraTheOrc@users.noreply.github.com> --- .github/workflows/ai-tournament.yml | 68 ++ README.md | 43 +- docs/gengine/how_to_play_echoes.md | 167 +++++ ...emergent_story_game_implementation_plan.md | 23 +- gamedev-agent-thoughts.txt | 107 +++ scripts/analyze_ai_games.py | 610 ++++++++++++++++++ scripts/run_ai_tournament.py | 510 +++++++++++++++ tests/scripts/test_ai_analysis.py | 502 ++++++++++++++ tests/scripts/test_ai_tournament.py | 419 ++++++++++++ 9 files changed, 2433 insertions(+), 16 deletions(-) create mode 100644 .github/workflows/ai-tournament.yml create mode 100644 scripts/analyze_ai_games.py create mode 100644 scripts/run_ai_tournament.py create mode 100644 tests/scripts/test_ai_analysis.py create mode 100644 tests/scripts/test_ai_tournament.py diff --git a/.github/workflows/ai-tournament.yml b/.github/workflows/ai-tournament.yml new file mode 100644 index 00000000..94127080 --- /dev/null +++ b/.github/workflows/ai-tournament.yml @@ -0,0 +1,68 @@ +name: AI Tournament + +on: + schedule: + # Run nightly at 2:00 AM UTC + - cron: '0 2 * * *' + workflow_dispatch: + inputs: + games: + description: 'Number of games to run' + required: false + default: '100' + ticks: + description: 'Ticks per game' + required: false + default: '100' + +permissions: + contents: read + +jobs: + tournament: + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v4 + + - name: Set up Python 3.12 + uses: actions/setup-python@v5 + with: + python-version: "3.12" + + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install -e ".[dev]" + + - name: Run AI tournament + run: | + python scripts/run_ai_tournament.py \ + --games ${{ github.event.inputs.games || '100' }} \ + --ticks ${{ github.event.inputs.ticks || '100' }} \ + --strategies balanced aggressive diplomatic \ + --seed 42 \ + --verbose \ + --output build/tournament-results.json + + - name: Analyze tournament results + run: | + python scripts/analyze_ai_games.py \ + --input build/tournament-results.json \ + --world default \ + --output build/tournament-analysis.json + + - name: Archive tournament results + uses: actions/upload-artifact@v4 + with: + name: tournament-results-${{ github.run_id }} + path: | + build/tournament-results.json + build/tournament-analysis.json + retention-days: 90 + + - name: Print analysis summary + run: | + python scripts/analyze_ai_games.py \ + --input build/tournament-results.json \ + --world default diff --git a/README.md b/README.md index b33154c7..264a1803 100644 --- a/README.md +++ b/README.md @@ -1330,15 +1330,50 @@ service names as hostnames: - Gateway → Simulation: `http://simulation:8000` - Gateway → LLM: `http://llm:8001` +## AI Tournaments & Balance Tooling + +Phase 9 M9.4 provides tournament infrastructure for automated balance testing: + +### Running Tournaments + +```bash +# Run 100 games with default strategies +uv run python scripts/run_ai_tournament.py --games 100 --output build/tournament.json + +# Run with specific strategies and more ticks +uv run python scripts/run_ai_tournament.py \ + --games 50 --ticks 200 --strategies balanced aggressive diplomatic --verbose +``` + +### Analyzing Results + +```bash +# Analyze tournament results +uv run python scripts/analyze_ai_games.py --input build/tournament.json + +# Compare against authored story seeds +uv run python scripts/analyze_ai_games.py --input build/tournament.json --world default +``` + +The analysis identifies: +- Win rate deltas between strategies +- Dominant or underperforming strategies +- Unused story seeds +- Overpowered actions + +### CI Integration + +The `.github/workflows/ai-tournament.yml` workflow runs nightly tournaments +and archives results. Trigger manual runs via the GitHub Actions UI. + +See `docs/gengine/how_to_play_echoes.md` Section 13 for the complete balance +iteration workflow. + ## Next Steps 1. **Phase 8 – Kubernetes Deployment** – create Kubernetes manifests for local minikube deployment, enabling multi-container orchestration and service discovery. Docker containerization is complete (see Docker section above). -2. **Phase 9 M9.4 – AI tournaments and balance tooling** – create tournament - scripts that run multiple AI strategies in parallel, aggregate comparative - reports (win rates, stability curves, story seed coverage), and identify - balance outliers. See the Phase 9 section of the implementation plan. Progress is tracked in the implementation plan document; update this README as new phases land (CLI tooling, services, Kubernetes manifests, etc.). diff --git a/docs/gengine/how_to_play_echoes.md b/docs/gengine/how_to_play_echoes.md index db5f890d..1fa838f8 100644 --- a/docs/gengine/how_to_play_echoes.md +++ b/docs/gengine/how_to_play_echoes.md @@ -951,3 +951,170 @@ Post-mortem summary: The post-mortem is saved alongside the campaign data for later review. Ended campaigns can still be resumed if you want to continue playing. + +## 13. AI Tournaments & Balance Tooling + +The repository includes AI tournament infrastructure for automated balance +testing and validation. Tournaments run multiple AI players with different +strategies in parallel, then aggregate results to identify balance anomalies. + +### Running Tournaments + +Run a tournament with default settings: + +```bash +uv run python scripts/run_ai_tournament.py \ + --games 100 --ticks 100 --output build/tournament.json +``` + +The tournament script supports several options: + +| Flag | Description | +| -------------- | ---------------------------------------------------- | +| `--games/-g` | Total number of games to run (default: 100) | +| `--ticks/-t` | Ticks per game (default: 100) | +| `--strategies` | Strategies to test (balanced, aggressive, diplomatic)| +| `--seed` | Base random seed for deterministic runs (default: 42)| +| `--workers` | Max parallel workers (default: auto) | +| `--output/-o` | Path to write JSON results | +| `--verbose/-v` | Print progress during tournament | + +Example output shows win rates and stability metrics per strategy: + +``` +================================================================================ +AI TOURNAMENT RESULTS +================================================================================ + +Games: 100/100 completed (0 failed) +Total duration: 45.2s + +Strategy Win Rate Avg Stab Min Stab Max Stab Avg Actions +-------------------------------------------------------------------------------- +balanced 65.0% 0.720 0.450 1.000 5.2 +aggressive 72.0% 0.680 0.380 1.000 8.1 +diplomatic 58.0% 0.750 0.520 1.000 3.4 +-------------------------------------------------------------------------------- +``` + +### Analyzing Results + +After running a tournament, analyze the results for balance insights: + +```bash +uv run python scripts/analyze_ai_games.py \ + --input build/tournament.json --world default +``` + +The analysis script: + +- Compares win rates across strategies +- Identifies dominant strategies (win rate delta > 15%) +- Flags unused or underused story seeds +- Detects overpowered actions +- Generates actionable recommendations + +Example analysis output: + +``` +================================================================================ +AI TOURNAMENT ANALYSIS REPORT +================================================================================ + +Tournament: 100 games, 100 ticks each +Strategies: balanced, aggressive, diplomatic + +-------------------------------------------------------------------------------- +WIN RATE ANALYSIS +-------------------------------------------------------------------------------- +Best strategy: aggressive (72.0%) +Worst strategy: diplomatic (58.0%) +Win rate delta: 14.0% +Balance status: ✓ Balanced + +-------------------------------------------------------------------------------- +ACTION ANALYSIS +-------------------------------------------------------------------------------- +Most used: INSPECT (450 times) +Least used: NEGOTIATE (120 times) + +-------------------------------------------------------------------------------- +RECOMMENDATIONS +-------------------------------------------------------------------------------- +1. No significant balance issues detected - system appears well-tuned +================================================================================ +``` + +### Balance Iteration Workflow + +When tuning game balance, follow this workflow: + +1. **Run baseline tournament**: Capture initial metrics with `--seed 42` for + reproducibility. + + ```bash + uv run python scripts/run_ai_tournament.py \ + --games 100 --output build/baseline.json + ``` + +2. **Analyze baseline**: Review strategy balance, action distribution, and seed + coverage. + + ```bash + uv run python scripts/analyze_ai_games.py \ + --input build/baseline.json --world default + ``` + +3. **Adjust parameters**: Based on analysis findings, modify config values in + `content/config/simulation.yml`: + + - Strategy thresholds affect AI decision-making + - Economy settings influence resource pressure + - Director pacing controls narrative density + +4. **Run comparison tournament**: Use the same seed for deterministic comparison. + + ```bash + uv run python scripts/run_ai_tournament.py \ + --games 100 --output build/tuned.json --seed 42 + ``` + +5. **Compare results**: Diff the analysis reports to validate improvements. + + ```bash + # Compare win rates between runs + python scripts/analyze_ai_games.py --input build/baseline.json --json > /tmp/a.json + python scripts/analyze_ai_games.py --input build/tuned.json --json > /tmp/b.json + diff /tmp/a.json /tmp/b.json + ``` + +6. **Iterate**: Repeat steps 3-5 until balance metrics fall within acceptable + ranges. + +### CI Integration + +The repository includes a GitHub Actions workflow (`.github/workflows/ai-tournament.yml`) +that runs nightly tournaments: + +- Executes 100 games with all strategies +- Archives results as artifacts for 90 days +- Prints analysis summary in the job log + +To trigger a manual tournament run, use the GitHub Actions UI and select +"Run workflow" with optional game/tick counts. + +### Interpreting Anomalies + +The analysis script flags several types of balance issues: + +| Anomaly Type | Severity | Meaning | +| -------------------- | -------- | ------------------------------------------ | +| `dominant_strategy` | High | One strategy wins > 20% more than others | +| `strategy_imbalance` | Medium | Win rate delta between 15-20% | +| `dominant_action` | Medium | One action accounts for > 50% of all uses | +| `unused_story_seeds` | High/Low | Story seeds never triggered during games | +| `low_seed_coverage` | Medium | Less than 50% of seeds were activated | +| `low_activity` | Low | A strategy averages < 1 action per game | + +Recommendations are generated automatically based on detected anomalies. Use +them as starting points for parameter tuning rather than prescriptive fixes. diff --git a/docs/simul/emergent_story_game_implementation_plan.md b/docs/simul/emergent_story_game_implementation_plan.md index f793a02c..20ac9ca0 100644 --- a/docs/simul/emergent_story_game_implementation_plan.md +++ b/docs/simul/emergent_story_game_implementation_plan.md @@ -684,22 +684,21 @@ playthrough validation. - ✅ Documentation includes prompt and trade-off guidance - ✅ Telemetry distinguishes rule-based vs. LLM-driven decisions -- **M9.4 AI tournaments and balance tooling** (1-2 days): Create +- ✅ **M9.4 AI tournaments and balance tooling** (COMPLETED): Created `scripts/run_ai_tournament.py` that executes N parallel games with varied AI - strategies, world configs, and random seeds. Aggregate results into + strategies, world configs, and random seeds. Aggregates results into comparative reports (win rates, average stability curves, story seed coverage, - resource efficiency). Add analysis scripts under `scripts/analyze_ai_games.py` - that identify dominant strategies, balance outliers, and underutilized content -. - Document tournament workflow and balance iteration loops in the README. + resource efficiency). Added `scripts/analyze_ai_games.py` that identifies + dominant strategies, balance outliers, and underutilized content. Documented + tournament workflow and balance iteration loops in gameplay guide. - **Acceptance Criteria:** + **Acceptance Criteria (all met):** - - Tournament script runs 100+ games in parallel with configurable strategies - - Comparative reports surface win rate deltas and balance anomalies - - Analysis identifies unused story seeds or overpowered actions - - Documentation guides designers through balance iteration workflow - - CI integration runs nightly tournaments and archives results + - ✅ Tournament script runs 100+ games in parallel with configurable strategies + - ✅ Comparative reports surface win rate deltas and balance anomalies + - ✅ Analysis identifies unused story seeds or overpowered actions + - ✅ Documentation guides designers through balance iteration workflow + - ✅ CI integration runs nightly tournaments and archives results **Phase 9 Dependencies:** diff --git a/gamedev-agent-thoughts.txt b/gamedev-agent-thoughts.txt index ede0feff..221d9a8f 100644 --- a/gamedev-agent-thoughts.txt +++ b/gamedev-agent-thoughts.txt @@ -610,3 +610,110 @@ All acceptance criteria for Issue #34 are met. 5. ✅ Documentation distinguishes health vs. metrics endpoints with example responses ### Task Complete: Issue #39 - Dedicated Metrics Endpoints for Gateway and LLM Services + + +## Issue #49 - AI Tournaments & Balance Tooling (Task 9.4.1) - 2025-12-04T04:49 UTC +========================================================= + +### Initial Assessment +- Working branch: copilot/resulting-meadowlark +- Repository synced and clean +- All 822 tests pass +- Coverage: 92.12% + +### Acceptance Criteria from Issue: +1. Tournament script runs 100+ games in parallel with configurable strategies +2. Comparative reports surface win rate deltas and balance anomalies +3. Analysis identifies unused story seeds or overpowered actions +4. Documentation guides designers through balance iteration workflow +5. CI integration runs nightly tournaments and archives results + +### Implementation Plan: + +1. **scripts/run_ai_tournament.py** - Tournament runner + - Execute N parallel games using ProcessPoolExecutor + - Support configurable strategies: BALANCED, AGGRESSIVE, DIPLOMATIC, HYBRID + - Support different world configs and random seeds + - Capture per-game telemetry: final stability, story seed activations, actions taken + - Aggregate results into structured JSON reports + - CLI interface similar to run_difficulty_sweeps.py + +2. **scripts/analyze_ai_games.py** - Analysis script + - Read tournament results JSON + - Calculate win rates per strategy (based on final stability threshold) + - Identify story seeds that never triggered + - Flag dominant strategies and balance outliers + - Generate human-readable summary reports + +3. **tests/ai_player/test_tournament.py** - Tournament tests + - Test tournament runner with small game count + - Test result aggregation + - Test parallel execution + +4. **tests/ai_player/test_analysis.py** - Analysis tests + - Test win rate calculation + - Test unused seed detection + - Test outlier detection + +5. **.github/workflows/ai-tournament.yml** - CI workflow + - Nightly tournament runs + - Archive results as artifacts + +6. **Documentation Updates** + - docs/gengine/how_to_play_echoes.md - balance iteration workflow + - docs/simul/emergent_story_game_implementation_plan.md - mark M9.4 complete + +### Implementation Log: + +[2025-12-04T04:49 UTC] Started implementation + +**Step 1: Created scripts/run_ai_tournament.py** +- Tournament runner with ProcessPoolExecutor for parallel game execution +- GameResult, TournamentConfig, TournamentReport dataclasses +- run_single_game() function for isolated game execution +- run_tournament() with stats aggregation per strategy +- CLI interface with --games, --ticks, --strategies, --seed, --output, --verbose + +**Step 2: Created scripts/analyze_ai_games.py** +- Analysis module for tournament results +- BalanceAnomaly, AnalysisReport dataclasses +- analyze_win_rates(), analyze_actions(), analyze_story_seeds() functions +- detect_anomalies() for identifying balance issues +- generate_recommendations() for actionable guidance +- CLI interface with --input, --world, --output, --json + +**Step 3: Created tests/scripts/test_ai_tournament.py** +- 19 tests covering GameResult, TournamentConfig, TournamentReport +- run_single_game and run_tournament tests +- Tournament determinism tests +- CLI tests + +**Step 4: Created tests/scripts/test_ai_analysis.py** +- 29 tests covering BalanceAnomaly, AnalysisReport +- Win rate, action, and story seed analysis tests +- Anomaly detection and recommendation tests +- CLI tests + +**Step 5: Created .github/workflows/ai-tournament.yml** +- Nightly tournament workflow (2:00 AM UTC) +- Manual trigger with configurable games/ticks +- Archives results for 90 days + +**Step 6: Updated documentation** +- docs/gengine/how_to_play_echoes.md: Added Section 13 (AI Tournaments & Balance Tooling) +- docs/simul/emergent_story_game_implementation_plan.md: Marked M9.4 complete +- README.md: Added AI Tournaments section + +**Test Results:** +- All 870 tests pass (822 existing + 48 new) +- Coverage: 92.10% +- Linting: All checks passed + +### Task 9.4.1 Status: COMPLETED + +All acceptance criteria met: +✅ Tournament script runs 100+ games in parallel with configurable strategies +✅ Comparative reports surface win rate deltas and balance anomalies +✅ Analysis identifies unused story seeds or overpowered actions +✅ Documentation guides designers through balance iteration workflow +✅ CI integration runs nightly tournaments and archives results diff --git a/scripts/analyze_ai_games.py b/scripts/analyze_ai_games.py new file mode 100644 index 00000000..5c52ab25 --- /dev/null +++ b/scripts/analyze_ai_games.py @@ -0,0 +1,610 @@ +#!/usr/bin/env python3 +"""Analyze AI tournament results for balance and coverage insights. + +Reads tournament results JSON and generates reports identifying: +- Win rate comparisons across strategies +- Balance anomalies (dominant strategies, overpowered actions) +- Story seed coverage (unused/underused seeds) +- Action distribution patterns + +Examples +-------- +Analyze a tournament results file:: + + uv run python scripts/analyze_ai_games.py --input build/tournament.json + +Generate detailed report:: + + uv run python scripts/analyze_ai_games.py \\ + --input build/tournament.json --verbose --output build/analysis.json + +Compare against authored story seeds:: + + uv run python scripts/analyze_ai_games.py \\ + --input build/tournament.json --world default +""" + +from __future__ import annotations + +import argparse +import json +import sys +from dataclasses import dataclass, field +from pathlib import Path +from typing import Any, Sequence + + +@dataclass +class BalanceAnomaly: + """Represents a detected balance issue.""" + + anomaly_type: str + severity: str # "low", "medium", "high" + description: str + data: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + return { + "type": self.anomaly_type, + "severity": self.severity, + "description": self.description, + "data": self.data, + } + + +@dataclass +class AnalysisReport: + """Complete analysis report from tournament data.""" + + tournament_config: dict[str, Any] + strategy_comparison: dict[str, dict[str, Any]] + win_rate_analysis: dict[str, Any] + action_analysis: dict[str, Any] + story_seed_analysis: dict[str, Any] + anomalies: list[BalanceAnomaly] + recommendations: list[str] + + def to_dict(self) -> dict[str, Any]: + return { + "tournament_config": self.tournament_config, + "strategy_comparison": self.strategy_comparison, + "win_rate_analysis": self.win_rate_analysis, + "action_analysis": self.action_analysis, + "story_seed_analysis": self.story_seed_analysis, + "anomalies": [a.to_dict() for a in self.anomalies], + "recommendations": self.recommendations, + } + + +def load_tournament_results(path: Path) -> dict[str, Any]: + """Load tournament results from JSON file.""" + with open(path) as f: + return json.load(f) + + +def get_authored_story_seeds(world: str) -> list[str]: + """Load story seed IDs from the world content.""" + try: + import yaml + + seeds_path = Path(f"content/worlds/{world}/story_seeds.yml") + if not seeds_path.exists(): + return [] + + with open(seeds_path) as f: + data = yaml.safe_load(f) + if not data or "seeds" not in data: + return [] + return [s.get("id", "") for s in data.get("seeds", []) if s.get("id")] + except Exception: + return [] + + +def analyze_win_rates(strategy_stats: dict[str, dict[str, Any]]) -> dict[str, Any]: + """Analyze win rate patterns across strategies.""" + win_rates = {s: stats.get("win_rate", 0.0) for s, stats in strategy_stats.items()} + + if not win_rates: + return {"error": "No strategy data available"} + + sorted_rates = sorted(win_rates.items(), key=lambda x: x[1], reverse=True) + best_strategy, best_rate = sorted_rates[0] + worst_strategy, worst_rate = sorted_rates[-1] + win_rate_delta = best_rate - worst_rate + + avg_win_rate = sum(win_rates.values()) / len(win_rates) + variance = sum((r - avg_win_rate) ** 2 for r in win_rates.values()) / len(win_rates) + + return { + "win_rates": {k: round(v, 4) for k, v in win_rates.items()}, + "best_strategy": best_strategy, + "best_win_rate": round(best_rate, 4), + "worst_strategy": worst_strategy, + "worst_win_rate": round(worst_rate, 4), + "win_rate_delta": round(win_rate_delta, 4), + "average_win_rate": round(avg_win_rate, 4), + "win_rate_variance": round(variance, 6), + "is_balanced": win_rate_delta < 0.15, + } + + +def analyze_actions(strategy_stats: dict[str, dict[str, Any]]) -> dict[str, Any]: + """Analyze action usage patterns across strategies.""" + # Aggregate action counts + total_actions: dict[str, int] = {} + action_by_strategy: dict[str, dict[str, int]] = {} + + for strategy, stats in strategy_stats.items(): + breakdown = stats.get("action_breakdown", {}) + action_by_strategy[strategy] = breakdown + for action, count in breakdown.items(): + total_actions[action] = total_actions.get(action, 0) + count + + if not total_actions: + return {"error": "No action data available"} + + # Find most/least used actions + sorted_actions = sorted(total_actions.items(), key=lambda x: x[1], reverse=True) + most_used = sorted_actions[0] if sorted_actions else ("none", 0) + least_used = sorted_actions[-1] if sorted_actions else ("none", 0) + + # Calculate action dominance + total = sum(total_actions.values()) + action_percentages = ( + {a: c / total for a, c in total_actions.items()} if total else {} + ) + + return { + "total_actions": total_actions, + "action_by_strategy": action_by_strategy, + "most_used_action": most_used[0], + "most_used_count": most_used[1], + "least_used_action": least_used[0], + "least_used_count": least_used[1], + "action_percentages": {k: round(v, 4) for k, v in action_percentages.items()}, + "dominant_action": ( + most_used[0] if action_percentages.get(most_used[0], 0) > 0.5 else None + ), + } + + +def analyze_story_seeds( + results: dict[str, Any], + authored_seeds: list[str] | None = None, +) -> dict[str, Any]: + """Analyze story seed activation patterns.""" + seen_seeds = set(results.get("all_story_seeds_seen", [])) + + # Aggregate per-game seed data + seed_counts: dict[str, int] = {} + total_games = 0 + + for _strategy, games in results.get("games", {}).items(): + for game in games: + if game.get("error") is None: + total_games += 1 + for seed in game.get("story_seeds_activated", []): + seed_counts[seed] = seed_counts.get(seed, 0) + 1 + + # Compare against authored seeds + unused_seeds: list[str] = [] + if authored_seeds: + unused_seeds = [s for s in authored_seeds if s not in seen_seeds] + + # Calculate activation rates + activation_rates = {} + if total_games > 0: + activation_rates = {s: c / total_games for s, c in seed_counts.items()} + + return { + "seeds_seen": sorted(seen_seeds), + "seed_counts": seed_counts, + "activation_rates": {k: round(v, 4) for k, v in activation_rates.items()}, + "total_games_analyzed": total_games, + "authored_seeds": authored_seeds or [], + "unused_seeds": unused_seeds, + "coverage_rate": ( + len(seen_seeds) / len(authored_seeds) + if authored_seeds + else 1.0 + ), + } + + +def detect_anomalies( + win_rate_analysis: dict[str, Any], + action_analysis: dict[str, Any], + story_seed_analysis: dict[str, Any], + strategy_stats: dict[str, dict[str, Any]], +) -> list[BalanceAnomaly]: + """Detect balance anomalies from analysis results.""" + anomalies: list[BalanceAnomaly] = [] + + # Check for dominant strategy + if win_rate_analysis.get("win_rate_delta", 0) > 0.2: + anomalies.append( + BalanceAnomaly( + anomaly_type="dominant_strategy", + severity="high", + description=( + f"Strategy '{win_rate_analysis['best_strategy']}' has " + f"significantly higher win rate " + f"({win_rate_analysis['best_win_rate']:.1%}) " + f"than '{win_rate_analysis['worst_strategy']}' " + f"({win_rate_analysis['worst_win_rate']:.1%})" + ), + data={ + "best_strategy": win_rate_analysis["best_strategy"], + "worst_strategy": win_rate_analysis["worst_strategy"], + "delta": win_rate_analysis["win_rate_delta"], + }, + ) + ) + elif win_rate_analysis.get("win_rate_delta", 0) > 0.15: + anomalies.append( + BalanceAnomaly( + anomaly_type="strategy_imbalance", + severity="medium", + description=( + f"Moderate win rate gap between strategies " + f"({win_rate_analysis['win_rate_delta']:.1%})" + ), + data={"delta": win_rate_analysis["win_rate_delta"]}, + ) + ) + + # Check for dominant action + dominant_action = action_analysis.get("dominant_action") + if dominant_action: + pct = action_analysis["action_percentages"].get(dominant_action, 0) + anomalies.append( + BalanceAnomaly( + anomaly_type="dominant_action", + severity="medium", + description=( + f"Action '{dominant_action}' accounts for {pct:.1%} of all actions" + ), + data={ + "action": dominant_action, + "percentage": pct, + }, + ) + ) + + # Check for unused story seeds + unused = story_seed_analysis.get("unused_seeds", []) + if unused: + severity = "high" if len(unused) > 2 else "low" + anomalies.append( + BalanceAnomaly( + anomaly_type="unused_story_seeds", + severity=severity, + description=( + f"{len(unused)} story seeds never activated: " + f"{', '.join(unused)}" + ), + data={"unused_seeds": unused}, + ) + ) + + # Check for low seed coverage + coverage = story_seed_analysis.get("coverage_rate", 1.0) + if coverage < 0.5 and story_seed_analysis.get("authored_seeds"): + anomalies.append( + BalanceAnomaly( + anomaly_type="low_seed_coverage", + severity="medium", + description=f"Only {coverage:.1%} of story seeds were activated", + data={"coverage_rate": coverage}, + ) + ) + + # Check for strategy with very low action count + for strategy, stats in strategy_stats.items(): + avg_actions = stats.get("avg_actions", 0) + if avg_actions < 1.0: + anomalies.append( + BalanceAnomaly( + anomaly_type="low_activity_strategy", + severity="low", + description=( + f"Strategy '{strategy}' averages only {avg_actions:.1f} " + "actions per game" + ), + data={"strategy": strategy, "avg_actions": avg_actions}, + ) + ) + + return anomalies + + +def generate_recommendations( + anomalies: list[BalanceAnomaly], + win_rate_analysis: dict[str, Any], + action_analysis: dict[str, Any], + story_seed_analysis: dict[str, Any], +) -> list[str]: + """Generate actionable recommendations based on analysis.""" + recommendations: list[str] = [] + + # Strategy balance recommendations + if not win_rate_analysis.get("is_balanced", True): + best = win_rate_analysis.get("best_strategy", "") + worst = win_rate_analysis.get("worst_strategy", "") + recommendations.append( + f"Consider buffing '{worst}' strategy or adding constraints to '{best}'" + ) + + # Action balance recommendations + dominant = action_analysis.get("dominant_action") + if dominant: + recommendations.append( + f"Review effectiveness of '{dominant}' action - may be overpowered" + ) + + least_used = action_analysis.get("least_used_action") + if least_used and action_analysis.get("least_used_count", 0) < 5: + recommendations.append( + f"Action '{least_used}' is rarely used - consider making it more attractive" + ) + + # Story seed recommendations + unused = story_seed_analysis.get("unused_seeds", []) + if unused: + recommendations.append( + f"Review trigger conditions for unused seeds: {', '.join(unused[:3])}" + ) + + coverage = story_seed_analysis.get("coverage_rate", 1.0) + if coverage < 0.7 and story_seed_analysis.get("authored_seeds"): + recommendations.append( + "Consider increasing game length or lowering seed activation thresholds " + "to improve coverage" + ) + + # General recommendations + if not recommendations: + recommendations.append( + "No significant balance issues detected - system appears well-tuned" + ) + + return recommendations + + +def analyze_tournament( + results: dict[str, Any], + authored_seeds: list[str] | None = None, +) -> AnalysisReport: + """Perform complete analysis on tournament results. + + Parameters + ---------- + results + Tournament results dictionary (from JSON file). + authored_seeds + Optional list of all authored story seed IDs for coverage comparison. + + Returns + ------- + AnalysisReport + Complete analysis with findings, anomalies, and recommendations. + """ + strategy_stats = results.get("strategy_stats", {}) + + win_rate_analysis = analyze_win_rates(strategy_stats) + action_analysis = analyze_actions(strategy_stats) + story_seed_analysis = analyze_story_seeds(results, authored_seeds) + + anomalies = detect_anomalies( + win_rate_analysis, + action_analysis, + story_seed_analysis, + strategy_stats, + ) + + recommendations = generate_recommendations( + anomalies, + win_rate_analysis, + action_analysis, + story_seed_analysis, + ) + + return AnalysisReport( + tournament_config=results.get("config", {}), + strategy_comparison={ + strategy: { + "win_rate": stats.get("win_rate", 0.0), + "avg_stability": stats.get("avg_stability", 0.0), + "avg_actions": stats.get("avg_actions", 0.0), + "games_completed": stats.get("games_completed", 0), + } + for strategy, stats in strategy_stats.items() + }, + win_rate_analysis=win_rate_analysis, + action_analysis=action_analysis, + story_seed_analysis=story_seed_analysis, + anomalies=anomalies, + recommendations=recommendations, + ) + + +def print_analysis_report(report: AnalysisReport) -> None: + """Print a human-readable analysis report.""" + print("\n" + "=" * 80) + print("AI TOURNAMENT ANALYSIS REPORT") + print("=" * 80) + + # Tournament config + config = report.tournament_config + print(f"\nTournament: {config.get('num_games', 0)} games, " + f"{config.get('ticks_per_game', 0)} ticks each") + print(f"Strategies: {', '.join(config.get('strategies', []))}") + + # Strategy comparison + print("\n" + "-" * 80) + print("STRATEGY COMPARISON") + print("-" * 80) + print( + f"{'Strategy':<12} {'Win Rate':>10} {'Avg Stab':>10} " + f"{'Avg Actions':>12} {'Games':>8}" + ) + print("-" * 52) + + for strategy, stats in report.strategy_comparison.items(): + print( + f"{strategy:<12} {stats['win_rate']:>10.1%} " + f"{stats['avg_stability']:>10.3f} " + f"{stats['avg_actions']:>12.1f} {stats['games_completed']:>8}" + ) + + # Win rate analysis + wra = report.win_rate_analysis + print("\n" + "-" * 80) + print("WIN RATE ANALYSIS") + print("-" * 80) + print( + f"Best strategy: {wra.get('best_strategy', 'N/A')} " + f"({wra.get('best_win_rate', 0):.1%})" + ) + print( + f"Worst strategy: {wra.get('worst_strategy', 'N/A')} " + f"({wra.get('worst_win_rate', 0):.1%})" + ) + print(f"Win rate delta: {wra.get('win_rate_delta', 0):.1%}") + balanced_str = "✓ Balanced" if wra.get("is_balanced") else "⚠ Imbalanced" + print(f"Balance status: {balanced_str}") + + # Action analysis + aa = report.action_analysis + print("\n" + "-" * 80) + print("ACTION ANALYSIS") + print("-" * 80) + most_used = aa.get("most_used_action", "N/A") + most_count = aa.get("most_used_count", 0) + print(f"Most used: {most_used} ({most_count} times)") + least_used = aa.get("least_used_action", "N/A") + least_count = aa.get("least_used_count", 0) + print(f"Least used: {least_used} ({least_count} times)") + if aa.get("dominant_action"): + print(f"⚠ Dominant action: {aa['dominant_action']}") + + # Story seed analysis + ssa = report.story_seed_analysis + print("\n" + "-" * 80) + print("STORY SEED COVERAGE") + print("-" * 80) + print(f"Seeds activated: {len(ssa.get('seeds_seen', []))}") + if ssa.get("authored_seeds"): + print(f"Authored seeds: {len(ssa['authored_seeds'])}") + print(f"Coverage rate: {ssa.get('coverage_rate', 0):.1%}") + if ssa.get("unused_seeds"): + print(f"Unused seeds: {', '.join(ssa['unused_seeds'])}") + + # Anomalies + if report.anomalies: + print("\n" + "-" * 80) + print("DETECTED ANOMALIES") + print("-" * 80) + for anomaly in report.anomalies: + severity_icon = {"low": "ℹ", "medium": "⚠", "high": "❌"}.get( + anomaly.severity, "?" + ) + print(f"{severity_icon} [{anomaly.severity.upper()}] {anomaly.description}") + + # Recommendations + print("\n" + "-" * 80) + print("RECOMMENDATIONS") + print("-" * 80) + for i, rec in enumerate(report.recommendations, 1): + print(f"{i}. {rec}") + + print("\n" + "=" * 80) + + +def main(argv: Sequence[str] | None = None) -> int: + """CLI entry point for analyzing AI tournament results.""" + parser = argparse.ArgumentParser( + description="Analyze AI tournament results for balance and coverage insights.", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Analyze tournament results + uv run python scripts/analyze_ai_games.py --input build/tournament.json + + # Compare against authored story seeds + uv run python scripts/analyze_ai_games.py \\ + --input build/tournament.json --world default + + # Save analysis to JSON + uv run python scripts/analyze_ai_games.py \\ + --input build/tournament.json -o build/analysis.json +""", + ) + parser.add_argument( + "--input", + "-i", + type=Path, + required=True, + help="Path to tournament results JSON file", + ) + parser.add_argument( + "--world", + "-w", + default=None, + help="World name to load authored story seeds for coverage comparison", + ) + parser.add_argument( + "--output", + "-o", + type=Path, + default=None, + help="Path to write JSON analysis report", + ) + parser.add_argument( + "--json", + action="store_true", + help="Output as JSON instead of human-readable report", + ) + parser.add_argument( + "--verbose", + "-v", + action="store_true", + help="Include detailed data in output", + ) + + args = parser.parse_args(argv) + + if not args.input.exists(): + sys.stderr.write(f"Error: Input file not found: {args.input}\n") + return 1 + + results = load_tournament_results(args.input) + + # Load authored seeds if world specified + authored_seeds = None + if args.world: + authored_seeds = get_authored_story_seeds(args.world) + if args.verbose: + sys.stderr.write( + f"Loaded {len(authored_seeds)} authored story seeds " + f"from '{args.world}'\n" + ) + + report = analyze_tournament(results, authored_seeds) + + if args.output: + args.output.parent.mkdir(parents=True, exist_ok=True) + args.output.write_text(json.dumps(report.to_dict(), indent=2, sort_keys=True)) + if args.verbose: + print(f"Analysis written to {args.output}") + + if args.json: + print(json.dumps(report.to_dict(), indent=2, sort_keys=True)) + else: + print_analysis_report(report) + + return 0 + + +if __name__ == "__main__": # pragma: no cover + raise SystemExit(main()) diff --git a/scripts/run_ai_tournament.py b/scripts/run_ai_tournament.py new file mode 100644 index 00000000..39558799 --- /dev/null +++ b/scripts/run_ai_tournament.py @@ -0,0 +1,510 @@ +#!/usr/bin/env python3 +"""Run AI tournaments with parallel games and varied strategies. + +Executes N parallel games with varied AI strategies, world configs, and seeds +to produce comparative balance reports. + +Examples +-------- +Basic tournament with default settings:: + + uv run python scripts/run_ai_tournament.py \\ + --games 100 --output build/tournament.json + +Tournament with specific strategies:: + + uv run python scripts/run_ai_tournament.py \\ + --games 50 --strategies balanced aggressive --ticks 200 + +Verbose mode with progress output:: + + uv run python scripts/run_ai_tournament.py --games 20 --verbose +""" + +from __future__ import annotations + +import argparse +import json +import os +import sys +from concurrent.futures import ProcessPoolExecutor, as_completed +from dataclasses import dataclass, field +from pathlib import Path +from time import perf_counter +from typing import Any, Sequence + +# Set environment to avoid import issues in worker processes +os.environ.setdefault("ECHOES_CONFIG_ROOT", "content/config") + + +@dataclass +class GameResult: + """Result from a single tournament game.""" + + game_id: int + strategy: str + seed: int + ticks_run: int + final_stability: float + actions_taken: int + story_seeds_activated: list[str] = field(default_factory=list) + action_counts: dict[str, int] = field(default_factory=dict) + duration_seconds: float = 0.0 + error: str | None = None + + def to_dict(self) -> dict[str, Any]: + return { + "game_id": self.game_id, + "strategy": self.strategy, + "seed": self.seed, + "ticks_run": self.ticks_run, + "final_stability": round(self.final_stability, 4), + "actions_taken": self.actions_taken, + "story_seeds_activated": self.story_seeds_activated, + "action_counts": self.action_counts, + "duration_seconds": round(self.duration_seconds, 3), + "error": self.error, + } + + +@dataclass +class TournamentConfig: + """Configuration for running a tournament.""" + + num_games: int = 100 + ticks_per_game: int = 100 + strategies: list[str] = field( + default_factory=lambda: ["balanced", "aggressive", "diplomatic"] + ) + base_seed: int = 42 + world: str = "default" + max_workers: int | None = None + stability_win_threshold: float = 0.5 + + +@dataclass +class TournamentReport: + """Aggregated report from a tournament.""" + + config: dict[str, Any] + total_games: int + completed_games: int + failed_games: int + games_by_strategy: dict[str, list[GameResult]] + strategy_stats: dict[str, dict[str, Any]] + all_story_seeds: set[str] + unused_story_seeds: list[str] + total_duration_seconds: float + + def to_dict(self) -> dict[str, Any]: + return { + "config": self.config, + "total_games": self.total_games, + "completed_games": self.completed_games, + "failed_games": self.failed_games, + "strategy_stats": self.strategy_stats, + "all_story_seeds_seen": sorted(self.all_story_seeds), + "unused_story_seeds": self.unused_story_seeds, + "total_duration_seconds": round(self.total_duration_seconds, 2), + "games": { + strategy: [g.to_dict() for g in games] + for strategy, games in self.games_by_strategy.items() + }, + } + + +def run_single_game( + game_id: int, + strategy_name: str, + seed: int, + ticks: int, + world: str, +) -> GameResult: + """Run a single game with the given parameters. + + This function is designed to be called in a separate process. + """ + start_time = perf_counter() + try: + # Import inside function for process isolation + from gengine.ai_player import ActorConfig, AIActor + from gengine.ai_player.strategies import StrategyType + from gengine.echoes.sim import SimEngine + + # Map strategy name to type + strategy_map = { + "balanced": StrategyType.BALANCED, + "aggressive": StrategyType.AGGRESSIVE, + "diplomatic": StrategyType.DIPLOMATIC, + "hybrid": StrategyType.HYBRID, + } + strategy_type = strategy_map.get(strategy_name.lower(), StrategyType.BALANCED) + + # Initialize engine with seed + engine = SimEngine() + engine.initialize_state(world=world) + + # Set seed by advancing one tick + engine.advance_ticks(1, seed=seed) + + # Create actor with config + config = ActorConfig( + strategy_type=strategy_type, + tick_budget=ticks, + analysis_interval=10, + log_decisions=False, + ) + actor = AIActor(engine=engine, config=config) + + # Run the game + report = actor.run() + + # Extract story seeds from final state + final_state = engine.query_view("summary") + story_seeds = [] + seed_data = final_state.get("story_seeds", []) + if isinstance(seed_data, list): + for seed_info in seed_data: + if isinstance(seed_info, dict): + seed_id = seed_info.get("seed_id") or seed_info.get("id", "unknown") + story_seeds.append(seed_id) + + duration = perf_counter() - start_time + + return GameResult( + game_id=game_id, + strategy=strategy_name, + seed=seed, + ticks_run=report.ticks_run, + final_stability=report.final_stability, + actions_taken=report.actions_taken, + story_seeds_activated=story_seeds, + action_counts=report.telemetry.get("action_counts", {}), + duration_seconds=duration, + ) + + except Exception as e: + duration = perf_counter() - start_time + return GameResult( + game_id=game_id, + strategy=strategy_name, + seed=seed, + ticks_run=0, + final_stability=0.0, + actions_taken=0, + duration_seconds=duration, + error=str(e), + ) + + +def run_tournament( + config: TournamentConfig, + verbose: bool = False, +) -> TournamentReport: + """Run a complete tournament with the given configuration. + + Parameters + ---------- + config + Tournament configuration. + verbose + If True, print progress to stderr. + + Returns + ------- + TournamentReport + Aggregated results from all games. + """ + start_time = perf_counter() + + # Build list of game tasks + tasks: list[tuple[int, str, int, int, str]] = [] + game_id = 0 + games_per_strategy = config.num_games // len(config.strategies) + remainder = config.num_games % len(config.strategies) + + for i, strategy in enumerate(config.strategies): + num_games = games_per_strategy + (1 if i < remainder else 0) + for _j in range(num_games): + seed = config.base_seed + game_id + tasks.append((game_id, strategy, seed, config.ticks_per_game, config.world)) + game_id += 1 + + if verbose: + sys.stderr.write( + f"Starting tournament: {len(tasks)} games, " + f"{len(config.strategies)} strategies, " + f"{config.ticks_per_game} ticks each\n" + ) + + # Run games in parallel + results: list[GameResult] = [] + completed = 0 + + max_workers = config.max_workers or min(4, os.cpu_count() or 1) + + with ProcessPoolExecutor(max_workers=max_workers) as executor: + futures = { + executor.submit( + run_single_game, gid, strategy, seed, ticks, world + ): gid + for gid, strategy, seed, ticks, world in tasks + } + + for future in as_completed(futures): + result = future.result() + results.append(result) + completed += 1 + + if verbose and completed % 10 == 0: + sys.stderr.write( + f"Progress: {completed}/{len(tasks)} games completed\n" + ) + + # Aggregate results by strategy + games_by_strategy: dict[str, list[GameResult]] = {} + for strategy in config.strategies: + games_by_strategy[strategy] = [r for r in results if r.strategy == strategy] + + # Calculate statistics per strategy + strategy_stats: dict[str, dict[str, Any]] = {} + all_story_seeds: set[str] = set() + + for strategy, games in games_by_strategy.items(): + successful = [g for g in games if g.error is None] + failed = [g for g in games if g.error is not None] + + stabilities = [g.final_stability for g in successful] + threshold = config.stability_win_threshold + wins = [g for g in successful if g.final_stability >= threshold] + total_actions = sum(g.actions_taken for g in successful) + + # Collect story seeds + for g in successful: + all_story_seeds.update(g.story_seeds_activated) + + # Action breakdown + action_totals: dict[str, int] = {} + for g in successful: + for action, count in g.action_counts.items(): + action_totals[action] = action_totals.get(action, 0) + count + + strategy_stats[strategy] = { + "games_played": len(games), + "games_completed": len(successful), + "games_failed": len(failed), + "win_rate": len(wins) / len(successful) if successful else 0.0, + "avg_stability": ( + sum(stabilities) / len(stabilities) if stabilities else 0.0 + ), + "min_stability": min(stabilities) if stabilities else 0.0, + "max_stability": max(stabilities) if stabilities else 0.0, + "total_actions": total_actions, + "avg_actions": total_actions / len(successful) if successful else 0.0, + "action_breakdown": action_totals, + "avg_duration_seconds": ( + sum(g.duration_seconds for g in successful) / len(successful) + if successful + else 0.0 + ), + } + + # Identify unused story seeds (compare with known seeds from content) + # For now, we'll just report what we saw + unused_story_seeds: list[str] = [] # Would compare against authored seeds + + total_duration = perf_counter() - start_time + + if verbose: + sys.stderr.write( + f"\nTournament complete: {len(results)} games in {total_duration:.1f}s\n" + ) + + return TournamentReport( + config={ + "num_games": config.num_games, + "ticks_per_game": config.ticks_per_game, + "strategies": config.strategies, + "base_seed": config.base_seed, + "world": config.world, + "stability_win_threshold": config.stability_win_threshold, + }, + total_games=len(tasks), + completed_games=sum(1 for r in results if r.error is None), + failed_games=sum(1 for r in results if r.error is not None), + games_by_strategy=games_by_strategy, + strategy_stats=strategy_stats, + all_story_seeds=all_story_seeds, + unused_story_seeds=unused_story_seeds, + total_duration_seconds=total_duration, + ) + + +def print_summary_table(report: TournamentReport) -> None: + """Print a human-readable summary of tournament results.""" + print("\n" + "=" * 80) + print("AI TOURNAMENT RESULTS") + print("=" * 80) + print( + f"\nGames: {report.completed_games}/{report.total_games} completed " + f"({report.failed_games} failed)" + ) + print(f"Total duration: {report.total_duration_seconds:.1f}s") + print() + + # Strategy comparison table + print( + f"{'Strategy':<12} {'Win Rate':>10} {'Avg Stab':>10} {'Min Stab':>10} " + f"{'Max Stab':>10} {'Avg Actions':>12}" + ) + print("-" * 80) + + for strategy, stats in report.strategy_stats.items(): + print( + f"{strategy:<12} {stats['win_rate']:>10.1%} " + f"{stats['avg_stability']:>10.3f} " + f"{stats['min_stability']:>10.3f} {stats['max_stability']:>10.3f} " + f"{stats['avg_actions']:>12.1f}" + ) + + print("-" * 80) + + # Story seeds summary + if report.all_story_seeds: + print(f"\nStory seeds activated: {', '.join(sorted(report.all_story_seeds))}") + + # Balance observations + print("\n" + "=" * 80) + print("BALANCE OBSERVATIONS") + print("=" * 80) + + # Find dominant strategy + win_rates = [(s, stats["win_rate"]) for s, stats in report.strategy_stats.items()] + if win_rates: + win_rates.sort(key=lambda x: x[1], reverse=True) + best = win_rates[0] + worst = win_rates[-1] + delta = best[1] - worst[1] + + print(f"\nBest strategy: {best[0]} ({best[1]:.1%} win rate)") + print(f"Worst strategy: {worst[0]} ({worst[1]:.1%} win rate)") + print(f"Win rate delta: {delta:.1%}") + + if delta > 0.2: + print("\n⚠️ WARNING: Large win rate delta suggests balance issues") + + print("=" * 80) + + +def main(argv: Sequence[str] | None = None) -> int: + """CLI entry point for running AI tournaments.""" + parser = argparse.ArgumentParser( + description="Run AI tournaments with parallel games and varied strategies.", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Run 100 games with default settings + uv run python scripts/run_ai_tournament.py --games 100 + + # Run with specific strategies + uv run python scripts/run_ai_tournament.py --games 50 --strategies balanced aggressive + + # Save results to file + uv run python scripts/run_ai_tournament.py --games 100 --output build/tournament.json +""", + ) + parser.add_argument( + "--games", + "-g", + type=int, + default=100, + help="Total number of games to run (default: 100)", + ) + parser.add_argument( + "--ticks", + "-t", + type=int, + default=100, + help="Ticks per game (default: 100)", + ) + parser.add_argument( + "--strategies", + "-s", + nargs="+", + choices=["balanced", "aggressive", "diplomatic", "hybrid"], + default=["balanced", "aggressive", "diplomatic"], + help="Strategies to test (default: balanced aggressive diplomatic)", + ) + parser.add_argument( + "--seed", + type=int, + default=42, + help="Base random seed (default: 42)", + ) + parser.add_argument( + "--world", + "-w", + default="default", + help="World bundle to use (default: default)", + ) + parser.add_argument( + "--workers", + type=int, + default=None, + help="Max parallel workers (default: auto)", + ) + parser.add_argument( + "--win-threshold", + type=float, + default=0.5, + help="Stability threshold for a 'win' (default: 0.5)", + ) + parser.add_argument( + "--output", + "-o", + type=Path, + default=None, + help="Path to write JSON results", + ) + parser.add_argument( + "--json", + action="store_true", + help="Output as JSON instead of table", + ) + parser.add_argument( + "--verbose", + "-v", + action="store_true", + help="Print progress during tournament", + ) + + args = parser.parse_args(argv) + + config = TournamentConfig( + num_games=args.games, + ticks_per_game=args.ticks, + strategies=args.strategies, + base_seed=args.seed, + world=args.world, + max_workers=args.workers, + stability_win_threshold=args.win_threshold, + ) + + report = run_tournament(config, verbose=args.verbose) + + if args.output: + args.output.parent.mkdir(parents=True, exist_ok=True) + args.output.write_text(json.dumps(report.to_dict(), indent=2, sort_keys=True)) + if args.verbose: + print(f"\nResults written to {args.output}") + + if args.json: + print(json.dumps(report.to_dict(), indent=2, sort_keys=True)) + else: + print_summary_table(report) + + return 0 + + +if __name__ == "__main__": # pragma: no cover + raise SystemExit(main()) diff --git a/tests/scripts/test_ai_analysis.py b/tests/scripts/test_ai_analysis.py new file mode 100644 index 00000000..77bd9599 --- /dev/null +++ b/tests/scripts/test_ai_analysis.py @@ -0,0 +1,502 @@ +"""Tests for AI tournament analysis module.""" + +from __future__ import annotations + +import json +import sys +import tempfile +from importlib import util +from pathlib import Path + +import pytest + +_MODULE_PATH = ( + Path(__file__).resolve().parents[2] / "scripts" / "analyze_ai_games.py" +) + + +def _load_analysis_module(): + spec = util.spec_from_file_location("analysis_driver", _MODULE_PATH) + module = util.module_from_spec(spec) + assert spec and spec.loader + sys.modules.setdefault("analysis_driver", module) + spec.loader.exec_module(module) + return module + + +_driver = _load_analysis_module() +AnalysisReport = _driver.AnalysisReport +BalanceAnomaly = _driver.BalanceAnomaly +analyze_actions = _driver.analyze_actions +analyze_story_seeds = _driver.analyze_story_seeds +analyze_tournament = _driver.analyze_tournament +analyze_win_rates = _driver.analyze_win_rates +detect_anomalies = _driver.detect_anomalies +generate_recommendations = _driver.generate_recommendations +load_tournament_results = _driver.load_tournament_results +main = _driver.main + + +class TestBalanceAnomaly: + """Tests for the BalanceAnomaly dataclass.""" + + def test_anomaly_to_dict(self) -> None: + anomaly = BalanceAnomaly( + anomaly_type="dominant_strategy", + severity="high", + description="Strategy 'aggressive' dominates", + data={"strategy": "aggressive", "win_rate": 0.95}, + ) + + result = anomaly.to_dict() + + assert result["type"] == "dominant_strategy" + assert result["severity"] == "high" + assert result["description"] == "Strategy 'aggressive' dominates" + assert result["data"]["strategy"] == "aggressive" + + +class TestAnalysisReport: + """Tests for the AnalysisReport dataclass.""" + + def test_report_to_dict(self) -> None: + anomaly = BalanceAnomaly( + anomaly_type="test", severity="low", description="Test anomaly" + ) + report = AnalysisReport( + tournament_config={"num_games": 100}, + strategy_comparison={"balanced": {"win_rate": 0.5}}, + win_rate_analysis={"is_balanced": True}, + action_analysis={"most_used_action": "INSPECT"}, + story_seed_analysis={"seeds_seen": ["seed-1"]}, + anomalies=[anomaly], + recommendations=["Test recommendation"], + ) + + result = report.to_dict() + + assert result["tournament_config"] == {"num_games": 100} + assert "balanced" in result["strategy_comparison"] + assert len(result["anomalies"]) == 1 + assert result["recommendations"] == ["Test recommendation"] + + +class TestAnalyzeWinRates: + """Tests for the analyze_win_rates function.""" + + def test_analyze_balanced_strategies(self) -> None: + strategy_stats = { + "balanced": {"win_rate": 0.50}, + "aggressive": {"win_rate": 0.55}, + "diplomatic": {"win_rate": 0.52}, + } + + result = analyze_win_rates(strategy_stats) + + assert result["best_strategy"] == "aggressive" + assert result["best_win_rate"] == 0.55 + assert result["worst_strategy"] == "balanced" + assert result["worst_win_rate"] == 0.50 + assert result["win_rate_delta"] == 0.05 + assert result["is_balanced"] is True + + def test_analyze_imbalanced_strategies(self) -> None: + strategy_stats = { + "balanced": {"win_rate": 0.30}, + "aggressive": {"win_rate": 0.80}, + } + + result = analyze_win_rates(strategy_stats) + + assert result["win_rate_delta"] == 0.50 + assert result["is_balanced"] is False + + def test_analyze_empty_stats(self) -> None: + result = analyze_win_rates({}) + assert "error" in result + + def test_analyze_single_strategy(self) -> None: + strategy_stats = {"balanced": {"win_rate": 0.75}} + + result = analyze_win_rates(strategy_stats) + + assert result["best_strategy"] == "balanced" + assert result["worst_strategy"] == "balanced" + assert result["win_rate_delta"] == 0.0 + assert result["is_balanced"] is True + + +class TestAnalyzeActions: + """Tests for the analyze_actions function.""" + + def test_analyze_action_distribution(self) -> None: + strategy_stats = { + "balanced": {"action_breakdown": {"INSPECT": 20, "NEGOTIATE": 15}}, + "aggressive": {"action_breakdown": {"INSPECT": 10, "DEPLOY_RESOURCE": 25}}, + } + + result = analyze_actions(strategy_stats) + + assert result["total_actions"]["INSPECT"] == 30 + assert result["total_actions"]["NEGOTIATE"] == 15 + assert result["total_actions"]["DEPLOY_RESOURCE"] == 25 + assert result["most_used_action"] == "INSPECT" + assert result["most_used_count"] == 30 + assert result["least_used_action"] == "NEGOTIATE" + assert result["least_used_count"] == 15 + + def test_analyze_dominant_action(self) -> None: + strategy_stats = { + "balanced": {"action_breakdown": {"INSPECT": 100}}, + "aggressive": {"action_breakdown": {"INSPECT": 100, "NEGOTIATE": 10}}, + } + + result = analyze_actions(strategy_stats) + + # INSPECT is over 50% of total actions + assert result["dominant_action"] == "INSPECT" + + def test_analyze_no_dominant_action(self) -> None: + strategy_stats = { + "balanced": {"action_breakdown": {"INSPECT": 30, "NEGOTIATE": 30}}, + } + + result = analyze_actions(strategy_stats) + + assert result["dominant_action"] is None + + def test_analyze_empty_actions(self) -> None: + result = analyze_actions({}) + assert "error" in result + + +class TestAnalyzeStorySeeds: + """Tests for the analyze_story_seeds function.""" + + def test_analyze_seed_coverage(self) -> None: + results = { + "all_story_seeds_seen": ["seed-1", "seed-2"], + "games": { + "balanced": [ + {"story_seeds_activated": ["seed-1"], "error": None}, + {"story_seeds_activated": ["seed-1", "seed-2"], "error": None}, + ] + }, + } + + analysis = analyze_story_seeds(results) + + assert "seed-1" in analysis["seeds_seen"] + assert "seed-2" in analysis["seeds_seen"] + assert analysis["seed_counts"]["seed-1"] == 2 + assert analysis["seed_counts"]["seed-2"] == 1 + assert analysis["total_games_analyzed"] == 2 + + def test_analyze_with_authored_seeds(self) -> None: + results = { + "all_story_seeds_seen": ["seed-1"], + "games": { + "balanced": [ + {"story_seeds_activated": ["seed-1"], "error": None}, + ] + }, + } + authored = ["seed-1", "seed-2", "seed-3"] + + analysis = analyze_story_seeds(results, authored) + + assert analysis["unused_seeds"] == ["seed-2", "seed-3"] + assert analysis["coverage_rate"] == pytest.approx(1 / 3) + + def test_analyze_full_coverage(self) -> None: + results = { + "all_story_seeds_seen": ["seed-1", "seed-2"], + "games": {}, + } + authored = ["seed-1", "seed-2"] + + analysis = analyze_story_seeds(results, authored) + + assert analysis["unused_seeds"] == [] + assert analysis["coverage_rate"] == 1.0 + + +class TestDetectAnomalies: + """Tests for the detect_anomalies function.""" + + def test_detect_dominant_strategy(self) -> None: + win_rate = { + "win_rate_delta": 0.25, + "best_strategy": "aggressive", + "best_win_rate": 0.85, + "worst_strategy": "diplomatic", + "worst_win_rate": 0.60, + } + + anomalies = detect_anomalies(win_rate, {}, {}, {}) + + assert len(anomalies) >= 1 + dominant = [a for a in anomalies if a.anomaly_type == "dominant_strategy"] + assert len(dominant) == 1 + assert dominant[0].severity == "high" + + def test_detect_strategy_imbalance(self) -> None: + win_rate = { + "win_rate_delta": 0.18, # Between 0.15 and 0.2 + "best_strategy": "aggressive", + "best_win_rate": 0.75, + "worst_strategy": "diplomatic", + "worst_win_rate": 0.57, + } + + anomalies = detect_anomalies(win_rate, {}, {}, {}) + + imbalance = [a for a in anomalies if a.anomaly_type == "strategy_imbalance"] + assert len(imbalance) == 1 + assert imbalance[0].severity == "medium" + + def test_detect_dominant_action(self) -> None: + action_analysis = { + "dominant_action": "INSPECT", + "action_percentages": {"INSPECT": 0.75}, + } + + anomalies = detect_anomalies({}, action_analysis, {}, {}) + + dominant = [a for a in anomalies if a.anomaly_type == "dominant_action"] + assert len(dominant) == 1 + + def test_detect_unused_story_seeds(self) -> None: + story_seed_analysis = { + "unused_seeds": ["seed-1", "seed-2", "seed-3"], + "authored_seeds": ["seed-1", "seed-2", "seed-3", "seed-4"], + } + + anomalies = detect_anomalies({}, {}, story_seed_analysis, {}) + + unused = [a for a in anomalies if a.anomaly_type == "unused_story_seeds"] + assert len(unused) == 1 + assert unused[0].severity == "high" + + def test_detect_low_activity_strategy(self) -> None: + strategy_stats = { + "balanced": {"avg_actions": 0.5}, + } + + anomalies = detect_anomalies({}, {}, {}, strategy_stats) + + low_activity = [ + a for a in anomalies if a.anomaly_type == "low_activity_strategy" + ] + assert len(low_activity) == 1 + + def test_no_anomalies_when_balanced(self) -> None: + win_rate = { + "win_rate_delta": 0.05, + "is_balanced": True, + } + action_analysis = {"dominant_action": None} + story_seed_analysis = {"unused_seeds": [], "coverage_rate": 1.0} + strategy_stats = {"balanced": {"avg_actions": 5.0}} + + anomalies = detect_anomalies( + win_rate, action_analysis, story_seed_analysis, strategy_stats + ) + + assert len(anomalies) == 0 + + +class TestGenerateRecommendations: + """Tests for the generate_recommendations function.""" + + def test_recommends_for_imbalanced(self) -> None: + anomalies = [ + BalanceAnomaly( + "dominant_strategy", "high", "Strategy imbalanced" + ) + ] + win_rate = { + "is_balanced": False, + "best_strategy": "aggressive", + "worst_strategy": "diplomatic", + } + + recs = generate_recommendations(anomalies, win_rate, {}, {}) + + assert len(recs) >= 1 + assert any("diplomatic" in r.lower() or "aggressive" in r.lower() for r in recs) + + def test_recommends_for_dominant_action(self) -> None: + action_analysis = {"dominant_action": "INSPECT"} + + recs = generate_recommendations([], {}, action_analysis, {}) + + assert any("INSPECT" in r for r in recs) + + def test_recommends_for_unused_seeds(self) -> None: + story_seed_analysis = { + "unused_seeds": ["seed-1", "seed-2"], + "coverage_rate": 0.5, + } + + recs = generate_recommendations([], {}, {}, story_seed_analysis) + + assert any("seed" in r.lower() for r in recs) + + def test_default_recommendation_when_balanced(self) -> None: + recs = generate_recommendations( + [], {"is_balanced": True}, {"dominant_action": None}, {} + ) + + assert len(recs) >= 1 + assert any("no significant" in r.lower() for r in recs) + + +class TestAnalyzeTournament: + """Tests for the full analyze_tournament function.""" + + def test_full_analysis(self) -> None: + results = { + "config": {"num_games": 10, "strategies": ["balanced", "aggressive"]}, + "strategy_stats": { + "balanced": { + "win_rate": 0.6, + "avg_stability": 0.7, + "avg_actions": 5.0, + "games_completed": 5, + "action_breakdown": {"INSPECT": 20}, + }, + "aggressive": { + "win_rate": 0.5, + "avg_stability": 0.65, + "avg_actions": 8.0, + "games_completed": 5, + "action_breakdown": {"INSPECT": 10, "DEPLOY_RESOURCE": 30}, + }, + }, + "all_story_seeds_seen": ["seed-1"], + "games": { + "balanced": [ + {"story_seeds_activated": ["seed-1"], "error": None}, + ], + "aggressive": [ + {"story_seeds_activated": [], "error": None}, + ], + }, + } + + report = analyze_tournament(results) + + assert report.tournament_config["num_games"] == 10 + assert "balanced" in report.strategy_comparison + assert "aggressive" in report.strategy_comparison + assert report.win_rate_analysis["is_balanced"] is True + assert len(report.recommendations) >= 1 + + def test_analysis_with_authored_seeds(self) -> None: + results = { + "config": {}, + "strategy_stats": { + "balanced": { + "win_rate": 0.5, + "avg_stability": 0.5, + "avg_actions": 5.0, + "games_completed": 1, + "action_breakdown": {}, + }, + }, + "all_story_seeds_seen": ["seed-1"], + "games": {}, + } + authored = ["seed-1", "seed-2", "seed-3"] + + report = analyze_tournament(results, authored) + + assert report.story_seed_analysis["unused_seeds"] == ["seed-2", "seed-3"] + + +class TestLoadTournamentResults: + """Tests for loading tournament results from files.""" + + def test_load_valid_json(self) -> None: + results = { + "config": {"num_games": 10}, + "total_games": 10, + } + + with tempfile.TemporaryDirectory() as tmpdir: + path = Path(tmpdir) / "results.json" + path.write_text(json.dumps(results)) + + loaded = load_tournament_results(path) + + assert loaded["config"]["num_games"] == 10 + assert loaded["total_games"] == 10 + + +class TestAnalysisCLI: + """Tests for the analysis CLI.""" + + def test_cli_basic_run( + self, tmp_path: Path, capsys: pytest.CaptureFixture + ) -> None: + """Test CLI with minimal arguments.""" + results = { + "config": {"num_games": 10}, + "strategy_stats": { + "balanced": { + "win_rate": 0.5, + "avg_stability": 0.5, + "avg_actions": 5.0, + "games_completed": 10, + "action_breakdown": {"INSPECT": 50}, + }, + }, + "all_story_seeds_seen": [], + "games": {}, + } + input_path = tmp_path / "results.json" + input_path.write_text(json.dumps(results)) + + exit_code = main(["--input", str(input_path)]) + + assert exit_code == 0 + captured = capsys.readouterr() + assert "AI TOURNAMENT ANALYSIS REPORT" in captured.out + + def test_cli_json_output( + self, tmp_path: Path, capsys: pytest.CaptureFixture + ) -> None: + """Test CLI with JSON output format.""" + results = { + "config": {"num_games": 10}, + "strategy_stats": { + "balanced": { + "win_rate": 0.5, + "avg_stability": 0.5, + "avg_actions": 5.0, + "games_completed": 10, + "action_breakdown": {}, + }, + }, + "all_story_seeds_seen": [], + "games": {}, + } + input_path = tmp_path / "results.json" + input_path.write_text(json.dumps(results)) + + exit_code = main(["--input", str(input_path), "--json"]) + + assert exit_code == 0 + captured = capsys.readouterr() + data = json.loads(captured.out) + assert "tournament_config" in data + assert "anomalies" in data + + def test_cli_missing_input_file(self, capsys: pytest.CaptureFixture) -> None: + """Test CLI with missing input file.""" + exit_code = main(["--input", "/nonexistent/path/results.json"]) + + assert exit_code == 1 + captured = capsys.readouterr() + assert "Error" in captured.err diff --git a/tests/scripts/test_ai_tournament.py b/tests/scripts/test_ai_tournament.py new file mode 100644 index 00000000..7c68d185 --- /dev/null +++ b/tests/scripts/test_ai_tournament.py @@ -0,0 +1,419 @@ +"""Tests for AI tournament infrastructure.""" + +from __future__ import annotations + +import json +import sys +import tempfile +from importlib import util +from pathlib import Path + +import pytest + +_MODULE_PATH = ( + Path(__file__).resolve().parents[2] / "scripts" / "run_ai_tournament.py" +) + + +def _load_tournament_module(): + spec = util.spec_from_file_location("tournament_driver", _MODULE_PATH) + module = util.module_from_spec(spec) + assert spec and spec.loader + sys.modules.setdefault("tournament_driver", module) + spec.loader.exec_module(module) + return module + + +_driver = _load_tournament_module() +GameResult = _driver.GameResult +TournamentConfig = _driver.TournamentConfig +TournamentReport = _driver.TournamentReport +run_single_game = _driver.run_single_game +run_tournament = _driver.run_tournament +main = _driver.main + + +class TestGameResult: + """Tests for the GameResult dataclass.""" + + def test_game_result_default_values(self) -> None: + result = GameResult( + game_id=1, + strategy="balanced", + seed=42, + ticks_run=100, + final_stability=0.75, + actions_taken=10, + ) + assert result.story_seeds_activated == [] + assert result.action_counts == {} + assert result.duration_seconds == 0.0 + assert result.error is None + + def test_game_result_to_dict(self) -> None: + result = GameResult( + game_id=1, + strategy="balanced", + seed=42, + ticks_run=100, + final_stability=0.7567, + actions_taken=10, + story_seeds_activated=["seed-1", "seed-2"], + action_counts={"INSPECT": 5, "NEGOTIATE": 5}, + duration_seconds=1.234, + ) + + data = result.to_dict() + + assert data["game_id"] == 1 + assert data["strategy"] == "balanced" + assert data["seed"] == 42 + assert data["ticks_run"] == 100 + assert data["final_stability"] == 0.7567 + assert data["actions_taken"] == 10 + assert data["story_seeds_activated"] == ["seed-1", "seed-2"] + assert data["action_counts"] == {"INSPECT": 5, "NEGOTIATE": 5} + assert data["duration_seconds"] == 1.234 + assert data["error"] is None + + def test_game_result_with_error(self) -> None: + result = GameResult( + game_id=1, + strategy="balanced", + seed=42, + ticks_run=0, + final_stability=0.0, + actions_taken=0, + error="Connection failed", + ) + + data = result.to_dict() + assert data["error"] == "Connection failed" + + +class TestTournamentConfig: + """Tests for the TournamentConfig dataclass.""" + + def test_default_config(self) -> None: + config = TournamentConfig() + assert config.num_games == 100 + assert config.ticks_per_game == 100 + assert config.strategies == ["balanced", "aggressive", "diplomatic"] + assert config.base_seed == 42 + assert config.world == "default" + assert config.max_workers is None + assert config.stability_win_threshold == 0.5 + + def test_custom_config(self) -> None: + config = TournamentConfig( + num_games=50, + ticks_per_game=200, + strategies=["balanced", "hybrid"], + base_seed=123, + world="test", + max_workers=2, + stability_win_threshold=0.6, + ) + assert config.num_games == 50 + assert config.ticks_per_game == 200 + assert config.strategies == ["balanced", "hybrid"] + assert config.base_seed == 123 + assert config.world == "test" + assert config.max_workers == 2 + assert config.stability_win_threshold == 0.6 + + +class TestTournamentReport: + """Tests for the TournamentReport dataclass.""" + + def test_report_to_dict(self) -> None: + game1 = GameResult( + game_id=1, + strategy="balanced", + seed=42, + ticks_run=100, + final_stability=0.8, + actions_taken=5, + ) + game2 = GameResult( + game_id=2, + strategy="aggressive", + seed=43, + ticks_run=100, + final_stability=0.6, + actions_taken=10, + ) + + report = TournamentReport( + config={"num_games": 2, "strategies": ["balanced", "aggressive"]}, + total_games=2, + completed_games=2, + failed_games=0, + games_by_strategy={ + "balanced": [game1], + "aggressive": [game2], + }, + strategy_stats={ + "balanced": {"win_rate": 1.0, "avg_stability": 0.8}, + "aggressive": {"win_rate": 0.5, "avg_stability": 0.6}, + }, + all_story_seeds={"seed-1"}, + unused_story_seeds=[], + total_duration_seconds=5.5, + ) + + data = report.to_dict() + + assert data["total_games"] == 2 + assert data["completed_games"] == 2 + assert data["failed_games"] == 0 + assert "balanced" in data["strategy_stats"] + assert "aggressive" in data["strategy_stats"] + assert data["all_story_seeds_seen"] == ["seed-1"] + assert data["total_duration_seconds"] == 5.5 + assert len(data["games"]["balanced"]) == 1 + assert len(data["games"]["aggressive"]) == 1 + + +class TestRunSingleGame: + """Tests for the run_single_game function.""" + + def test_run_single_game_balanced(self) -> None: + result = run_single_game( + game_id=1, + strategy_name="balanced", + seed=42, + ticks=10, + world="default", + ) + + assert result.game_id == 1 + assert result.strategy == "balanced" + assert result.seed == 42 + assert result.ticks_run == 10 + assert result.error is None + assert 0.0 <= result.final_stability <= 1.0 + assert result.duration_seconds > 0 + + def test_run_single_game_aggressive(self) -> None: + result = run_single_game( + game_id=2, + strategy_name="aggressive", + seed=43, + ticks=10, + world="default", + ) + + assert result.game_id == 2 + assert result.strategy == "aggressive" + assert result.error is None + + def test_run_single_game_diplomatic(self) -> None: + result = run_single_game( + game_id=3, + strategy_name="diplomatic", + seed=44, + ticks=10, + world="default", + ) + + assert result.game_id == 3 + assert result.strategy == "diplomatic" + assert result.error is None + + def test_run_single_game_invalid_world(self) -> None: + result = run_single_game( + game_id=1, + strategy_name="balanced", + seed=42, + ticks=10, + world="nonexistent_world", + ) + + assert result.error is not None + assert result.ticks_run == 0 + + +class TestRunTournament: + """Tests for the run_tournament function.""" + + def test_run_tournament_small(self) -> None: + """Run a small tournament to verify basic functionality.""" + config = TournamentConfig( + num_games=6, + ticks_per_game=10, + strategies=["balanced", "aggressive"], + base_seed=42, + max_workers=2, + ) + + report = run_tournament(config, verbose=False) + + assert report.total_games == 6 + assert report.completed_games == 6 + assert report.failed_games == 0 + assert "balanced" in report.games_by_strategy + assert "aggressive" in report.games_by_strategy + assert len(report.games_by_strategy["balanced"]) == 3 + assert len(report.games_by_strategy["aggressive"]) == 3 + + def test_run_tournament_calculates_stats(self) -> None: + """Verify that tournament calculates strategy statistics.""" + config = TournamentConfig( + num_games=4, + ticks_per_game=10, + strategies=["balanced", "diplomatic"], + base_seed=42, + max_workers=1, + ) + + report = run_tournament(config, verbose=False) + + assert "balanced" in report.strategy_stats + assert "diplomatic" in report.strategy_stats + + for _strategy, stats in report.strategy_stats.items(): + assert "games_played" in stats + assert "win_rate" in stats + assert "avg_stability" in stats + assert "total_actions" in stats + assert "avg_actions" in stats + + def test_run_tournament_with_single_strategy(self) -> None: + """Test tournament with only one strategy.""" + config = TournamentConfig( + num_games=3, + ticks_per_game=10, + strategies=["balanced"], + base_seed=42, + max_workers=1, + ) + + report = run_tournament(config, verbose=False) + + assert report.total_games == 3 + assert len(report.games_by_strategy["balanced"]) == 3 + + def test_run_tournament_collects_story_seeds(self) -> None: + """Test that tournament collects story seed information.""" + config = TournamentConfig( + num_games=2, + ticks_per_game=50, # More ticks to potentially trigger seeds + strategies=["balanced"], + base_seed=42, + max_workers=1, + ) + + report = run_tournament(config, verbose=False) + + # Story seeds may or may not be activated, just verify the field exists + assert isinstance(report.all_story_seeds, set) + + +class TestTournamentDeterminism: + """Tests for tournament determinism with fixed seeds.""" + + def test_same_seed_produces_same_result(self) -> None: + """Running the same game twice with same seed should produce same result.""" + result1 = run_single_game( + game_id=1, + strategy_name="balanced", + seed=42, + ticks=20, + world="default", + ) + result2 = run_single_game( + game_id=1, + strategy_name="balanced", + seed=42, + ticks=20, + world="default", + ) + + assert result1.final_stability == result2.final_stability + assert result1.actions_taken == result2.actions_taken + + def test_different_seeds_may_differ(self) -> None: + """Different seeds may produce different outcomes.""" + result1 = run_single_game( + game_id=1, + strategy_name="balanced", + seed=42, + ticks=20, + world="default", + ) + result2 = run_single_game( + game_id=2, + strategy_name="balanced", + seed=12345, + ticks=20, + world="default", + ) + + # Results may or may not differ, but both should complete successfully + assert result1.error is None + assert result2.error is None + + +class TestTournamentOutputFile: + """Tests for tournament JSON output.""" + + def test_tournament_writes_json_output(self) -> None: + """Test that tournament can write results to JSON file.""" + config = TournamentConfig( + num_games=2, + ticks_per_game=10, + strategies=["balanced"], + max_workers=1, + ) + + report = run_tournament(config, verbose=False) + + with tempfile.TemporaryDirectory() as tmpdir: + output_path = Path(tmpdir) / "results.json" + output_path.write_text(json.dumps(report.to_dict(), indent=2)) + + # Verify file was written and can be read back + assert output_path.exists() + loaded = json.loads(output_path.read_text()) + assert loaded["total_games"] == 2 + assert "balanced" in loaded["games"] + + +class TestTournamentCLI: + """Tests for the tournament CLI.""" + + def test_cli_basic_run( + self, tmp_path: Path, capsys: pytest.CaptureFixture + ) -> None: + """Test CLI with minimal arguments.""" + output_path = tmp_path / "results.json" + + exit_code = main([ + "--games", "2", + "--ticks", "5", + "--strategies", "balanced", + "--output", str(output_path), + ]) + + assert exit_code == 0 + assert output_path.exists() + + captured = capsys.readouterr() + assert "AI TOURNAMENT RESULTS" in captured.out + + def test_cli_json_output( + self, tmp_path: Path, capsys: pytest.CaptureFixture + ) -> None: + """Test CLI with JSON output format.""" + exit_code = main([ + "--games", "2", + "--ticks", "5", + "--strategies", "balanced", + "--json", + ]) + + assert exit_code == 0 + captured = capsys.readouterr() + data = json.loads(captured.out) + assert "total_games" in data From db92a52e489de1f26cb0153f29660e1fbd828ee6 Mon Sep 17 00:00:00 2001 From: Ross Gardler Date: Wed, 3 Dec 2025 21:35:52 -0800 Subject: [PATCH 3/3] docs: add Section 13 AI Tournament & Balance Analysis, update guides and README for new tooling --- README.md | 9 ++-- .../ai_tournament_and_balance_analysis.md | 54 +++++++++++++++++++ docs/gengine/how_to_play_echoes.md | 8 +-- 3 files changed, 62 insertions(+), 9 deletions(-) create mode 100644 docs/gengine/ai_tournament_and_balance_analysis.md diff --git a/README.md b/README.md index 264a1803..d716feda 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,9 @@ # GEngine: Echoes of Emergence -A staged simulation project that prototypes the "Echoes of Emergence" CLI + LLM -experience. The long-term goal is a service-first architecture (simulation -service, CLI gateway, LLM intent service) designed for Kubernetes. This README -summarizes the current state of development and the immediate workflows you can -run locally. + +A staged simulation project that prototypes the "Echoes of Emergence" CLI + LLM experience. The long-term goal is a service-first architecture (simulation service, CLI gateway, LLM intent service) designed for Kubernetes. This README summarizes the current state of development and the immediate workflows you can run locally. + +**For AI tournament and balance analysis tooling, see [Section 13: AI Tournament & Balance Analysis](docs/gengine/ai_tournament_and_balance_analysis.md).** ## Current Status (Phases 1–4) diff --git a/docs/gengine/ai_tournament_and_balance_analysis.md b/docs/gengine/ai_tournament_and_balance_analysis.md new file mode 100644 index 00000000..209b77e1 --- /dev/null +++ b/docs/gengine/ai_tournament_and_balance_analysis.md @@ -0,0 +1,54 @@ +# Section 13: AI Tournament & Balance Analysis + +**Last Updated:** 2025-12-03 + +## Overview +This section describes how to use the AI tournament and balance analysis tooling introduced in Phase 9. These tools help designers and developers run large batches of AI-driven games in parallel, compare strategy performance, and identify balance issues or underutilized content. + +## Running AI Tournaments + +The tournament script executes multiple games in parallel, each using a configurable AI strategy (BALANCED, AGGRESSIVE, DIPLOMATIC, HYBRID). Telemetry is captured for each game, and results are aggregated into a single JSON file. + +**Example:** +```bash +uv run python scripts/run_ai_tournament.py --games 100 --output build/tournament.json +``` +- `--games`: Number of games to run (default: 100) +- `--output`: Path to save the aggregated results +- Additional flags allow you to specify strategies, seeds, and world configs. + +## Analyzing Tournament Results + +After running a tournament, use the analysis script to generate comparative reports. This tool surfaces win rate differences, balance anomalies, and unused story seeds. + +**Example:** +```bash +uv run python scripts/analyze_ai_games.py build/tournament.json --report build/analysis.txt +``` +- `--report`: Path to save the analysis output + +The report includes: +- Win rate comparison across strategies +- Detection of unused story seeds +- Flagging of balance outliers + +## Balance Iteration Workflow + +1. Run a tournament with a large number of games and varied strategies. +2. Analyze the results to identify dominant strategies, underpowered/overpowered actions, and unused content. +3. Adjust simulation parameters or authored content as needed. +4. Repeat the process to validate improvements. + +## CI Integration + +A nightly CI workflow automatically runs tournaments and archives results for ongoing balance review. See `.github/workflows/ai-tournament.yml` for details. + +## Usage Tips +- Use different world configs and seeds to stress-test balance across scenarios. +- Review the analysis report regularly to guide design iteration. +- Archived CI artifacts provide a historical record of balance changes. + +## See Also +- [How to Play Echoes](./how_to_play_echoes.md) +- [Implementation Plan](../simul/emergent_story_game_implementation_plan.md) +- [README](../../README.md) diff --git a/docs/gengine/how_to_play_echoes.md b/docs/gengine/how_to_play_echoes.md index 1fa838f8..cc5b14d2 100644 --- a/docs/gengine/how_to_play_echoes.md +++ b/docs/gengine/how_to_play_echoes.md @@ -1,9 +1,9 @@ # How to Play Echoes of Emergence -This guide explains how to run the current Echoes of Emergence prototype, -interpret its outputs, and iterate on the simulation while new systems are -under construction. It assumes you have cloned the repository and installed all -runtime/dev dependencies via `uv sync --group dev`. + +This guide explains how to run the current Echoes of Emergence prototype, interpret its outputs, and iterate on the simulation while new systems are under construction. It assumes you have cloned the repository and installed all runtime/dev dependencies via `uv sync --group dev`. + +**New!** For large-scale AI playtesting and balance iteration, see [Section 13: AI Tournament & Balance Analysis](./ai_tournament_and_balance_analysis.md). ## 1. Launching the Shell