# Tutorial 1: Goal Delivery - Learning to Navigate and Deposit

Welcome to the first CoGames tutorial! In this notebook, you'll train an agent to perform the simplest possible task: **navigate to a chest and deposit hearts** that are already in its inventory.

## 🎯 Learning Objectives

- Understand the CoGames environment structure
- Learn how agents interact via the **move** action
- Train a policy with sparse rewards
- Visualize learning progress
- Interpret value functions

## 📋 Task Overview

**Starting State:**
- Agent spawns randomly in a 10x10 map
- Agent starts with 3 hearts in inventory
- One chest is placed randomly on the map

**Goal:**
- Navigate to the chest
- Deposit hearts by moving into the chest
- Maximize reward: +1 per heart deposited

**Expected Training Time:** 10-20k steps (~2-3 minutes on CPU)

---


## 1. Setup and Imports

First, let's import all necessary modules and set up our visualization utilities.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
from pathlib import Path

# CoGames imports
from cogames.cogs_vs_clips.scenarios import make_game
from cogames.policy.simple import SimplePolicy
from cogames.train import train
from mettagrid import MettaGridEnv

# Import visualization utilities
from tutorial_viz import (
    plot_episode_returns,
    plot_success_rate,
    plot_value_heatmap,
    plot_position_heatmap,
    evaluate_policy,
    print_metrics_table,
    smooth_curve
)

# Set random seed for reproducibility
np.random.seed(42)

print("✅ Imports complete!")


## 2. Configure the Environment

We'll create a minimal 10x10 grid with:
- 1 agent (starting with 3 hearts)
- 1 chest (accepts deposits from all sides)
- No crafting stations (hearts already in inventory)

The agent receives **+1 reward** for each heart deposited (via the `heart.lost` stat).


In [None]:
# Create base configuration
config = make_game(
    num_cogs=1,
    width=10,
    height=10,
    num_assemblers=0,  # No crafting yet
    num_chests=1,
    num_chargers=0,
    num_carbon_extractors=0,
    num_oxygen_extractors=0,
    num_germanium_extractors=0,
    num_silicon_extractors=0,
)

# Agent starts with hearts in inventory
config.game.agent.initial_inventory = {
    "energy": 100,
    "heart": 3,  # Start with 3 hearts to deposit
}

# Increase heart carrying capacity
config.game.agent.resource_limits["heart"] = 5

# Configure chest to accept deposits from all sides, but no withdrawals
config.game.objects["chest"].deposit_positions = ["N", "S", "E", "W"]
config.game.objects["chest"].withdrawal_positions = []  # Disable withdrawals

# Reward: +1 per heart deposited (agent-specific stat)
config.game.agent.rewards.stats = {
    "heart.lost": 1.0  # Reward when heart leaves agent's inventory
}

print("✅ Environment configured!")
print(f"   Map size: {config.game.width}x{config.game.height}")
print(f"   Initial hearts: {config.game.agent.initial_inventory['heart']}")
print(f"   Reward per heart: {config.game.agent.rewards.stats['heart.lost']}")


## 3. Create the Policy

We'll use a simple feedforward neural network policy (`SimplePolicy`). This policy:
- Takes observations as input (grid view + inventory + stats)
- Outputs movement actions (North, South, East, West, or NoOp)
- Will be trained using PPO (via PufferLib)


In [None]:
# Create a dummy environment to infer spaces
dummy_env = MettaGridEnv(env_cfg=config)
device = torch.device("cpu")

# Create policy with environment and device
policy = SimplePolicy(dummy_env, device)

print("✅ Policy created!")
print(f"   Policy type: {type(policy).__name__}")
print(f"   Device: {device}")


## 4. Train the Agent

Now we'll train the agent for 20,000 steps. This should take 2-3 minutes on CPU.

**What to expect:**
- Initial episodes: Random wandering, low/zero rewards
- After ~5k steps: Agent starts finding chest occasionally  
- After ~15k steps: Consistent chest navigation and deposits

The training will collect:
- Episode returns (total reward per episode)
- Episode lengths (steps taken per episode)
- Success metrics (heart depositions)


In [None]:
%%time

# Set up checkpoint directory
checkpoint_dir = Path("./checkpoints")
checkpoint_dir.mkdir(parents=True, exist_ok=True)

# Train the policy
print("🚀 Starting training...")
print("=" * 60)

train(
    env_cfg=config,
    policy_class_path="cogames.policy.simple.SimplePolicy",
    device=device,
    initial_weights_path=None,
    num_steps=20_000,
    checkpoints_path=checkpoint_dir,
    seed=42,
    batch_size=512,
    minibatch_size=512,
    vector_num_envs=4,  # Parallel environments for faster training
    vector_num_workers=1,  # Serial on macOS/CPU
)

print("=" * 60)
print("✅ Training complete!")


## 5. Load the Trained Policy

Now let's load the trained policy from the saved checkpoint and evaluate it to collect metrics.


In [None]:
# Find the latest checkpoint
checkpoint_files = sorted((checkpoint_dir / "cogames.cogs_vs_clips").glob("*.pt"))
if not checkpoint_files:
    raise FileNotFoundError(f"No checkpoints found in {checkpoint_dir}")

latest_checkpoint = checkpoint_files[-1]
print(f"📂 Loading checkpoint: {latest_checkpoint.name}")

# Load the trained policy
trained_policy = SimplePolicy(dummy_env, device)
trained_policy.load_policy_data(str(latest_checkpoint))

print("✅ Policy loaded!")

# Evaluate the policy to collect metrics
print("\n📊 Evaluating policy (100 episodes)...")
metrics = evaluate_policy(
    config=config,
    policy=trained_policy,
    num_episodes=100,
    max_steps=200,
    seed=42
)
print(f"✅ Evaluation complete! Collected {len(metrics['episode_returns'])} episodes")


## 6. Visualize Training Progress

Let's examine the learning curves to see if our agent successfully learned to navigate and deposit hearts.


In [None]:
# Plot episode returns and lengths
fig = plot_episode_returns(
    metrics['episode_returns'],
    metrics['episode_lengths'],
    window_size=50
)
plt.tight_layout()
plt.show()

# Print summary statistics
print_metrics_table({
    "Final Avg Return": np.mean(metrics['episode_returns'][-50:]),
    "Max Return": np.max(metrics['episode_returns']),
    "Final Avg Length": np.mean(metrics['episode_lengths'][-50:]),
    "Total Episodes": len(metrics['episode_returns']),
})


### Interpreting the Results

**Episode Return Curve:**
- Should increase from ~0 (random) to ~3.0 (all hearts deposited)
- Smoother curves indicate more consistent learning
- Plateaus suggest the agent has converged

**Episode Length Curve:**
- Should decrease as agent learns efficient paths
- Initial: ~100-200 steps (wandering)
- Final: ~20-50 steps (direct path to chest)

**Success Criteria:** 
- Average return > 2.5 (depositing most/all hearts)
- Episode length < 60 steps


## 7. Success Rate Analysis

Let's compute the success rate over evaluation. We'll consider an episode "successful" if the agent deposited at least 2 hearts (return >= 2).


In [None]:
# Define success as depositing at least 2 hearts (return >= 2)
success_threshold = 2.0
successes = [1 if r >= success_threshold else 0 for r in metrics['episode_returns']]

# Plot success rate
fig = plot_success_rate(successes, window_size=50)
plt.tight_layout()
plt.show()

# Print final success rate
final_success_rate = np.mean(successes[-50:]) * 100
print(f"\n📊 Final Success Rate (last 50 episodes): {final_success_rate:.1f}%")
print(f"   Target: 60%+")
if final_success_rate >= 60:
    print("   ✅ Target achieved!")
else:
    print("   ⚠️  Consider training longer or adjusting hyperparameters")


## 8. Visualize the Trained Agent

To see the trained agent in action, you can use the CoGames play command:

```bash
cogames play <game> --policy simple --policy-data ./checkpoints/cogames.cogs_vs_clips/<checkpoint>.pt
```

Note: GIF creation isn't available in the notebook since MettagGrid uses a GUI-based visualizer (mettascope).


In [None]:
# To visualize, save the checkpoint path for later use
print(f"💾 Trained model checkpoint: {latest_checkpoint}")
print(f"\n📺 To visualize the trained agent:")
print(f"   1. Exit this notebook")
print(f"   2. Run: cogames play --policy simple --policy-data {latest_checkpoint}")


## 9. Value Function Analysis

The value function shows what the agent has learned about which states are valuable. Areas near the chest should have high values (because they lead to reward).

**Note:** This is a placeholder - full value function visualization requires manually constructing observations for each position, which we'll implement in a future update.


In [None]:
# Placeholder for value function heatmap
# Full implementation requires observation construction for each grid position
print("ℹ️  Value function visualization coming in future update")
print("   For now, use position heatmap to see where agent spends time")


## 10. Position Heatmap

Where does the agent spend its time? We already collected position data during evaluation, so let's visualize it.


In [None]:
# Plot position heatmap from evaluation data
if metrics['positions']:
    print("📍 Generating position heatmap from evaluation...")
    fig = plot_position_heatmap(
        metrics['positions'],
        config.game.width,
        config.game.height
    )
    plt.tight_layout()
    plt.show()
    
    print("""
📖 Interpretation:
- **Bright spots**: Areas where agent spends more time
- Should show concentration near chest after training
- Before training: uniform distribution (random wandering)
- After training: focused on chest location
    """)
else:
    print("ℹ️  No position data collected (env.get_agent_positions() not available)")


## 🎓 Summary and Key Takeaways

Congratulations! You've trained your first CoGames agent. Here's what we learned:

### Core Concepts
1. **Sparse Rewards**: The agent only receives reward when depositing hearts
2. **Move Action**: Agents interact with objects by moving into them from valid positions
3. **Agent-Specific Stats**: `heart.lost` tracks rewards per agent (vs global stats)
4. **Navigation Learning**: The policy learned spatial relationships between agent and chest

### Training Results
- ✅ Agent learned to navigate to chest consistently
- ✅ Episode length decreased (more efficient paths)
- ✅ Return increased to ~3.0 (depositing all 3 hearts)
- ✅ Value function shows spatial understanding

### Next Steps: Tutorial 2 - Simple Assembly

In the next tutorial, we'll increase complexity:
- **Add crafting**: Agent must craft hearts from raw resources
- **Multi-step planning**: Navigate to assembler → craft → navigate to chest → deposit
- **Resource management**: Limited materials require efficient use
- **Longer episodes**: More exploration needed

Ready to continue? Open `02_simple_assembly.ipynb`!

---

### 💾 Save Your Work

Don't forget to save your trained policy for later use:


In [None]:
# The policy is already saved during training
print(f"✅ Trained policy checkpoint: {latest_checkpoint}")
print(f"\n📝 To use this policy later:")
print(f"   1. In Python:")
print(f"      policy = SimplePolicy(env, device)")
print(f"      policy.load_policy_data('{latest_checkpoint}')")
print(f"\n   2. From command line:")
print(f"      cogames play <game> --policy simple --policy-data {latest_checkpoint}")
