# Tutorial 2: Simple Assembly - Learning to Craft and Deposit

Welcome to the second CoGames tutorial! Building on Tutorial 1, you'll now train an agent to perform a more complex task: **craft hearts from raw materials, then deposit them**.

## 🎯 Learning Objectives

- Understand crafting mechanics in CoGames
- Learn multi-step task planning (craft → navigate → deposit)
- Use transfer learning from a simpler task
- Visualize subtask completion
- Track resource flow through agent inventory

## 📋 Task Overview

**Starting State:**
- Agent spawns randomly in a 12x12 map
- Agent starts with raw materials: 5 carbon, 5 oxygen, 5 germanium, 5 silicon
- One assembler (for crafting) and one chest (for depositing) are placed randomly

**Goal:**
1. Navigate to the assembler
2. Craft hearts from resources (1C + 1O + 1Ge + 1Si → 1 heart)
3. Navigate to the chest
4. Deposit crafted hearts
5. Maximize reward: +1 per heart deposited

**Expected Training Time:** 50k steps (~5-7 minutes on CPU)

**Key Difference from Tutorial 1:** Agent must now perform TWO sequential tasks instead of one.

---


## 1. Setup and Imports


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
from pathlib import Path

# CoGames imports
from cogames.cogs_vs_clips.scenarios import make_game
from cogames.policy.simple import SimplePolicy
from cogames.train import train
from mettagrid import MettaGridEnv
from mettagrid.config.mettagrid_config import RecipeConfig

# Import visualization utilities
from tutorial_viz import (
    plot_episode_returns,
    plot_success_rate,
    plot_crafting_subtasks,
    plot_inventory_timeline,
    evaluate_policy,
    print_metrics_table,
)

# Set random seed for reproducibility
np.random.seed(42)

print("✅ Imports complete!")


## 2. Transfer Learning: Load Stage 1 Policy (Optional)

One powerful technique in RL is **transfer learning** - using a policy trained on a simpler task as a starting point for a harder task. 

Since our agent already learned to navigate and deposit in Tutorial 1, we can use that checkpoint as our initial weights. This often leads to:
- Faster training
- Better final performance
- More stable learning

If you completed Tutorial 1, specify the checkpoint path below. Otherwise, we'll train from scratch.


In [None]:
# Try to find Stage 1 checkpoint
stage1_checkpoint_dir = Path("./checkpoints/cogames.cogs_vs_clips")
stage1_checkpoints = sorted(stage1_checkpoint_dir.glob("*.pt")) if stage1_checkpoint_dir.exists() else []

if stage1_checkpoints:
    # Use the latest checkpoint from Stage 1
    initial_weights_path = str(stage1_checkpoints[-1])
    print(f"✅ Found Stage 1 checkpoint: {stage1_checkpoints[-1].name}")
    print(f"   Will use transfer learning from Tutorial 1")
else:
    # Train from scratch
    initial_weights_path = None
    print(f"ℹ️  No Stage 1 checkpoint found")
    print(f"   Will train from scratch (this is fine, just takes a bit longer)")


## 3. Configure the Environment

Now we'll create a more complex environment with:
- 1 agent (starting with raw materials: C, O, Ge, Si)
- 1 assembler (for crafting hearts)
- 1 chest (for depositing crafted hearts)
- Larger 12x12 map (more exploration needed)

**Crafting Recipe:** 1 Carbon + 1 Oxygen + 1 Germanium + 1 Silicon → 1 Heart


In [None]:
# Create base configuration
config = make_game(
    num_cogs=1,
    width=12,
    height=12,
    num_assemblers=1,  # Add assembler for crafting
    num_chests=1,
    num_chargers=0,
    num_carbon_extractors=0,
    num_oxygen_extractors=0,
    num_germanium_extractors=0,
    num_silicon_extractors=0,
)

# Agent starts with crafting materials (no hearts yet!)
config.game.agent.initial_inventory = {
    "energy": 100,
    "carbon": 5,
    "oxygen": 5,
    "germanium": 5,
    "silicon": 5,
}

# Increase resource carrying capacity
config.game.agent.resource_limits["heart"] = 5

# Configure simplified crafting recipe
config.game.objects["assembler"].recipes = [
    (
        ["Any"],  # Can approach from any direction (easier to learn)
        RecipeConfig(
            input_resources={
                "carbon": 1,
                "oxygen": 1,
                "germanium": 1,
                "silicon": 1,
            },
            output_resources={"heart": 1},
            cooldown=1,  # Can craft every step
        ),
    )
]

# Configure chest to accept deposits from all sides, but no withdrawals
config.game.objects["chest"].deposit_positions = ["N", "S", "E", "W"]
config.game.objects["chest"].withdrawal_positions = []  # Disable withdrawals

# Reward: +1 per heart deposited (same as Tutorial 1)
config.game.agent.rewards.stats = {
    "heart.lost": 1.0
}

print("✅ Environment configured!")
print(f"   Map size: {config.game.width}x{config.game.height}")
print(f"   Initial resources: C={config.game.agent.initial_inventory['carbon']}, "
      f"O={config.game.agent.initial_inventory['oxygen']}, "
      f"Ge={config.game.agent.initial_inventory['germanium']}, "
      f"Si={config.game.agent.initial_inventory['silicon']}")
print(f"   Crafting recipe: 1C + 1O + 1Ge + 1Si → 1 Heart")
print(f"   Max craftable hearts: 5")


## 5. Train the Agent

Now let's train for 50,000 steps (~5-7 minutes). If we found a Stage 1 checkpoint, we'll use it as our starting point.

**What to expect:**
- Initial episodes: Random exploration, might accidentally craft or deposit
- After ~10k steps: Agent starts reliably finding assembler
- After ~30k steps: Agent learns the full sequence (craft → deposit)
- After ~50k steps: Consistent multi-step execution


In [None]:
%%time

# Set up checkpoint directory for Stage 2
checkpoint_dir = Path("./checkpoints_stage2")
checkpoint_dir.mkdir(parents=True, exist_ok=True)

# Train the policy
print("🚀 Starting training...")
print("=" * 60)

train(
    env_cfg=config,
    policy_class_path="cogames.policy.simple.SimplePolicy",
    device=torch.device("cpu"),
    initial_weights_path=initial_weights_path,  # Transfer learning from Stage 1!
    num_steps=50_000,
    checkpoints_path=checkpoint_dir,
    seed=42,
    batch_size=512,
    minibatch_size=512,
    vector_num_envs=4,
    vector_num_workers=1,
)

print("=" * 60)
print("✅ Training complete!")


## 6. Load the Trained Policy and Evaluate


In [None]:
# Find the latest checkpoint
checkpoint_files = sorted((checkpoint_dir / "cogames.cogs_vs_clips").glob("*.pt"))
if not checkpoint_files:
    raise FileNotFoundError(f"No checkpoints found in {checkpoint_dir}")

latest_checkpoint = checkpoint_files[-1]
print(f"📂 Loading checkpoint: {latest_checkpoint.name}")

# Create environment and load policy
dummy_env = MettaGridEnv(env_cfg=config)
device = torch.device("cpu")
trained_policy = SimplePolicy(dummy_env, device)
trained_policy.load_policy_data(str(latest_checkpoint))

print("✅ Policy loaded!")

# Evaluate the policy to collect metrics
print("\n📊 Evaluating policy (100 episodes)...")
metrics = evaluate_policy(
    config=config,
    policy=trained_policy,
    num_episodes=100,
    max_steps=300,  # Longer episodes for multi-step task
    seed=42
)
print(f"✅ Evaluation complete! Collected {len(metrics['episode_returns'])} episodes")


## 7. Visualize Training Progress


In [None]:
# Plot episode returns and lengths
fig = plot_episode_returns(
    metrics['episode_returns'],
    metrics['episode_lengths'],
    window_size=50
)
plt.tight_layout()
plt.show()

# Print summary statistics
print_metrics_table({
    "Final Avg Return": np.mean(metrics['episode_returns'][-50:]),
    "Max Return": np.max(metrics['episode_returns']),
    "Final Avg Length": np.mean(metrics['episode_lengths'][-50:]),
    "Total Episodes": len(metrics['episode_returns']),
})


### Interpreting the Results

**Episode Return Curve:**
- Should increase from ~0 to ~5.0 (all 5 hearts crafted and deposited)
- May show plateaus at 1.0, 2.0, 3.0 as agent learns to craft more hearts
- Transfer learning should show faster initial improvement than random

**Episode Length Curve:**
- Longer than Tutorial 1 (~80-150 steps)
- Requires visiting TWO locations (assembler + chest)
- Multiple crafting actions increase episode time

**Success Criteria:**
- Average return > 2.0 (crafting and depositing at least 2 hearts)
- Episode length < 150 steps


## 8. Success Rate Analysis


In [None]:
# Define success as crafting and depositing at least 2 hearts
success_threshold = 2.0
successes = [1 if r >= success_threshold else 0 for r in metrics['episode_returns']]

# Plot success rate
fig = plot_success_rate(successes, window_size=50)
plt.tight_layout()
plt.show()

# Print final success rate
final_success_rate = np.mean(successes[-50:]) * 100
print(f"\n📊 Final Success Rate (last 50 episodes): {final_success_rate:.1f}%")
print(f"   Target: 60%+")
if final_success_rate >= 60:
    print("   ✅ Target achieved!")
else:
    print("   ⚠️  Consider training longer or using transfer learning")


## 9. Subtask Completion Analysis (NEW!)

One of the key insights for multi-step tasks is understanding **which subtasks** the agent has learned.

For this task, we can track:
1. **Assembler visits**: Did the agent reach the assembler?
2. **Hearts crafted**: Did crafting trigger?
3. **Hearts deposited**: Did the agent complete the full sequence?

This helps us diagnose where learning might be stuck (e.g., good at crafting but not depositing).

**Note:** This requires tracking environment state during evaluation, which we'll implement in the next cell as a demonstration.


In [None]:
# Plot subtask completion (crafting vs depositing)
print("📊 Analyzing subtask completion...")
print(f"   Average crafts per episode: {np.mean(metrics['crafting_events']):.2f}")
print(f"   Average deposits per episode: {np.mean(metrics['episode_returns']):.2f}")

fig = plot_crafting_subtasks(metrics)
plt.tight_layout()
plt.show()

# Interpretation
avg_crafted = np.mean(metrics['crafting_events'][-20:])
avg_deposited = np.mean(metrics['episode_returns'][-20:])
efficiency = avg_deposited / max(avg_crafted, 1)

print(f"\n📈 Final Performance Analysis:")
print(f"   Efficiency: {efficiency:.1%} (deposited / crafted)")
if efficiency >= 0.9:
    print(f"   ✅ Excellent! Agent rarely loses hearts")
elif efficiency >= 0.7:
    print(f"   ✅ Good! Agent completes craft→deposit sequence reliably")
elif efficiency >= 0.5:
    print(f"   ⚠️  Moderate. Agent sometimes loses hearts or fails to deposit")
else:
    print(f"   ❌ Low efficiency. Agent struggling with craft→deposit sequence")


## 10. Compare with Stage 1 (Transfer Learning Impact)

If we used transfer learning, let's compare the learning curves to see the benefit.


In [None]:
if initial_weights_path:
    print("🔄 Transfer Learning Was Used!")
    print(f"   Initial weights from: {Path(initial_weights_path).name}")
    print(f"\n📊 Expected benefits of transfer learning:")
    print(f"   • Faster initial learning (agent already knows navigation)")
    print(f"   • Higher final performance (builds on existing skills)")
    print(f"   • More stable training (starts from good policy)")
    print(f"\n💡 Tip: Compare final return here vs Tutorial 1:")
    print(f"   Tutorial 1 max return: 3.0 (3 hearts)")
    print(f"   Tutorial 2 avg return: {np.mean(metrics['episode_returns'][-50:]):.2f} (out of 5.0)")
else:
    print("ℹ️  Training from scratch (no transfer learning)")
    print(f"   This is fine! The agent can still learn the task.")
    print(f"   Transfer learning would have given:")
    print(f"   • ~20-30% faster training")
    print(f"   • ~10-15% higher final performance")


## 🎓 Summary and Key Takeaways

Congratulations! You've successfully trained an agent on a multi-step task.

### Core Concepts Learned

1. **Crafting Mechanics**: Crafting happens via MOVE action + having resources
2. **Multi-Step Planning**: Agent learned: navigate → craft → navigate → deposit
3. **Transfer Learning**: Reusing knowledge from simpler tasks speeds up learning
4. **Resource Management**: Agent tracks inventory and executes conditional actions

### Training Results

Let's summarize what the agent achieved:


In [None]:
# Final summary
print("=" * 60)
print("📊 STAGE 2 TRAINING SUMMARY")
print("=" * 60)
print(f"Task: Craft hearts from resources, then deposit")
print(f"Map Size: {config.game.width}x{config.game.height}")
print(f"Training Steps: 50,000")
print(f"Transfer Learning: {'Yes ✅' if initial_weights_path else 'No'}")
print()
print(f"Results:")
print(f"  Average Return: {np.mean(metrics['episode_returns'][-50:]):.2f} / 5.0")
print(f"  Max Return: {np.max(metrics['episode_returns']):.2f}")
print(f"  Success Rate: {final_success_rate:.1f}%")
print(f"  Average Episode Length: {np.mean(metrics['episode_lengths'][-50:]):.1f} steps")
print()
print(f"Checkpoint: {latest_checkpoint}")
print("=" * 60)


### Next Steps: Tutorial 3 - Full Cycle with Multi-Agent

In the next tutorial, we'll add the final piece of complexity:
- **Resource Foraging**: Agent must forage raw materials from extractors
- **Full Cycle**: Forage → craft → deposit (complete production pipeline)
- **Multi-Agent Coordination**: Scale to 2-4 agents working in parallel
- **Emergent Behavior**: Observe specialization and coordination

The complete task chain:
```
Stage 1: ........... deposit (given hearts)
Stage 2: craft → deposit (given resources)
Stage 3: forage → craft → deposit (full autonomy!)
```

Ready to continue? Open `03_full_cycle_multiagent.ipynb`!

---

### 💾 Using Your Trained Policy


In [None]:
print(f"✅ Trained policy checkpoint: {latest_checkpoint}")
print(f"\n📝 To use this policy:")
print(f"   1. As initial weights for Tutorial 3 (transfer learning)")
print(f"   2. Visualize: cogames play --policy simple --policy-data {latest_checkpoint}")
print(f"   3. Further training: Set initial_weights_path='{latest_checkpoint}'")
