# Interactive Scorecard Widget Example

This notebook demonstrates how to use the `ScorecardWidget` - an anywidget-based implementation of the observatory scorecard component for Jupyter notebooks.

The widget provides interactive policy evaluation scorecards with:
- Have control over the number of policies displayed
- Decide and select metrics to render
- Choose which runs to get policies from
- Dynamically get data from the Observatory API to render
- Display scorecards for custom metrics if you like


## Import and Basic Setup


In [None]:
%load_ext autoreload
%autoreload 2
%load_ext anywidget

print("Setup complete! Auto-reload enabled.")

## Real Data from Metta Database

Now let's write code that fetches real evaluation data from metta's databases:

First we'll need some API variables.

Then we'll make an API client for the metta API.

Then we'll use the client to retrieve policy data.

Then we'll render it with the ScorecardWidget.


In [None]:
from metta.app_backend.clients.scorecard_client import ScorecardClient
from experiments.notebooks.utils.scorecard_widget.scorecard_widget.ScorecardWidget import ScorecardWidget
client = ScorecardClient()

scorecard_widget = ScorecardWidget(client=client)
scorecard_widget

## Example: Finding policies with search text

Now let's explore what's available in the database and create a scorecard with real data:


In [None]:
# For now, let's try with some common metrics and see what we find:

# You can search for policies with a list of search texts
restrict_to_policy_names = ["zfogg.1753775626", "daveey.arena.rnd."]

# You can get training run policies by exact training run names
# restrict_to_policy_names = [
#     "daveey.arena.rnd.16x4.2",
#     "relh.skypilot.fff.j20.666",
#     "bullm.navigation.low_reward.baseline",
#     "bullm.navigation.low_reward.baseline.07-17", 
#     "bullm.navigation.low_reward.baseline.07-23",
#     "relh.multigpu.fff.1",
#     "relh.skypilot.fff.j21.2",
# ]

scorecard2 = ScorecardWidget(client=client)
await scorecard2.fetch_real_scorecard_data(
    restrict_to_policy_names=restrict_to_policy_names,
    restrict_to_metrics=["reward", "heart.get", "action.move.success"],
    policy_selector="latest", # "best" or "latest"
    max_policies=50  # Limit display to keep it manageable
)
    
scorecard2

## Example: Finding policies using training run names

Here's how to create a scorecard with your own training runs and metrics:


In [None]:
# Example: Create a custom scorecard with specific training runs and metrics

# Use training run names (recommended - uses smart selection)
my_training_runs = [
    # Add your training run names here, for example:
    "gregorypylypovych.navigation.ffa_NAV_DEFAULT_vtm_4rooms_of_4_seed0.07-27",
    "gregorypylypovych.navigation.ffa_DEFAULT_vtm_4rooms_of_4_seed0.07-27",
    "gregorypylypovych.navigation.ffa_DEFAULT_tfn_4rooms_of_4_seed0.07-27",
    "gregorypylypovych.navigation.ffa_DEFAULT_vtb_4rooms_of_4_seed0.07-27",
]

# Step 2: Define metrics you want to compare
my_metrics = [
    "reward",
    "heart.get",           # Example game-specific metric
    "action.move.success", # Example action success rate
    # Add more metrics as needed
]

print("🎯 Creating custom scorecard with best policies from training runs...")
# Select best policies from training runs
custom_scorecard = ScorecardWidget(client=client)
await custom_scorecard.fetch_real_scorecard_data(
    restrict_to_policy_names=my_training_runs,
    restrict_to_metrics=my_metrics,
    policy_selector="best",
    max_policies=20
)

print("📊 Custom scorecard created! Try:")
print("   - Hovering over cells to see detailed values")
print("   - Changing metrics with: custom_scorecard.update_metric('heart.get')")
print("   - Adjusting policies shown: custom_scorecard.set_num_policies(15)")

custom_scorecard

## Example: Setting the metric

Now let's see the `update_metric` functionality working properly! This example shows a scorecard where changing the metric actually changes the displayed values:


In [None]:
# Last one. Scroll up again to see the change.
print("\n🔄 Changing metric to 'success_rate'...")
custom_scorecard.update_metric('action.move.success')

print("Now the scorecard in the cell that ran before this one should have changed.")

In [None]:
# And now let's change it to reward.
custom_scorecard.update_metric('reward')

## Example: Custom metrics

We can really define our cells to have any metric data we want. This is useful because we plan to have all sorts of metrics. Let's look at an example of using any old metric we decide.

First, you need to know the data format:

### Info: Data Cell Format Reference

The scorecard widget expects data in a specific format that matches the
observatory dashboard:

```python
cells = {
    'policy_name': {
        'eval_name': {
            'metrics': {
                'reward': 50,
                'heart.get': 98,
                'action.move.success': 5,
                'ore_red.get': 24.2,
                # ... more metrics
            },
            'replayUrl': str,         # URL to replay file
            'evalName': str,          # Should match the key
        },
        # ... more evaluations
    },
    # ... more policies
}
```

**Important notes:**
- Evaluation names with "/" will be grouped by category (the part before "/")
- The scorecard shows policies sorted by average score (worst to best, bottom to top)
- Policy names that contain ":v" will have WandB URLs generated automatically
- Replay URLs should be accessible URLs or file paths

This widget provides the same interactive functionality as the observatory dashboard but in a python environment, making it perfect for exploratory analysis and sharing results via Jupyter notebooks!


### Here's how to create a scorecard with your own data:


In [None]:
# Create a new scorecard widget
from experiments.notebooks.utils.scorecard_widget.scorecard_widget.ScorecardWidget import create_scorecard_widget

custom_widget = create_scorecard_widget()

# Define your data structure
# This should match the format expected by the observatory dashboard
cells_data = {
    'my_policy_v1': {
        'task_a/level1': {
            'metrics': {
                'custom_score': 85.2,
            },
            'replayUrl': 'https://example.com/replay1.json', 
            'evalName': 'task_a/level1'
        },
        'task_a/level2': {
            'metrics': {
                'custom_score': 87.5,
            },
            'replayUrl': 'https://example.com/replay2.json', 
            'evalName': 'task_a/level2'
        },
        'task_b/challenge1': {
            'metrics': {
                'custom_score': 92.5,
            },
            'replayUrl': 'https://example.com/replay3.json', 
            'evalName': 'task_b/challenge1'
        },
    },
    'my_policy_v2': {
        'task_a/level1': {
            'metrics': {
                'custom_score': 22.5,
            },
            'replayUrl': 'https://example.com/replay4.json', 
            'evalName': 'task_a/level1'
        },
        'task_a/level2': {
            'metrics': {
                'custom_score': 42.5,
            },
            'replayUrl': 'https://example.com/replay5.json', 
            'evalName': 'task_a/level2'
        },
        'task_b/challenge1': {
            'metrics': {
                'custom_score': 62.5,
            },
            'replayUrl': 'https://example.com/replay6.json', 
            'evalName': 'task_b/challenge1'
        },
    },
}

eval_names = ['task_a/level1', 'task_a/level2', 'task_b/challenge1']
policy_names = ['my_policy_v1', 'my_policy_v2']
policy_averages = {
    'my_policy_v1': 91.6,
    'my_policy_v2': 89.6,
}

# Set the data
custom_widget.set_data(
    cells=cells_data,
    eval_names=eval_names,
    policy_names=policy_names,
    policy_average_scores=policy_averages,
    selected_metric="custom_score"
)

# Display the widget
custom_widget
