# Metrics Notebook

## Imports & Setup

In [16]:
import json
import time
from typing import List, Dict, Any
import pandas as pd
from pathlib import Path
import sys
import openai
from dotenv import load_dotenv
import os

from llama_index.core import (
    StorageContext,
    load_index_from_storage
)


In [None]:
BASE_DIR = Path().resolve().parent
sys.path.append(str(BASE_DIR / "5-game_physics_awareness"))

from engine import MagicJudgeEngine 


## Query Function

In [2]:


def ask_judge(
    judge, 
    question: str,
    history: List[Dict] = None,
    collect_tokens: bool = False
) -> Dict[str, Any]:
    """
    Realiza una consulta al juez. 
    Robusta a fallos: Si explota, devuelve un diccionario con el error 
    en lugar de detener todo el script de evaluación.
    """
    if history is None:
        history = []

    t_start = time.time()
    
    # Variables de control
    full_response = ""
    tokens = []
    error_msg = None
    success = False

    try:
        # Ejecutamos la query
        stream = judge.query(question, history=history)

        # Consumimos el stream
        for token in stream:
            # Manejo defensivo por si LlamaIndex cambia la estructura del objeto
            delta = getattr(token, "delta", str(token))
            
            if delta:
                full_response += delta
                if collect_tokens:
                    tokens.append(delta)
        
        success = True

    except Exception as e:
        # Capturamos el error para que el loop de 100 preguntas no se detenga
        error_msg = str(e)
        print(f"⚠️ Error procesando pregunta: {question[:30]}... | {error_msg}")

    latency = time.time() - t_start

    return {
        "question": question,
        "generated_answer": full_response.strip(), # Limpiamos espacios
        "ground_truth": None, # Esto lo llenarás tú después al cruzar con tu dataset
        "latency": latency,
        "success": success,
        "error": error_msg,
        "tokens": tokens if collect_tokens else None
    }

In [3]:
judge = MagicJudgeEngine()

[LOG] Building Rules Index...


  from .autonotebook import tqdm as notebook_tqdm


[LOG] Starting Rules Parsing...
[LOG] Processing 2217 lines of glossary...

DATASET STATISTICS (RULES)
   Total Nodes:      3834
   Rules Parsed:     3109
   Glossary Terms:   725


[SAMPLE RULE] {
  "rule_id": "805.5",
  "chapter_id": "8",
  "chapter_title": "Multiplayer Rules",
  "section_id": "805",
  "section_title": "Shared Team Turns Option",
  "text": "Teams have priority, not individual players."
}

[SAMPLE GLOSSARY] {
  "rule_id": "Venture into [Quality]",
  "chapter_id": "G",
  "chapter_title": "Glossary",
  "section_id": null,
  "section_title": null,
  "text": "A variant of the venture into the dungeon ability that allows a player to bring a dungeon card with [quality] into the game or move a player\u2019s venture marker. See rule 701.49, \u201cVenture into the Dungeon.\u201d"
}


Parsing nodes: 100%|██████████| 3834/3834 [00:00<00:00, 11878.34it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:26<00:00, 77.21it/s] 
Generating embeddings: 100%|██████████| 1786/1786 [00:16<00:00, 105.61it/s]


[LOG] Building Cards Index...
[LOG] Downloading AtomicCards.json...
[LOG] Processing Cards JSON...


100%|██████████| 33331/33331 [00:00<00:00, 99268.52card/s] 



DATASET STATISTICS (CARDS)
   Cards Indexed:    30225
   Format Skipped:   3106 (Not Legal)


[SAMPLE CARD] {
  "card_name": "Dispel",
  "text": "Card Name: Dispel\nFormat Legality: Pioneer: Legal, Modern: Legal, Legacy: Legal, Vintage: Legal, Commander: Legal, Pauper: Legal\nCost: {U}\nType: Instant\nOracle Text:\nCounter target instant spell.\n"
}



Parsing nodes: 100%|██████████| 30225/30225 [00:03<00:00, 9358.32it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:20<00:00, 100.77it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:20<00:00, 99.07it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:19<00:00, 106.72it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:24<00:00, 84.93it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:21<00:00, 93.85it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:19<00:00, 102.84it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:21<00:00, 95.48it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:20<00:00, 99.92it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:23<00:00, 87.16it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:23<00:00, 85.46it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:20<00:00, 99.88it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:23<00:00, 85.57it/s] 
Generating embedd

In [4]:
# test with one question
question = "If I attack with a creature with Deathtouch and Trample and it gets blocked, how much damage do I need to assign to the blocker?"
response = ask_judge(judge,question)

[LOG] Search Query: If I attack with a creature with Deathtouch and Trample and it gets blocked, how much damage do I need to assign to the blocker?
[LOG] No exact cards found. Running semantic search.

[LOG] RETRIEVAL CANDIDATES (After Filtering)
 - [Rule] 702.19b                        (sc: 0.70)
 - [Rule] 702.19d                        (sc: 0.64)
 - [Rule] 510.1c                         (sc: 0.63)
 - [Rule] 702.19e                        (sc: 0.61)
 - [Rule] 510.1d                         (sc: 0.60)
 - [Rule] 702.2c                         (sc: 0.59)
 - [Card] Enlarge                        (sc: 0.59)
 - [Rule] 510.1a                         (sc: 0.58)
 - [Rule] 510.1                          (sc: 0.58)
 - [Card] Ride Down                      (sc: 0.57)
 - [Card] Mirror Shield                  (sc: 0.55)
 - [Card] Deathcoil Wurm                 (sc: 0.54)
 - [Card] Fight to the Death             (sc: 0.54)



In [5]:
response


{'question': 'If I attack with a creature with Deathtouch and Trample and it gets blocked, how much damage do I need to assign to the blocker?',
 'generated_answer': 'Ah, the intricacies of combat mechanics! Let us delve into the nuances of damage assignment when a creature with both deathtouch and trample engages in battle.\n\n### 1. The Interaction\nWe are examining a scenario where an attacking creature possesses both deathtouch and trample, and it is subsequently blocked by another creature.\n\n### 2. The Logic (Step-by-Step)\n- **Deathtouch Mechanic**: A creature with deathtouch only needs to assign a single point of damage to a blocking creature for that damage to be considered lethal. This is crucial because it allows the attacking creature to potentially assign less damage than its total power while still fulfilling the requirement for lethal damage.\n  \n- **Trample Mechanic**: When a creature with trample is blocked, it must assign enough damage to the blocking creature to me

## Dataset Questions and answers

In [6]:
BASE_DIR = Path().resolve().parent
sys.path.append(str(BASE_DIR / "7-grader_ai_metrics"))

# 1. Cargar el archivo JSON
with open('questions_v1.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# 2. Crear el DataFrame de Definiciones (Metadata)
df_bloques = pd.DataFrame(data['blocks_definition'])

# 3. Crear el DataFrame de Preguntas
df_questions = pd.DataFrame(data['questions'])

In [7]:
df_bloques

Unnamed: 0,block,theme,focus
0,1,Combat & Basic Keywords,Fundamentals of damage and common keywords.
1,2,The Stack & Spell Interaction,"LIFO order, counterspells, and independence of..."
2,3,Targets & Triggers,"Legal targets, ward, and ETB/LTB triggers."
3,4,Resources & Mana,"Mana pool, land types, and cost taxes."
4,5,State-Based Actions (SBA),"Legend rule, 0 toughness, and game loss condit..."
5,6,Replacement Effects & Control,Instead effects and continuous control changes.
6,7,Commander & Multiplayer,"Color identity, commander tax, and command zon..."
7,8,Copies & Transformations,"Copyable values, morph, and double-faced cards."
8,9,Costs & X Spells,Mana value on stack vs other zones and cost ca...
9,10,Elite Interactions (Pro Level),"Layers, complex dependencies, and timestamp lo..."


In [8]:
df_questions['block'] = (df_questions['id'] - 1) // 10 + 1
df_questions = df_questions.merge(df_bloques[['block', 'theme']], on='block')

In [9]:
df_questions.head(20)

Unnamed: 0,id,difficulty,category,question,ground_truth,block,theme
0,1,Easy,Keywords,If I attack with a creature with [[Lifelink]] ...,Yes. Lifelink causes you to gain life equal to...,1,Combat & Basic Keywords
1,2,Easy,Keywords,Can a creature with [[Flying]] be blocked by a...,Yes. Reach is a keyword specifically designed ...,1,Combat & Basic Keywords
2,3,Easy,Keywords,"If my creature has [[Vigilance]], does it stil...",Yes. Vigilance means the creature does not tap...,1,Combat & Basic Keywords
3,4,Easy,Keywords,Does [[Ward 2]] counter a spell if the opponen...,Yes. Ward is a triggered ability. When the cre...,1,Combat & Basic Keywords
4,5,Medium,Keywords,If I attack with a creature with [[Deathtouch]...,You only need to assign 1 point of damage to t...,1,Combat & Basic Keywords
5,6,Easy,Keywords,Can a creature with [[Haste]] activate an abil...,Yes. Haste removes the restriction that preven...,1,Combat & Basic Keywords
6,7,Medium,Turn Structure,Can I cast a creature spell during my opponent...,"No. Without Flash, creatures can only be cast ...",1,Combat & Basic Keywords
7,8,Easy,Keywords,If my creature has [[First Strike]] and my opp...,No. Your creature deals damage in the first st...,1,Combat & Basic Keywords
8,9,Medium,Keywords,"If a creature has [[Indestructible]], does it ...",Yes. Indestructible only prevents destruction ...,1,Combat & Basic Keywords
9,10,Easy,Keywords,"If I have [[Double Strike]], do I get two trig...",Yes. The creature deals damage in two separate...,1,Combat & Basic Keywords


In [10]:
# generar respuestas con clase

def get_responses(row):
    question = row['question']
    response = ask_judge(judge, question)
    return response['generated_answer']

df_questions['model_answer'] = df_questions.apply(get_responses, axis=1)

[LOG] Search Query: If I attack with a creature with [[Lifelink]] and my opponent blocks with a 1/1, do I still gain life equal to my creature's power?
[LOG] Target Cards Identified: ['Lifelink']
   >>> Found Card: Lifelink

[LOG] RETRIEVAL CANDIDATES (After Filtering)
 - [Card] Lifelink                       (sc: 2.00)
 - [Rule] 702.15e                        (sc: 0.64)
 - [Rule] 120.3f                         (sc: 0.60)
 - [Rule] 702.15b                        (sc: 0.60)
 - [Rule] 702.15d                        (sc: 0.54)
 - [Rule] 510.1c                         (sc: 0.53)
 - [Rule] 120.4d                         (sc: 0.53)
 - [Rule] 510.1a                         (sc: 0.53)
 - [Rule] 119.9                          (sc: 0.53)

[LOG] Search Query: Can a creature with [[Flying]] be blocked by a creature with [[Reach]]?
[LOG] Target Cards Identified: ['Flying', 'Reach']
[LOG] No exact cards found. Running semantic search.

[LOG] RETRIEVAL CANDIDATES (After Filtering)
 - [Rule] 702.9b   

In [11]:
df_questions.head()

Unnamed: 0,id,difficulty,category,question,ground_truth,block,theme,model_answer
0,1,Easy,Keywords,If I attack with a creature with [[Lifelink]] ...,Yes. Lifelink causes you to gain life equal to...,1,Combat & Basic Keywords,"Ah, the intricacies of combat and lifelink! Le..."
1,2,Easy,Keywords,Can a creature with [[Flying]] be blocked by a...,Yes. Reach is a keyword specifically designed ...,1,Combat & Basic Keywords,"Ah, the intricacies of combat mechanics! Let u..."
2,3,Easy,Keywords,"If my creature has [[Vigilance]], does it stil...",Yes. Vigilance means the creature does not tap...,1,Combat & Basic Keywords,"Ah, the nuances of vigilance and tapping mecha..."
3,4,Easy,Keywords,Does [[Ward 2]] counter a spell if the opponen...,Yes. Ward is a triggered ability. When the cre...,1,Combat & Basic Keywords,"Ah, the intricacies of ward mechanics! Let us ..."
4,5,Medium,Keywords,If I attack with a creature with [[Deathtouch]...,You only need to assign 1 point of damage to t...,1,Combat & Basic Keywords,"Ah, the intricacies of combat mechanics! Let u..."


In [12]:
df_questions.to_csv("questions_with_model_answers_v1.csv", index=False)

## Metric Evaluation

In [13]:
# prompt para evaluar respuestas

def get_eval_prompt(question, ground_truth, model_answer):
    return f"""
    ROLE:
    You are a Senior Magic: The Gathering Level 3 Judge. Your task is to evaluate the accuracy of a Rules Bot's response compared to an official Ground Truth.

    INPUT DATA:
    - User Question: {question}
    - Ground Truth (Correct Answer): {ground_truth}
    - Bot's Answer: {model_answer}

    EVALUATION CRITERIA:
    1. Technical Accuracy (Critical): Does the bot provide the correct ruling? 
       - If the bot says "Yes" when the answer is "No", or provides a wrong number (e.g., "3 damage" instead of "10"), the score MUST be 0.
       - Logic errors regarding Layers, Timestamps, or State-Based Actions must be heavily penalized.
    2. Completeness: Does the bot explain *why* based on the rules?
    3. Source Citation: Does the bot mention relevant rules or card names correctly?

    SCORING SCALE (0-5):
    - 5: Perfectly accurate, explains the logic, and matches the Ground Truth.
    - 4: Correct ruling but missing some nuance or explanation.
    - 3: Correct ruling but with slightly confusing or redundant explanation.
    - 1-2: Major technical inaccuracies or misleading information.
    - 0: Completely wrong ruling (e.g., opposite outcome) or hallucination.

    OUTPUT FORMAT:
    You must return ONLY a JSON object with the following keys:
    {{
        "score": int,
        "verdict": "CORRECT" or "INCORRECT",
        "reasoning": "A brief explanation of why the score was given, focusing on technical MTG rules."
    }}
    """

In [17]:
# funcción de evaluación
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

client = openai.OpenAI(api_key=openai_api_key)

def run_evaluation(df):
    results = []
    
    for idx, row in df.iterrows():
        print(f"Judging question {row['id']}...")
        
        prompt = get_eval_prompt(row['question'], row['ground_truth'], row['model_answer'])
        
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "system", "content": "You are a precise MTG Judge Evaluator."},
                          {"role": "user", "content": prompt}],
                response_format={ "type": "json_object" },
                temperature=0 # Queremos consistencia, no creatividad
            )
            
            evaluation = json.loads(response.choices[0].message.content)
            results.append(evaluation)
        except Exception as e:
            results.append({"score": 0, "verdict": "ERROR", "reasoning": str(e)})

    # Unimos los resultados al DataFrame original
    df_results = pd.concat([df, pd.DataFrame(results)], axis=1)
    return df_results


In [18]:
# Ejecutar prueba
df_test = df_questions.head(10)
df_test_eval = run_evaluation(df_test)

Judging question 1...
Judging question 2...
Judging question 3...
Judging question 4...
Judging question 5...
Judging question 6...
Judging question 7...
Judging question 8...
Judging question 9...
Judging question 10...


In [19]:
df_test_eval

Unnamed: 0,id,difficulty,category,question,ground_truth,block,theme,model_answer,score,verdict,reasoning
0,1,Easy,Keywords,If I attack with a creature with [[Lifelink]] ...,Yes. Lifelink causes you to gain life equal to...,1,Combat & Basic Keywords,"Ah, the intricacies of combat and lifelink! Le...",5,CORRECT,The bot accurately explains that lifelink allo...
1,2,Easy,Keywords,Can a creature with [[Flying]] be blocked by a...,Yes. Reach is a keyword specifically designed ...,1,Combat & Basic Keywords,"Ah, the intricacies of combat mechanics! Let u...",5,CORRECT,The bot accurately explains that a creature wi...
2,3,Easy,Keywords,"If my creature has [[Vigilance]], does it stil...",Yes. Vigilance means the creature does not tap...,1,Combat & Basic Keywords,"Ah, the nuances of vigilance and tapping mecha...",0,INCORRECT,The bot's response does not provide a correct ...
3,4,Easy,Keywords,Does [[Ward 2]] counter a spell if the opponen...,Yes. Ward is a triggered ability. When the cre...,1,Combat & Basic Keywords,"Ah, the intricacies of ward mechanics! Let us ...",2,INCORRECT,The bot does not provide a clear ruling on whe...
4,5,Medium,Keywords,If I attack with a creature with [[Deathtouch]...,You only need to assign 1 point of damage to t...,1,Combat & Basic Keywords,"Ah, the intricacies of combat mechanics! Let u...",5,CORRECT,The bot accurately explains that only 1 damage...
5,6,Easy,Keywords,Can a creature with [[Haste]] activate an abil...,Yes. Haste removes the restriction that preven...,1,Combat & Basic Keywords,"Ah, the intricacies of haste and activated abi...",1,INCORRECT,The bot does not provide a clear answer to the...
6,7,Medium,Turn Structure,Can I cast a creature spell during my opponent...,"No. Without Flash, creatures can only be cast ...",1,Combat & Basic Keywords,"Ah, the nuances of timing and spellcasting! Le...",5,CORRECT,The bot accurately explains that a creature sp...
7,8,Easy,Keywords,If my creature has [[First Strike]] and my opp...,No. Your creature deals damage in the first st...,1,Combat & Basic Keywords,"Ah, a query that delves into the intricacies o...",5,CORRECT,The bot accurately explains that the creature ...
8,9,Medium,Keywords,"If a creature has [[Indestructible]], does it ...",Yes. Indestructible only prevents destruction ...,1,Combat & Basic Keywords,"Ah, the intricacies of creature interactions a...",0,INCORRECT,The bot's response incorrectly states that a c...
9,10,Easy,Keywords,"If I have [[Double Strike]], do I get two trig...",Yes. The creature deals damage in two separate...,1,Combat & Basic Keywords,"Ah, a query regarding the nuances of combat da...",5,CORRECT,The bot accurately explains that a creature wi...


In [31]:
print(df_test_eval.loc[2, 'question'])
print(df_test_eval.loc[2, 'ground_truth'])
print(df_test_eval.loc[2, 'model_answer'])
print(df_test_eval.loc[2, 'reasoning'])

If my creature has [[Vigilance]], does it still tap to activate an ability with the {T} symbol after it has attacked?
Yes. Vigilance means the creature does not tap to attack. Since it remains untapped after the declare attackers step, it is available to be tapped for other abilities later.
Ah, the nuances of vigilance and tapping mechanics! Let us delve into the intricacies of this interaction.

### 1. The Interaction
The key card in question is **Vigilance**, an enchantment that grants the enchanted creature the ability to attack without tapping. The query revolves around whether a creature with vigilance can still tap to activate an ability that requires the {T} symbol after it has attacked.

### 2. The Logic (Step-by-Step)
- **Vigilance Effect**: When a creature is enchanted with Vigilance, it does not tap when it attacks. This means that during the Declare Attackers step, the creature remains untapped despite being declared as an attacker.
- **Activated Abilities**: According to R

In [30]:
print(df_test_eval.loc[5, 'question'])
print(df_test_eval.loc[5, 'ground_truth'])
print(df_test_eval.loc[5, 'model_answer'])

Can a creature with [[Haste]] activate an ability with a {T} symbol the turn it comes into play?
Yes. Haste removes the restriction that prevents a creature from attacking or using tap abilities until you have controlled it since the beginning of your most recent turn.
Ah, the intricacies of haste and activated abilities! Let us delve into this fascinating interaction.

### 1. The Interaction
We are examining the relationship between the keyword ability Haste and the activation of abilities that require tapping a creature, denoted by the {T} symbol.

### 2. The Logic (Step-by-Step)
- **Understanding Haste**: Haste is a static ability that allows a creature to ignore the "summoning sickness" rule, which typically prevents a creature from attacking or activating abilities that require tapping or untapping unless it has been under its controller's control since the beginning of their most recent turn (Rule 302.6).
- **The Role of Haste**: When a creature gains haste, it can attack and act