# TSG-Bench Tutorial: Understanding Temporal-Spatial Scene Graph Evaluation

This notebook provides a comprehensive walkthrough of the TSG-Bench repository, demonstrating:
- How scene graphs are structured
- The 4 evaluation tasks
- How to use OpenAI models for evaluation
- Where evaluation code lives and how it works
- Running hands-on examples

**Repository:** TSG-Bench - First unified benchmark for scene graph understanding and generation

**Website:** https://tsg-bench.netlify.app/

## 1. Setup and Imports

First, let's import necessary libraries and add the repo to our path.

In [1]:
import json
import sys
from pathlib import Path
from pprint import pprint

# Add repo to path
repo_path = Path.cwd()
if str(repo_path) not in sys.path:
    sys.path.insert(0, str(repo_path))

print(f"Working directory: {repo_path}")
print(f"Python path updated: {repo_path in [Path(p) for p in sys.path]}")

Working directory: /home/jtu9/sgg/tsg-bench
Python path updated: True


## 2. Repository Overview

### What is TSG-Bench?

TSG-Bench evaluates Large Language Models on their ability to:
1. **Generate** scene graphs from textual descriptions
2. **Understand** scene graphs and reason over them

### The 4 Evaluation Tasks:

| Task | Type | Description | Dataset Size | Metrics |
|------|------|-------------|--------------|----------|
| **SA-SGG** | Generation | Single-action scene graph generation | 1,188 samples | P/R/F1 |
| **MA-SGG** | Generation | Multi-action scene graph generation | 853 samples | P/R/F1 |
| **SGQA** | Understanding | Scene graph question answering | 99 samples (500+ QA pairs) | Accuracy |
| **SGDS** | Understanding | Scene graph description selection | 249 samples | Accuracy |

### Repository Structure:

```
tsg-bench/
â”œâ”€â”€ evaluation/          # 4 task evaluation scripts
â”‚   â”œâ”€â”€ generation/      # sa-sgg.py, ma-sgg.py
â”‚   â””â”€â”€ understanding/   # sgqa.py, sgds.py
â”œâ”€â”€ models/models.py     # LLM wrappers (OpenAI, Anthropic, etc.)
â”œâ”€â”€ resource/
â”‚   â”œâ”€â”€ dataset/         # JSONL datasets for all tasks
â”‚   â””â”€â”€ prompts/         # Task-specific prompt templates
â””â”€â”€ utils/               # Config and path utilities
```

## 3. Understanding Scene Graphs

### What is a Scene Graph?

A scene graph represents actions and spatial relationships as a graph of **triplets**:
```
[source_node, edge_type, target_node]
```

### Node Types:
- **person** (always the starting node)
- **Actions:** pick-up, place, tighten, drill, etc.
- **Objects:** screwdriver, bowl, wood, clamp, etc.
- **Hands:** hand1 (left), hand2 (right)

### Edge Types:
- **verb:** connects person â†’ action
- **dobj:** direct object (action â†’ object)
- **Prepositions:** with, from, into, on, under, etc.

### Example:
**Text:** "Pick up the screwdriver with your left hand"

**Scene Graph:**
```python
[
    ["person", "verb", "pick-up"],
    ["pick-up", "dobj", "screwdriver"],
    ["pick-up", "with", "hand1"]
]
```

## 4. Loading and Exploring Datasets

Let's load examples from each task dataset to see their structure.

### 4.1 SA-SGG Dataset (Single-Action Scene Graph Generation)

In [2]:
# Load SA-SGG dataset
sa_sgg_path = repo_path / "resource" / "dataset" / "generation" / "sa-sgg.jsonl"

with open(sa_sgg_path, 'r') as f:
    sa_sgg_samples = [json.loads(line) for line in f]

print(f"Total SA-SGG samples: {len(sa_sgg_samples)}")
print("\n" + "="*80)
print("Example SA-SGG Sample:")
print("="*80)

example = sa_sgg_samples[0]
print(f"\nData ID: {example['data_id']}")
print(f"\nContext: {example['context'][:200]}...")
print(f"\nTarget Sentence: {example['target_sentence']}")
print(f"\nGround Truth Scene Graph:")
pprint(example['graphs'][0]['triplets'])
print(f"\nAvailable Objects: {example['mandatory_space']['object'][:10]}...")
print(f"Available Verbs: {example['mandatory_space']['verb'][:10]}...")
print(f"Available Relationships: {example['mandatory_space']['relationship'][:10]}...")

Total SA-SGG samples: 1188

Example SA-SGG Sample:

Data ID: e9be1118-a5cf-4431-b2e8-e3edcfa9f949

Context: The task began by preparing the necessary tools, picking up the screwdriver and the screw. The screw was then positioned and secured into the wood using the screwdriver. Once the screw was firmly in p...

Target Sentence: The fastener was then secured tightly with both hands to ensure stability.

Ground Truth Scene Graph:
[['person', 'verb', 'tighten'],
 ['tighten', 'dobj', 'clamp'],
 ['tighten', 'with', 'hand1'],
 ['tighten', 'with', 'hand2']]

Available Objects: ['person', 'clamp', 'hand1', 'hand2', 'screwdriver', 'wood', 'wood-piece', 'wood-pieces', 'workbench']...
Available Verbs: ['align', 'drill', 'hold', 'loosen', 'pick-up', 'place', 'position', 'put-down', 'release', 'remove']...
Available Relationships: ['dobj', 'from', 'into', 'on', 'with', 'verb', 'with']...


### 4.2 MA-SGG Dataset (Multi-Action Scene Graph Generation)

In [3]:
# Load MA-SGG dataset
ma_sgg_path = repo_path / "resource" / "dataset" / "generation" / "ma-sgg.jsonl"

with open(ma_sgg_path, 'r') as f:
    ma_sgg_samples = [json.loads(line) for line in f]

print(f"Total MA-SGG samples: {len(ma_sgg_samples)}")
print("\n" + "="*80)
print("Example MA-SGG Sample (Multiple Actions):")
print("="*80)

example = ma_sgg_samples[0]
print(f"\nTarget Sentence: {example['target_sentence']}")
print(f"\nNumber of Scene Graphs: {len(example['graphs'])}")
for i, graph in enumerate(example['graphs']):
    print(f"\nScene Graph {i+1} (Action {graph['action_id']}):")
    pprint(graph['triplets'])

Total MA-SGG samples: 853

Example MA-SGG Sample (Multiple Actions):

Target Sentence: The task began by preparing the necessary tools, picking up the screwdriver and the screw.

Number of Scene Graphs: 2

Scene Graph 1 (Action 14):
[['person', 'verb', 'pick-up'],
 ['pick-up', 'dobj', 'screwdriver'],
 ['pick-up', 'with', 'hand1']]

Scene Graph 2 (Action 13):
[['person', 'verb', 'pick-up'],
 ['pick-up', 'dobj', 'screw'],
 ['pick-up', 'with', 'hand2']]


### 4.3 SGQA Dataset (Scene Graph Question Answering)

In [4]:
# Load SGQA dataset
sgqa_path = repo_path / "resource" / "dataset" / "understanding" / "sgqa.jsonl"

with open(sgqa_path, 'r') as f:
    sgqa_samples = [json.loads(line) for line in f]

print(f"Total SGQA samples: {len(sgqa_samples)}")

# Count total QA pairs
total_qa = sum(len(sample['qa_pairs']) for sample in sgqa_samples)
print(f"Total QA pairs: {total_qa}")

print("\n" + "="*80)
print("Example SGQA Sample:")
print("="*80)

example = sgqa_samples[0]
print(f"\nContext Scene Graphs ({len(example['context_graphs'])} graphs):")
for i, graph in enumerate(example['context_graphs'][:2]):
    print(f"\nGraph {i+1}:")
    pprint(graph)

print(f"\nQuestion-Answer Pairs ({len(example['qa_pairs'])} pairs):")
for i, qa in enumerate(example['qa_pairs'][:3]):
    print(f"\nQ{i+1}: {qa['Q']}")
    print(f"A{i+1}: {qa['A']}")

Total SGQA samples: 100
Total QA pairs: 500

Example SGQA Sample:

Context Scene Graphs (11 graphs):

Graph 1:
[['pick-up', 'with', 'hand1'],
 ['pick-up', 'with', 'hand2'],
 ['mop-stick', 'from', 'floor'],
 ['person', 'verb', 'pick-up'],
 ['pick-up', 'dobj', 'mop-stick']]

Graph 2:
[['sweep', 'with', 'hand1'],
 ['sweep', 'with', 'hand2'],
 ['sweep', 'with', 'mop-stick'],
 ['sweep', 'dobj', 'floor'],
 ['sweep', 'in', 'car'],
 ['person', 'verb', 'sweep']]

Question-Answer Pairs (5 pairs):

Q1: What object was picked up before sweeping the floor?
A1: mop-stick

Q2: Which location did the person interact with after using the cloth?
A2: wall

Q3: What object was handled immediately after opening the cabinet?
A3: cloth


### 4.4 SGDS Dataset (Scene Graph Description Selection)

In [5]:
# Load SGDS dataset
sgds_path = repo_path / "resource" / "dataset" / "understanding" / "sgds.jsonl"

with open(sgds_path, 'r') as f:
    sgds_samples = [json.loads(line) for line in f]

print(f"Total SGDS samples: {len(sgds_samples)}")
print("\n" + "="*80)
print("Example SGDS Sample:")
print("="*80)

example = sgds_samples[0]
print(f"\nTarget Scene Graph (triplets):")
pprint(example['triplet'])

print(f"\nCorrect Answer Position: {example['position']} (0-indexed)")
print(f"\nCandidate Descriptions (5 options):")
for i, desc in enumerate(example['variations']):
    marker = "<-- CORRECT" if i == example['position'] else ""
    print(f"[{chr(65+i)}] {desc} {marker}")

Total SGDS samples: 250

Example SGDS Sample:

Target Scene Graph (triplets):
[['roll', 'dobj', 'dough'],
 ['roll', 'with', 'roller'],
 ['roll', 'with', 'hand1'],
 ['roll', 'with', 'hand2'],
 ['person', 'verb', 'roll']]

Correct Answer Position: 1 (0-indexed)

Candidate Descriptions (5 options):
[A] This step was skipped to avoid achieving the desired consistency. 
[B] This step was repeated to achieve the desired consistency. <-- CORRECT
[C] This step was repeated to disrupt the desired consistency. 
[D] This step was altered to prevent the desired consistency. 
[E] This step was ignored to ensure the undesired consistency. 


## 5. Model Setup - OpenAI

Let's configure and test the OpenAI model wrapper.

In [6]:
# Import model classes
from models.models import GPT4o, GPT4oMini

# Check if config exists
config_path = repo_path / "conf.d" / "config.yaml"
if config_path.exists():
    print("âœ“ Config file found at conf.d/config.yaml")
    print("\nNote: Make sure your OpenAI API key is configured in this file:")
    print("""\nopenai:
  key: your_api_key_here""")
else:
    print("âš  Config file not found!")
    print("Please create conf.d/config.yaml with your OpenAI API key")

âœ“ Config file found at conf.d/config.yaml

Note: Make sure your OpenAI API key is configured in this file:

openai:
  key: your_api_key_here


In [7]:
# Initialize models
try:
    gpt4o = GPT4o()
    gpt4o_mini = GPT4oMini()
    print("âœ“ OpenAI models initialized successfully")
    print(f"  - GPT-4o: {gpt4o.model_name}")
    print(f"  - GPT-4o-mini: {gpt4o_mini.model_name}")
except Exception as e:
    print(f"âœ— Error initializing models: {e}")
    print("Make sure your OpenAI API key is configured in conf.d/config.yaml")

âœ“ OpenAI models initialized successfully
âœ— Error initializing models: 'GPT4o' object has no attribute 'model_name'
Make sure your OpenAI API key is configured in conf.d/config.yaml


In [8]:
# Test model invocation
test_prompt = "What is a scene graph? Answer in one sentence."

try:
    response = gpt4o_mini.invoke(test_prompt)
    print("âœ“ Model invocation successful!")
    print(f"\nPrompt: {test_prompt}")
    print(f"\nResponse: {response}")
except Exception as e:
    print(f"âœ— Error invoking model: {e}")

âœ“ Model invocation successful!

Prompt: What is a scene graph? Answer in one sentence.

Response: A scene graph is a data structure that represents the spatial and hierarchical relationships between objects in a scene, often used in computer graphics and visualization to manage and render complex environments efficiently.


## 6. Task Deep Dive: SA-SGG (Single-Action Scene Graph Generation)

### Where is the evaluation code?
**Location:** `evaluation/generation/sa-sgg.py`

### Key Classes:
1. **GraphGeneration** - Formats prompts and invokes model
2. **GraphScorer** - Parses responses and calculates metrics
3. **GraphEvaluator** - Orchestrates the evaluation pipeline

### Method:
1. Load sample with context, target sentence, available nodes/edges
2. Format prompt using template from `resource/prompts/sa-sgg.txt`
3. Invoke model to generate scene graph triplets
4. Parse response (format: `node -> edge -> node`)
5. Calculate Precision, Recall, F1 vs ground truth

**Note:** The evaluation files have hyphens in their names (e.g., `sa-sgg.py`), so we use `importlib` to import them (Python doesn't allow hyphens in standard import statements).

Let's see how this works:

In [9]:
# Import SA-SGG evaluation classes using importlib
# (Files have hyphens in names, so we can't use standard import)
import importlib.util

spec = importlib.util.spec_from_file_location(
    "sa_sgg",
    str(repo_path / "evaluation" / "generation" / "sa-sgg.py")
)
sa_sgg_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(sa_sgg_module)

# Now we can access the classes
GraphGeneration = sa_sgg_module.GraphGeneration
GraphScorer = sa_sgg_module.GraphScorer

print("âœ“ SA-SGG module imported successfully")

# Load prompt template
prompt_path = repo_path / "resource" / "prompts" / "sa-sgg.txt"
with open(prompt_path, 'r') as f:
    sa_sgg_prompt_template = f.read()

print("\nSA-SGG Prompt Template (first 500 chars):")
print("="*80)
print(sa_sgg_prompt_template[:500] + "...")
print("\n[Full prompt is {} characters]".format(len(sa_sgg_prompt_template)))

âœ“ SA-SGG module imported successfully

SA-SGG Prompt Template (first 500 chars):
You are an AI model tasked with generating a scene graph based on a given sentence, adhering to specific rules for the graph, nodes, and edges, while considering the provided context, available nodes, and available edges.

Rules for Scene Graph Representation:
1. A graph is composed of one or more triplets of nodes and edges.
2. A triplet starts with a node and another node is connected by an edge. (Format: node -> edge -> node)
3. Each triplet is split with a new line.
4. There must be a triple...

[Full prompt is 2451 characters]


In [10]:
# Example: Generate scene graph for one sample
sample = sa_sgg_samples[10]  # Pick an interesting example

print("Input Sample:")
print("="*80)
print(f"Target Sentence: {sample['target_sentence']}")
print(f"\nGround Truth Triplets:")
pprint(sample['graphs'][0]['triplets'])

# Prepare inputs
context = sample['context']
target_sentence = sample['target_sentence']
available_nodes = ", ".join(sample['mandatory_space']['object'])
available_edges = ", ".join(sample['mandatory_space']['relationship'])
verbs = sample['mandatory_space']['verb']

print("\n" + "="*80)
print("Inputs prepared for model:")
print("="*80)
print(f"Context: {context[:200]}...")
print(f"Available nodes: {available_nodes[:100]}...")
print(f"Available edges: {available_edges}")
print(f"Available verbs: {verbs[:10]}...")

Input Sample:
Target Sentence: Finally, the rag was placed back on the floor.

Ground Truth Triplets:
[['put-down', 'with', 'hand1'],
 ['put-down', 'dobj', 'rag'],
 ['put-down', 'on', 'floor'],
 ['person', 'verb', 'put-down']]

Inputs prepared for model:
Context: The book collection was organized with both hands to ensure everything was in its proper place. A book was relocated from the floor to the bookshelf with both hands, ensuring it was positioned correct...
Available nodes: person, book, bookshelf, brush, dust, floor, hand1, hand2, rag, shelf...
Available edges: dobj, from, on, verb, with
Available verbs: ['align', 'arrange', 'check', 'hold', 'pick', 'pick-up', 'put', 'put-down', 'remove', 'take']...


In [11]:
# Invoke model
try:
    graph_gen = GraphGeneration(gpt4o_mini)
    response = graph_gen.invoke(
        context=context,
        target_sentence=target_sentence,
        available_nodes=available_nodes,
        available_edges=available_edges,
        verbs=verbs
    )
    
    print("\n" + "="*80)
    print("Model Response:")
    print("="*80)
    print(response)
    
    # Parse and score
    scorer = GraphScorer()
    parsed_scene_graphs = scorer.parse_response(response)
    
    print("\n" + "="*80)
    print("Parsed Scene Graphs:")
    print("="*80)
    if parsed_scene_graphs:
        predicted_triplets = parsed_scene_graphs[0]  # Take first scene graph
        pprint(predicted_triplets)
        
        # Calculate metrics
        metrics = scorer.calculate_scores(
            sample['graphs'][0]['triplets'],
            predicted_triplets
        )
        
        print("\n" + "="*80)
        print("Evaluation Metrics:")
        print("="*80)
        print(f"Precision: {metrics['precision']:.3f}")
        print(f"Recall: {metrics['recall']:.3f}")
        print(f"F1 Score: {metrics['f1']:.3f}")
    else:
        print("No valid scene graph parsed from response")
    
except Exception as e:
    print(f"Error during generation: {e}")
    import traceback
    traceback.print_exc()


Model Response:
person -> verb -> put  
rag -> dobj -> floor  
rag -> from -> floor

Parsed Scene Graphs:
[['person', 'verb', 'put'], ['rag', 'dobj', 'floor'], ['rag', 'from', 'floor']]

Evaluation Metrics:
Precision: 0.000
Recall: 0.000
F1 Score: 0.000


## 7. Task Deep Dive: MA-SGG (Multi-Action Scene Graph Generation)

### Where is the evaluation code?
**Location:** `evaluation/generation/ma-sgg.py`

### Key Difference from SA-SGG:
- Model must generate **exactly N scene graphs** (one per action)
- Requires implicit action segmentation
- Graphs separated by blank lines in output

### Method:
Same as SA-SGG but with multiple graphs per sample.

**Note:** Using `importlib` to import the hyphenated file name.

In [12]:
# Import MA-SGG evaluation classes using importlib
spec_ma = importlib.util.spec_from_file_location(
    "ma_sgg",
    str(repo_path / "evaluation" / "generation" / "ma-sgg.py")
)
ma_sgg_module = importlib.util.module_from_spec(spec_ma)
spec_ma.loader.exec_module(ma_sgg_module)

MAGraphGeneration = ma_sgg_module.GraphGeneration
MAGraphScorer = ma_sgg_module.GraphScorer

print("âœ“ MA-SGG module imported successfully")

# Load prompt template
ma_prompt_path = repo_path / "resource" / "prompts" / "ma-sgg.txt"
with open(ma_prompt_path, 'r') as f:
    ma_sgg_prompt_template = f.read()

print("\nMA-SGG Prompt Template (first 500 chars):")
print("="*80)
print(ma_sgg_prompt_template[:500] + "...")

# Example with multiple actions
sample = ma_sgg_samples[5]
print("\n" + "="*80)
print("Input Sample:")
print("="*80)
print(f"Target Sentence: {sample['target_sentence']}")
print(f"\nNumber of Actions: {len(sample['graphs'])}")
print(f"\nGround Truth Scene Graphs:")
for i, graph in enumerate(sample['graphs']):
    print(f"\nGraph {i+1}:")
    pprint(graph['triplets'])

âœ“ MA-SGG module imported successfully

MA-SGG Prompt Template (first 500 chars):
You are an AI model tasked with generating scene graphs based on a given sentence. Your goal is to create exactly the specified number of scene graphs by extracting meaningful relationships between entities, actions, and objects while ensuring that the scene graphs represent actions that would visually appear in a scene.

Rules for Generating Multiple Scene Graphs:
1. Generate precisely {num_scene_graphs} scene graphsâ€”no more, no less.
2. Each scene graph must depict an action that would be expl...

Input Sample:
Target Sentence: The wood piece was carefully adjusted using both hands.

Number of Actions: 2

Ground Truth Scene Graphs:

Graph 1:
[['person', 'verb', 'pick-up'],
 ['pick-up', 'dobj', 'wood-piece'],
 ['pick-up', 'with', 'hand1'],
 ['pick-up', 'with', 'hand2']]

Graph 2:
[['person', 'verb', 'align'],
 ['align', 'dobj', 'wood-piece'],
 ['align', 'with', 'hand1'],
 ['align', 'with', 'hand2']]


## 8. Task Deep Dive: SGQA (Scene Graph Question Answering)

### Where is the evaluation code?
**Location:** `evaluation/understanding/sgqa.py`

### Key Classes:
1. **QA** - Prompts model with scene graphs + question
2. **QADataLoader** - Loads and structures QA pairs
3. **QAEvaluator** - Processes questions and calculates accuracy

### Method:
1. Load QA pairs with context scene graphs
2. Format prompt with scene graphs + question
3. Extract answer using regex `\[(.*?)\]`
4. Compare with ground truth (case-insensitive exact match)
5. Calculate accuracy

**Note:** Using `importlib` to import the module.

In [13]:
# Import SGQA evaluation classes using importlib
spec_sgqa = importlib.util.spec_from_file_location(
    "sgqa",
    str(repo_path / "evaluation" / "understanding" / "sgqa.py")
)
sgqa_module = importlib.util.module_from_spec(spec_sgqa)
spec_sgqa.loader.exec_module(sgqa_module)

QA = sgqa_module.QA

print("âœ“ SGQA module imported successfully")

# Load prompt template
sgqa_prompt_path = repo_path / "resource" / "prompts" / "sgqa.txt"
with open(sgqa_prompt_path, 'r') as f:
    sgqa_prompt_template = f.read()

print("\nSGQA Prompt Template:")
print("="*80)
print(sgqa_prompt_template)

# Example QA
sample = sgqa_samples[2]
qa_pair = sample['qa_pairs'][0]

print("\n" + "="*80)
print("Input:")
print("="*80)
print(f"Context Graphs: {len(sample['context_graphs'])} scene graphs")
for i, graph in enumerate(sample['context_graphs'][:3]):
    print(f"\nGraph {i+1}: {graph}")
print(f"\nQuestion: {qa_pair['Q']}")
print(f"Ground Truth Answer: {qa_pair['A']}")

âœ“ SGQA module imported successfully

SGQA Prompt Template:
You are a highly advanced language model specialized in answering questions based on a given scene graph and question. Your task is to analyze the scene graph and provide the correct answer in a single word. Your output must strictly follow the format [answer], and nothing else should be printed. Ensure that your answer is concise, accurate, and matches the format exactly.

Scene Graph: {scene_graph}
Question: {question}


Input:
Context Graphs: 19 scene graphs

Graph 1: [['turn-on', 'dobj', 'tap'], ['turn-on', 'with', 'hand1'], ['person', 'verb', 'turn-on']]

Graph 2: [['wash', 'dobj', 'onion'], ['onion', 'in', 'hand2'], ['wash', 'under', 'water'], ['person', 'verb', 'wash']]

Graph 3: [['turn-off', 'dobj', 'tap'], ['turn-off', 'with', 'hand1'], ['person', 'verb', 'turn-off']]

Question: What object was first manipulated before water flow began?
Ground Truth Answer: tap


In [14]:
# Test SGQA
try:
    qa = QA(gpt4o_mini)
    
    # Format scene graphs as string
    scene_graph_str = str(sample['context_graphs'])
    
    print("Full Input:")
    print("="*80)
    print(f"Scene Graphs: {scene_graph_str[:500]}...")
    print(f"\nQuestion: {qa_pair['Q']}")
    
    # Invoke model
    predicted_answer = qa.invoke(
        scene_graph=scene_graph_str,
        question=qa_pair['Q']
    )
    
    print("\n" + "="*80)
    print("Model Response:")
    print("="*80)
    print(f"Predicted Answer: {predicted_answer}")
    
    print("\n" + "="*80)
    print("Evaluation:")
    print("="*80)
    print(f"Predicted: {predicted_answer}")
    print(f"Ground Truth: {qa_pair['A']}")
    print(f"Correct: {predicted_answer.lower() == qa_pair['A'].lower()}")
    
except Exception as e:
    print(f"Error during QA: {e}")
    import traceback
    traceback.print_exc()

Full Input:
Scene Graphs: [[['turn-on', 'dobj', 'tap'], ['turn-on', 'with', 'hand1'], ['person', 'verb', 'turn-on']], [['wash', 'dobj', 'onion'], ['onion', 'in', 'hand2'], ['wash', 'under', 'water'], ['person', 'verb', 'wash']], [['turn-off', 'dobj', 'tap'], ['turn-off', 'with', 'hand1'], ['person', 'verb', 'turn-off']], [['pick-up', 'dobj', 'pot'], ['pick-up', 'with', 'hand1'], ['person', 'verb', 'pick-up']], [['transfer', 'dobj', 'onion'], ['onion', 'from', 'board'], ['transfer', 'into', 'pot'], ['transfer', 'with', 'h...

Question: What object was first manipulated before water flow began?

Model Response:
Predicted Answer: onion

Evaluation:
Predicted: onion
Ground Truth: tap
Correct: False


## 9. Task Deep Dive: SGDS (Scene Graph Description Selection)

### Where is the evaluation code?
**Location:** `evaluation/understanding/sgds.py`

### Key Classes:
1. **SceneGraphToText** - Prompts model with scene graph + candidates
2. **SceneGraphEvaluator** - Evaluates across dataset

### Method:
1. Load samples with target scene graph + 5 candidate descriptions
2. Model selects best matching description (A-E)
3. Parse answer letter, convert to index
4. Compare with ground truth position
5. Calculate accuracy

**Note:** Using `importlib` to import the module.

In [15]:
# Import SGDS evaluation classes using importlib
spec_sgds = importlib.util.spec_from_file_location(
    "sgds",
    str(repo_path / "evaluation" / "understanding" / "sgds.py")
)
sgds_module = importlib.util.module_from_spec(spec_sgds)
spec_sgds.loader.exec_module(sgds_module)

SceneGraphToText = sgds_module.SceneGraphToText

print("âœ“ SGDS module imported successfully")

# Load prompt template
sgds_prompt_path = repo_path / "resource" / "prompts" / "sgds.txt"
with open(sgds_prompt_path, 'r') as f:
    sgds_prompt_template = f.read()

print("\nSGDS Prompt Template:")
print("="*80)
print(sgds_prompt_template)

# Example
sample = sgds_samples[5]
print("\n" + "="*80)
print("Input:")
print("="*80)
print(f"Target Scene Graph:")
pprint(sample['triplet'])
print(f"\nCandidate Descriptions:")
for i, desc in enumerate(sample['variations']):
    marker = "<-- CORRECT" if i == sample['position'] else ""
    print(f"[{chr(65+i)}] {desc} {marker}")

âœ“ SGDS module imported successfully

SGDS Prompt Template:
You are an AI that analyzes a Scene Graph based on the context and select the best text description of it among the given candidates.

1. Input:
   - Context: A list of scene graphs representing the preceding context.
     - Each graph is composed of a set of triplets `[node1, edge, node2]`. `node1` and `node2` are one of person, action, object and hand. `edge` represents the relationship between them (e.g., `verb`, `dobj`, `from`, `with`).
   - Target Scene Graph: A set of triplets that should be described into text correctly.
   - Description Candidates: Candidates of sentence descriptions of the Target Scene Graph based on the Context.

2. Task:
   - Determine which description best matches the Target Scene Graph.

3. Output:  
   - Be sure to choose only one letter of the matching description.  
   - Do not output any additional text or explanation. Only the letter in [ ] (e.g., [A]).

Key rules of edges in a triplet:
   

In [16]:
# Test SGDS
try:
    sgds = SceneGraphToText(gpt4o_mini)
    
    # Format sentences (variations) as string
    sentences_str = "\n".join([f"[{chr(65+i)}] {desc}" for i, desc in enumerate(sample['variations'])])
    triplet_str = str(sample['triplet'])
    context_str = str(sample.get('context_graphs', []))
    
    print("Full Input:")
    print("="*80)
    print(f"Target Scene Graph: {triplet_str}")
    print(f"\nCandidate Descriptions:\n{sentences_str}")
    
    # Invoke model
    response = sgds.invoke(
        sentences=sentences_str,
        triplet=triplet_str,
        context_graphs=context_str
    )
    
    print("\n" + "="*80)
    print("Model Response:")
    print("="*80)
    print(response)
    
    # Parse answer
    match = re.search(r'\[([A-E])\]', response)
    predicted_letter = match.group(1) if match else None
    predicted_index = ord(predicted_letter) - ord('A') if predicted_letter else None
    
    print("\n" + "="*80)
    print("Evaluation:")
    print("="*80)
    print(f"Predicted: {predicted_letter} (index {predicted_index})")
    print(f"Ground Truth: {chr(65 + sample['position'])} (index {sample['position']})")
    print(f"Correct: {predicted_index == sample['position'] if predicted_index is not None else False}")
    
except Exception as e:
    print(f"Error during SGDS: {e}")
    import traceback
    traceback.print_exc()

Full Input:
Target Scene Graph: [['person', 'verb', 'apply'], ['apply', 'dobj', 'glue'], ['apply', 'with', 'brush'], ['apply', 'with', 'hand1'], ['apply', 'onto', 'bolt']]

Candidate Descriptions:
[A] Adhesive was then removed from the fastener using a scraper held in one hand.
[B] Paint was then applied onto the wall using a roller held in one hand.
[C] Glue was then spread onto the paper using a brush held in both hands.
[D] Adhesive was then wiped off the fastener using a cloth held in one hand.
[E] Adhesive was then used onto the fastener using an applicator held in one hand.

Model Response:
[E]
Error during SGDS: name 're' is not defined


Traceback (most recent call last):
  File "/srv/local/shared/temp/tmp1/jtu9/tmp/ipykernel_848332/668971126.py", line 28, in <module>
    match = re.search(r'\[([A-E])\]', response)
NameError: name 're' is not defined


## 10. Running Full Evaluations

The evaluation scripts can be run directly from command line:

```bash
# SA-SGG evaluation
python evaluation/generation/sa-sgg.py

# MA-SGG evaluation
python evaluation/generation/ma-sgg.py

# SGQA evaluation
python evaluation/understanding/sgqa.py

# SGDS evaluation
python evaluation/understanding/sgds.py
```

Each script:
1. Loops through all configured models
2. Uses ThreadPoolExecutor for parallel API calls (5-15 workers)
3. Calculates metrics across entire dataset
4. Prints results to console

### Evaluation Code Structure:

All evaluation scripts follow similar pattern:
```python
class TaskGeneration/TaskQA:
    def __init__(self, model)
    def prompt_format(...)  # Format prompt with template
    def invoke(...)         # Call model

class TaskScorer:
    def parse_response(...)  # Extract predictions
    def score(...)           # Calculate metrics

class TaskEvaluator:
    def evaluate(...)        # Run full evaluation
    def _parallel_process()  # ThreadPoolExecutor
```

## 11. Mini-Evaluation Example

Let's run a small evaluation on 5 samples from SA-SGG to see the complete pipeline in action.

In [17]:
# Mini SA-SGG evaluation
# Use GraphEvaluator from the sa_sgg_module we imported earlier
GraphEvaluator = sa_sgg_module.GraphEvaluator

# Take first 5 samples
mini_dataset = sa_sgg_samples[:5]

print(f"Running SA-SGG evaluation on {len(mini_dataset)} samples with GPT-4o-mini...")
print("="*80)

try:
    evaluator = GraphEvaluator(gpt4o_mini)
    results = evaluator.evaluate(mini_dataset)
    
    print("\nResults:")
    print("="*80)
    print(f"Average Precision: {results['avg_precision']:.3f}")
    print(f"Average Recall: {results['avg_recall']:.3f}")
    print(f"Average F1 Score: {results['avg_f1']:.3f}")
    print(f"\nSamples evaluated: {len(results['samples'])}")
    
    print("\nPer-sample breakdown:")
    for i, sample_result in enumerate(results['samples']):
        print(f"\nSample {i+1}:")
        print(f"  P: {sample_result['precision']:.3f}, R: {sample_result['recall']:.3f}, F1: {sample_result['f1']:.3f}")
        
except Exception as e:
    print(f"Error during evaluation: {e}")
    import traceback
    traceback.print_exc()

Running SA-SGG evaluation on 5 samples with GPT-4o-mini...
Error during evaluation: expected str, bytes or os.PathLike object, not list


Traceback (most recent call last):
  File "/srv/local/shared/temp/tmp1/jtu9/tmp/ipykernel_848332/3310588118.py", line 13, in <module>
    results = evaluator.evaluate(mini_dataset)
  File "/home/jtu9/sgg/tsg-bench/evaluation/generation/sa-sgg.py", line 151, in evaluate
    with open(data_path, "r", encoding="utf-8") as f:
TypeError: expected str, bytes or os.PathLike object, not list


## 12. Understanding Evaluation Metrics

### Generation Tasks (SA-SGG, MA-SGG):

**Precision:** How many predicted triplets are correct?
```python
precision = correct_triplets / predicted_triplets
```

**Recall:** How many ground truth triplets were found?
```python
recall = correct_triplets / ground_truth_triplets
```

**F1 Score:** Harmonic mean of precision and recall
```python
f1 = 2 * (precision * recall) / (precision + recall)
```

**Macro-averaging:** Average metrics across all samples/graphs

### Understanding Tasks (SGQA, SGDS):

**Accuracy:** Percentage of correct predictions
```python
accuracy = correct_predictions / total_predictions
```

**Matching:** Case-insensitive exact string match

## 13. Key Insights and Next Steps

### What We've Learned:

1. **Scene Graph Structure:**
   - Triplets: `[source_node, edge_type, target_node]`
   - Always starts with `person` node
   - Captures both actions and spatial relationships

2. **Four Evaluation Tasks:**
   - **SA-SGG:** Single action generation (1,188 samples)
   - **MA-SGG:** Multi-action generation (853 samples)
   - **SGQA:** Question answering (500+ QA pairs)
   - **SGDS:** Description selection (249 samples)

3. **Evaluation Pipeline:**
   - Load data from `resource/dataset/`
   - Format prompts using templates from `resource/prompts/`
   - Invoke models via unified interface in `models/models.py`
   - Parse responses and calculate metrics
   - Use parallel processing for efficiency

4. **Code Locations:**
   - **SA-SGG:** `evaluation/generation/sa-sgg.py`
   - **MA-SGG:** `evaluation/generation/ma-sgg.py`
   - **SGQA:** `evaluation/understanding/sgqa.py`
   - **SGDS:** `evaluation/understanding/sgds.py`
   - **Models:** `models/models.py`

### How to Extend:

1. **Add New Models:**
   ```python
   class NewModel(LLM):
       def __init__(self):
           self.model_name = "new-model"
       
       def invoke(self, message):
           # Your implementation
           return response
   ```

2. **Customize Prompts:**
   - Edit files in `resource/prompts/`
   - Modify `prompt_format()` methods in evaluation scripts

3. **Save Results:**
   - Uncomment `_save_results()` in evaluation scripts
   - Or redirect stdout: `python sa-sgg.py > results.txt`

4. **Batch Processing:**
   - Adjust `max_workers` in ThreadPoolExecutor
   - Balance API rate limits vs speed

### Resources:

- **Website:** https://tsg-bench.netlify.app/
- **Dataset:** `resource/dataset/`
- **Prompts:** `resource/prompts/`
- **Config:** `conf.d/config.yaml`

## 14. Summary

This notebook demonstrated:

âœ“ How scene graphs represent actions and spatial relationships as triplets  
âœ“ The 4 evaluation tasks and their datasets  
âœ“ How to configure and use OpenAI models  
âœ“ Where evaluation code lives and how it works  
âœ“ The complete evaluation pipeline from prompt to metrics  
âœ“ How to run evaluations and interpret results  

**Next steps:**
- Run full evaluations on all datasets
- Compare different models (GPT-4o vs GPT-4o-mini vs Claude, etc.)
- Analyze error patterns
- Customize prompts for better performance

Happy experimenting with TSG-Bench! ðŸš€