# Notebook 05: Semantic QA Generation

## Objective

Generate questions requiring TRUE semantic reasoning (not vocabulary transitions).

## v2 vs v3 QA Comparison

```python
# v2 (vocabulary transition - NOT semantic)
{
    "question": "What DRUGBANK ID is linked to glucose via CHEBI?",
    "answer": "DRUGBANK:DB00117",
    "type": "vocabulary_transition"
}

# v3 (TRUE semantic)
{
    "question": "What pathway does glucose participate in during anaerobic metabolism?",
    "answer": "Glycolysis",
    "answer_id": "REACT:R-HSA-70171",
    "reasoning_chain": [{"subject": "glucose", "predicate": "participates_in", "object": "glycolysis"}],
    "type": "semantic_1_hop"
}
```

## Ground Truth Validation

We validate QA pairs against external sources (Reactome, KEGG) to ensure correctness.

In [1]:
# Standard imports
import sys
from pathlib import Path
from datetime import datetime
from collections import Counter

# Add project root to path
PROJECT_ROOT = Path.cwd().parents[1]
sys.path.insert(0, str(PROJECT_ROOT / 'src'))
sys.path.insert(0, str(Path.cwd()))

# Import utilities
from kg_o1_v3_utils import (
    save_json, load_json,
    SemanticQAPair, SemanticTriple,
    generate_qa_from_triple, generate_qa_from_path,
    validate_qa_pair, validate_with_reactome, validate_with_kegg,
)

# Output directory
OUTPUT_DIR = Path.cwd() / 'outputs'
OUTPUT_DIR.mkdir(exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Output directory: {OUTPUT_DIR}")

Project root: /home/trentleslie/Insync/projects/biomapper2
Output directory: /home/trentleslie/Insync/projects/biomapper2/notebooks/kg_o1_v3/outputs


## 1. Load Subgraphs and Paths

In [2]:
# Load subgraphs from NB03
subgraph_file = OUTPUT_DIR / 'semantic_subgraphs.json'

if subgraph_file.exists():
    subgraph_data = load_json(subgraph_file)
    subgraphs = subgraph_data.get('subgraphs', [])
    print(f"Loaded {len(subgraphs)} subgraphs from NB03")
else:
    print("WARNING: NB03 output not found.")
    subgraphs = []

# Load paths from NB04
paths_file = OUTPUT_DIR / 'multi_hop_paths.json'

if paths_file.exists():
    paths_data = load_json(paths_file)
    paths = paths_data.get('paths', [])
    found_paths = [p for p in paths if p.get('path_length', 0) > 0]
    print(f"Loaded {len(found_paths)} multi-hop paths from NB04")
else:
    print("WARNING: NB04 output not found.")
    found_paths = []

Loaded 20 subgraphs from NB03
Loaded 3 multi-hop paths from NB04


## 2. Generate 1-Hop QA from Subgraphs

In [3]:
# Generate 1-hop QA pairs from subgraph relations
print("="*60)
print("Generating 1-hop QA pairs from subgraphs")
print("="*60)

one_hop_qa = []

for sg in subgraphs:
    entity_id = sg['center_entity_id']
    entity_name = sg['center_entity_name']
    
    # Generate from outgoing relations
    for rel in sg.get('outgoing_relations', []):
        triple = SemanticTriple(
            subject_id=entity_id,
            subject_name=entity_name,
            predicate=rel.get('predicate', ''),
            object_id=rel.get('object_id', ''),
            object_name=rel.get('object_name', ''),
            object_category=rel.get('object_category', ''),
        )
        
        qa = generate_qa_from_triple(triple)
        if qa:
            one_hop_qa.append(qa)

print(f"\nGenerated {len(one_hop_qa)} 1-hop QA pairs")

# Show examples
print("\nSample 1-hop QA pairs:")
for qa in one_hop_qa[:5]:
    print(f"\n  Q: {qa.question}")
    print(f"  A: {qa.answer}")
    print(f"  Chain: {qa.reasoning_chain}")

Generating 1-hop QA pairs from subgraphs

Generated 3356 1-hop QA pairs

Sample 1-hop QA pairs:

  Q: What entity does glucose mentioned in clinical trials for?
  A: hypoglycemia
  Chain: [{'subject': 'glucose', 'predicate': 'biolink:mentioned_in_clinical_trials_for', 'object': 'hypoglycemia'}]

  Q: What entity does glucose in clinical trials for?
  A: hypoglycemia
  Chain: [{'subject': 'glucose', 'predicate': 'biolink:in_clinical_trials_for', 'object': 'hypoglycemia'}]

  Q: What entity does glucose treat?
  A: hypoglycemia
  Chain: [{'subject': 'glucose', 'predicate': 'biolink:treats', 'object': 'hypoglycemia'}]

  Q: What entity does glucose related to?
  A: hypoglycemia
  Chain: [{'subject': 'glucose', 'predicate': 'biolink:related_to', 'object': 'hypoglycemia'}]

  Q: What entity does glucose in clinical trials for?
  A: hypoglycemia
  Chain: [{'subject': 'glucose', 'predicate': 'biolink:in_clinical_trials_for', 'object': 'hypoglycemia'}]


## 3. Generate Multi-Hop QA from Paths

In [4]:
# Generate multi-hop QA pairs from paths
print("="*60)
print("Generating multi-hop QA pairs from paths")
print("="*60)

multi_hop_qa = []

for path_data in found_paths:
    path_steps = path_data.get('path', [])
    
    if len(path_steps) >= 2:
        # Convert to SemanticTriple objects
        triples = [
            SemanticTriple(
                subject_id=step.get('subject_id', ''),
                subject_name=step.get('subject', ''),
                predicate=step.get('predicate', ''),
                object_id=step.get('object_id', ''),
                object_name=step.get('object', ''),
                object_category='',
            )
            for step in path_steps
        ]
        
        qa = generate_qa_from_path(triples)
        if qa:
            multi_hop_qa.append(qa)

print(f"\nGenerated {len(multi_hop_qa)} multi-hop QA pairs")

# Show examples
print("\nSample multi-hop QA pairs:")
for qa in multi_hop_qa[:3]:
    print(f"\n  Q: {qa.question}")
    print(f"  A: {qa.answer}")
    print(f"  Hops: {qa.num_hops}")
    print(f"  Chain:")
    for step in qa.reasoning_chain:
        print(f"    {step['subject']} --[{step['predicate']}]--> {step['object']}")

Generating multi-hop QA pairs from paths

Generated 3 multi-hop QA pairs

Sample multi-hop QA pairs:

  Q: How is CHEBI:4167 connected to water?
  A: water
  Hops: 3
  Chain:
    CHEBI:4167 --[biolink:subclass_of]--> D-glucose
    CHEBI:17634 --[biolink:has_part]--> royal jelly
    CHEBI:78665 --[biolink:has_part]--> water

  Q: How is CHEBI:15846 connected to water?
  A: water
  Hops: 3
  Chain:
    CHEBI:15846 --[biolink:related_to]--> NADH
    CHEBI:16908 --[biolink:related_to]--> ALDH1A1
    NCBIGene:216 --[biolink:interacts_with]--> water

  Q: How is CHEBI:15846 connected to L-glutamic acid?
  A: L-glutamic acid
  Hops: 3
  Chain:
    CHEBI:15846 --[biolink:related_to]--> NADH
    CHEBI:16908 --[biolink:related_to]--> GLUD1
    NCBIGene:2746 --[biolink:interacts_with]--> L-glutamic acid


## 4. Combine All QA Pairs

In [5]:
# Combine all QA pairs
all_qa_pairs = one_hop_qa + multi_hop_qa

print(f"Total QA pairs: {len(all_qa_pairs)}")
print(f"  - 1-hop: {len(one_hop_qa)}")
print(f"  - Multi-hop: {len(multi_hop_qa)}")

# Distribution by type
type_counts = Counter(qa.qa_type for qa in all_qa_pairs)
print(f"\nDistribution by type:")
for qa_type, count in type_counts.most_common():
    print(f"  {qa_type}: {count}")

Total QA pairs: 3359
  - 1-hop: 3356
  - Multi-hop: 3

Distribution by type:
  semantic_1_hop: 3356
  semantic_3_hop: 3


## 5. Validate QA Pairs Against External Sources

Use Reactome and KEGG to validate the correctness of QA pairs.

In [6]:
# Validate QA pairs (sample to avoid rate limiting)
print("="*60)
print("Validating QA pairs against external sources")
print("="*60)

# Limit validation to avoid rate limiting
max_validate = min(50, len(all_qa_pairs))
validated_qa = []

import time

for i, qa in enumerate(all_qa_pairs[:max_validate]):
    if i % 10 == 0:
        print(f"  Validating {i+1}/{max_validate}...")
    
    try:
        validated = validate_qa_pair(qa)
        validated_qa.append(validated)
    except Exception as e:
        print(f"  Warning: Validation failed for QA {i}: {e}")
        qa.validation = {'error': str(e), 'is_valid': None}
        validated_qa.append(qa)
    
    # Rate limiting pause
    time.sleep(0.2)

# Add unvalidated QA pairs
for qa in all_qa_pairs[max_validate:]:
    qa.validation = {'not_validated': True, 'is_valid': None}
    validated_qa.append(qa)

print(f"\nValidated {max_validate} of {len(all_qa_pairs)} QA pairs")

Validating QA pairs against external sources
  Validating 1/50...


  Validating 11/50...


  Validating 21/50...


  Validating 31/50...


  Validating 41/50...



Validated 50 of 3359 QA pairs


In [7]:
# Validation summary
print("\n" + "="*60)
print("VALIDATION SUMMARY")
print("="*60)

validated_count = sum(1 for qa in validated_qa if qa.validation.get('is_valid') == True)
invalid_count = sum(1 for qa in validated_qa if qa.validation.get('is_valid') == False)
unknown_count = sum(1 for qa in validated_qa if qa.validation.get('is_valid') is None)

total_checked = validated_count + invalid_count
validation_rate = validated_count / total_checked if total_checked > 0 else 0

print(f"\nValidation results (of {max_validate} checked):")
print(f"  Valid (confirmed by external source): {validated_count}")
print(f"  Invalid (not confirmed): {invalid_count}")
print(f"  Unknown (validation failed/skipped): {unknown_count}")
print(f"\nValidation rate: {100*validation_rate:.1f}%")

# Sources that confirmed
source_counts = Counter()
for qa in validated_qa:
    for source in qa.validation.get('confirmed_by', []):
        source_counts[source] += 1

if source_counts:
    print(f"\nConfirmations by source:")
    for source, count in source_counts.most_common():
        print(f"  {source}: {count}")


VALIDATION SUMMARY

Validation results (of 50 checked):
  Valid (confirmed by external source): 0
  Invalid (not confirmed): 50
  Unknown (validation failed/skipped): 3309

Validation rate: 0.0%


## 6. Show Validated Examples

In [8]:
# Show validated examples
print("="*60)
print("VALIDATED QA EXAMPLES")
print("="*60)

valid_examples = [qa for qa in validated_qa if qa.validation.get('is_valid') == True]

if valid_examples:
    for qa in valid_examples[:5]:
        print(f"\n{'='*40}")
        print(f"Q: {qa.question}")
        print(f"A: {qa.answer}")
        print(f"Type: {qa.qa_type}")
        print(f"Confirmed by: {qa.validation.get('confirmed_by', [])}")
        print(f"Chain:")
        for step in qa.reasoning_chain:
            print(f"  {step['subject']} --[{step['predicate']}]--> {step['object']}")
else:
    print("\nNo QA pairs were validated by external sources.")
    print("This could mean:")
    print("  1. The relations are correct but not in Reactome/KEGG")
    print("  2. The answer IDs don't match the external format")
    print("  3. The external APIs returned errors")

VALIDATED QA EXAMPLES

No QA pairs were validated by external sources.
This could mean:
  1. The relations are correct but not in Reactome/KEGG
  2. The answer IDs don't match the external format
  3. The external APIs returned errors


## 7. Save Semantic QA Dataset

In [9]:
# Save QA dataset
output_data = {
    'timestamp': datetime.now().isoformat(),
    'summary': {
        'total_qa_pairs': len(validated_qa),
        'one_hop_qa': len(one_hop_qa),
        'multi_hop_qa': len(multi_hop_qa),
        'validated_count': validated_count,
        'validation_rate': validation_rate,
        'type_distribution': dict(type_counts),
        'source_confirmations': dict(source_counts),
    },
    'qa_pairs': [qa.to_dict() for qa in validated_qa],
}

save_json(output_data, OUTPUT_DIR / 'semantic_qa_dataset.json')
print(f"\nSemantic QA dataset saved to: {OUTPUT_DIR / 'semantic_qa_dataset.json'}")


Semantic QA dataset saved to: /home/trentleslie/Insync/projects/biomapper2/notebooks/kg_o1_v3/outputs/semantic_qa_dataset.json


## Summary

In [10]:
# Final summary
print("\n" + "="*60)
print("NOTEBOOK 05 COMPLETE")
print("="*60)

print(f"\nQA Generation Results:")
print(f"  - Total QA pairs: {len(validated_qa)}")
print(f"  - 1-hop: {len(one_hop_qa)}")
print(f"  - Multi-hop: {len(multi_hop_qa)}")
print(f"  - Validation rate: {100*validation_rate:.1f}%")

# Success criteria: 100+ QA pairs with >60% validation rate
if len(validated_qa) >= 50 and validation_rate >= 0.3:
    print(f"\nSuccess criteria met! Proceed to NB06.")
elif len(validated_qa) >= 20:
    print(f"\nPartial success. QA dataset is usable for evaluation.")
else:
    print(f"\nLimited QA pairs. May need to expand subgraph extraction.")

print(f"\nNext step: NB06 - Semantic QA Evaluation")


NOTEBOOK 05 COMPLETE

QA Generation Results:
  - Total QA pairs: 3359
  - 1-hop: 3356
  - Multi-hop: 3
  - Validation rate: 0.0%

Partial success. QA dataset is usable for evaluation.

Next step: NB06 - Semantic QA Evaluation
