# Notebook 07: Semantic vs Vocabulary Gap Analysis

## Objective

Quantify the capability difference between:
- **v2 Vocabulary QA**: Entity resolution via search (95.2% EM)
- **v3 Semantic QA**: Multi-hop reasoning via graph traversal

## Key Questions

1. Does KRAKEN have true semantic reasoning capability?
2. What types of questions can each approach answer?
3. When should you use search vs graph traversal?

In [1]:
# Standard imports
import sys
from pathlib import Path
from datetime import datetime

# Add project root to path
PROJECT_ROOT = Path.cwd().parents[1]
sys.path.insert(0, str(PROJECT_ROOT / 'src'))
sys.path.insert(0, str(Path.cwd()))

# Import utilities
from kg_o1_v3_utils import save_json, load_json

# Output directory
OUTPUT_DIR = Path.cwd() / 'outputs'
V2_OUTPUT_DIR = Path.cwd().parent / 'kg_o1_v2' / 'outputs'

print(f"v3 output directory: {OUTPUT_DIR}")
print(f"v2 output directory: {V2_OUTPUT_DIR}")

v3 output directory: /home/trentleslie/Insync/projects/biomapper2/notebooks/kg_o1_v3/outputs
v2 output directory: /home/trentleslie/Insync/projects/biomapper2/notebooks/kg_o1_v2/outputs


## 1. Load All Results

In [2]:
# Load v3 results
v3_results = {}

# NB01: API Audit
if (OUTPUT_DIR / 'one_hop_api_audit.json').exists():
    v3_results['api_audit'] = load_json(OUTPUT_DIR / 'one_hop_api_audit.json')
    print(f"Loaded API audit: GO/NO-GO = {v3_results['api_audit'].get('go_no_go_decision', {}).get('decision')}")

# NB02: Predicate mapping
if (OUTPUT_DIR / 'predicate_mapping.json').exists():
    v3_results['predicate_mapping'] = load_json(OUTPUT_DIR / 'predicate_mapping.json')
    print(f"Loaded predicate mapping: {v3_results['predicate_mapping'].get('summary', {}).get('total_predicates', 0)} predicates")

# NB03: Semantic subgraphs
if (OUTPUT_DIR / 'semantic_subgraphs.json').exists():
    v3_results['subgraphs'] = load_json(OUTPUT_DIR / 'semantic_subgraphs.json')
    print(f"Loaded subgraphs: {len(v3_results['subgraphs'].get('subgraphs', []))} subgraphs")

# NB04: Multi-hop paths
if (OUTPUT_DIR / 'multi_hop_paths.json').exists():
    v3_results['paths'] = load_json(OUTPUT_DIR / 'multi_hop_paths.json')
    print(f"Loaded paths: {v3_results['paths'].get('summary', {}).get('paths_found', 0)} paths found")

# NB05: Semantic QA dataset
if (OUTPUT_DIR / 'semantic_qa_dataset.json').exists():
    v3_results['qa_dataset'] = load_json(OUTPUT_DIR / 'semantic_qa_dataset.json')
    print(f"Loaded QA dataset: {len(v3_results['qa_dataset'].get('qa_pairs', []))} QA pairs")

# NB06: Evaluation results
if (OUTPUT_DIR / 'semantic_qa_evaluation.json').exists():
    v3_results['evaluation'] = load_json(OUTPUT_DIR / 'semantic_qa_evaluation.json')
    print(f"Loaded evaluation results")

Loaded API audit: GO/NO-GO = GO
Loaded predicate mapping: 88 predicates
Loaded subgraphs: 20 subgraphs
Loaded paths: 3 paths found
Loaded QA dataset: 3359 QA pairs
Loaded evaluation results


In [3]:
# Load v2 results for comparison
v2_results = {}

if V2_OUTPUT_DIR.exists():
    # v2 search evaluation
    if (V2_OUTPUT_DIR / 'search_evaluation_results.json').exists():
        v2_results['evaluation'] = load_json(V2_OUTPUT_DIR / 'search_evaluation_results.json')
        print(f"Loaded v2 evaluation results")
    
    # v2 QA dataset
    if (V2_OUTPUT_DIR / 'multihop_qa_dataset.json').exists():
        v2_results['qa_dataset'] = load_json(V2_OUTPUT_DIR / 'multihop_qa_dataset.json')
        print(f"Loaded v2 QA dataset")
else:
    print("v2 results not available - using reference values")
    v2_results = {
        'reference': {
            'vocabulary_qa_em': 0.952,  # From v2 final report
            'hybrid_search_em': 0.952,
            'text_search_em': 1.0,
            'vector_search_em': 0.0,
        }
    }

## 2. Capability Comparison

In [4]:
# Build capability comparison
print("="*60)
print("CAPABILITY COMPARISON: v2 vs v3")
print("="*60)

# v2 metrics
v2_vocab_em = v2_results.get('reference', {}).get('vocabulary_qa_em', 0.952)
if 'evaluation' in v2_results:
    v2_vocab_em = v2_results['evaluation'].get('summary', {}).get('overall_metrics', {}).get('hybrid_em', v2_vocab_em)

# v3 metrics
v3_search_em = 0
v3_graph_recall = 0

if 'evaluation' in v3_results:
    v3_search_em = v3_results['evaluation'].get('search_method', {}).get('metrics', {}).get('exact_match', 0)
    v3_graph_recall = v3_results['evaluation'].get('graph_method', {}).get('metrics', {}).get('recall', 0)

comparison = {
    'v2_vocabulary_qa': {
        'method': 'hybrid_search + reranking',
        'exact_match': v2_vocab_em,
        'use_case': 'Entity resolution (same entity, different IDs)',
        'example': 'What HMDB ID corresponds to glucose?',
    },
    'v3_semantic_qa': {
        'search_em': v3_search_em,
        'graph_recall': v3_graph_recall,
        'use_case': 'Multi-hop reasoning (different entities, semantic relations)',
        'example': 'What pathway does glucose participate in?',
    },
}

print(f"\nVocabulary QA (v2):")
print(f"  Method: {comparison['v2_vocabulary_qa']['method']}")
print(f"  Exact Match: {100*comparison['v2_vocabulary_qa']['exact_match']:.1f}%")
print(f"  Use case: {comparison['v2_vocabulary_qa']['use_case']}")
print(f"  Example: {comparison['v2_vocabulary_qa']['example']}")

print(f"\nSemantic QA (v3):")
print(f"  Search EM: {100*comparison['v3_semantic_qa']['search_em']:.1f}%")
print(f"  Graph Recall: {100*comparison['v3_semantic_qa']['graph_recall']:.1f}%")
print(f"  Use case: {comparison['v3_semantic_qa']['use_case']}")
print(f"  Example: {comparison['v3_semantic_qa']['example']}")

CAPABILITY COMPARISON: v2 vs v3

Vocabulary QA (v2):
  Method: hybrid_search + reranking
  Exact Match: 95.2%
  Use case: Entity resolution (same entity, different IDs)
  Example: What HMDB ID corresponds to glucose?

Semantic QA (v3):
  Search EM: 0.0%
  Graph Recall: 63.2%
  Use case: Multi-hop reasoning (different entities, semantic relations)
  Example: What pathway does glucose participate in?


## 3. Key Insights

In [5]:
# Generate insights
print("="*60)
print("KEY INSIGHTS")
print("="*60)

insights = []

# Insight 1: Search performance on semantic QA
if v3_search_em < v2_vocab_em:
    diff = v2_vocab_em - v3_search_em
    insights.append({
        'finding': 'Search performs worse on semantic QA',
        'detail': f'Search EM drops from {100*v2_vocab_em:.1f}% (vocabulary) to {100*v3_search_em:.1f}% (semantic) - a {100*diff:.1f}pp decrease',
        'implication': 'Search is optimized for entity resolution, not semantic reasoning',
    })

# Insight 2: Graph traversal capability
if v3_graph_recall > 0:
    insights.append({
        'finding': 'Graph traversal provides semantic capability',
        'detail': f'Graph recall of {100*v3_graph_recall:.1f}% on semantic QA',
        'implication': 'KRAKEN has semantic relations accessible via /one-hop',
    })
else:
    insights.append({
        'finding': 'Limited graph traversal capability',
        'detail': 'Graph traversal did not find answers effectively',
        'implication': 'KRAKEN may be primarily vocabulary-focused',
    })

# Insight 3: Complementary approaches
if v3_search_em > 0 or v3_graph_recall > 0:
    insights.append({
        'finding': 'Complementary use cases',
        'detail': 'Search excels at entity resolution; graph traversal for semantic reasoning',
        'implication': 'Hybrid approach recommended: search for entities, graph for relations',
    })

# Display insights
for i, insight in enumerate(insights, 1):
    print(f"\n{i}. {insight['finding']}")
    print(f"   Detail: {insight['detail']}")
    print(f"   Implication: {insight['implication']}")

KEY INSIGHTS

1. Search performs worse on semantic QA
   Detail: Search EM drops from 95.2% (vocabulary) to 0.0% (semantic) - a 95.2pp decrease
   Implication: Search is optimized for entity resolution, not semantic reasoning

2. Graph traversal provides semantic capability
   Detail: Graph recall of 63.2% on semantic QA
   Implication: KRAKEN has semantic relations accessible via /one-hop

3. Complementary use cases
   Detail: Search excels at entity resolution; graph traversal for semantic reasoning
   Implication: Hybrid approach recommended: search for entities, graph for relations


## 4. Semantic Gap Assessment

In [6]:
# Semantic gap assessment
print("="*60)
print("SEMANTIC GAP ASSESSMENT")
print("="*60)

# Get predicate data
semantic_predicates = 0
equiv_predicates = 0
semantic_edge_pct = 0

if 'predicate_mapping' in v3_results:
    summary = v3_results['predicate_mapping'].get('summary', {})
    semantic_predicates = summary.get('semantic_predicates', 0)
    equiv_predicates = summary.get('equivalency_predicates', 0)
    semantic_edge_pct = summary.get('semantic_edge_percent', 0)

# Determine semantic capability level
if semantic_edge_pct > 50:
    capability_level = 'HIGH'
    capability_desc = 'KRAKEN has robust semantic capability'
elif semantic_edge_pct > 20:
    capability_level = 'MEDIUM'
    capability_desc = 'KRAKEN has moderate semantic capability'
elif semantic_edge_pct > 5:
    capability_level = 'LOW'
    capability_desc = 'KRAKEN has limited semantic capability'
else:
    capability_level = 'MINIMAL'
    capability_desc = 'KRAKEN is primarily vocabulary-focused'

semantic_gap_assessment = {
    'semantic_predicates': semantic_predicates,
    'equivalency_predicates': equiv_predicates,
    'semantic_edge_percent': semantic_edge_pct,
    'capability_level': capability_level,
    'capability_description': capability_desc,
    'v2_conclusion': 'KRAKEN is vocabulary-focused (v2 achieved 95.2% EM on entity resolution)',
    'v3_finding': f'Semantic capability is {capability_level}',
}

print(f"\nPredicate Analysis:")
print(f"  Semantic predicates: {semantic_predicates}")
print(f"  Equivalency predicates: {equiv_predicates}")
print(f"  Semantic edge %: {semantic_edge_pct:.1f}%")
print(f"\nCapability Level: {capability_level}")
print(f"Assessment: {capability_desc}")
print(f"\nv2 Conclusion: {semantic_gap_assessment['v2_conclusion']}")
print(f"v3 Finding: {semantic_gap_assessment['v3_finding']}")

SEMANTIC GAP ASSESSMENT

Predicate Analysis:
  Semantic predicates: 26
  Equivalency predicates: 1
  Semantic edge %: 89.1%

Capability Level: HIGH
Assessment: KRAKEN has robust semantic capability

v2 Conclusion: KRAKEN is vocabulary-focused (v2 achieved 95.2% EM on entity resolution)
v3 Finding: Semantic capability is HIGH


## 5. Recommendations

In [7]:
# Generate recommendations
print("="*60)
print("RECOMMENDATIONS")
print("="*60)

recommendations = []

# Recommendation 1: Entity resolution
recommendations.append({
    'use_case': 'Entity Resolution',
    'recommended_method': 'Hybrid Search + Reranking',
    'expected_performance': f'{100*v2_vocab_em:.0f}% EM',
    'rationale': 'Search is optimized for finding same entity across vocabularies',
})

# Recommendation 2: Semantic queries
if capability_level in ['HIGH', 'MEDIUM']:
    recommendations.append({
        'use_case': 'Semantic Queries (pathways, diseases, etc.)',
        'recommended_method': 'Graph Traversal via /one-hop',
        'expected_performance': f'{100*v3_graph_recall:.0f}% Recall',
        'rationale': 'Graph traversal follows semantic edges directly',
    })
else:
    recommendations.append({
        'use_case': 'Semantic Queries',
        'recommended_method': 'External APIs (Reactome, KEGG)',
        'expected_performance': 'Varies by API',
        'rationale': 'KRAKEN has limited semantic relations',
    })

# Recommendation 3: Hybrid approach
recommendations.append({
    'use_case': 'Complex Queries',
    'recommended_method': 'Search + Graph Hybrid',
    'expected_performance': 'Depends on query type',
    'rationale': 'Use search to find entity, then graph to explore relations',
})

for i, rec in enumerate(recommendations, 1):
    print(f"\n{i}. {rec['use_case']}")
    print(f"   Method: {rec['recommended_method']}")
    print(f"   Expected: {rec['expected_performance']}")
    print(f"   Rationale: {rec['rationale']}")

RECOMMENDATIONS

1. Entity Resolution
   Method: Hybrid Search + Reranking
   Expected: 95% EM
   Rationale: Search is optimized for finding same entity across vocabularies

2. Semantic Queries (pathways, diseases, etc.)
   Method: Graph Traversal via /one-hop
   Expected: 63% Recall
   Rationale: Graph traversal follows semantic edges directly

3. Complex Queries
   Method: Search + Graph Hybrid
   Expected: Depends on query type
   Rationale: Use search to find entity, then graph to explore relations


## 6. Save Gap Analysis

In [8]:
# Save gap analysis
output_data = {
    'timestamp': datetime.now().isoformat(),
    'capability_comparison': comparison,
    'semantic_gap_assessment': semantic_gap_assessment,
    'insights': insights,
    'recommendations': recommendations,
    'conclusion': {
        'search_best_for': 'Entity resolution (vocabulary QA)',
        'graph_best_for': 'Semantic reasoning (pathway/disease queries)',
        'overall': capability_desc,
    },
}

save_json(output_data, OUTPUT_DIR / 'semantic_gap_analysis_v3.json')
print(f"\nGap analysis saved to: {OUTPUT_DIR / 'semantic_gap_analysis_v3.json'}")


Gap analysis saved to: /home/trentleslie/Insync/projects/biomapper2/notebooks/kg_o1_v3/outputs/semantic_gap_analysis_v3.json


## Summary

In [9]:
# Final summary
print("\n" + "="*60)
print("NOTEBOOK 07 COMPLETE")
print("="*60)

print(f"\nKey Findings:")
print(f"  - v2 Vocabulary QA EM: {100*v2_vocab_em:.1f}%")
print(f"  - v3 Semantic Search EM: {100*v3_search_em:.1f}%")
print(f"  - v3 Graph Recall: {100*v3_graph_recall:.1f}%")
print(f"  - Semantic Capability: {capability_level}")

print(f"\nConclusion: {capability_desc}")

print(f"\nNext step: NB08 - Integration Recommendations")


NOTEBOOK 07 COMPLETE

Key Findings:
  - v2 Vocabulary QA EM: 95.2%
  - v3 Semantic Search EM: 0.0%
  - v3 Graph Recall: 63.2%
  - Semantic Capability: HIGH

Conclusion: KRAKEN has robust semantic capability

Next step: NB08 - Integration Recommendations
