# Notebook 02: Predicate & Relationship Mapping

## Objective

Discover and classify ALL semantic relationships available in KRAKEN.

**This notebook depends on NB01 finding that `/one-hop` works and returns edges.**

## Key Questions

1. What predicates are available in KRAKEN?
2. How do they classify (SEMANTIC vs EQUIVALENCY)?
3. What % of Biolink semantic predicates does KRAKEN have?
4. Are the available predicates useful for metabolite reasoning?

In [1]:
# Standard imports
import sys
import json
from pathlib import Path
from datetime import datetime
from collections import Counter, defaultdict

# Add project root to path
PROJECT_ROOT = Path.cwd().parents[1]
sys.path.insert(0, str(PROJECT_ROOT / 'src'))
sys.path.insert(0, str(Path.cwd()))

# Import utilities
from kg_o1_v3_utils import (
    test_one_hop, get_predicates, parse_one_hop_edges,
    classify_predicate, classify_all_predicates,
    hybrid_search, text_search,
    save_json, load_json,
    TEST_ENTITIES, SEMANTIC_PREDICATES, EQUIVALENCY_PREDICATES,
)

# Output directory
OUTPUT_DIR = Path.cwd() / 'outputs'
OUTPUT_DIR.mkdir(exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Output directory: {OUTPUT_DIR}")

Project root: /home/trentleslie/Insync/projects/biomapper2
Output directory: /home/trentleslie/Insync/projects/biomapper2/notebooks/kg_o1_v3/outputs


## 1. Load NB01 Audit Results

Check if we should proceed based on NB01's GO/NO-GO decision.

In [2]:
# Load NB01 results
audit_file = OUTPUT_DIR / 'one_hop_api_audit.json'

if audit_file.exists():
    nb01_results = load_json(audit_file)
    decision = nb01_results.get('go_no_go_decision', {}).get('decision')
    print(f"NB01 Decision: {decision}")
    print(f"Reasoning: {nb01_results.get('go_no_go_decision', {}).get('reasoning')}")
    
    if decision != 'GO':
        print("\n" + "!"*60)
        print("WARNING: NB01 decision was not GO.")
        print("This notebook may not produce meaningful results.")
        print("!"*60)
else:
    print("WARNING: NB01 audit file not found. Run NB01 first.")
    nb01_results = None

NB01 Decision: GO
Reasoning: Found 15 semantic predicates with 99.0% semantic edges. Proceed with v3.


## 2. Get All Predicates from API

In [3]:
# Get predicates from /predicates endpoint
print("="*60)
print("STEP 1: Fetching predicates from /predicates endpoint")
print("="*60)

api_predicates = get_predicates()

if api_predicates:
    print(f"\nTotal predicates from API: {len(api_predicates)}")
    print("\nSample predicates:")
    for p in api_predicates[:20]:
        print(f"  - {p}")
    if len(api_predicates) > 20:
        print(f"  ... and {len(api_predicates) - 20} more")
else:
    print("\n/predicates endpoint not available. Will discover from one-hop.")

STEP 1: Fetching predicates from /predicates endpoint

Total predicates from API: 87

Sample predicates:
  - biolink:actively_involved_in
  - biolink:affects
  - biolink:affects_likelihood_of
  - biolink:affects_response_to
  - biolink:ameliorates_condition
  - biolink:applied_to_treat
  - biolink:assesses
  - biolink:associated_with
  - biolink:associated_with_resistance_to
  - biolink:beneficial_in_models_for
  - biolink:binds
  - biolink:biomarker_for
  - biolink:broad_match
  - biolink:capable_of
  - biolink:catalyzes
  - biolink:causes
  - biolink:chemically_similar_to
  - biolink:close_match
  - biolink:coexists_with
  - biolink:colocalizes_with
  ... and 67 more


## 3. Sample Usage for Each Predicate

For each predicate, find example triples to understand its semantics.

In [4]:
# Extended seed entities for better coverage
EXTENDED_SEED_ENTITIES = [
    # Metabolites
    ("CHEBI:4167", "glucose"),
    ("CHEBI:15846", "NAD+"),
    ("CHEBI:16113", "cholesterol"),
    ("CHEBI:17234", "alanine"),
    ("CHEBI:30616", "ATP"),
    ("CHEBI:15377", "water"),
    ("CHEBI:17368", "ethanol"),
    ("CHEBI:16842", "fructose"),
    ("CHEBI:17754", "glycerol"),
    ("CHEBI:16199", "urea"),
    # Try other ID formats
    ("HMDB:HMDB0000122", "glucose (HMDB)"),
    ("KEGG.COMPOUND:C00031", "glucose (KEGG)"),
    ("DRUGBANK:DB00117", "L-histidine"),
]

print(f"Extended seed entities: {len(EXTENDED_SEED_ENTITIES)}")

Extended seed entities: 13


In [5]:
# Discover predicates by sampling one-hop from many entities
print("="*60)
print("STEP 2: Discovering predicates via one-hop sampling")
print("="*60)

discovered_predicates = Counter()
predicate_examples = defaultdict(list)

for entity_id, entity_name in EXTENDED_SEED_ENTITIES:
    result = test_one_hop(entity_id, direction="both", limit=50)
    
    # Use parse_one_hop_edges to correctly extract edges with predicates
    # (fixes predicate metadata loss bug)
    edges = parse_one_hop_edges(result)
    
    for edge in edges:
        pred = edge.get('predicate', 'unknown')
        discovered_predicates[pred] += 1
        
        # Store example (up to 5 per predicate)
        if len(predicate_examples[pred]) < 5:
            predicate_examples[pred].append({
                'subject_id': entity_id,
                'subject_name': entity_name,
                'object_id': edge.get('object_id', edge.get('end_node_id', '?')),
                'object_name': edge.get('object_name', edge.get('end_node_name', '?')),
                'object_category': edge.get('object_category', edge.get('category', '?')),
            })

print(f"\nDiscovered {len(discovered_predicates)} unique predicates")
print(f"Total edges examined: {sum(discovered_predicates.values())}")

STEP 2: Discovering predicates via one-hop sampling

Discovered 20 unique predicates
Total edges examined: 2494


## 4. Classify All Predicates

In [6]:
# Classify predicates
print("="*60)
print("STEP 3: Classifying predicates")
print("="*60)

all_predicates = list(set(api_predicates + list(discovered_predicates.keys())))
classification = classify_all_predicates(all_predicates)

# Count by classification
semantic_count = sum(discovered_predicates.get(p, 0) for p in classification['semantic'])
equiv_count = sum(discovered_predicates.get(p, 0) for p in classification['equivalency'])
unknown_count = sum(discovered_predicates.get(p, 0) for p in classification['unknown'])
total_edge_count = semantic_count + equiv_count + unknown_count

print(f"\n{'Classification':<20} {'Types':<10} {'Edges':<10} {'%':<10}")
print("-" * 50)
print(f"{'Semantic':<20} {len(classification['semantic']):<10} {semantic_count:<10} {100*semantic_count/max(total_edge_count,1):.1f}%")
print(f"{'Equivalency':<20} {len(classification['equivalency']):<10} {equiv_count:<10} {100*equiv_count/max(total_edge_count,1):.1f}%")
print(f"{'Unknown':<20} {len(classification['unknown']):<10} {unknown_count:<10} {100*unknown_count/max(total_edge_count,1):.1f}%")
print("-" * 50)
print(f"{'TOTAL':<20} {len(all_predicates):<10} {total_edge_count:<10}")

STEP 3: Classifying predicates

Classification       Types      Edges      %         
--------------------------------------------------
Semantic             26         2222       89.1%
Equivalency          1          8          0.3%
Unknown              61         264        10.6%
--------------------------------------------------
TOTAL                88         2494      


In [7]:
# Show semantic predicates with examples
print("\n" + "="*60)
print("SEMANTIC PREDICATES (useful for reasoning)")
print("="*60)

for pred in classification['semantic']:
    count = discovered_predicates.get(pred, 0)
    examples = predicate_examples.get(pred, [])
    
    print(f"\n{pred} ({count} edges)")
    for ex in examples[:2]:
        print(f"  {ex['subject_name']} --> {ex['object_name']} [{ex['object_category']}]")


SEMANTIC PREDICATES (useful for reasoning)

biolink:subclass_of (405 edges)
  glucose --> D-glucose []
  glucose --> D-glucose []

biolink:actively_involved_in (0 edges)

biolink:treats (57 edges)
  glucose --> hypoglycemia []
  glucose --> hypoglycemia []

biolink:coexists_with (0 edges)

biolink:correlated_with (0 edges)

biolink:associated_with (0 edges)

biolink:has_part (140 edges)
  glucose --> Glucose 50 MG/ML includes Injectable Solution & Injection []
  glucose --> Glucose 500 MG/ML includes Injection & Prefilled Syringe []

biolink:expressed_in (0 edges)

biolink:participates_in (0 edges)

biolink:close_match (115 edges)
  glucose --> aldehydo-L-glucose []
  glucose --> aldehydo-L-glucose []

biolink:capable_of (0 edges)

biolink:mentioned_in_clinical_trials_for (38 edges)
  glucose --> hypoglycemia []
  glucose --> hyperglycemia []

biolink:applied_to_treat (12 edges)
  glucose --> hypoglycemia []
  glucose --> type 2 diabetes mellitus []

biolink:related_to (862 edges)
  g

In [8]:
# Show equivalency predicates with examples
print("\n" + "="*60)
print("EQUIVALENCY PREDICATES (vocabulary mappings only)")
print("="*60)

for pred in classification['equivalency']:
    count = discovered_predicates.get(pred, 0)
    examples = predicate_examples.get(pred, [])
    
    print(f"\n{pred} ({count} edges)")
    for ex in examples[:2]:
        print(f"  {ex['subject_name']} --> {ex['object_id']}")


EQUIVALENCY PREDICATES (vocabulary mappings only)

biolink:same_as (8 edges)
  glucose --> CHEBI:4167
  alanine --> CHEBI:4167


In [9]:
# Show unknown predicates for manual classification
print("\n" + "="*60)
print("UNKNOWN PREDICATES (need manual classification)")
print("="*60)

unknown_sorted = sorted(
    classification['unknown'],
    key=lambda p: discovered_predicates.get(p, 0),
    reverse=True
)

for pred in unknown_sorted[:20]:
    count = discovered_predicates.get(pred, 0)
    examples = predicate_examples.get(pred, [])
    
    print(f"\n{pred} ({count} edges)")
    for ex in examples[:2]:
        print(f"  {ex['subject_name']} --> {ex['object_name']} [{ex['object_category']}]")


UNKNOWN PREDICATES (need manual classification)

biolink:interacts_with (158 edges)
  glucose --> GCK []
  NAD+ --> SIRT4 []

biolink:physically_interacts_with (102 edges)
  glucose --> GCK []
  glucose --> GCK []

biolink:produces (4 edges)
  glucose --> gluconeogenesis []
  alanine --> gluconeogenesis []

biolink:ameliorates_condition (0 edges)

biolink:gene_associated_with_condition (0 edges)

biolink:manifestation_of (0 edges)

biolink:regulates (0 edges)

biolink:composed_primarily_of (0 edges)

biolink:affects_likelihood_of (0 edges)

biolink:consumes (0 edges)

biolink:disrupts (0 edges)

biolink:associated_with_resistance_to (0 edges)

biolink:opposite_of (0 edges)

biolink:precedes (0 edges)

biolink:derives_into (0 edges)

biolink:develops_from (0 edges)

biolink:has_increased_amount (0 edges)

biolink:binds (0 edges)

biolink:is_sequence_variant_of (0 edges)

biolink:assesses (0 edges)


## 5. Compare with Biolink Predicate Hierarchy

Cross-reference against the full Biolink model to understand coverage.

In [10]:
# Key Biolink predicates for metabolite/biochemistry reasoning
BIOLINK_METABOLITE_PREDICATES = {
    # Participation
    'biolink:participates_in': 'Entity participates in a process/pathway',
    'biolink:has_participant': 'Process has this entity as participant',
    'biolink:actively_involved_in': 'Active involvement in process',
    
    # Catalysis
    'biolink:catalyzes': 'Enzyme catalyzes reaction',
    'biolink:is_substrate_of': 'Entity is substrate of enzyme',
    'biolink:has_substrate': 'Enzyme has this substrate',
    'biolink:has_product': 'Reaction produces this entity',
    
    # Metabolism
    'biolink:has_metabolite': 'Entity has this metabolite',
    'biolink:metabolite_of': 'Entity is metabolite of',
    'biolink:affects': 'Entity affects another',
    
    # Transport
    'biolink:transports': 'Entity transports another',
    'biolink:transported_by': 'Entity is transported by',
    
    # Location
    'biolink:located_in': 'Entity is located in',
    'biolink:location_of': 'Entity is location of',
    
    # Structure
    'biolink:part_of': 'Entity is part of',
    'biolink:has_part': 'Entity has part',
    
    # Association
    'biolink:associated_with': 'General association',
    'biolink:correlated_with': 'Correlation relationship',
    'biolink:related_to': 'General relation',
}

print("="*60)
print("BIOLINK COVERAGE ANALYSIS")
print("="*60)
print(f"\nKey Biolink predicates for metabolite reasoning: {len(BIOLINK_METABOLITE_PREDICATES)}")

# Check which are available in KRAKEN
available = []
missing = []

for pred, desc in BIOLINK_METABOLITE_PREDICATES.items():
    # Check both with and without biolink prefix
    variants = [pred, pred.replace('biolink:', '')]
    
    found = any(v in all_predicates or v.lower() in [p.lower() for p in all_predicates] for v in variants)
    
    if found:
        count = discovered_predicates.get(pred, 0)
        available.append((pred, desc, count))
    else:
        missing.append((pred, desc))

print(f"\nAvailable in KRAKEN: {len(available)}/{len(BIOLINK_METABOLITE_PREDICATES)} ({100*len(available)/len(BIOLINK_METABOLITE_PREDICATES):.0f}%)")
print("\n  Available:")
for pred, desc, count in available:
    print(f"    [{count:3d}] {pred}: {desc}")

print(f"\n  Missing ({len(missing)}):")
for pred, desc in missing:
    print(f"    [ - ] {pred}: {desc}")

BIOLINK COVERAGE ANALYSIS

Key Biolink predicates for metabolite reasoning: 19

Available in KRAKEN: 11/19 (58%)

  Available:
    [  0] biolink:participates_in: Entity participates in a process/pathway
    [336] biolink:has_participant: Process has this entity as participant
    [  0] biolink:actively_involved_in: Active involvement in process
    [  0] biolink:catalyzes: Enzyme catalyzes reaction
    [  0] biolink:has_metabolite: Entity has this metabolite
    [  9] biolink:affects: Entity affects another
    [  1] biolink:located_in: Entity is located in
    [140] biolink:has_part: Entity has part
    [  0] biolink:associated_with: General association
    [  0] biolink:correlated_with: Correlation relationship
    [862] biolink:related_to: General relation

  Missing (8):
    [ - ] biolink:is_substrate_of: Entity is substrate of enzyme
    [ - ] biolink:has_substrate: Enzyme has this substrate
    [ - ] biolink:has_product: Reaction produces this entity
    [ - ] biolink:metabolite_

## 6. Predicate Utility Assessment

Assess whether the available predicates are useful for the types of questions we want to answer.

In [11]:
# Question types and required predicates
QUESTION_TYPES = {
    'pathway_participation': {
        'example': 'What pathway does glucose participate in?',
        'required_predicates': ['participates_in', 'biolink:participates_in', 'in_pathway'],
    },
    'enzyme_substrate': {
        'example': 'What enzyme catalyzes the phosphorylation of glucose?',
        'required_predicates': ['catalyzes', 'is_substrate_of', 'biolink:catalyzes'],
    },
    'disease_association': {
        'example': 'What diseases are associated with high cholesterol?',
        'required_predicates': ['associated_with', 'biolink:associated_with', 'related_to'],
    },
    'drug_target': {
        'example': 'What drugs target glucose metabolism?',
        'required_predicates': ['treats', 'affects', 'biolink:treats', 'biolink:affects'],
    },
    'metabolite_product': {
        'example': 'What is produced from glucose in glycolysis?',
        'required_predicates': ['has_metabolite', 'produces', 'biolink:has_metabolite'],
    },
}

print("="*60)
print("QUESTION TYPE SUPPORT ANALYSIS")
print("="*60)

for qtype, info in QUESTION_TYPES.items():
    # Check if any required predicate is available
    available_for_qtype = []
    for req_pred in info['required_predicates']:
        if req_pred in all_predicates or req_pred.lower() in [p.lower() for p in all_predicates]:
            count = discovered_predicates.get(req_pred, 0)
            available_for_qtype.append((req_pred, count))
    
    status = "SUPPORTED" if available_for_qtype else "NOT SUPPORTED"
    marker = "OK" if available_for_qtype else "XX"
    
    print(f"\n[{marker}] {qtype}")
    print(f"    Example: {info['example']}")
    print(f"    Status: {status}")
    if available_for_qtype:
        for pred, count in available_for_qtype:
            print(f"      - {pred} ({count} edges)")

QUESTION TYPE SUPPORT ANALYSIS

[OK] pathway_participation
    Example: What pathway does glucose participate in?
    Status: SUPPORTED
      - biolink:participates_in (0 edges)

[OK] enzyme_substrate
    Example: What enzyme catalyzes the phosphorylation of glucose?
    Status: SUPPORTED
      - biolink:catalyzes (0 edges)

[OK] disease_association
    Example: What diseases are associated with high cholesterol?
    Status: SUPPORTED
      - biolink:associated_with (0 edges)

[OK] drug_target
    Example: What drugs target glucose metabolism?
    Status: SUPPORTED
      - biolink:treats (57 edges)
      - biolink:affects (9 edges)

[OK] metabolite_product
    Example: What is produced from glucose in glycolysis?
    Status: SUPPORTED
      - biolink:has_metabolite (0 edges)


## 7. Save Predicate Mapping

In [12]:
# Compile predicate mapping
predicate_mapping = {
    'timestamp': datetime.now().isoformat(),
    'summary': {
        'total_predicates': len(all_predicates),
        'semantic_predicates': len(classification['semantic']),
        'equivalency_predicates': len(classification['equivalency']),
        'unknown_predicates': len(classification['unknown']),
        'total_edges_sampled': sum(discovered_predicates.values()),
        'semantic_edge_percent': 100 * semantic_count / max(total_edge_count, 1),
        'biolink_coverage': 100 * len(available) / len(BIOLINK_METABOLITE_PREDICATES),
    },
    'classification': {
        'semantic': classification['semantic'],
        'equivalency': classification['equivalency'],
        'unknown': classification['unknown'],
    },
    'predicate_counts': dict(discovered_predicates),
    'predicate_examples': {
        pred: examples for pred, examples in predicate_examples.items()
    },
    'biolink_analysis': {
        'available': [(p, d, c) for p, d, c in available],
        'missing': [(p, d) for p, d in missing],
    },
    'question_type_support': {
        qtype: {
            'supported': bool([p for p in info['required_predicates'] 
                              if p in all_predicates or p.lower() in [x.lower() for x in all_predicates]]),
            'example': info['example'],
        }
        for qtype, info in QUESTION_TYPES.items()
    },
}

save_json(predicate_mapping, OUTPUT_DIR / 'predicate_mapping.json')
print(f"\nPredicate mapping saved to: {OUTPUT_DIR / 'predicate_mapping.json'}")


Predicate mapping saved to: /home/trentleslie/Insync/projects/biomapper2/notebooks/kg_o1_v3/outputs/predicate_mapping.json


## Summary

This notebook mapped all available predicates in KRAKEN and assessed their utility for semantic reasoning.

In [13]:
# Final summary
print("\n" + "="*60)
print("NOTEBOOK 02 COMPLETE")
print("="*60)

print(f"\nKey Findings:")
print(f"  - Total predicates: {len(all_predicates)}")
print(f"  - Semantic predicates: {len(classification['semantic'])} ({100*semantic_count/max(total_edge_count,1):.1f}% of edges)")
print(f"  - Biolink coverage: {100*len(available)/len(BIOLINK_METABOLITE_PREDICATES):.0f}%")

# Assessment
if len(classification['semantic']) >= 5 and semantic_count > 0:
    print(f"\nAssessment: KRAKEN has semantic capability. Proceed to NB03.")
elif len(classification['semantic']) > 0:
    print(f"\nAssessment: Limited semantic capability. May need focused approach.")
else:
    print(f"\nAssessment: No semantic predicates found. v3 approach not viable.")

print(f"\nNext step: NB03 - Semantic Subgraph Extraction")


NOTEBOOK 02 COMPLETE

Key Findings:
  - Total predicates: 88
  - Semantic predicates: 26 (89.1% of edges)
  - Biolink coverage: 58%

Assessment: KRAKEN has semantic capability. Proceed to NB03.

Next step: NB03 - Semantic Subgraph Extraction
