# Open Cure: Exploring Biomedical Knowledge Graphs

This notebook explores the downloaded knowledge graphs and demonstrates basic drug repurposing queries.

## Setup

First, make sure you've downloaded the knowledge graphs:
```bash
python scripts/download_graphs.py --all
```

In [None]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from pathlib import Path
from collections import Counter

# Project paths
PROJECT_ROOT = Path('.').resolve().parent
DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'

## 1. Explore DRKG (Drug Repurposing Knowledge Graph)

DRKG contains ~97K entities and 5.8M edges from 6 databases.

In [None]:
# Load DRKG
drkg_path = RAW_DIR / 'drkg' / 'drkg.tsv'

if drkg_path.exists():
    drkg = pd.read_csv(drkg_path, sep='\t', header=None, names=['head', 'relation', 'tail'])
    print(f"DRKG loaded: {len(drkg):,} edges")
    drkg.head(10)
else:
    print(f"DRKG not found at {drkg_path}")
    print("Run: python scripts/download_graphs.py --drkg")

In [None]:
# Analyze entity types in DRKG
if 'drkg' in dir():
    head_types = drkg['head'].apply(lambda x: x.split('::')[0] if '::' in x else 'Unknown')
    tail_types = drkg['tail'].apply(lambda x: x.split('::')[0] if '::' in x else 'Unknown')
    
    all_types = pd.concat([head_types, tail_types])
    type_counts = all_types.value_counts()
    
    print("Entity types in DRKG:")
    print(type_counts)

In [None]:
# Analyze relation types
if 'drkg' in dir():
    relation_counts = drkg['relation'].value_counts()
    
    print(f"\nNumber of relation types: {len(relation_counts)}")
    print("\nTop 20 relations:")
    print(relation_counts.head(20))

## 2. Find Drug-Disease Connections

Let's look for existing drug-disease relationships in the graph.

In [None]:
# Find Compound-Disease edges
if 'drkg' in dir():
    drug_disease_edges = drkg[
        (drkg['head'].str.startswith('Compound::')) & 
        (drkg['tail'].str.startswith('Disease::'))
    ]
    
    print(f"Drug-Disease edges: {len(drug_disease_edges):,}")
    print("\nRelation types for Drug-Disease:")
    print(drug_disease_edges['relation'].value_counts())

In [None]:
# Look for a specific disease (e.g., Castleman disease - what Fajgenbaum studies)
if 'drkg' in dir():
    castleman_edges = drkg[
        drkg['head'].str.contains('castleman', case=False) | 
        drkg['tail'].str.contains('castleman', case=False)
    ]
    
    if len(castleman_edges) > 0:
        print(f"Found {len(castleman_edges)} edges related to Castleman disease:")
        display(castleman_edges.head(20))
    else:
        print("No Castleman disease entries found in DRKG")
        print("\nSearching for rare diseases...")
        rare_disease = drkg[drkg['tail'].str.contains('rare', case=False)]
        print(f"Found {len(rare_disease)} edges mentioning 'rare'")

## 3. Build and Explore the Unified Graph

Run the unified graph builder first:
```bash
python src/ingest/build_unified_graph.py
```

In [None]:
# Load unified graph
unified_nodes_path = PROCESSED_DIR / 'unified_nodes.csv'
unified_edges_path = PROCESSED_DIR / 'unified_edges.csv'

if unified_nodes_path.exists():
    unified_nodes = pd.read_csv(unified_nodes_path)
    unified_edges = pd.read_csv(unified_edges_path)
    
    print(f"Unified graph: {len(unified_nodes):,} nodes, {len(unified_edges):,} edges")
    print("\nNode types:")
    print(unified_nodes['type'].value_counts())
else:
    print("Unified graph not built yet.")
    print("Run: python src/ingest/build_unified_graph.py")

## 4. Simple Drug Repurposing Query

Find drugs that target genes associated with a disease.

In [None]:
def find_repurposing_candidates(disease_term, drkg_df, top_k=20):
    """
    Simple drug repurposing: find drugs that target genes associated with a disease.
    
    Logic:
    1. Find genes associated with the disease
    2. Find drugs that target those genes
    3. Rank by number of shared gene targets
    """
    # Find disease node
    disease_edges = drkg_df[
        drkg_df['head'].str.contains(disease_term, case=False) | 
        drkg_df['tail'].str.contains(disease_term, case=False)
    ]
    
    if len(disease_edges) == 0:
        print(f"No entries found for '{disease_term}'")
        return pd.DataFrame()
    
    # Get disease ID
    disease_ids = set()
    for _, row in disease_edges.iterrows():
        if 'Disease::' in row['head']:
            disease_ids.add(row['head'])
        if 'Disease::' in row['tail']:
            disease_ids.add(row['tail'])
    
    print(f"Found disease IDs: {disease_ids}")
    
    # Find genes associated with disease
    disease_genes = set()
    for disease_id in disease_ids:
        gene_edges = drkg_df[
            ((drkg_df['head'] == disease_id) & (drkg_df['tail'].str.startswith('Gene::'))) |
            ((drkg_df['tail'] == disease_id) & (drkg_df['head'].str.startswith('Gene::')))
        ]
        for _, row in gene_edges.iterrows():
            if row['head'].startswith('Gene::'):
                disease_genes.add(row['head'])
            if row['tail'].startswith('Gene::'):
                disease_genes.add(row['tail'])
    
    print(f"Found {len(disease_genes)} genes associated with disease")
    
    if len(disease_genes) == 0:
        return pd.DataFrame()
    
    # Find drugs targeting these genes
    drug_gene_edges = drkg_df[
        (drkg_df['head'].str.startswith('Compound::')) & 
        (drkg_df['tail'].isin(disease_genes))
    ]
    
    # Count drugs by number of shared targets
    drug_counts = drug_gene_edges['head'].value_counts()
    
    results = pd.DataFrame({
        'drug': drug_counts.index[:top_k],
        'shared_gene_targets': drug_counts.values[:top_k]
    })
    
    return results

In [None]:
# Example: Find repurposing candidates for a disease
if 'drkg' in dir():
    # Try different diseases
    for disease in ['diabetes', 'alzheimer', 'parkinson']:
        print(f"\n{'='*60}")
        print(f"Repurposing candidates for: {disease}")
        print('='*60)
        candidates = find_repurposing_candidates(disease, drkg)
        if len(candidates) > 0:
            display(candidates.head(10))

## 5. Next Steps

This is just the beginning. The full pipeline would:

1. **Train embedding models** (TransE, RotatE) to learn latent representations
2. **Use GNNs** to capture multi-hop relationships
3. **Generate explanations** for predictions using path analysis
4. **Validate** predictions against clinical literature
5. **Focus on rare diseases** where data is sparse

See:
- `src/models/link_prediction.py` for model implementations
- `src/models/explainer.py` for explainability tools

In [None]:
# Preview the model interface
from src.models import DrugDiseasePredictor, PredictionExplainer

print("DrugDiseasePredictor methods:")
print([m for m in dir(DrugDiseasePredictor) if not m.startswith('_')])

print("\nPredictionExplainer methods:")
print([m for m in dir(PredictionExplainer) if not m.startswith('_')])