# on2vec: Ontology Embeddings Demo

This notebook demonstrates how to use the on2vec toolkit to generate embeddings from OWL ontologies.

We'll use the Cardiovascular Disease Ontology (CVDO) as an example - a structured vocabulary for cardiovascular disease concepts, risk factors, and related terms.

## Overview

1. **Download**: Get the Cardiovascular Disease Ontology OWL file
2. **Train**: Create a GNN model from the ontology structure 
3. **Embed**: Generate embedding vectors for all concepts
4. **Analyze**: Explore the embeddings with metadata and vector operations
5. **Export**: Convert to different formats for downstream analysis

## Setup

First, let's install the notebook dependencies and import required modules:

In [1]:
# Install notebook dependencies if needed
import sys
import subprocess

def install_if_missing(package):
    try:
        __import__(package)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])

install_if_missing("requests")
install_if_missing("IPython")

In [2]:
import os
import sys
import requests
import numpy as np
from pathlib import Path
from IPython.display import display, Markdown, HTML

# Add the on2vec package to the path
sys.path.insert(0, '.')

# Import on2vec modules
from on2vec import (
    train_ontology_embeddings,
    embed_ontology_with_model,
    inspect_parquet_metadata,
    load_embeddings_as_dataframe,
    convert_parquet_to_csv,
    add_embedding_vectors,
    subtract_embedding_vectors,
    get_embedding_vector
)

print("✅ Imports successful!")

✅ Imports successful!


## Step 1: Download the Cardiovascular Disease Ontology

The Cardiovascular Disease Ontology (CVDO) is a structured vocabulary for cardiovascular disease concepts. Let's download it:

In [3]:
def download_owl_file(url, filename):
    """Download an OWL file from URL with progress indication."""
    if os.path.exists(filename):
        print(f"📁 {filename} already exists (size: {os.path.getsize(filename):,} bytes)")
        return filename
    
    print(f"⬇️  Downloading {filename} from {url}")
    
    try:
        response = requests.get(url, stream=True)
        response.raise_for_status()
        
        total_size = int(response.headers.get('content-length', 0))
        
        with open(filename, 'wb') as f:
            downloaded = 0
            for chunk in response.iter_content(chunk_size=8192):
                if chunk:
                    f.write(chunk)
                    downloaded += len(chunk)
                    if total_size > 0:
                        percent = (downloaded / total_size) * 100
                        print(f"\rProgress: {percent:.1f}% ({downloaded:,}/{total_size:,} bytes)", end="")
        
        print(f"\n✅ Downloaded {filename} ({os.path.getsize(filename):,} bytes)")
        return filename
        
    except Exception as e:
        print(f"❌ Error downloading {filename}: {e}")
        return None

# Download the Cardiovascular Disease Ontology
owl_url = "http://purl.obolibrary.org/obo/cvdo.owl"
owl_file = "cvdo.owl"

downloaded_file = download_owl_file(owl_url, owl_file)

if downloaded_file:
    display(Markdown(f"**📊 Cardiovascular Disease Ontology downloaded:** `{downloaded_file}`"))
    display(Markdown(f"**📏 File size:** {os.path.getsize(downloaded_file):,} bytes ({os.path.getsize(downloaded_file)/(1024*1024):.1f} MB)"))
else:
    display(Markdown("❌ **Failed to download ontology file!**"))

📁 cvdo.owl already exists (size: 1,339,226 bytes)


**📊 Cardiovascular Disease Ontology downloaded:** `cvdo.owl`

**📏 File size:** 1,339,226 bytes (1.3 MB)

## Step 2: Train a Model

Now let's train a Graph Neural Network on the cardiovascular disease ontology structure. We'll use a smaller model for demonstration purposes:

In [4]:
if downloaded_file:
    print("🚀 Training GNN model on Cardiovascular Disease Ontology...")
    print("This may take a few minutes depending on ontology size.")
    
    try:
        # Train the model using the high-level function
        training_result = train_ontology_embeddings(
            owl_file=downloaded_file,
            model_output='cvdo_model.pt',
            model_type='gcn',
            hidden_dim=32,      # Smaller for demo
            out_dim=16,         # 16-dimensional embeddings
            epochs=20,          # Fewer epochs for demo
            loss_fn_name='triplet'
        )
        
        display(Markdown(f"### ✅ Training Complete!"))
        display(Markdown(f"**Model saved to:** `{training_result['model_path']}`"))
        display(Markdown(f"**Training time:** {training_result.get('training_time', 'N/A')} seconds"))
        
        # Display model info
        checkpoint = training_result.get('checkpoint', {})
        model_info = checkpoint.get('model_config', {})
        
        info_html = f"""
        <div style="background-color: #f0f8ff; padding: 15px; border-radius: 5px; border-left: 4px solid #0066cc;">
        <h4>🤖 Model Configuration</h4>
        <ul>
            <li><strong>Architecture:</strong> {model_info.get('model_type', 'GCN').upper()}</li>
            <li><strong>Hidden Dimensions:</strong> {model_info.get('hidden_dim', 32)}</li>
            <li><strong>Output Dimensions:</strong> {model_info.get('out_dim', 16)}</li>
            <li><strong>Loss Function:</strong> {model_info.get('loss_function', 'triplet')}</li>
            <li><strong>Epochs:</strong> {model_info.get('epochs', 20)}</li>
        </ul>
        </div>
        """
        display(HTML(info_html))
        
        model_file = training_result['model_path']
        
    except Exception as e:
        display(Markdown(f"❌ **Training failed:** {e}"))
        model_file = None
else:
    display(Markdown("⏭️ **Skipping training** - no ontology file available"))
    model_file = None

🚀 Training GNN model on Cardiovascular Disease Ontology...
This may take a few minutes depending on ontology size.


### ✅ Training Complete!

**Model saved to:** `cvdo_model.pt`

**Training time:** N/A seconds

## Step 3: Generate Embeddings

With our trained model, let's generate embeddings for all concepts in the ontology:

In [5]:
if model_file and downloaded_file:
    print("📊 Generating embeddings for all concepts...")
    
    embedding_file = "cvdo_embeddings.parquet"
    
    try:
        # Generate embeddings using the trained model
        embedding_result = embed_ontology_with_model(
            model_path=model_file,
            owl_file=downloaded_file,
            output_file=embedding_file
        )
        
        display(Markdown(f"### ✅ Embeddings Generated!"))
        display(Markdown(f"**File:** `{embedding_file}`"))
        display(Markdown(f"**Embeddings:** {len(embedding_result['node_ids']):,} concept vectors"))
        display(Markdown(f"**Dimensions:** {embedding_result['embeddings'].shape[1]} per vector"))
        
        # Display alignment info
        alignment = embedding_result.get('alignment_info', {})
        
        alignment_html = f"""
        <div style="background-color: #f0fff0; padding: 15px; border-radius: 5px; border-left: 4px solid #00cc66;">
        <h4>🔗 Ontology Alignment</h4>
        <ul>
            <li><strong>Aligned Classes:</strong> {alignment.get('aligned_classes', 0):,}</li>
            <li><strong>Total Classes:</strong> {alignment.get('total_classes', 0):,}</li>
            <li><strong>Alignment Ratio:</strong> {alignment.get('alignment_ratio', 0):.1%}</li>
        </ul>
        </div>
        """
        display(HTML(alignment_html))
        
    except Exception as e:
        display(Markdown(f"❌ **Embedding generation failed:** {e}"))
        embedding_file = None
else:
    display(Markdown("⏭️ **Skipping embedding generation** - no trained model available"))
    embedding_file = None

📊 Generating embeddings for all concepts...


### ✅ Embeddings Generated!

**File:** `cvdo_embeddings.parquet`

**Embeddings:** 720 concept vectors

**Dimensions:** 16 per vector

## Step 4: Inspect Embeddings Metadata

Let's examine the metadata stored in our embedding file:

In [6]:
if embedding_file and os.path.exists(embedding_file):
    print("🔍 Inspecting embedding file metadata:")
    print("=" * 50)
    
    # Use our inspect function to show metadata
    metadata = inspect_parquet_metadata(embedding_file)
    
    # Also show file size info
    file_size = os.path.getsize(embedding_file)
    display(Markdown(f"**💾 File Size:** {file_size:,} bytes ({file_size/(1024*1024):.2f} MB)"))
else:
    display(Markdown("⏭️ **No embedding file to inspect**"))

🔍 Inspecting embedding file metadata:

📊 Embedding File: cvdo_embeddings.parquet
📈 Embeddings: 720 vectors
📐 Dimensions: 16

🏷️  Metadata:
------------------------------
📄 Source Ontology: cvdo.owl
⏰ Generated: 2025-09-17 11:26:34
🤖 Model: GCN
   └─ Hidden: 32, Output: 16
🔗 Alignment: 720/0 classes
   └─ Ratio: 100.0%
📊 Source Size: 1.3 MB

📋 Additional Metadata:
   timestamp: 2025-09-17T11:26:34.210683
   embedding_dimension: 16
   num_embeddings: 720



**💾 File Size:** 98,314 bytes (0.09 MB)

## Step 5: Load and Explore Embeddings

Let's load the embeddings as a DataFrame for analysis:

In [7]:
if embedding_file and os.path.exists(embedding_file):
    # Load embeddings as DataFrame
    df, metadata = load_embeddings_as_dataframe(embedding_file, return_metadata=True)
    
    display(Markdown(f"### 📊 DataFrame Overview"))
    display(Markdown(f"**Shape:** {df.shape[0]:,} rows × {df.shape[1]} columns"))
    
    # Show first few rows
    display(Markdown("### 🔎 First 10 Concept IDs:"))
    
    for i, node_id in enumerate(df['node_id'].head(10).to_list(), 1):
        # Make URIs more readable by showing just the end part
        short_id = node_id.split('/')[-1] if '/' in node_id else node_id
        display(Markdown(f"**{i:2d}.** `{short_id}` → `{node_id}`"))
    
    if len(df) > 10:
        display(Markdown(f"... and {len(df) - 10:,} more concepts"))
    
    # Show embedding statistics
    sample_embedding = np.array(df['embedding'][0])
    display(Markdown(f"### 📈 Embedding Statistics"))
    display(Markdown(f"**Vector dimensions:** {len(sample_embedding)}"))
    display(Markdown(f"**Sample vector range:** [{sample_embedding.min():.3f}, {sample_embedding.max():.3f}]"))
    display(Markdown(f"**Sample vector mean:** {sample_embedding.mean():.3f}"))
    
else:
    display(Markdown("⏭️ **No embedding file to analyze**"))

### 📊 DataFrame Overview

**Shape:** 720 rows × 2 columns

### 🔎 First 10 Concept IDs:

** 1.** `BFO_0000020` → `http://purl.obolibrary.org/obo/BFO_0000020`

** 2.** `BFO_0000006` → `http://purl.obolibrary.org/obo/BFO_0000006`

** 3.** `BFO_0000004` → `http://purl.obolibrary.org/obo/BFO_0000004`

** 4.** `BFO_0000017` → `http://purl.obolibrary.org/obo/BFO_0000017`

** 5.** `BFO_0000015` → `http://purl.obolibrary.org/obo/BFO_0000015`

** 6.** `BFO_0000002` → `http://purl.obolibrary.org/obo/BFO_0000002`

** 7.** `BFO_0000031` → `http://purl.obolibrary.org/obo/BFO_0000031`

** 8.** `BFO_0000035` → `http://purl.obolibrary.org/obo/BFO_0000035`

** 9.** `BFO_0000029` → `http://purl.obolibrary.org/obo/BFO_0000029`

**10.** `BFO_0000040` → `http://purl.obolibrary.org/obo/BFO_0000040`

... and 710 more concepts

### 📈 Embedding Statistics

**Vector dimensions:** 16

**Sample vector range:** [-0.928, 0.873]

**Sample vector mean:** -0.033

## Step 6: Vector Operations

Let's demonstrate vector arithmetic operations on embeddings:

In [8]:
if embedding_file and os.path.exists(embedding_file):
    # Get first few concept IDs for demonstration
    df = load_embeddings_as_dataframe(embedding_file)
    concept_ids = df['node_id'].head(5).to_list()
    
    if len(concept_ids) >= 3:
        concept1 = concept_ids[0]
        concept2 = concept_ids[1]
        concept3 = concept_ids[2]
        
        display(Markdown("### ➕ Vector Addition"))
        display(Markdown(f"Computing: `{concept1.split('/')[-1]}` + `{concept2.split('/')[-1]}`"))
        
        # Add two vectors
        try:
            sum_vector = add_embedding_vectors(embedding_file, concept1, embedding_file, concept2)
            display(Markdown(f"**Result:** {len(sum_vector)}-dimensional vector"))
            display(Markdown(f"**Range:** [{sum_vector.min():.3f}, {sum_vector.max():.3f}]"))
            display(Markdown(f"**Mean:** {sum_vector.mean():.3f}"))
        except Exception as e:
            display(Markdown(f"❌ Addition failed: {e}"))
        
        display(Markdown("### ➖ Vector Subtraction"))
        display(Markdown(f"Computing: `{concept1.split('/')[-1]}` - `{concept3.split('/')[-1]}`"))
        
        # Subtract two vectors
        try:
            diff_vector = subtract_embedding_vectors(embedding_file, concept1, embedding_file, concept3)
            display(Markdown(f"**Result:** {len(diff_vector)}-dimensional vector"))
            display(Markdown(f"**Range:** [{diff_vector.min():.3f}, {diff_vector.max():.3f}]"))
            display(Markdown(f"**Mean:** {diff_vector.mean():.3f}"))
        except Exception as e:
            display(Markdown(f"❌ Subtraction failed: {e}"))
        
        display(Markdown("### 🎯 Individual Vector Retrieval"))
        display(Markdown(f"Getting embedding for: `{concept1.split('/')[-1]}`"))
        
        try:
            vector = get_embedding_vector(embedding_file, concept1)
            display(Markdown(f"**Dimensions:** {len(vector)}"))
            display(Markdown(f"**First 5 values:** {vector[:5].tolist()}"))
            display(Markdown(f"**Last 5 values:** {vector[-5:].tolist()}"))
        except Exception as e:
            display(Markdown(f"❌ Vector retrieval failed: {e}"))
    
    else:
        display(Markdown("⚠️ **Not enough concepts for vector operations demo**"))
else:
    display(Markdown("⏭️ **No embedding file for vector operations**"))

### ➕ Vector Addition

Computing: `BFO_0000020` + `BFO_0000006`

**Result:** 16-dimensional vector

**Range:** [-2.032, 1.828]

**Mean:** 0.157

### ➖ Vector Subtraction

Computing: `BFO_0000020` - `BFO_0000004`

**Result:** 16-dimensional vector

**Range:** [-0.458, 0.328]

**Mean:** -0.116

### 🎯 Individual Vector Retrieval

Getting embedding for: `BFO_0000020`

**Dimensions:** 16

**First 5 values:** [-0.6471280455589294, 0.5326902866363525, -0.0022261682897806168, 0.2935771942138672, -0.702740490436554]

**Last 5 values:** [-0.0542389415204525, 0.8728678226470947, 0.09189876168966293, 0.454987108707428, 0.06532125920057297]

## Step 7: Format Conversion

Finally, let's convert our embeddings to CSV format for use with other tools:

In [9]:
if embedding_file and os.path.exists(embedding_file):
    display(Markdown("### 📁 Converting to CSV Format"))
    
    try:
        csv_file = convert_parquet_to_csv(embedding_file)
        
        csv_size = os.path.getsize(csv_file)
        parquet_size = os.path.getsize(embedding_file)
        
        display(Markdown(f"✅ **Conversion complete!**"))
        display(Markdown(f"**CSV file:** `{csv_file}`"))
        display(Markdown(f"**CSV size:** {csv_size:,} bytes ({csv_size/(1024*1024):.2f} MB)"))
        display(Markdown(f"**Parquet size:** {parquet_size:,} bytes ({parquet_size/(1024*1024):.2f} MB)"))
        display(Markdown(f"**Size ratio:** CSV is {csv_size/parquet_size:.1f}x larger than Parquet"))
        
        # Show CSV preview
        display(Markdown("### 👀 CSV Preview (first 3 lines):"))
        
        try:
            with open(csv_file, 'r') as f:
                lines = [f.readline().strip() for _ in range(3)]
            
            for i, line in enumerate(lines, 1):
                # Truncate long lines for display
                display_line = line[:100] + "..." if len(line) > 100 else line
                display(Markdown(f"**Line {i}:** `{display_line}`"))
                
        except Exception as e:
            display(Markdown(f"⚠️ Could not read CSV preview: {e}"))
    
    except Exception as e:
        display(Markdown(f"❌ **CSV conversion failed:** {e}"))
else:
    display(Markdown("⏭️ **No embedding file to convert**"))

### 📁 Converting to CSV Format

✅ **Conversion complete!**

**CSV file:** `cvdo_embeddings.csv`

**CSV size:** 261,784 bytes (0.25 MB)

**Parquet size:** 98,314 bytes (0.09 MB)

**Size ratio:** CSV is 2.7x larger than Parquet

### 👀 CSV Preview (first 3 lines):

**Line 1:** `node_id,dim_0,dim_1,dim_2,dim_3,dim_4,dim_5,dim_6,dim_7,dim_8,dim_9,dim_10,dim_11,dim_12,dim_13,dim_...`

**Line 2:** `http://purl.obolibrary.org/obo/BFO_0000020,-0.6471280455589294,0.5326902866363525,-0.002226168289780...`

**Line 3:** `http://purl.obolibrary.org/obo/BFO_0000006,-1.3850938081741333,0.7696705460548401,0.5669081807136536...`

## Summary & Next Steps

🎉 **Congratulations!** You've successfully:

1. ⬇️ Downloaded a real-world ontology (Cardiovascular Disease Ontology)
2. 🤖 Trained a Graph Neural Network on the ontology structure
3. 📊 Generated high-dimensional embeddings for all concepts
4. 🔍 Inspected the rich metadata stored with embeddings
5. ➕➖ Performed vector arithmetic operations
6. 📁 Converted between Parquet and CSV formats

### 🚀 What You Can Do Next:

- **Analyze similarities:** Use cosine similarity to find related cardiovascular concepts
- **Cluster concepts:** Apply K-means or hierarchical clustering to discover disease patterns
- **Visualize:** Create UMAP or t-SNE plots of the cardiovascular disease embedding space
- **Cross-ontology mapping:** Train on CVDO, embed other medical ontologies
- **Semantic search:** Find diseases similar to a query condition

### 📚 Learn More:

- Check the project README for CLI tools and advanced usage
- Explore the `parquet_tools.py` script for more utilities
- Try different GNN architectures (GCN, GAT) and loss functions
- Experiment with different embedding dimensions and training parameters

---

### 🧹 Cleanup (Optional)

Run this cell if you want to remove the generated files:

In [None]:
# Uncomment and run to clean up generated files
# import os

# files_to_remove = ['cvdo.owl', 'cvdo_model.pt', 'cvdo_embeddings.parquet', 'cvdo_embeddings.csv']

# for filename in files_to_remove:
#     if os.path.exists(filename):
#         os.remove(filename)
#         print(f"🗑️ Removed {filename}")
#     else:
#         print(f"⚠️ {filename} not found")

# print("✅ Cleanup complete!")