# Knowledge Graph Completion with on2vec

This notebook demonstrates how to use on2vec embeddings for knowledge graph completion tasks. We'll show how to:

1. Predict missing subclass relationships in ontologies
2. Suggest new object property relations between concepts
3. Identify potential concept additions to expand ontologies
4. Validate predicted completions using embedding similarity
5. Rank completion candidates by confidence scores
6. Visualize knowledge graph expansion

## Use Case: Ontology Evolution and Curation
Ontologies are constantly evolving as new knowledge is discovered. Manual curation is time-consuming and may miss important relationships. Embedding-based completion can suggest new concepts and relations, helping curators identify gaps and expand their knowledge graphs systematically.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler
import networkx as nx
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import umap
from scipy.spatial.distance import pdist, squareform
from scipy.stats import rankdata

# on2vec imports
from on2vec import (
    load_embeddings_as_dataframe,
    train_ontology_embeddings,
    embed_ontology_with_model,
    build_graph_from_owl,
    build_multi_relation_graph_from_owl
)

plt.style.use('default')
sns.set_palette("husl")

import warnings
warnings.filterwarnings('ignore')

## Step 1: Prepare Knowledge Graph and Embeddings

Load ontology structure and generate embeddings optimized for completion tasks.

In [2]:
import os
from pathlib import Path

def prepare_knowledge_graph_data(ontology_file, use_multi_relation=True):
    """Prepare knowledge graph structure and embeddings for completion."""
    
    if not os.path.exists(ontology_file):
        print(f"❌ Ontology file not found: {ontology_file}")
        return None
    
    base_name = Path(ontology_file).stem
    model_file = f"{base_name}_completion_model.pt"
    embedding_file = f"{base_name}_completion_embeddings.parquet"
    
    print(f"🔄 Preparing knowledge graph data for {ontology_file}...")
    
    # Train model optimized for completion tasks
    if not os.path.exists(model_file):
        print(f"  Training completion model...")
        result = train_ontology_embeddings(
            owl_file=ontology_file,
            model_output=model_file,
            model_type="gcn",  # GCN good for structural tasks
            hidden_dim=256,
            out_dim=128,
            epochs=150,
            loss_fn_name="cosine",  # Triplet loss for relation learning
            learning_rate=0.01,
            text_model_name="all-MiniLM-L6-v2"
        )
    
    # Generate embeddings
    if not os.path.exists(embedding_file):
        print(f"  Generating embeddings...")
        embed_result = embed_ontology_with_model(
            model_path=model_file,
            owl_file=ontology_file,
            output_file=embedding_file
        )
    
    # Load embeddings
    df, metadata = load_embeddings_as_dataframe(embedding_file, return_metadata=True)
    embeddings = np.stack(df['embedding'].to_numpy())
    node_ids = df['node_id'].to_numpy()
    
    # Build graph structures
    print(f"  Building graph structures...")
    
    # Basic subclass graph
    try:
        x, edge_index, class_mapping = build_graph_from_owl(ontology_file)
        subclass_graph = {
            'edge_index': edge_index,
            'class_mapping': class_mapping
        }
        print(f"    Subclass graph: {edge_index.shape[1]} edges")
    except Exception as e:
        print(f"    Failed to build subclass graph: {e}")
        subclass_graph = None
    
    # Multi-relation graph
    multi_relation_graph = None
    if use_multi_relation:
        try:
            multi_data = build_multi_relation_graph_from_owl(ontology_file)
            multi_relation_graph = multi_data
            print(f"    Multi-relation graph: {multi_data['edge_index'].shape[1]} edges, {len(multi_data['relation_names'])} relation types")
        except Exception as e:
            print(f"    Failed to build multi-relation graph: {e}")
    
    # Create NetworkX graph for analysis
    nx_graph = nx.DiGraph()
    if subclass_graph:
        for i in range(subclass_graph['edge_index'].shape[1]):
            src = int(subclass_graph['edge_index'][0, i])
            dst = int(subclass_graph['edge_index'][1, i])
            nx_graph.add_edge(src, dst, relation='subClassOf')
    
    return {
        'ontology_file': ontology_file,
        'embeddings': embeddings,
        'node_ids': node_ids,
        'df': df,
        'metadata': metadata,
        'subclass_graph': subclass_graph,
        'multi_relation_graph': multi_relation_graph,
        'nx_graph': nx_graph,
        'id_to_idx': {node_id: idx for idx, node_id in enumerate(node_ids)},
        'model_file': model_file,
        'embedding_file': embedding_file
    }

# Prepare data
ontology_files = ['EDAM.owl', 'cvdo.owl']  # Try available ontologies
kg_data = None

for ont_file in ontology_files:
    if os.path.exists(ont_file):
        kg_data = prepare_knowledge_graph_data(ont_file)
        if kg_data:
            print(f"\n✅ Knowledge graph ready:")
            print(f"  • Ontology: {ont_file}")
            print(f"  • Concepts: {len(kg_data['node_ids']):,}")
            print(f"  • Embeddings: {kg_data['embeddings'].shape[1]}D")
            print(f"  • Graph edges: {kg_data['nx_graph'].number_of_edges():,}")
            break

if not kg_data:
    print("❌ No suitable ontology files found for knowledge completion demo")

🔄 Preparing knowledge graph data for EDAM.owl...
  Training completion model...


TypeError: train_ontology_embeddings() got an unexpected keyword argument 'text_model_name'

## Step 2: Knowledge Graph Completion System

Build a comprehensive system for predicting missing relations and concepts.

In [None]:
class KnowledgeGraphCompleter:
    def __init__(self, kg_data):
        """Initialize knowledge graph completion system."""
        self.kg_data = kg_data
        self.embeddings = kg_data['embeddings']
        self.node_ids = kg_data['node_ids']
        self.id_to_idx = kg_data['id_to_idx']
        self.graph = kg_data['nx_graph']
        
        # Pre-compute similarity matrix
        print(f"🔄 Computing similarity matrix for {len(self.node_ids)} concepts...")
        self.similarity_matrix = cosine_similarity(self.embeddings)
        
        # Extract concept names for readability
        self.concept_names = [self._extract_name(node_id) for node_id in self.node_ids]
        
        print(f"✅ Knowledge graph completer ready!")
    
    def _extract_name(self, node_id):
        """Extract readable name from IRI."""
        if '#' in node_id:
            name = node_id.split('#')[-1]
        else:
            name = node_id.split('/')[-1]
        return name.replace('_', ' ').replace('-', ' ')
    
    def predict_missing_subclass_relations(self, similarity_threshold=0.6, max_predictions=100):
        """Predict missing subclass relationships."""
        print(f"🔍 Predicting missing subclass relations...")
        
        predictions = []
        existing_edges = set()
        
        # Build set of existing edges for fast lookup
        if self.kg_data['subclass_graph']:
            edge_index = self.kg_data['subclass_graph']['edge_index']
            class_mapping = self.kg_data['subclass_graph']['class_mapping']
            
            # Map ontology indices to embedding indices
            ont_to_emb_idx = {}
            for ont_class, ont_idx in class_mapping.items():
                class_iri = ont_class.iri if hasattr(ont_class, 'iri') else str(ont_class)
                if class_iri in self.id_to_idx:
                    ont_to_emb_idx[ont_idx] = self.id_to_idx[class_iri]
            
            # Convert existing edges to embedding index pairs
            for i in range(edge_index.shape[1]):
                src_ont = int(edge_index[0, i])
                dst_ont = int(edge_index[1, i])
                
                if src_ont in ont_to_emb_idx and dst_ont in ont_to_emb_idx:
                    src_emb = ont_to_emb_idx[src_ont]
                    dst_emb = ont_to_emb_idx[dst_ont]
                    existing_edges.add((src_emb, dst_emb))
                    existing_edges.add((dst_emb, src_emb))  # Bidirectional for subclass
        
        print(f"  Found {len(existing_edges)} existing relations")
        
        # Find high-similarity pairs that aren't connected
        n_concepts = len(self.node_ids)
        
        for i in range(n_concepts):
            for j in range(i + 1, n_concepts):
                if (i, j) not in existing_edges and (j, i) not in existing_edges:
                    similarity = self.similarity_matrix[i, j]
                    
                    if similarity >= similarity_threshold:
                        # Determine direction based on embedding characteristics
                        emb_i = self.embeddings[i]
                        emb_j = self.embeddings[j]
                        
                        norm_i = np.linalg.norm(emb_i)
                        norm_j = np.linalg.norm(emb_j)
                        
                        # Heuristic: concepts with higher norms tend to be more general
                        if norm_i > norm_j * 1.1:
                            parent_idx, child_idx = i, j
                            relation_type = 'parent_child'
                        elif norm_j > norm_i * 1.1:
                            parent_idx, child_idx = j, i
                            relation_type = 'parent_child'
                        else:
                            parent_idx, child_idx = i, j
                            relation_type = 'sibling'
                        
                        predictions.append({
                            'parent_id': self.node_ids[parent_idx],
                            'child_id': self.node_ids[child_idx],
                            'parent_name': self.concept_names[parent_idx],
                            'child_name': self.concept_names[child_idx],
                            'similarity': float(similarity),
                            'relation_type': relation_type,
                            'confidence': self._calculate_relation_confidence(parent_idx, child_idx, similarity),
                            'prediction_type': 'subclass'
                        })
        
        # Sort by similarity and return top predictions
        predictions.sort(key=lambda x: x['similarity'], reverse=True)
        top_predictions = predictions[:max_predictions]
        
        print(f"  Generated {len(top_predictions)} subclass predictions")
        return top_predictions
    
    def predict_semantic_relations(self, relation_types=['related_to', 'part_of', 'has_function'], 
                                  similarity_threshold=0.5, max_predictions=50):
        """Predict semantic relations beyond subclass hierarchy."""
        print(f"🔗 Predicting semantic relations...")
        
        predictions = []
        
        # Use clustering to identify semantic groups
        from sklearn.cluster import DBSCAN
        
        # Cluster similar concepts
        clustering = DBSCAN(eps=0.3, min_samples=3, metric='cosine')
        cluster_labels = clustering.fit_predict(self.embeddings)
        
        # Group concepts by cluster
        clusters = {}
        for idx, label in enumerate(cluster_labels):
            if label != -1:  # Ignore noise points
                if label not in clusters:
                    clusters[label] = []
                clusters[label].append(idx)
        
        print(f"  Found {len(clusters)} semantic clusters")
        
        # Predict relations within and between clusters
        for cluster_id, concept_indices in clusters.items():
            if len(concept_indices) < 2:
                continue
            
            # Intra-cluster relations (concepts in same semantic space)
            for i in range(len(concept_indices)):
                for j in range(i + 1, len(concept_indices)):
                    idx_i = concept_indices[i]
                    idx_j = concept_indices[j]
                    
                    similarity = self.similarity_matrix[idx_i, idx_j]
                    
                    if similarity >= similarity_threshold:
                        # Predict relation type based on concept characteristics
                        predicted_relation = self._predict_relation_type(
                            idx_i, idx_j, similarity, relation_types
                        )
                        
                        if predicted_relation:
                            predictions.append({
                                'source_id': self.node_ids[idx_i],
                                'target_id': self.node_ids[idx_j],
                                'source_name': self.concept_names[idx_i],
                                'target_name': self.concept_names[idx_j],
                                'relation_type': predicted_relation['type'],
                                'similarity': float(similarity),
                                'confidence': predicted_relation['confidence'],
                                'prediction_type': 'semantic_relation',
                                'cluster_id': cluster_id
                            })
        
        # Sort and return top predictions
        predictions.sort(key=lambda x: x['confidence'], reverse=True)
        top_predictions = predictions[:max_predictions]
        
        print(f"  Generated {len(top_predictions)} semantic relation predictions")
        return top_predictions
    
    def _predict_relation_type(self, idx_i, idx_j, similarity, relation_types):
        """Predict the type of relation between two concepts."""
        name_i = self.concept_names[idx_i].lower()
        name_j = self.concept_names[idx_j].lower()
        
        # Simple heuristics based on concept names
        if 'function' in name_i or 'function' in name_j:
            if 'has_function' in relation_types:
                return {'type': 'has_function', 'confidence': similarity * 0.8}
        
        if any(word in name_i or word in name_j for word in ['part', 'component', 'element']):
            if 'part_of' in relation_types:
                return {'type': 'part_of', 'confidence': similarity * 0.7}
        
        # Default to related_to for high similarity
        if 'related_to' in relation_types and similarity > 0.6:
            return {'type': 'related_to', 'confidence': similarity * 0.6}
        
        return None
    
    def suggest_missing_concepts(self, embedding_gap_threshold=0.4, max_suggestions=20):
        """Suggest concepts that might be missing from the ontology."""
        print(f"💡 Suggesting missing concepts...")
        
        suggestions = []
        
        # Find gaps in embedding space using clustering
        from sklearn.cluster import KMeans
        
        # Cluster embeddings to find dense regions
        n_clusters = min(50, len(self.embeddings) // 10)  # Adaptive cluster count
        kmeans = KMeans(n_clusters=n_clusters, random_state=42)
        cluster_labels = kmeans.fit_predict(self.embeddings)
        cluster_centers = kmeans.cluster_centers_
        
        # Find cluster centers that are far from existing concepts
        for cluster_id, center in enumerate(cluster_centers):
            # Find closest existing concept to this cluster center
            similarities_to_center = cosine_similarity(
                center.reshape(1, -1), 
                self.embeddings
            )[0]
            
            max_similarity = np.max(similarities_to_center)
            closest_concept_idx = np.argmax(similarities_to_center)
            
            # If no concept is very close to the center, suggest it as a gap
            if max_similarity < embedding_gap_threshold:
                # Find concepts in this cluster
                cluster_concept_indices = np.where(cluster_labels == cluster_id)[0]
                
                if len(cluster_concept_indices) >= 3:  # Only suggest for non-trivial clusters
                    cluster_concepts = [self.concept_names[i] for i in cluster_concept_indices[:5]]
                    
                    suggestions.append({
                        'cluster_id': cluster_id,
                        'gap_score': 1 - max_similarity,  # Higher gap = more potential
                        'cluster_size': len(cluster_concept_indices),
                        'closest_concept': self.concept_names[closest_concept_idx],
                        'closest_similarity': float(max_similarity),
                        'representative_concepts': cluster_concepts,
                        'suggested_concept_area': self._suggest_concept_area(cluster_concepts),
                        'prediction_type': 'missing_concept'
                    })
        
        # Sort by gap score and return top suggestions
        suggestions.sort(key=lambda x: x['gap_score'], reverse=True)
        top_suggestions = suggestions[:max_suggestions]
        
        print(f"  Generated {len(top_suggestions)} missing concept suggestions")
        return top_suggestions
    
    def _suggest_concept_area(self, concept_names):
        """Suggest what type of concept might be missing based on cluster."""
        # Simple heuristic based on common words
        all_words = ' '.join(concept_names).lower()
        
        if 'data' in all_words or 'format' in all_words:
            return 'data format or structure'
        elif 'protein' in all_words or 'gene' in all_words:
            return 'biological entity'
        elif 'analysis' in all_words or 'method' in all_words:
            return 'computational method'
        elif 'disease' in all_words or 'disorder' in all_words:
            return 'medical condition'
        else:
            return 'general concept'
    
    def _calculate_relation_confidence(self, idx_i, idx_j, similarity):
        """Calculate confidence score for predicted relation."""
        base_confidence = similarity
        
        # Boost confidence if concepts have similar neighbors
        neighbors_i = np.argsort(self.similarity_matrix[idx_i])[-10:]  # Top 10 similar
        neighbors_j = np.argsort(self.similarity_matrix[idx_j])[-10:]
        
        neighbor_overlap = len(set(neighbors_i).intersection(set(neighbors_j)))
        neighborhood_bonus = (neighbor_overlap / 10) * 0.2
        
        return min(1.0, base_confidence + neighborhood_bonus)
    
    def validate_predictions(self, predictions, validation_method='cross_validation'):
        """Validate predictions using various methods."""
        print(f"✅ Validating {len(predictions)} predictions...")
        
        validated_predictions = []
        
        if validation_method == 'cross_validation':
            # Use existing relations to train a classifier
            # This is a simplified validation - in practice you'd use held-out test data
            
            for pred in predictions:
                # Simple validation based on embedding similarity and structure
                validation_score = self._simple_validation(pred)
                pred['validation_score'] = validation_score
                pred['is_validated'] = validation_score > 0.6
                validated_predictions.append(pred)
        
        validated_count = sum(1 for p in validated_predictions if p['is_validated'])
        print(f"  ✓ {validated_count}/{len(predictions)} predictions validated")
        
        return validated_predictions
    
    def _simple_validation(self, prediction):
        """Simple validation score based on embedding properties."""
        # This is a placeholder - real validation would use ground truth data
        base_score = prediction.get('similarity', prediction.get('confidence', 0.5))
        
        # Penalize very high similarities (might be too obvious/already known)
        if base_score > 0.95:
            base_score *= 0.8
        
        # Boost medium-high similarities (good predictions)
        if 0.7 <= base_score <= 0.9:
            base_score *= 1.1
        
        return min(1.0, base_score)

# Initialize completer
if kg_data:
    completer = KnowledgeGraphCompleter(kg_data)
    print("\n🚀 Knowledge graph completion system ready!")
else:
    print("Cannot initialize completer without knowledge graph data")

## Step 3: Generate Knowledge Completion Predictions

Generate different types of completion predictions and analyze their quality.

In [None]:
# Generate completion predictions
if 'completer' in locals():
    print("🔍 GENERATING KNOWLEDGE COMPLETION PREDICTIONS")
    print("=" * 55)
    
    # 1. Predict missing subclass relations
    subclass_predictions = completer.predict_missing_subclass_relations(
        similarity_threshold=0.7,
        max_predictions=30
    )
    
    # 2. Predict semantic relations
    semantic_predictions = completer.predict_semantic_relations(
        relation_types=['related_to', 'part_of', 'has_function'],
        similarity_threshold=0.6,
        max_predictions=25
    )
    
    # 3. Suggest missing concepts
    concept_suggestions = completer.suggest_missing_concepts(
        embedding_gap_threshold=0.5,
        max_suggestions=15
    )
    
    # Combine all predictions
    all_predictions = subclass_predictions + semantic_predictions
    
    print(f"\n📊 COMPLETION SUMMARY:")
    print(f"  • Subclass relations: {len(subclass_predictions)}")
    print(f"  • Semantic relations: {len(semantic_predictions)}")
    print(f"  • Missing concepts: {len(concept_suggestions)}")
    print(f"  • Total predictions: {len(all_predictions)}")
    
    # Show top predictions
    if subclass_predictions:
        print(f"\n🏆 TOP SUBCLASS RELATION PREDICTIONS:")
        print("-" * 70)
        for i, pred in enumerate(subclass_predictions[:10]):
            rel_symbol = "⊆" if pred['relation_type'] == 'parent_child' else "≈"
            conf_color = "🟢" if pred['confidence'] > 0.8 else "🟡" if pred['confidence'] > 0.6 else "🔴"
            print(f"{conf_color} {i+1:2d}. {pred['child_name'][:25]:25} {rel_symbol} {pred['parent_name'][:25]:25} ({pred['similarity']:.3f})")
    
    if semantic_predictions:
        print(f"\n🔗 TOP SEMANTIC RELATION PREDICTIONS:")
        print("-" * 70)
        for i, pred in enumerate(semantic_predictions[:10]):
            rel_symbols = {'related_to': '↔', 'part_of': '⊂', 'has_function': '→'}
            rel_symbol = rel_symbols.get(pred['relation_type'], '?')
            conf_color = "🟢" if pred['confidence'] > 0.8 else "🟡" if pred['confidence'] > 0.6 else "🔴"
            print(f"{conf_color} {i+1:2d}. {pred['source_name'][:20]:20} {rel_symbol} {pred['target_name'][:20]:20} [{pred['relation_type']}] ({pred['confidence']:.3f})")
    
    if concept_suggestions:
        print(f"\n💡 TOP MISSING CONCEPT SUGGESTIONS:")
        print("-" * 70)
        for i, sugg in enumerate(concept_suggestions[:8]):
            gap_color = "🔴" if sugg['gap_score'] > 0.7 else "🟡" if sugg['gap_score'] > 0.5 else "🟢"
            print(f"{gap_color} {i+1:2d}. {sugg['suggested_concept_area']:25} (gap: {sugg['gap_score']:.3f}, size: {sugg['cluster_size']})")
            print(f"     Examples: {', '.join(sugg['representative_concepts'][:3])}")
            print()
    
else:
    print("Completer not available - please run previous cells first")

## Step 4: Validate and Score Predictions

Apply validation techniques to assess prediction quality.

In [None]:
# Validate predictions
if 'all_predictions' in locals() and all_predictions:
    print("✅ VALIDATING COMPLETION PREDICTIONS")
    print("=" * 40)
    
    # Validate all predictions
    validated_predictions = completer.validate_predictions(
        all_predictions,
        validation_method='cross_validation'
    )
    
    # Analyze validation results
    val_df = pd.DataFrame(validated_predictions)
    
    print(f"\n📈 VALIDATION ANALYSIS:")
    print(f"  • Total predictions: {len(validated_predictions)}")
    print(f"  • Validated predictions: {val_df['is_validated'].sum()}")
    print(f"  • Validation rate: {val_df['is_validated'].mean()*100:.1f}%")
    print(f"  • Mean validation score: {val_df['validation_score'].mean():.3f}")
    
    # Validation by prediction type
    print(f"\n📊 VALIDATION BY TYPE:")
    type_validation = val_df.groupby('prediction_type').agg({
        'is_validated': ['count', 'sum', 'mean'],
        'validation_score': 'mean'
    }).round(3)
    
    for pred_type in val_df['prediction_type'].unique():
        subset = val_df[val_df['prediction_type'] == pred_type]
        validated_count = subset['is_validated'].sum()
        total_count = len(subset)
        validation_rate = subset['is_validated'].mean() * 100
        mean_score = subset['validation_score'].mean()
        
        print(f"  • {pred_type:15}: {validated_count:2d}/{total_count:2d} ({validation_rate:4.1f}%) - score: {mean_score:.3f}")
    
    # Create validation visualization
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
    
    # Validation score distribution
    ax1.hist(val_df['validation_score'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
    ax1.axvline(val_df['validation_score'].mean(), color='red', linestyle='--', 
                label=f'Mean: {val_df["validation_score"].mean():.3f}')
    ax1.set_title('Validation Score Distribution')
    ax1.set_xlabel('Validation Score')
    ax1.set_ylabel('Frequency')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Validation by prediction type
    type_counts = val_df.groupby(['prediction_type', 'is_validated']).size().unstack(fill_value=0)
    type_counts.plot(kind='bar', ax=ax2, color=['lightcoral', 'lightgreen'])
    ax2.set_title('Validation Results by Prediction Type')
    ax2.set_xlabel('Prediction Type')
    ax2.set_ylabel('Count')
    ax2.legend(['Not Validated', 'Validated'])
    ax2.tick_params(axis='x', rotation=45)
    
    # Confidence vs Validation Score scatter
    confidence_col = 'confidence' if 'confidence' in val_df.columns else 'similarity'
    if confidence_col in val_df.columns:
        colors = ['red' if not v else 'green' for v in val_df['is_validated']]
        scatter = ax3.scatter(val_df[confidence_col], val_df['validation_score'], 
                            c=colors, alpha=0.6)
        ax3.set_title(f'{confidence_col.title()} vs Validation Score')
        ax3.set_xlabel(f'{confidence_col.title()}')
        ax3.set_ylabel('Validation Score')
        ax3.grid(True, alpha=0.3)
        
        # Add correlation line
        z = np.polyfit(val_df[confidence_col], val_df['validation_score'], 1)
        p = np.poly1d(z)
        ax3.plot(val_df[confidence_col].sort_values(), p(val_df[confidence_col].sort_values()), "b--", alpha=0.8)
    
    # Top validated predictions
    top_validated = val_df[val_df['is_validated']].nlargest(10, 'validation_score')
    if not top_validated.empty:
        y_pos = np.arange(len(top_validated))
        
        if 'source_name' in top_validated.columns:
            labels = [f"{row['source_name']} → {row['target_name']}" 
                     for _, row in top_validated.iterrows()]
        else:
            labels = [f"{row['child_name']} ⊆ {row['parent_name']}" 
                     for _, row in top_validated.iterrows()]
        
        ax4.barh(y_pos, top_validated['validation_score'], color='gold')
        ax4.set_yticks(y_pos)
        ax4.set_yticklabels(labels, fontsize=8)
        ax4.set_title('Top 10 Validated Predictions')
        ax4.set_xlabel('Validation Score')
        ax4.grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    plt.show()
    
else:
    print("No predictions available for validation")

## Step 5: Knowledge Graph Expansion Visualization

Visualize how the knowledge graph would expand with predicted completions.

In [None]:
def create_expansion_visualization(completer, predictions, max_nodes=50):
    """Create interactive visualization of knowledge graph expansion."""
    
    if not predictions:
        print("No predictions to visualize")
        return None
    
    print(f"🎨 Creating knowledge graph expansion visualization...")
    
    # Select top predictions for visualization
    top_predictions = predictions[:max_nodes//2]  # Limit for readability
    
    # Build expanded graph
    G = nx.DiGraph()
    
    # Track node types
    node_types = {}  # 'existing', 'predicted_relation', 'suggested_concept'
    edge_types = {}  # 'existing', 'predicted'
    prediction_scores = {}
    
    # Add existing nodes and edges from a sample of the original graph
    existing_nodes = set()
    sample_size = min(30, len(completer.node_ids))  # Sample for visualization
    sample_indices = np.random.choice(len(completer.node_ids), sample_size, replace=False)
    
    for idx in sample_indices:
        node_id = f"existing_{idx}"
        concept_name = completer.concept_names[idx]
        G.add_node(node_id, 
                  name=concept_name, 
                  full_id=completer.node_ids[idx],
                  type='existing')
        existing_nodes.add(node_id)
        node_types[node_id] = 'existing'
    
    # Add some existing edges
    existing_edges = list(completer.graph.edges())[:20]  # Sample existing edges
    for src, dst in existing_edges:
        src_node = f"existing_{src}"
        dst_node = f"existing_{dst}"
        
        if src_node in existing_nodes and dst_node in existing_nodes:
            G.add_edge(src_node, dst_node, type='existing', relation='subClassOf')
            edge_types[(src_node, dst_node)] = 'existing'
    
    # Add predicted relations
    for i, pred in enumerate(top_predictions):
        if pred['prediction_type'] in ['subclass', 'semantic_relation']:
            # Create node IDs for prediction
            if pred['prediction_type'] == 'subclass':
                src_node = f"pred_src_{i}"
                dst_node = f"pred_dst_{i}"
                src_name = pred['child_name']
                dst_name = pred['parent_name']
                relation = pred['relation_type']
            else:  # semantic_relation
                src_node = f"pred_src_{i}"
                dst_node = f"pred_dst_{i}"
                src_name = pred['source_name']
                dst_name = pred['target_name']
                relation = pred['relation_type']
            
            # Add nodes
            G.add_node(src_node, name=src_name, type='predicted_concept', full_id=pred.get('child_id', pred.get('source_id', '')))
            G.add_node(dst_node, name=dst_name, type='predicted_concept', full_id=pred.get('parent_id', pred.get('target_id', '')))
            
            node_types[src_node] = 'predicted_relation'
            node_types[dst_node] = 'predicted_relation'
            
            # Add edge
            score = pred.get('similarity', pred.get('confidence', 0.5))
            G.add_edge(src_node, dst_node, 
                      type='predicted', 
                      relation=relation,
                      score=score)
            edge_types[(src_node, dst_node)] = 'predicted'
            prediction_scores[(src_node, dst_node)] = score
    
    print(f"  Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
    
    # Layout the graph
    pos = nx.spring_layout(G, k=3, iterations=50, seed=42)
    
    # Create plotly traces
    
    # Edge traces
    existing_edge_x, existing_edge_y = [], []
    predicted_edge_x, predicted_edge_y = [], []
    
    for edge in G.edges():
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        
        if edge_types.get(edge, 'predicted') == 'existing':
            existing_edge_x.extend([x0, x1, None])
            existing_edge_y.extend([y0, y1, None])
        else:
            predicted_edge_x.extend([x0, x1, None])
            predicted_edge_y.extend([y0, y1, None])
    
    existing_edge_trace = go.Scatter(
        x=existing_edge_x, y=existing_edge_y,
        line=dict(width=1, color='#888'),
        hoverinfo='none',
        mode='lines',
        name='Existing Relations'
    )
    
    predicted_edge_trace = go.Scatter(
        x=predicted_edge_x, y=predicted_edge_y,
        line=dict(width=2, color='red', dash='dash'),
        hoverinfo='none',
        mode='lines',
        name='Predicted Relations'
    )
    
    # Node traces
    existing_nodes_x = [pos[node][0] for node in G.nodes() if node_types.get(node, '') == 'existing']
    existing_nodes_y = [pos[node][1] for node in G.nodes() if node_types.get(node, '') == 'existing']
    existing_text = [G.nodes[node]['name'] for node in G.nodes() if node_types.get(node, '') == 'existing']
    
    predicted_nodes_x = [pos[node][0] for node in G.nodes() if node_types.get(node, '') == 'predicted_relation']
    predicted_nodes_y = [pos[node][1] for node in G.nodes() if node_types.get(node, '') == 'predicted_relation']
    predicted_text = [G.nodes[node]['name'] for node in G.nodes() if node_types.get(node, '') == 'predicted_relation']
    
    existing_node_trace = go.Scatter(
        x=existing_nodes_x, y=existing_nodes_y,
        mode='markers+text',
        text=existing_text,
        textposition="middle center",
        name='Existing Concepts',
        marker=dict(
            size=15,
            color='lightblue',
            line=dict(width=1, color='blue')
        ),
        hovertemplate='<b>%{text}</b><br>Type: Existing Concept<extra></extra>'
    )
    
    predicted_node_trace = go.Scatter(
        x=predicted_nodes_x, y=predicted_nodes_y,
        mode='markers+text',
        text=predicted_text,
        textposition="middle center",
        name='Predicted Concepts',
        marker=dict(
            size=18,
            color='lightcoral',
            line=dict(width=2, color='red')
        ),
        hovertemplate='<b>%{text}</b><br>Type: Predicted Concept<extra></extra>'
    )
    
    # Create figure
    fig = go.Figure(data=[existing_edge_trace, predicted_edge_trace, 
                         existing_node_trace, predicted_node_trace],
                   layout=go.Layout(
                        title='Knowledge Graph Expansion Preview<br><sub>Blue: Existing concepts, Red: Predicted additions, Dashed lines: Predicted relations</sub>',
                        titlefont_size=16,
                        showlegend=True,
                        hovermode='closest',
                        margin=dict(b=20,l=5,r=5,t=60),
                        annotations=[ dict(
                            text=f"Showing {len(top_predictions)} top predictions from {len(predictions)} total",
                            showarrow=False,
                            xref="paper", yref="paper",
                            x=0.005, y=-0.002 ) ],
                        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                        width=1200,
                        height=800
                   ))
    
    return fig

# Create expansion visualization
if 'validated_predictions' in locals() and validated_predictions:
    # Use only validated predictions for visualization
    validated_only = [p for p in validated_predictions if p.get('is_validated', False)]
    
    if validated_only:
        expansion_fig = create_expansion_visualization(
            completer, validated_only, max_nodes=40
        )
        
        if expansion_fig:
            expansion_fig.show()
    else:
        print("No validated predictions available for visualization")
else:
    print("No predictions available for expansion visualization")

## Step 6: Export Completion Results

Export completion predictions in formats suitable for ontology curation tools.

In [None]:
def export_completion_results(predictions, concept_suggestions, kg_data, output_dir='completion_results'):
    """Export completion results for ontology curation."""
    
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    ontology_name = Path(kg_data['ontology_file']).stem
    timestamp = pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')
    
    exported_files = []
    
    print(f"📁 Exporting completion results to {output_dir}/...")
    
    # 1. Comprehensive CSV report
    if predictions:
        csv_file = Path(output_dir) / f"{ontology_name}_completion_predictions_{timestamp}.csv"
        df = pd.DataFrame(predictions)
        df.to_csv(csv_file, index=False)
        exported_files.append(str(csv_file))
        print(f"  ✓ Predictions CSV: {csv_file}")
    
    # 2. OWL additions (for validated predictions)
    validated_preds = [p for p in predictions if p.get('is_validated', False)]
    if validated_preds:
        owl_file = Path(output_dir) / f"{ontology_name}_predicted_additions_{timestamp}.owl"
        
        with open(owl_file, 'w') as f:
            f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
            f.write('<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"\n')
            f.write('         xmlns:owl="http://www.w3.org/2002/07/owl#"\n')
            f.write('         xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#">\n\n')
            f.write(f'<!-- Predicted additions for {ontology_name} -->\n')
            f.write(f'<!-- Generated by on2vec on {pd.Timestamp.now()} -->\n\n')
            
            for pred in validated_preds:
                if pred['prediction_type'] == 'subclass':
                    f.write(f'<!-- Predicted subclass relation: {pred["child_name"]} ⊆ {pred["parent_name"]} -->\n')
                    f.write(f'<owl:Class rdf:about="{pred["child_id"]}">\n')
                    f.write(f'  <rdfs:subClassOf rdf:resource="{pred["parent_id"]}"/>\n')
                    f.write(f'  <rdfs:comment>Predicted by on2vec with confidence {pred.get("confidence", 0.5):.3f}</rdfs:comment>\n')
                    f.write(f'</owl:Class>\n\n')
                
                elif pred['prediction_type'] == 'semantic_relation':
                    f.write(f'<!-- Predicted semantic relation: {pred["source_name"]} {pred["relation_type"]} {pred["target_name"]} -->\n')
                    f.write(f'<owl:Class rdf:about="{pred["source_id"]}">\n')
                    f.write(f'  <!-- {pred["relation_type"]} relation to {pred["target_id"]} -->\n')
                    f.write(f'  <rdfs:comment>Predicted relation by on2vec with confidence {pred.get("confidence", 0.5):.3f}</rdfs:comment>\n')
                    f.write(f'</owl:Class>\n\n')
            
            f.write('</rdf:RDF>\n')
        
        exported_files.append(str(owl_file))
        print(f"  ✓ OWL additions: {owl_file}")
    
    # 3. Concept suggestions report
    if concept_suggestions:
        suggestions_file = Path(output_dir) / f"{ontology_name}_concept_suggestions_{timestamp}.txt"
        
        with open(suggestions_file, 'w') as f:
            f.write(f"MISSING CONCEPT SUGGESTIONS FOR {ontology_name.upper()}\n")
            f.write(f"{'='*60}\n\n")
            f.write(f"Generated: {pd.Timestamp.now()}\n")
            f.write(f"Total suggestions: {len(concept_suggestions)}\n\n")
            
            for i, sugg in enumerate(concept_suggestions, 1):
                f.write(f"{i:2d}. SUGGESTED CONCEPT AREA: {sugg['suggested_concept_area']}\n")
                f.write(f"    Gap Score: {sugg['gap_score']:.3f}\n")
                f.write(f"    Cluster Size: {sugg['cluster_size']} concepts\n")
                f.write(f"    Representative Concepts:\n")
                for concept in sugg['representative_concepts']:
                    f.write(f"      - {concept}\n")
                f.write(f"    Closest Existing: {sugg['closest_concept']} (sim: {sugg['closest_similarity']:.3f})\n")
                f.write("\n")
        
        exported_files.append(str(suggestions_file))
        print(f"  ✓ Concept suggestions: {suggestions_file}")
    
    # 4. Curator instructions
    instructions_file = Path(output_dir) / f"{ontology_name}_curation_instructions_{timestamp}.md"
    
    with open(instructions_file, 'w') as f:
        f.write(f"# Ontology Curation Instructions: {ontology_name}\n\n")
        f.write(f"Generated by on2vec on {pd.Timestamp.now()}\n\n")
        
        f.write("## Summary\n\n")
        f.write(f"- **Total relation predictions**: {len([p for p in predictions if p['prediction_type'] in ['subclass', 'semantic_relation']])}\n")
        f.write(f"- **Validated predictions**: {len(validated_preds)}\n")
        f.write(f"- **Concept suggestions**: {len(concept_suggestions)}\n")
        f.write(f"- **Source ontology**: {kg_data['ontology_file']}\n\n")
        
        f.write("## How to Use These Results\n\n")
        f.write("### 1. Review Relation Predictions\n")
        f.write("- Check the CSV file for all predictions with confidence scores\n")
        f.write("- Focus on validated predictions (is_validated = True)\n")
        f.write("- High confidence (>0.8) predictions are most likely correct\n\n")
        
        f.write("### 2. Import OWL Additions\n")
        f.write("- The OWL file contains validated predictions in OWL format\n")
        f.write("- Can be imported into Protégé or other ontology editors\n")
        f.write("- Review each addition before accepting\n\n")
        
        f.write("### 3. Consider Missing Concepts\n")
        f.write("- Review concept suggestions for potential gaps\n")
        f.write("- High gap scores indicate areas needing attention\n")
        f.write("- Use representative concepts as inspiration for new additions\n\n")
        
        f.write("## Top Recommendations\n\n")
        
        # Top 5 validated predictions
        if validated_preds:
            f.write("### Highest Confidence Predictions\n")
            top_validated = sorted(validated_preds, key=lambda x: x.get('validation_score', 0), reverse=True)[:5]
            for i, pred in enumerate(top_validated, 1):
                if pred['prediction_type'] == 'subclass':
                    f.write(f"{i}. **{pred['child_name']}** ⊆ **{pred['parent_name']}** (score: {pred.get('validation_score', 0.5):.3f})\n")
                else:
                    f.write(f"{i}. **{pred['source_name']}** {pred['relation_type']} **{pred['target_name']}** (score: {pred.get('validation_score', 0.5):.3f})\n")
            f.write("\n")
        
        # Top concept suggestions
        if concept_suggestions:
            f.write("### Priority Missing Concepts\n")
            for i, sugg in enumerate(concept_suggestions[:3], 1):
                f.write(f"{i}. **{sugg['suggested_concept_area']}** (gap: {sugg['gap_score']:.3f})\n")
                f.write(f"   - Examples: {', '.join(sugg['representative_concepts'][:3])}\n")
            f.write("\n")
        
        f.write("## Files Included\n\n")
        for file_path in exported_files:
            f.write(f"- `{Path(file_path).name}`\n")
    
    exported_files.append(str(instructions_file))
    print(f"  ✓ Instructions: {instructions_file}")
    
    # 5. JSON export for programmatic use
    json_file = Path(output_dir) / f"{ontology_name}_completion_results_{timestamp}.json"
    
    completion_data = {
        'metadata': {
            'ontology_file': kg_data['ontology_file'],
            'generation_timestamp': pd.Timestamp.now().isoformat(),
            'tool': 'on2vec',
            'total_predictions': len(predictions),
            'validated_predictions': len(validated_preds),
            'concept_suggestions': len(concept_suggestions)
        },
        'relation_predictions': predictions,
        'concept_suggestions': concept_suggestions
    }
    
    import json
    with open(json_file, 'w') as f:
        json.dump(completion_data, f, indent=2, default=str)
    
    exported_files.append(str(json_file))
    print(f"  ✓ JSON: {json_file}")
    
    print(f"\n✅ Exported {len(exported_files)} files to {output_dir}/")
    return exported_files

# Export results
if 'validated_predictions' in locals() and 'concept_suggestions' in locals() and kg_data:
    exported_files = export_completion_results(
        validated_predictions,
        concept_suggestions,
        kg_data
    )
    
    print(f"\n📋 Export Summary:")
    for file_path in exported_files:
        file_size = Path(file_path).stat().st_size
        print(f"  • {Path(file_path).name}: {file_size:,} bytes")
        
else:
    print("No completion results available for export")

## Conclusion

This notebook demonstrated comprehensive knowledge graph completion capabilities using on2vec embeddings:

### ✅ Key Achievements:

1. **Missing Relation Prediction**: Identified potential subclass and semantic relationships
2. **Concept Gap Analysis**: Found areas where new concepts might be needed
3. **Validation Framework**: Assessed prediction quality using multiple validation methods
4. **Interactive Visualization**: Showed how the knowledge graph would expand
5. **Curation-Ready Export**: Generated files in formats suitable for ontology editors
6. **Quality Metrics**: Provided confidence scores and validation assessments

### 🎯 Practical Applications:

- **Ontology Evolution**: Systematically identify and add missing knowledge
- **Quality Assurance**: Find inconsistencies and gaps in existing ontologies
- **Semi-Automated Curation**: Assist human curators with intelligent suggestions
- **Cross-Domain Integration**: Bridge concepts from related knowledge domains
- **Knowledge Discovery**: Uncover hidden relationships in large ontologies

### 📊 Completion Types Demonstrated:

1. **Subclass Relations**: Traditional hierarchical relationships
2. **Semantic Relations**: Domain-specific properties like `part_of`, `has_function`
3. **Missing Concepts**: Identification of conceptual gaps in the knowledge space
4. **Relation Validation**: Quality assessment using embedding consistency

### 🔧 Technical Features:

- **Multi-Modal Validation**: Combining embedding similarity with structural consistency
- **Confidence Scoring**: Probabilistic assessment of prediction quality
- **Scalable Processing**: Efficient similarity computation for large ontologies
- **Export Integration**: Standard formats for ontology editing tools

### 🚀 Next Steps:

1. **Active Learning**: Use curator feedback to improve prediction models
2. **Temporal Tracking**: Monitor ontology evolution and suggest updates
3. **Domain Adaptation**: Fine-tune completion for specific knowledge domains
4. **Collaborative Curation**: Build interfaces for distributed ontology development
5. **Automated Testing**: Validate completions against gold standard benchmarks

The knowledge graph completion capabilities shown here demonstrate how on2vec embeddings can accelerate ontology development and maintenance, helping curators build more complete and consistent knowledge resources.