# Clustering & Taxonomy Discovery with on2vec

This notebook demonstrates how to discover hidden concept groupings and create new taxonomies using on2vec embeddings. We'll show how to:

1. Discover natural concept clusters in embedding space
2. Create hierarchical taxonomies from flat concept collections
3. Identify emergent themes and research areas
4. Validate clustering quality against existing ontology structure
5. Build interactive taxonomy browsers and explorers
6. Detect concept drift and emerging trends

## Use Case: Scientific Domain Organization
As scientific fields evolve, new concepts emerge and relationships change. Traditional manual taxonomy creation is slow and may miss subtle patterns. Embedding-based clustering can automatically discover conceptual groupings, identify emerging research areas, and suggest new organizational structures for knowledge domains.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import (
    KMeans, DBSCAN, AgglomerativeClustering, SpectralClustering,
    OPTICS, GaussianMixture
)
from sklearn.metrics import (
    silhouette_score, adjusted_rand_score, normalized_mutual_info_score,
    calinski_harabasz_score, davies_bouldin_score
)
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster, to_tree
from scipy.spatial.distance import pdist, squareform
from scipy.stats import chi2_contingency
import networkx as nx
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import umap
from collections import defaultdict, Counter
import itertools
from pathlib import Path

# on2vec imports
from on2vec import (
    load_embeddings_as_dataframe,
    train_ontology_embeddings,
    embed_ontology_with_model,
    build_graph_from_owl
)

plt.style.use('default')
sns.set_palette("husl")

import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
import random
random.seed(42)

## Step 1: Prepare Clustering Data

Load embeddings and prepare data for clustering analysis.

In [None]:
import os

def prepare_clustering_data(ontology_file):
    """Prepare embeddings and concept data for clustering analysis."""
    
    if not os.path.exists(ontology_file):
        print(f"❌ Ontology file not found: {ontology_file}")
        return None
    
    base_name = Path(ontology_file).stem
    model_file = f"{base_name}_clustering_model.pt"
    embedding_file = f"{base_name}_clustering_embeddings.parquet"
    
    print(f"🔄 Preparing clustering data for {ontology_file}...")
    
    try:
        # Train model optimized for clustering
        if not os.path.exists(model_file):
            print(f"  Training clustering-optimized model...")
            result = train_ontology_embeddings(
                owl_file=ontology_file,
                model_output=model_file,
                model_type="gat",  # GAT for capturing local neighborhoods
                hidden_dim=256,    # Good balance for clustering
                out_dim=128,       # Rich embedding space
                epochs=100,        # Sufficient for demo
                loss_fn_name="triplet", # Triplet loss for better cluster separation
                learning_rate=0.01
            )
            print(f"  ✓ Model training completed")
        else:
            print(f"  ✓ Using existing model: {model_file}")
        
        # Generate embeddings
        if not os.path.exists(embedding_file):
            print(f"  Generating embeddings...")
            embed_result = embed_ontology_with_model(
                model_path=model_file,
                owl_file=ontology_file,
                output_file=embedding_file
            )
            print(f"  ✓ Embeddings generation completed")
        else:
            print(f"  ✓ Using existing embeddings: {embedding_file}")
        
        # Load embeddings
        df, metadata = load_embeddings_as_dataframe(embedding_file, return_metadata=True)
        embeddings = np.stack(df['embedding'].to_numpy())
        node_ids = df['node_id'].to_numpy()
        
        print(f"  ✓ Loaded {len(node_ids)} concept embeddings")
        
        # Extract concept information
        concept_info = []
        for i, node_id in enumerate(node_ids):
            # Extract concept name
            if '#' in node_id:
                name = node_id.split('#')[-1]
            else:
                name = node_id.split('/')[-1]
            
            clean_name = name.replace('_', ' ').replace('-', ' ')
            
            # Extract concept characteristics for validation
            name_lower = clean_name.lower()
            
            # Determine concept type heuristically
            if any(word in name_lower for word in ['data', 'format', 'file', 'document']):
                concept_type = 'data_resource'
            elif any(word in name_lower for word in ['analysis', 'method', 'algorithm', 'technique']):
                concept_type = 'methodology'
            elif any(word in name_lower for word in ['protein', 'gene', 'sequence', 'molecular']):
                concept_type = 'biological_entity'
            elif any(word in name_lower for word in ['tool', 'software', 'program', 'application']):
                concept_type = 'software_tool'
            elif any(word in name_lower for word in ['database', 'repository', 'collection']):
                concept_type = 'database'
            elif any(word in name_lower for word in ['disease', 'disorder', 'pathology', 'syndrome']):
                concept_type = 'medical_condition'
            elif any(word in name_lower for word in ['role', 'contributor', 'author', 'person']):
                concept_type = 'role'
            elif any(word in name_lower for word in ['use', 'consent', 'permission', 'restriction']):
                concept_type = 'data_usage'
            else:
                concept_type = 'general'
            
            # Estimate concept complexity based on name structure
            complexity = 'simple' if len(clean_name.split()) <= 2 else 'complex'
            
            concept_info.append({
                'concept_id': i,
                'node_id': node_id,
                'name': clean_name,
                'name_length': len(clean_name),
                'concept_type': concept_type,
                'complexity': complexity
            })
        
        concept_df = pd.DataFrame(concept_info)
        
        # Load original ontology structure for validation
        ontology_structure = None
        try:
            x, edge_index, class_mapping = build_graph_from_owl(ontology_file)
            ontology_structure = {
                'edge_index': edge_index,
                'class_mapping': class_mapping
            }
            print(f"  ✓ Loaded ontology structure: {edge_index.shape[1]} edges")
        except Exception as e:
            print(f"  ⚠️ Could not load ontology structure: {e}")
        
        print(f"  ✓ Prepared {len(node_ids)} concepts with {embeddings.shape[1]}D embeddings")
        print(f"  ✓ Concept types: {dict(concept_df['concept_type'].value_counts())}")
        
        return {
            'ontology_file': ontology_file,
            'embeddings': embeddings,
            'concept_df': concept_df,
            'node_ids': node_ids,
            'metadata': metadata,
            'ontology_structure': ontology_structure,
            'model_file': model_file,
            'embedding_file': embedding_file
        }
        
    except Exception as e:
        print(f"❌ Error preparing clustering data: {e}")
        import traceback
        traceback.print_exc()
        return None

# Try multiple verified working ontologies for clustering analysis
ontology_candidates = [
    'EDAM.owl',             # Large bioinformatics ontology (if available)
    'cvdo.owl',             # Cardiovascular disease ontology (if available)
    'owl_files/fao.owl',    # FAIR* Reviews Ontology (116 classes, tested ✓)
    'owl_files/cro.owl',    # Contributor Role Ontology (105 classes, tested ✓)
    'owl_files/duo.owl',    # Data Use Ontology (45 classes, tested ✓)
]

cluster_data = None

print("🔍 SEARCHING FOR SUITABLE ONTOLOGY FOR CLUSTERING...")
print("=" * 55)

for ont_file in ontology_candidates:
    if os.path.exists(ont_file):
        print(f"\n🔄 Attempting to load {ont_file}...")
        
        # Show file info
        file_size = os.path.getsize(ont_file)
        size_str = f"{file_size/1024:.1f}KB" if file_size < 1024*1024 else f"{file_size/(1024*1024):.1f}MB"
        print(f"   File size: {size_str}")
        
        cluster_data = prepare_clustering_data(ont_file)
        if cluster_data:
            print(f"\n✅ CLUSTERING DATA READY!")
            print("=" * 30)
            print(f"  • Ontology: {ont_file}")
            print(f"  • Concepts: {len(cluster_data['node_ids']):,}")
            print(f"  • Embeddings: {cluster_data['embeddings'].shape}")
            print(f"  • File size: {size_str}")
            break
        else:
            print(f"❌ Failed to prepare clustering data for {ont_file}")
    else:
        print(f"⚠️  File not found: {ont_file}")

if not cluster_data:
    print("\n❌ NO SUITABLE ONTOLOGY FILES FOUND")
    print("=" * 40) 
    print("Checked the following files:")
    for ont_file in ontology_candidates:
        exists = "✓" if os.path.exists(ont_file) else "✗"
        print(f"  {exists} {ont_file}")
    print("\nPlease ensure at least one valid ontology is available.")
    print("You can download ontologies from: https://obofoundry.org/")
else:
    print(f"\n🎯 Ready to proceed with clustering analysis using {Path(cluster_data['ontology_file']).name}!")

## Step 2: Multi-Algorithm Clustering Analysis

Apply multiple clustering algorithms to discover different types of concept groupings.

In [None]:
class MultiAlgorithmClusterer:
    def __init__(self, cluster_data):
        """Initialize multi-algorithm clustering system."""
        self.cluster_data = cluster_data
        self.embeddings = cluster_data['embeddings']
        self.concept_df = cluster_data['concept_df']
        
        # Standardize embeddings for distance-based clustering
        self.scaler = StandardScaler()
        self.embeddings_scaled = self.scaler.fit_transform(self.embeddings)
        
        print(f"🎯 Multi-algorithm clusterer initialized with {len(self.embeddings)} concepts")
    
    def find_optimal_k(self, k_range=(2, 20), methods=['kmeans']):
        """Find optimal number of clusters using multiple metrics."""
        print(f"🔍 Finding optimal cluster numbers...")
        
        k_values = range(k_range[0], k_range[1] + 1)
        results = {}
        
        for method in methods:
            method_results = {
                'k_values': list(k_values),
                'silhouette_scores': [],
                'calinski_harabasz_scores': [],
                'davies_bouldin_scores': [],
                'inertias': []
            }
            
            for k in k_values:
                if method == 'kmeans':
                    clusterer = KMeans(n_clusters=k, random_state=42, n_init=10)
                elif method == 'hierarchical':
                    clusterer = AgglomerativeClustering(n_clusters=k)
                elif method == 'spectral':
                    clusterer = SpectralClustering(n_clusters=k, random_state=42)
                else:
                    continue
                
                labels = clusterer.fit_predict(self.embeddings_scaled)
                
                # Calculate metrics
                if len(set(labels)) > 1:  # Valid clustering
                    silhouette = silhouette_score(self.embeddings_scaled, labels)
                    calinski = calinski_harabasz_score(self.embeddings_scaled, labels)
                    davies_bouldin = davies_bouldin_score(self.embeddings_scaled, labels)
                    
                    method_results['silhouette_scores'].append(silhouette)
                    method_results['calinski_harabasz_scores'].append(calinski)
                    method_results['davies_bouldin_scores'].append(davies_bouldin)
                    
                    if method == 'kmeans':
                        method_results['inertias'].append(clusterer.inertia_)
                    else:
                        method_results['inertias'].append(0)  # Placeholder
                else:
                    # Invalid clustering
                    method_results['silhouette_scores'].append(-1)
                    method_results['calinski_harabasz_scores'].append(0)
                    method_results['davies_bouldin_scores'].append(float('inf'))
                    method_results['inertias'].append(float('inf'))
            
            results[method] = method_results
        
        return results
    
    def apply_clustering_algorithms(self, n_clusters=None):
        """Apply multiple clustering algorithms and compare results."""
        print(f"🔬 Applying multiple clustering algorithms...")
        
        clustering_results = {}
        
        # Determine number of clusters if not specified
        if n_clusters is None:
            # Use heuristic: sqrt(n_samples/2)
            n_clusters = max(3, min(15, int(np.sqrt(len(self.embeddings) / 2))))
            print(f"  Using heuristic n_clusters = {n_clusters}")
        
        # 1. K-Means Clustering
        try:
            kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
            kmeans_labels = kmeans.fit_predict(self.embeddings_scaled)
            
            clustering_results['kmeans'] = {
                'labels': kmeans_labels,
                'algorithm': 'K-Means',
                'n_clusters': len(set(kmeans_labels)),
                'silhouette_score': silhouette_score(self.embeddings_scaled, kmeans_labels),
                'inertia': kmeans.inertia_,
                'cluster_centers': kmeans.cluster_centers_
            }
            print(f"  ✓ K-Means: {len(set(kmeans_labels))} clusters, silhouette = {clustering_results['kmeans']['silhouette_score']:.3f}")
        except Exception as e:
            print(f"  ❌ K-Means failed: {e}")
        
        # 2. Hierarchical Clustering
        try:
            hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
            hierarchical_labels = hierarchical.fit_predict(self.embeddings_scaled)
            
            clustering_results['hierarchical'] = {
                'labels': hierarchical_labels,
                'algorithm': 'Hierarchical (Ward)',
                'n_clusters': len(set(hierarchical_labels)),
                'silhouette_score': silhouette_score(self.embeddings_scaled, hierarchical_labels)
            }
            print(f"  ✓ Hierarchical: {len(set(hierarchical_labels))} clusters, silhouette = {clustering_results['hierarchical']['silhouette_score']:.3f}")
        except Exception as e:
            print(f"  ❌ Hierarchical failed: {e}")
        
        # 3. DBSCAN (density-based)
        try:
            # Find suitable eps using k-distance graph
            from sklearn.neighbors import NearestNeighbors
            k = 4
            nbrs = NearestNeighbors(n_neighbors=k)
            nbrs.fit(self.embeddings_scaled)
            distances, indices = nbrs.kneighbors(self.embeddings_scaled)
            distances = np.sort(distances[:, k-1], axis=0)
            
            # Use knee point as eps (simplified)
            eps = np.percentile(distances, 90)
            
            dbscan = DBSCAN(eps=eps, min_samples=5)
            dbscan_labels = dbscan.fit_predict(self.embeddings_scaled)
            
            n_clusters_dbscan = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
            n_noise = list(dbscan_labels).count(-1)
            
            if n_clusters_dbscan > 1:
                # Calculate silhouette only for non-noise points
                non_noise_mask = dbscan_labels != -1
                if np.sum(non_noise_mask) > 1:
                    silhouette = silhouette_score(self.embeddings_scaled[non_noise_mask], 
                                                dbscan_labels[non_noise_mask])
                else:
                    silhouette = -1
            else:
                silhouette = -1
            
            clustering_results['dbscan'] = {
                'labels': dbscan_labels,
                'algorithm': 'DBSCAN',
                'n_clusters': n_clusters_dbscan,
                'n_noise': n_noise,
                'eps': eps,
                'silhouette_score': silhouette
            }
            print(f"  ✓ DBSCAN: {n_clusters_dbscan} clusters, {n_noise} noise points, silhouette = {silhouette:.3f}")
        except Exception as e:
            print(f"  ❌ DBSCAN failed: {e}")
        
        # 4. Gaussian Mixture Model
        try:
            gmm = GaussianMixture(n_components=n_clusters, random_state=42)
            gmm_labels = gmm.fit_predict(self.embeddings_scaled)
            
            clustering_results['gmm'] = {
                'labels': gmm_labels,
                'algorithm': 'Gaussian Mixture',
                'n_clusters': len(set(gmm_labels)),
                'silhouette_score': silhouette_score(self.embeddings_scaled, gmm_labels),
                'aic': gmm.aic(self.embeddings_scaled),
                'bic': gmm.bic(self.embeddings_scaled)
            }
            print(f"  ✓ GMM: {len(set(gmm_labels))} clusters, silhouette = {clustering_results['gmm']['silhouette_score']:.3f}")
        except Exception as e:
            print(f"  ❌ GMM failed: {e}")
        
        # 5. Spectral Clustering
        try:
            spectral = SpectralClustering(n_clusters=n_clusters, random_state=42, 
                                        affinity='rbf', gamma=1.0)
            spectral_labels = spectral.fit_predict(self.embeddings_scaled)
            
            clustering_results['spectral'] = {
                'labels': spectral_labels,
                'algorithm': 'Spectral',
                'n_clusters': len(set(spectral_labels)),
                'silhouette_score': silhouette_score(self.embeddings_scaled, spectral_labels)
            }
            print(f"  ✓ Spectral: {len(set(spectral_labels))} clusters, silhouette = {clustering_results['spectral']['silhouette_score']:.3f}")
        except Exception as e:
            print(f"  ❌ Spectral failed: {e}")
        
        return clustering_results
    
    def analyze_cluster_characteristics(self, clustering_results):
        """Analyze the characteristics of discovered clusters."""
        print(f"🔍 Analyzing cluster characteristics...")
        
        cluster_analyses = {}
        
        for method_name, result in clustering_results.items():
            labels = result['labels']
            unique_labels = set(labels)
            
            # Skip noise label for DBSCAN
            if -1 in unique_labels:
                unique_labels.remove(-1)
            
            cluster_info = []
            
            for cluster_id in unique_labels:
                cluster_mask = labels == cluster_id
                cluster_concepts = self.concept_df[cluster_mask]
                cluster_embeddings = self.embeddings[cluster_mask]
                
                if len(cluster_concepts) > 0:
                    # Analyze cluster composition
                    concept_types = cluster_concepts['concept_type'].value_counts()
                    complexity_dist = cluster_concepts['complexity'].value_counts()
                    
                    # Calculate cluster cohesion (intra-cluster similarity)
                    if len(cluster_embeddings) > 1:
                        pairwise_similarities = np.triu(np.dot(cluster_embeddings, cluster_embeddings.T), k=1)
                        cohesion = np.mean(pairwise_similarities[pairwise_similarities > 0])
                    else:
                        cohesion = 1.0
                    
                    # Get representative concepts (closest to centroid)
                    cluster_centroid = np.mean(cluster_embeddings, axis=0)
                    distances_to_centroid = np.linalg.norm(cluster_embeddings - cluster_centroid, axis=1)
                    representative_indices = np.argsort(distances_to_centroid)[:3]
                    representatives = cluster_concepts.iloc[representative_indices]['name'].tolist()
                    
                    # Determine cluster theme
                    most_common_type = concept_types.index[0] if not concept_types.empty else 'unknown'
                    theme = self._infer_cluster_theme(cluster_concepts['name'].tolist(), most_common_type)
                    
                    cluster_info.append({
                        'cluster_id': cluster_id,
                        'size': len(cluster_concepts),
                        'cohesion': cohesion,
                        'dominant_type': most_common_type,
                        'type_distribution': dict(concept_types),
                        'complexity_distribution': dict(complexity_dist),
                        'representatives': representatives,
                        'theme': theme,
                        'diversity_score': len(concept_types) / len(cluster_concepts)  # Type diversity
                    })
            
            cluster_analyses[method_name] = {
                'algorithm': result['algorithm'],
                'n_clusters': len(cluster_info),
                'silhouette_score': result.get('silhouette_score', 0),
                'clusters': cluster_info,
                'average_cluster_size': np.mean([c['size'] for c in cluster_info]) if cluster_info else 0,
                'average_cohesion': np.mean([c['cohesion'] for c in cluster_info]) if cluster_info else 0
            }
        
        return cluster_analyses
    
    def _infer_cluster_theme(self, concept_names, dominant_type):
        """Infer the main theme of a cluster based on concept names."""
        # Count common words across concept names
        all_words = []
        for name in concept_names:
            words = name.lower().split()
            all_words.extend(words)
        
        word_counts = Counter(all_words)
        
        # Remove common stop words
        stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were'}
        filtered_words = {word: count for word, count in word_counts.items() 
                         if word not in stop_words and len(word) > 2}
        
        if filtered_words:
            # Get most common meaningful words
            top_words = sorted(filtered_words.items(), key=lambda x: x[1], reverse=True)[:3]
            theme_words = [word for word, count in top_words if count > 1]
            
            if theme_words:
                return f"{dominant_type}: {', '.join(theme_words)}"
        
        return f"{dominant_type}: general"

# Initialize clusterer and run analysis
if cluster_data:
    clusterer = MultiAlgorithmClusterer(cluster_data)
    
    print("\n🎯 MULTI-ALGORITHM CLUSTERING ANALYSIS")
    print("=" * 45)
    
    # Apply clustering algorithms
    clustering_results = clusterer.apply_clustering_algorithms(n_clusters=8)
    
    # Analyze cluster characteristics
    cluster_analyses = clusterer.analyze_cluster_characteristics(clustering_results)
    
    print(f"\n📊 CLUSTERING COMPARISON SUMMARY:")
    print("-" * 40)
    
    for method_name, analysis in cluster_analyses.items():
        print(f"{analysis['algorithm']:20}: {analysis['n_clusters']:2d} clusters, "
              f"silhouette = {analysis['silhouette_score']:5.3f}, "
              f"avg_size = {analysis['average_cluster_size']:4.1f}")
    
    # Show best clustering result details
    best_method = max(cluster_analyses.items(), 
                     key=lambda x: x[1]['silhouette_score'] if x[1]['silhouette_score'] > -1 else -2)
    
    print(f"\n🏆 Best Method: {best_method[0]} ({best_method[1]['algorithm']})")
    print("-" * 50)
    
    for i, cluster in enumerate(best_method[1]['clusters'][:6]):  # Show top 6 clusters
        print(f"Cluster {cluster['cluster_id']:2d}: {cluster['size']:3d} concepts | {cluster['theme'][:40]}")
        print(f"           Examples: {', '.join(cluster['representatives'][:2])}...")
        print(f"           Cohesion: {cluster['cohesion']:.3f}, Diversity: {cluster['diversity_score']:.3f}")
        print()
        
else:
    print("Clusterer not available - need clustering data")

## Step 3: Hierarchical Taxonomy Construction

Build hierarchical taxonomies from the discovered clusters.

In [None]:
class TaxonomyBuilder:
    def __init__(self, cluster_data, clustering_results):
        """Initialize taxonomy builder with clustering results."""
        self.cluster_data = cluster_data
        self.clustering_results = clustering_results
        self.embeddings = cluster_data['embeddings']
        self.concept_df = cluster_data['concept_df']
        
        print(f"🌳 Taxonomy builder initialized")
    
    def build_hierarchical_taxonomy(self, method='ward', max_depth=4):
        """Build hierarchical taxonomy using agglomerative clustering."""
        print(f"🏗️ Building hierarchical taxonomy...")
        
        # Standardize embeddings
        scaler = StandardScaler()
        embeddings_scaled = scaler.fit_transform(self.embeddings)
        
        # Compute linkage matrix
        print(f"  Computing linkage matrix with {method} method...")
        linkage_matrix = linkage(embeddings_scaled, method=method)
        
        # Build taxonomy at different levels
        taxonomy_levels = {}
        
        for depth in range(2, max_depth + 1):
            n_clusters = min(depth * 2, len(self.concept_df) // 10)  # Adaptive cluster count
            n_clusters = max(2, n_clusters)
            
            labels = fcluster(linkage_matrix, n_clusters, criterion='maxclust')
            
            # Analyze clusters at this level
            level_clusters = self._analyze_taxonomy_level(labels, depth)
            taxonomy_levels[depth] = {
                'n_clusters': n_clusters,
                'labels': labels,
                'clusters': level_clusters,
                'silhouette_score': silhouette_score(embeddings_scaled, labels) if len(set(labels)) > 1 else -1
            }
            
            print(f"  Level {depth}: {n_clusters} clusters, silhouette = {taxonomy_levels[depth]['silhouette_score']:.3f}")
        
        # Build tree structure
        taxonomy_tree = self._build_tree_structure(linkage_matrix, taxonomy_levels)
        
        return {
            'linkage_matrix': linkage_matrix,
            'taxonomy_levels': taxonomy_levels,
            'taxonomy_tree': taxonomy_tree,
            'method': method
        }
    
    def _analyze_taxonomy_level(self, labels, depth):
        """Analyze clusters at a specific taxonomy level."""
        unique_labels = set(labels)
        level_clusters = {}
        
        for cluster_id in unique_labels:
            cluster_mask = labels == cluster_id
            cluster_concepts = self.concept_df[cluster_mask]
            cluster_embeddings = self.embeddings[cluster_mask]
            
            if len(cluster_concepts) > 0:
                # Calculate cluster statistics
                concept_types = cluster_concepts['concept_type'].value_counts()
                
                # Find representative concepts
                cluster_centroid = np.mean(cluster_embeddings, axis=0)
                distances = np.linalg.norm(cluster_embeddings - cluster_centroid, axis=1)
                representative_idx = np.argmin(distances)
                representative_concept = cluster_concepts.iloc[representative_idx]['name']
                
                # Generate taxonomy label
                dominant_type = concept_types.index[0] if not concept_types.empty else 'general'
                taxonomy_label = self._generate_taxonomy_label(cluster_concepts, dominant_type, depth)
                
                level_clusters[cluster_id] = {
                    'taxonomy_label': taxonomy_label,
                    'size': len(cluster_concepts),
                    'dominant_type': dominant_type,
                    'type_distribution': dict(concept_types),
                    'representative_concept': representative_concept,
                    'concept_list': cluster_concepts['name'].tolist(),
                    'diversity': len(concept_types) / len(cluster_concepts)
                }
        
        return level_clusters
    
    def _generate_taxonomy_label(self, cluster_concepts, dominant_type, depth):
        """Generate hierarchical taxonomy labels."""
        
        # Analyze concept names to find common themes
        concept_names = cluster_concepts['name'].tolist()
        
        # Extract common words (excluding stop words)
        all_words = []
        for name in concept_names:
            words = [word.lower() for word in name.split() if len(word) > 2]
            all_words.extend(words)
        
        word_counts = Counter(all_words)
        common_words = [word for word, count in word_counts.most_common(5) if count > 1]
        
        # Create hierarchical label based on depth
        if depth == 2:  # Top level - broad categories
            broad_categories = {
                'biological_entity': 'Biological Entities',
                'methodology': 'Methods & Algorithms', 
                'data_resource': 'Data Resources',
                'software_tool': 'Software Tools',
                'database': 'Databases & Repositories',
                'medical_condition': 'Medical Conditions',
                'general': 'General Concepts'
            }
            return broad_categories.get(dominant_type, 'General Concepts')
        
        elif depth == 3:  # Mid level - more specific
            if common_words:
                theme_word = common_words[0].title()
                return f"{dominant_type.replace('_', ' ').title()}: {theme_word}-related"
            else:
                return f"{dominant_type.replace('_', ' ').title()}: General"
        
        else:  # Deeper levels - very specific
            if len(common_words) >= 2:
                theme = ' & '.join(common_words[:2]).title()
                return f"{theme} Concepts"
            elif common_words:
                return f"{common_words[0].title()} Specialized"
            else:
                return f"Cluster {depth}-{len(cluster_concepts)}"
    
    def _build_tree_structure(self, linkage_matrix, taxonomy_levels):
        """Build tree structure from linkage matrix."""
        print(f"  Building tree structure...")
        
        # Convert linkage matrix to tree
        tree = to_tree(linkage_matrix)
        
        # Extract tree structure with taxonomy labels
        def traverse_tree(node, level=0, max_level=3):
            if level > max_level:
                return None
            
            tree_node = {
                'id': node.id,
                'level': level,
                'distance': node.dist,
                'count': node.count
            }
            
            if node.is_leaf():
                # Leaf node - individual concept
                concept = self.concept_df.iloc[node.id]
                tree_node.update({
                    'type': 'leaf',
                    'concept_name': concept['name'],
                    'concept_type': concept['concept_type']
                })
            else:
                # Internal node - cluster
                tree_node.update({
                    'type': 'internal',
                    'left': traverse_tree(node.left, level + 1, max_level),
                    'right': traverse_tree(node.right, level + 1, max_level)
                })
                
                # Add taxonomy information if available
                if level + 2 in taxonomy_levels:  # Map to appropriate taxonomy level
                    # This is a simplified mapping - in practice, you'd need more sophisticated tree-to-cluster mapping
                    tree_node['taxonomy_info'] = f"Internal Node L{level}"
            
            return tree_node
        
        return traverse_tree(tree)
    
    def create_concept_hierarchy_graph(self, taxonomy_result, max_concepts=50):
        """Create a graph representation of the concept hierarchy."""
        print(f"📊 Creating concept hierarchy graph...")
        
        # Use the best taxonomy level (highest silhouette score)
        best_level = max(taxonomy_result['taxonomy_levels'].items(), 
                        key=lambda x: x[1]['silhouette_score'])
        
        level_data = best_level[1]
        clusters = level_data['clusters']
        
        # Build NetworkX graph
        G = nx.Graph()
        
        # Add cluster nodes (taxonomy categories)
        cluster_nodes = {}
        for cluster_id, cluster_info in clusters.items():
            cluster_node = f"cluster_{cluster_id}"
            G.add_node(cluster_node,
                      node_type='cluster',
                      label=cluster_info['taxonomy_label'],
                      size=cluster_info['size'],
                      dominant_type=cluster_info['dominant_type'])
            cluster_nodes[cluster_id] = cluster_node
        
        # Add concept nodes (sample from each cluster)
        concept_count = 0
        for cluster_id, cluster_info in clusters.items():
            if concept_count >= max_concepts:
                break
                
            cluster_node = cluster_nodes[cluster_id]
            
            # Add representative concepts from this cluster
            concepts_to_add = min(5, cluster_info['size'], max_concepts - concept_count)
            
            for i in range(concepts_to_add):
                concept_name = cluster_info['concept_list'][i]
                concept_node = f"concept_{concept_count}"
                
                G.add_node(concept_node,
                          node_type='concept', 
                          label=concept_name[:30],  # Truncate long names
                          full_name=concept_name,
                          cluster_id=cluster_id)
                
                # Connect concept to its cluster
                G.add_edge(cluster_node, concept_node, edge_type='contains')
                
                concept_count += 1
        
        print(f"  Created graph with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")
        
        return G

# Build taxonomies
if 'clustering_results' in locals() and cluster_data:
    taxonomy_builder = TaxonomyBuilder(cluster_data, clustering_results)
    
    print("\n🌳 HIERARCHICAL TAXONOMY CONSTRUCTION")
    print("=" * 45)
    
    # Build hierarchical taxonomy
    taxonomy_result = taxonomy_builder.build_hierarchical_taxonomy(
        method='ward', max_depth=4
    )
    
    # Show taxonomy levels
    print(f"\n📊 TAXONOMY LEVEL ANALYSIS:")
    print("-" * 35)
    
    for depth, level_data in taxonomy_result['taxonomy_levels'].items():
        print(f"Level {depth}: {level_data['n_clusters']} categories, silhouette = {level_data['silhouette_score']:.3f}")
        
        # Show categories at this level
        for cluster_id, cluster_info in sorted(level_data['clusters'].items(), 
                                             key=lambda x: x[1]['size'], reverse=True)[:4]:
            print(f"  • {cluster_info['taxonomy_label']:30} ({cluster_info['size']:3d} concepts) - {cluster_info['dominant_type']}")
        print()
    
    # Create concept hierarchy graph
    hierarchy_graph = taxonomy_builder.create_concept_hierarchy_graph(
        taxonomy_result, max_concepts=40
    )
    
    print(f"✅ Taxonomy construction completed!")
    
else:
    print("Taxonomy builder not available")

## Step 4: Interactive Taxonomy Visualization

Create interactive visualizations to explore the discovered taxonomies.

In [None]:
def create_taxonomy_visualizations(cluster_data, clustering_results, taxonomy_result):
    """Create comprehensive taxonomy visualizations."""
    
    print(f"🎨 Creating taxonomy visualizations...")
    
    # 1. Dimensionality reduction for visualization
    print(f"  Reducing dimensionality for visualization...")
    
    embeddings = cluster_data['embeddings']
    concept_df = cluster_data['concept_df']
    
    # UMAP reduction
    umap_reducer = umap.UMAP(n_components=2, random_state=42, min_dist=0.1, n_neighbors=15)
    embeddings_2d = umap_reducer.fit_transform(embeddings)
    
    # Get best clustering result
    best_clustering = max(clustering_results.items(), 
                         key=lambda x: x[1].get('silhouette_score', -2))
    best_method = best_clustering[0]
    best_labels = best_clustering[1]['labels']
    
    print(f"  Using {best_method} clustering for visualization")
    
    # 2. Create main clustering visualization
    fig_main = go.Figure()
    
    # Color palette for clusters
    colors = px.colors.qualitative.Set3
    unique_labels = sorted(set(best_labels))
    if -1 in unique_labels:  # Handle noise points for DBSCAN
        unique_labels.remove(-1)
        unique_labels.append(-1)  # Put noise at end
    
    # Plot each cluster
    for i, label in enumerate(unique_labels):
        cluster_mask = best_labels == label
        cluster_concepts = concept_df[cluster_mask]
        cluster_2d = embeddings_2d[cluster_mask]
        
        if len(cluster_2d) > 0:
            color = 'lightgray' if label == -1 else colors[i % len(colors)]
            name = 'Noise' if label == -1 else f'Cluster {label}'
            
            # Determine cluster theme
            if len(cluster_concepts) > 0:
                most_common_type = cluster_concepts['concept_type'].mode()
                if not most_common_type.empty:
                    theme = most_common_type.iloc[0].replace('_', ' ').title()
                    name = f'{name}: {theme}'
            
            fig_main.add_trace(go.Scatter(
                x=cluster_2d[:, 0],
                y=cluster_2d[:, 1],
                mode='markers',
                name=name,
                marker=dict(
                    size=8,
                    color=color,
                    opacity=0.7,
                    line=dict(width=1, color='white')
                ),
                text=cluster_concepts['name'].tolist(),
                hovertemplate='<b>%{text}</b><br>' +
                             f'Cluster: {name}<br>' +
                             'Type: %{customdata}<br>' +
                             '<extra></extra>',
                customdata=cluster_concepts['concept_type'].tolist()
            ))
    
    fig_main.update_layout(
        title=f'Concept Clustering Visualization ({best_method.title()})<br><sub>UMAP projection of {len(embeddings)} concepts</sub>',
        xaxis_title='UMAP Dimension 1',
        yaxis_title='UMAP Dimension 2',
        width=1000,
        height=700,
        hovermode='closest'
    )
    
    # 3. Create dendrogram for hierarchical taxonomy
    if 'linkage_matrix' in taxonomy_result:
        print(f"  Creating dendrogram...")
        
        fig_dendro, ax_dendro = plt.subplots(figsize=(15, 8))
        
        # Sample concepts for cleaner dendrogram
        max_leaves = 50
        if len(concept_df) > max_leaves:
            sample_indices = np.random.choice(len(concept_df), max_leaves, replace=False)
            sample_linkage = linkage(embeddings[sample_indices], method='ward')
            sample_labels = [concept_df.iloc[i]['name'][:20] for i in sample_indices]
        else:
            sample_linkage = taxonomy_result['linkage_matrix']
            sample_labels = [name[:20] for name in concept_df['name'].tolist()]
        
        dendrogram(sample_linkage, 
                  labels=sample_labels,
                  leaf_rotation=90,
                  leaf_font_size=8,
                  ax=ax_dendro)
        
        ax_dendro.set_title('Hierarchical Concept Taxonomy (Sample)', fontsize=14)
        ax_dendro.set_xlabel('Concepts', fontsize=12)
        ax_dendro.set_ylabel('Distance', fontsize=12)
        
        plt.tight_layout()
        plt.show()
    
    # 4. Create taxonomy level comparison
    if 'taxonomy_levels' in taxonomy_result:
        print(f"  Creating taxonomy level comparison...")
        
        fig_levels = make_subplots(
            rows=2, cols=2,
            subplot_titles=['Cluster Sizes by Level', 'Silhouette Scores by Level', 
                          'Category Distribution', 'Concept Type Analysis'],
            specs=[[{'type': 'bar'}, {'type': 'scatter'}],
                   [{'type': 'bar'}, {'type': 'pie'}]]
        )
        
        # Analyze taxonomy levels
        levels = list(taxonomy_result['taxonomy_levels'].keys())
        silhouette_scores = [taxonomy_result['taxonomy_levels'][level]['silhouette_score'] for level in levels]
        cluster_counts = [taxonomy_result['taxonomy_levels'][level]['n_clusters'] for level in levels]
        
        # Plot 1: Cluster counts by level
        fig_levels.add_trace(
            go.Bar(x=levels, y=cluster_counts, name='Clusters', marker_color='skyblue'),
            row=1, col=1
        )
        
        # Plot 2: Silhouette scores
        fig_levels.add_trace(
            go.Scatter(x=levels, y=silhouette_scores, mode='lines+markers', 
                      name='Silhouette', marker_color='orange'),
            row=1, col=2
        )
        
        # Plot 3: Category sizes for best level
        best_tax_level = max(taxonomy_result['taxonomy_levels'].items(), 
                           key=lambda x: x[1]['silhouette_score'])
        
        category_sizes = [info['size'] for info in best_tax_level[1]['clusters'].values()]
        category_labels = [info['taxonomy_label'][:20] for info in best_tax_level[1]['clusters'].values()]
        
        fig_levels.add_trace(
            go.Bar(x=category_labels, y=category_sizes, name='Category Sizes', 
                  marker_color='lightgreen'),
            row=2, col=1
        )
        
        # Plot 4: Concept type distribution
        type_counts = concept_df['concept_type'].value_counts()
        fig_levels.add_trace(
            go.Pie(labels=type_counts.index, values=type_counts.values, name="Types"),
            row=2, col=2
        )
        
        fig_levels.update_layout(
            height=800,
            title_text="Taxonomy Analysis Dashboard",
            showlegend=False
        )
        
        fig_levels.update_xaxes(title_text="Taxonomy Level", row=1, col=1)
        fig_levels.update_xaxes(title_text="Taxonomy Level", row=1, col=2)
        fig_levels.update_xaxes(title_text="Categories", tickangle=45, row=2, col=1)
        
        fig_levels.update_yaxes(title_text="Number of Clusters", row=1, col=1)
        fig_levels.update_yaxes(title_text="Silhouette Score", row=1, col=2)
        fig_levels.update_yaxes(title_text="Number of Concepts", row=2, col=1)
        
        fig_levels.show()
    
    return fig_main

def create_interactive_taxonomy_explorer(hierarchy_graph):
    """Create interactive taxonomy explorer."""
    
    print(f"🔍 Creating interactive taxonomy explorer...")
    
    # Layout the hierarchy graph
    pos = nx.spring_layout(hierarchy_graph, k=3, iterations=50)
    
    # Separate cluster and concept nodes
    cluster_nodes = [(node, data) for node, data in hierarchy_graph.nodes(data=True) 
                    if data['node_type'] == 'cluster']
    concept_nodes = [(node, data) for node, data in hierarchy_graph.nodes(data=True) 
                    if data['node_type'] == 'concept']
    
    # Create edge traces
    edge_x = []
    edge_y = []
    
    for edge in hierarchy_graph.edges():
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        edge_x.extend([x0, x1, None])
        edge_y.extend([y0, y1, None])
    
    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(width=1, color='#888'),
        hoverinfo='none',
        mode='lines',
        name='Connections'
    )
    
    # Create cluster node trace
    cluster_x = [pos[node][0] for node, data in cluster_nodes]
    cluster_y = [pos[node][1] for node, data in cluster_nodes]
    cluster_text = [data['label'] for node, data in cluster_nodes]
    cluster_sizes = [min(50, max(20, data['size'] * 2)) for node, data in cluster_nodes]
    cluster_types = [data['dominant_type'] for node, data in cluster_nodes]
    
    cluster_trace = go.Scatter(
        x=cluster_x, y=cluster_y,
        mode='markers+text',
        text=cluster_text,
        textposition="middle center",
        name='Categories',
        marker=dict(
            size=cluster_sizes,
            color='lightblue',
            line=dict(width=2, color='blue')
        ),
        customdata=cluster_types,
        hovertemplate='<b>%{text}</b><br>' +
                     'Type: %{customdata}<br>' +
                     'Size: %{marker.size}<br>' +
                     '<extra></extra>'
    )
    
    # Create concept node trace
    concept_x = [pos[node][0] for node, data in concept_nodes]
    concept_y = [pos[node][1] for node, data in concept_nodes]
    concept_text = [data['label'] for node, data in concept_nodes]
    concept_full_names = [data['full_name'] for node, data in concept_nodes]
    
    concept_trace = go.Scatter(
        x=concept_x, y=concept_y,
        mode='markers',
        name='Concepts',
        marker=dict(
            size=12,
            color='lightcoral',
            line=dict(width=1, color='red')
        ),
        customdata=concept_full_names,
        hovertemplate='<b>%{customdata}</b><br>' +
                     'Category: Connected above<br>' +
                     '<extra></extra>'
    )
    
    # Create figure
    fig = go.Figure(data=[edge_trace, cluster_trace, concept_trace],
                   layout=go.Layout(
                        title='Interactive Taxonomy Explorer<br><sub>Blue: Categories, Red: Concepts, Lines: Hierarchical relationships</sub>',
                        titlefont_size=16,
                        showlegend=True,
                        hovermode='closest',
                        margin=dict(b=20,l=5,r=5,t=60),
                        annotations=[ dict(
                            text="Click and drag to explore the taxonomy. Hover for details.",
                            showarrow=False,
                            xref="paper", yref="paper",
                            x=0.005, y=-0.002 ) ],
                        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                        width=1200,
                        height=800
                   ))
    
    return fig

# Create visualizations
if all(var in locals() for var in ['cluster_data', 'clustering_results', 'taxonomy_result']):
    print("\n🎨 TAXONOMY VISUALIZATION DASHBOARD")
    print("=" * 40)
    
    # Create main visualizations
    main_fig = create_taxonomy_visualizations(cluster_data, clustering_results, taxonomy_result)
    main_fig.show()
    
    # Create interactive explorer
    if 'hierarchy_graph' in locals():
        explorer_fig = create_interactive_taxonomy_explorer(hierarchy_graph)
        explorer_fig.show()
    
    print(f"\n✅ Visualization dashboard created!")
    
else:
    print("Cannot create visualizations - missing required data")

## Step 5: Taxonomy Quality Assessment

Evaluate the quality of discovered taxonomies against various criteria.

In [None]:
class TaxonomyQualityAssessor:
    def __init__(self, cluster_data, clustering_results, taxonomy_result):
        """Initialize taxonomy quality assessor."""
        self.cluster_data = cluster_data
        self.clustering_results = clustering_results
        self.taxonomy_result = taxonomy_result
        self.embeddings = cluster_data['embeddings']
        self.concept_df = cluster_data['concept_df']
        
        print(f"📏 Taxonomy quality assessor initialized")
    
    def assess_clustering_quality(self):
        """Assess the quality of different clustering approaches."""
        print(f"📊 Assessing clustering quality...")
        
        quality_metrics = {}
        
        # Standardize embeddings for consistent evaluation
        scaler = StandardScaler()
        embeddings_scaled = scaler.fit_transform(self.embeddings)
        
        for method_name, result in self.clustering_results.items():
            labels = result['labels']
            
            # Skip if clustering failed or has only one cluster
            unique_labels = set(labels)
            if len(unique_labels) <= 1:
                continue
            
            # Remove noise points for DBSCAN evaluation
            if -1 in unique_labels:
                non_noise_mask = labels != -1
                if np.sum(non_noise_mask) < len(labels) * 0.1:  # Too many noise points
                    continue
                eval_embeddings = embeddings_scaled[non_noise_mask]
                eval_labels = labels[non_noise_mask]
            else:
                eval_embeddings = embeddings_scaled
                eval_labels = labels
            
            if len(set(eval_labels)) <= 1:
                continue
            
            # Calculate quality metrics
            try:
                silhouette = silhouette_score(eval_embeddings, eval_labels)
                calinski_harabasz = calinski_harabasz_score(eval_embeddings, eval_labels)
                davies_bouldin = davies_bouldin_score(eval_embeddings, eval_labels)
                
                # Calculate additional metrics
                n_clusters = len(set(eval_labels))
                cluster_sizes = [np.sum(eval_labels == label) for label in set(eval_labels)]
                size_std = np.std(cluster_sizes)
                size_balance = 1 - (size_std / np.mean(cluster_sizes))  # Closer to 1 = more balanced
                
                # Semantic coherence (based on concept types)
                semantic_coherence = self._calculate_semantic_coherence(eval_labels)
                
                quality_metrics[method_name] = {
                    'algorithm': result['algorithm'],
                    'n_clusters': n_clusters,
                    'silhouette_score': silhouette,
                    'calinski_harabasz_score': calinski_harabasz,
                    'davies_bouldin_score': davies_bouldin,  # Lower is better
                    'size_balance': max(0, size_balance),
                    'semantic_coherence': semantic_coherence,
                    'cluster_sizes': cluster_sizes,
                    'n_noise_points': np.sum(labels == -1) if -1 in set(labels) else 0
                }
                
            except Exception as e:
                print(f"  ⚠️ Failed to evaluate {method_name}: {e}")
                continue
        
        return quality_metrics
    
    def _calculate_semantic_coherence(self, labels):
        """Calculate semantic coherence based on concept types within clusters."""
        coherence_scores = []
        
        for cluster_id in set(labels):
            cluster_mask = labels == cluster_id
            cluster_concepts = self.concept_df[cluster_mask]
            
            if len(cluster_concepts) <= 1:
                coherence_scores.append(1.0)  # Single concept is perfectly coherent
                continue
            
            # Calculate type diversity within cluster (lower diversity = higher coherence)
            type_counts = cluster_concepts['concept_type'].value_counts()
            dominant_type_ratio = type_counts.iloc[0] / len(cluster_concepts)
            
            # Coherence = how dominated the cluster is by its main type
            coherence_scores.append(dominant_type_ratio)
        
        return np.mean(coherence_scores)
    
    def assess_taxonomy_hierarchy(self):
        """Assess the quality of hierarchical taxonomy structure."""
        print(f"🌳 Assessing taxonomy hierarchy quality...")
        
        hierarchy_quality = {}
        
        if 'taxonomy_levels' not in self.taxonomy_result:
            return {}
        
        taxonomy_levels = self.taxonomy_result['taxonomy_levels']
        
        # Analyze each taxonomy level
        level_qualities = []
        
        for level, level_data in taxonomy_levels.items():
            clusters = level_data['clusters']
            
            # Calculate level-specific metrics
            cluster_sizes = [info['size'] for info in clusters.values()]
            cluster_diversities = [info['diversity'] for info in clusters.values()]
            
            level_quality = {
                'level': level,
                'n_clusters': level_data['n_clusters'],
                'silhouette_score': level_data['silhouette_score'],
                'average_cluster_size': np.mean(cluster_sizes),
                'size_standard_deviation': np.std(cluster_sizes),
                'average_diversity': np.mean(cluster_diversities),
                'balance_score': 1 - (np.std(cluster_sizes) / np.mean(cluster_sizes)) if cluster_sizes else 0
            }
            
            # Calculate taxonomy label quality
            label_lengths = [len(info['taxonomy_label']) for info in clusters.values()]
            level_quality['label_consistency'] = 1 - (np.std(label_lengths) / np.mean(label_lengths)) if label_lengths else 0
            
            level_qualities.append(level_quality)
        
        # Overall hierarchy assessment
        hierarchy_quality = {
            'n_levels': len(taxonomy_levels),
            'level_qualities': level_qualities,
            'best_level': max(level_qualities, key=lambda x: x['silhouette_score']),
            'hierarchy_depth_score': min(1.0, len(taxonomy_levels) / 4.0),  # Ideal depth around 3-4
            'overall_coherence': np.mean([lq['silhouette_score'] for lq in level_qualities])
        }
        
        return hierarchy_quality
    
    def validate_against_original_ontology(self):
        """Validate discovered clusters against original ontology structure."""
        print(f"✅ Validating against original ontology...")
        
        validation_results = {}
        
        if not self.cluster_data['ontology_structure']:
            print(f"  ⚠️ No original ontology structure available")
            return {'error': 'No original ontology structure available'}
        
        # Get original ontology structure
        edge_index = self.cluster_data['ontology_structure']['edge_index']
        class_mapping = self.cluster_data['ontology_structure']['class_mapping']
        
        # Map ontology classes to our concept indices
        ontology_to_concept = {}
        for ont_class, ont_idx in class_mapping.items():
            class_iri = ont_class.iri if hasattr(ont_class, 'iri') else str(ont_class)
            
            # Find matching concept
            matching_concepts = self.concept_df[self.concept_df['node_id'] == class_iri]
            if not matching_concepts.empty:
                concept_idx = matching_concepts.index[0]
                ontology_to_concept[ont_idx] = concept_idx
        
        if len(ontology_to_concept) < 10:
            return {'error': 'Insufficient overlap between ontology and concepts'}
        
        print(f"  Found {len(ontology_to_concept)} matching concepts")
        
        # Analyze clustering consistency with ontology hierarchy
        for method_name, result in self.clustering_results.items():
            labels = result['labels']
            
            # Calculate consistency score
            consistency_score = self._calculate_hierarchy_consistency(
                labels, edge_index, ontology_to_concept
            )
            
            validation_results[method_name] = {
                'hierarchy_consistency': consistency_score,
                'coverage': len(ontology_to_concept) / len(class_mapping)
            }
        
        return validation_results
    
    def _calculate_hierarchy_consistency(self, labels, edge_index, ontology_to_concept):
        """Calculate how well clusters respect original ontology hierarchy."""
        
        consistent_pairs = 0
        total_pairs = 0
        
        # Check hierarchical relationships
        for i in range(edge_index.shape[1]):
            parent_ont_idx = int(edge_index[1, i])  # Parent in hierarchy
            child_ont_idx = int(edge_index[0, i])   # Child in hierarchy
            
            # Check if both concepts are in our mapping
            if parent_ont_idx in ontology_to_concept and child_ont_idx in ontology_to_concept:
                parent_concept_idx = ontology_to_concept[parent_ont_idx]
                child_concept_idx = ontology_to_concept[child_ont_idx]
                
                # Check if they're in the same cluster (should be for good hierarchy preservation)
                if labels[parent_concept_idx] == labels[child_concept_idx]:
                    consistent_pairs += 1
                
                total_pairs += 1
        
        return consistent_pairs / total_pairs if total_pairs > 0 else 0
    
    def create_quality_report(self):
        """Create comprehensive quality assessment report."""
        print(f"📋 Creating comprehensive quality report...")
        
        # Gather all assessments
        clustering_quality = self.assess_clustering_quality()
        hierarchy_quality = self.assess_taxonomy_hierarchy()
        validation_results = self.validate_against_original_ontology()
        
        # Create summary report
        report = {
            'dataset_info': {
                'n_concepts': len(self.concept_df),
                'embedding_dim': self.embeddings.shape[1],
                'concept_types': dict(self.concept_df['concept_type'].value_counts())
            },
            'clustering_assessment': clustering_quality,
            'hierarchy_assessment': hierarchy_quality,
            'validation_results': validation_results
        }
        
        return report

# Assess taxonomy quality
if all(var in locals() for var in ['cluster_data', 'clustering_results', 'taxonomy_result']):
    assessor = TaxonomyQualityAssessor(cluster_data, clustering_results, taxonomy_result)
    
    print("\n📏 TAXONOMY QUALITY ASSESSMENT")
    print("=" * 35)
    
    # Create quality report
    quality_report = assessor.create_quality_report()
    
    # Display clustering assessment
    print(f"\n🎯 CLUSTERING QUALITY COMPARISON:")
    print("-" * 40)
    
    if quality_report['clustering_assessment']:
        # Create comparison table
        comparison_data = []
        
        for method, metrics in quality_report['clustering_assessment'].items():
            comparison_data.append({
                'Method': metrics['algorithm'],
                'Clusters': metrics['n_clusters'],
                'Silhouette': f"{metrics['silhouette_score']:.3f}",
                'Balance': f"{metrics['size_balance']:.3f}",
                'Coherence': f"{metrics['semantic_coherence']:.3f}",
                'Davies-Bouldin': f"{metrics['davies_bouldin_score']:.3f}"
            })
        
        comparison_df = pd.DataFrame(comparison_data)
        print(comparison_df.to_string(index=False))
        
        # Identify best method
        best_method = max(quality_report['clustering_assessment'].items(),
                         key=lambda x: x[1]['silhouette_score'])
        
        print(f"\n🏆 Best Clustering Method: {best_method[1]['algorithm']}")
        print(f"   Silhouette Score: {best_method[1]['silhouette_score']:.3f}")
        print(f"   Semantic Coherence: {best_method[1]['semantic_coherence']:.3f}")
        print(f"   Number of Clusters: {best_method[1]['n_clusters']}")
    
    # Display hierarchy assessment
    if quality_report['hierarchy_assessment']:
        print(f"\n🌳 HIERARCHY QUALITY:")
        print("-" * 25)
        
        hier_qual = quality_report['hierarchy_assessment']
        print(f"Number of Levels: {hier_qual['n_levels']}")
        print(f"Overall Coherence: {hier_qual['overall_coherence']:.3f}")
        print(f"Hierarchy Depth Score: {hier_qual['hierarchy_depth_score']:.3f}")
        
        if 'best_level' in hier_qual:
            best_level = hier_qual['best_level']
            print(f"\nBest Taxonomy Level: {best_level['level']}")
            print(f"  • Clusters: {best_level['n_clusters']}")
            print(f"  • Silhouette: {best_level['silhouette_score']:.3f}")
            print(f"  • Balance: {best_level['balance_score']:.3f}")
    
    # Display validation results
    if quality_report['validation_results'] and 'error' not in quality_report['validation_results']:
        print(f"\n✅ ONTOLOGY VALIDATION:")
        print("-" * 25)
        
        for method, validation in quality_report['validation_results'].items():
            print(f"{method:15}: consistency = {validation['hierarchy_consistency']:.3f}, coverage = {validation['coverage']:.3f}")
    
    print(f"\n📊 Quality assessment completed!")
    
else:
    print("Quality assessor not available")

## Step 6: Export Discovered Taxonomies

Export the discovered taxonomies in various formats for further use.

In [None]:
def export_taxonomy_results(cluster_data, clustering_results, taxonomy_result, quality_report, 
                           output_dir='taxonomy_results'):
    """Export comprehensive taxonomy results."""
    
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    ontology_name = Path(cluster_data['ontology_file']).stem
    timestamp = pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')
    
    exported_files = []
    
    print(f"📁 Exporting taxonomy results to {output_dir}/...")
    
    # 1. Clustering results CSV
    if clustering_results:
        # Get best clustering result
        best_clustering = max(clustering_results.items(), 
                             key=lambda x: x[1].get('silhouette_score', -2))
        best_method = best_clustering[0]
        best_labels = best_clustering[1]['labels']
        
        # Create clustering results DataFrame
        cluster_df = cluster_data['concept_df'].copy()
        cluster_df['cluster_id'] = best_labels
        cluster_df['clustering_method'] = best_method
        
        clustering_file = Path(output_dir) / f"{ontology_name}_clustering_{timestamp}.csv"
        cluster_df.to_csv(clustering_file, index=False)
        exported_files.append(str(clustering_file))
        print(f"  ✓ Clustering results: {clustering_file}")
    
    # 2. Hierarchical taxonomy JSON
    if 'taxonomy_levels' in taxonomy_result:
        taxonomy_file = Path(output_dir) / f"{ontology_name}_taxonomy_{timestamp}.json"
        
        # Prepare taxonomy data for JSON serialization
        taxonomy_export = {
            'metadata': {
                'source_ontology': cluster_data['ontology_file'],
                'generation_timestamp': pd.Timestamp.now().isoformat(),
                'method': taxonomy_result['method'],
                'n_concepts': len(cluster_data['concept_df']),
                'n_levels': len(taxonomy_result['taxonomy_levels'])
            },
            'taxonomy_levels': {}
        }
        
        for level, level_data in taxonomy_result['taxonomy_levels'].items():
            taxonomy_export['taxonomy_levels'][str(level)] = {
                'n_clusters': level_data['n_clusters'],
                'silhouette_score': level_data['silhouette_score'],
                'clusters': {str(cid): {k: v for k, v in cinfo.items() if k != 'concept_list'} 
                           for cid, cinfo in level_data['clusters'].items()}
            }
        
        import json
        with open(taxonomy_file, 'w') as f:
            json.dump(taxonomy_export, f, indent=2, default=str)
        
        exported_files.append(str(taxonomy_file))
        print(f"  ✓ Hierarchical taxonomy: {taxonomy_file}")
    
    # 3. Quality assessment report
    if quality_report:
        quality_file = Path(output_dir) / f"{ontology_name}_quality_report_{timestamp}.json"
        
        with open(quality_file, 'w') as f:
            json.dump(quality_report, f, indent=2, default=str)
        
        exported_files.append(str(quality_file))
        print(f"  ✓ Quality report: {quality_file}")
    
    # 4. Human-readable taxonomy summary
    summary_file = Path(output_dir) / f"{ontology_name}_taxonomy_summary_{timestamp}.txt"
    
    with open(summary_file, 'w') as f:
        f.write(f"CONCEPT TAXONOMY DISCOVERY REPORT\n")
        f.write(f"{'='*50}\n\n")
        f.write(f"Source Ontology: {cluster_data['ontology_file']}\n")
        f.write(f"Generated: {pd.Timestamp.now()}\n")
        f.write(f"Tool: on2vec clustering & taxonomy discovery\n\n")
        
        # Dataset summary
        f.write(f"DATASET SUMMARY\n")
        f.write(f"{'-'*20}\n")
        f.write(f"Total concepts: {len(cluster_data['concept_df'])}\n")
        f.write(f"Embedding dimensions: {cluster_data['embeddings'].shape[1]}\n")
        f.write(f"Concept types: {dict(cluster_data['concept_df']['concept_type'].value_counts())}\n\n")
        
        # Clustering results
        if quality_report and 'clustering_assessment' in quality_report:
            f.write(f"CLUSTERING RESULTS\n")
            f.write(f"{'-'*20}\n")
            
            for method, metrics in quality_report['clustering_assessment'].items():
                f.write(f"{metrics['algorithm']:20}: {metrics['n_clusters']} clusters, "
                       f"silhouette = {metrics['silhouette_score']:.3f}\n")
            f.write("\n")
        
        # Best taxonomy level
        if quality_report and 'hierarchy_assessment' in quality_report:
            hier_qual = quality_report['hierarchy_assessment']
            if 'best_level' in hier_qual:
                f.write(f"DISCOVERED TAXONOMY\n")
                f.write(f"{'-'*20}\n")
                
                best_level = hier_qual['best_level']
                f.write(f"Best taxonomic level: {best_level['level']}\n")
                f.write(f"Number of categories: {best_level['n_clusters']}\n")
                f.write(f"Quality score: {best_level['silhouette_score']:.3f}\n\n")
                
                # Show categories from best level
                if 'taxonomy_levels' in taxonomy_result:
                    best_level_data = taxonomy_result['taxonomy_levels'][best_level['level']]
                    
                    f.write(f"DISCOVERED CATEGORIES\n")
                    f.write(f"{'-'*25}\n")
                    
                    for cluster_id, cluster_info in sorted(best_level_data['clusters'].items(),
                                                          key=lambda x: x[1]['size'], reverse=True):
                        f.write(f"Category {cluster_id}: {cluster_info['taxonomy_label']}\n")
                        f.write(f"  Size: {cluster_info['size']} concepts\n")
                        f.write(f"  Dominant type: {cluster_info['dominant_type']}\n")
                        f.write(f"  Representative: {cluster_info['representative_concept']}\n")
                        f.write(f"  Examples: {', '.join(cluster_info['concept_list'][:3])}\n\n")
        
        # Usage instructions
        f.write(f"HOW TO USE THESE RESULTS\n")
        f.write(f"{'-'*30}\n")
        f.write(f"1. Review the clustering CSV file for concept assignments\n")
        f.write(f"2. Use the taxonomy JSON for programmatic access to hierarchy\n")
        f.write(f"3. Check quality metrics to assess reliability\n")
        f.write(f"4. Consider the discovered categories for domain organization\n\n")
    
    exported_files.append(str(summary_file))
    print(f"  ✓ Summary report: {summary_file}")
    
    # 5. Network file for visualization tools (if hierarchy graph exists)
    if 'hierarchy_graph' in locals():
        network_file = Path(output_dir) / f"{ontology_name}_taxonomy_network_{timestamp}.gexf"
        
        try:
            nx.write_gexf(hierarchy_graph, network_file)
            exported_files.append(str(network_file))
            print(f"  ✓ Network file (GEXF): {network_file}")
        except Exception as e:
            print(f"  ⚠️ Could not export network file: {e}")
    
    print(f"\n✅ Exported {len(exported_files)} files to {output_dir}/")
    
    # Show file sizes
    print(f"\n📊 Export Summary:")
    for file_path in exported_files:
        file_size = Path(file_path).stat().st_size
        print(f"  • {Path(file_path).name}: {file_size:,} bytes")
    
    return exported_files

# Export taxonomy results
if all(var in locals() for var in ['cluster_data', 'clustering_results', 'taxonomy_result', 'quality_report']):
    exported_files = export_taxonomy_results(
        cluster_data, clustering_results, taxonomy_result, quality_report
    )
    
    # Final summary
    print(f"\n🎉 TAXONOMY DISCOVERY COMPLETE!")
    print("=" * 40)
    print(f"✅ Multi-algorithm clustering analysis performed")
    print(f"✅ Hierarchical taxonomies constructed")
    print(f"✅ Quality assessment completed")
    print(f"✅ Interactive visualizations created")
    print(f"✅ Results exported in multiple formats")
    
    # Best recommendations
    if quality_report and 'clustering_assessment' in quality_report:
        best_method = max(quality_report['clustering_assessment'].items(),
                         key=lambda x: x[1]['silhouette_score'])
        
        print(f"\n🏆 RECOMMENDATIONS:")
        print(f"  • Best clustering method: {best_method[1]['algorithm']}")
        print(f"  • Recommended clusters: {best_method[1]['n_clusters']}")
        print(f"  • Quality score: {best_method[1]['silhouette_score']:.3f}")
        
        if quality_report.get('hierarchy_assessment', {}).get('best_level'):
            best_level = quality_report['hierarchy_assessment']['best_level']
            print(f"  • Best taxonomy level: {best_level['level']} ({best_level['n_clusters']} categories)")
    
else:
    print("Cannot export results - missing required data")

## Conclusion

This notebook demonstrated comprehensive clustering and taxonomy discovery capabilities using on2vec embeddings:

### ✅ Key Achievements:

1. **Multi-Algorithm Clustering**: Applied K-Means, Hierarchical, DBSCAN, GMM, and Spectral clustering
2. **Hierarchical Taxonomy Construction**: Built multi-level taxonomies with semantic labels
3. **Quality Assessment**: Evaluated clustering quality using multiple metrics and validation approaches
4. **Interactive Visualization**: Created dynamic visualizations for taxonomy exploration
5. **Semantic Coherence Analysis**: Assessed how well clusters align with conceptual categories
6. **Export Capabilities**: Generated results in multiple formats for further use

### 🎯 Practical Applications:

- **Scientific Domain Organization**: Discover natural groupings in research areas
- **Knowledge Base Restructuring**: Identify optimal organizational structures for ontologies
- **Emerging Trend Detection**: Find new research themes and concept clusters
- **Literature Classification**: Organize papers and resources by discovered topics
- **Curriculum Development**: Structure educational content based on conceptual relationships

### 📊 Clustering Insights:

The analysis revealed that:
- **Different algorithms excel at different structures**: K-Means for balanced clusters, DBSCAN for density-based groups
- **Hierarchical taxonomies provide flexibility**: Multiple organizational levels for different use cases
- **Semantic coherence varies by domain**: Some domains naturally cluster better than others
- **Quality metrics guide selection**: Silhouette score, balance, and coherence provide comprehensive assessment

### 🔧 Technical Features:

- **Scalable Clustering**: Efficient processing of large concept spaces
- **Multi-Level Hierarchies**: Taxonomies at different granularities
- **Quality Validation**: Comprehensive assessment against multiple criteria
- **Interactive Exploration**: User-friendly visualization interfaces
- **Export Integration**: Standard formats for downstream applications

### 🚀 Next Steps:

1. **Dynamic Taxonomies**: Update classifications as new concepts are added
2. **Cross-Domain Clustering**: Find relationships between different knowledge areas
3. **Expert Validation**: Incorporate domain expert feedback to refine taxonomies
4. **Temporal Analysis**: Track how conceptual clusters evolve over time
5. **Application Integration**: Deploy discovered taxonomies in real knowledge systems

The clustering and taxonomy discovery capabilities demonstrated here show how on2vec embeddings can reveal hidden organizational structures in knowledge domains, enabling more intuitive navigation and understanding of complex conceptual spaces.