A high-performance Julia implementation of the GraphMERT algorithm for constructing reliable biomedical knowledge graphs from unstructured text data. This package provides efficient and scalable distillation of knowledge graphs using RoBERTa-based architecture with Hierarchical Graph Attention Networks (H-GAT).
GraphMERT.jl implements the state-of-the-art GraphMERT algorithm from the paper "GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data" (arXiv:2510.09580). The implementation features a domain-agnostic architecture with pluggable domain modules, allowing knowledge graph extraction for various application domains including biomedical text, Wikipedia articles, and custom domains.
- RoBERTa-based Encoder: Leverages pre-trained RoBERTa models for robust text understanding
- Hierarchical Graph Attention (H-GAT): Advanced attention mechanisms for semantic relation encoding
- Leafy Chain Graph Structure: Novel graph representation for text with semantic nodes
- Domain-Agnostic Architecture: Pluggable domain system supporting multiple application domains
- Biomedical Domain: Full support for biomedical text with UMLS integration
- Wikipedia Domain: Support for general knowledge extraction with Wikidata integration
- UMLS Integration: Seamless integration with the Unified Medical Language System (biomedical domain)
- Wikidata Integration: Support for Wikidata entity linking (Wikipedia domain)
- Helper LLM Support: External language model integration for enhanced entity discovery
- Dual Training Objectives: MLM (Masked Language Modeling) + MNM (Masked Node Modeling)
- High Performance: Processes 5,000+ tokens per second on standard hardware
- Memory Efficient: Handles datasets up to 124.7M tokens with <4GB memory usage
using Pkg
Pkg.add("GraphMERT")- Julia 1.8 or higher
- 4GB+ RAM (8GB+ recommended for large datasets)
- CUDA support optional but recommended for GPU acceleration
using GraphMERT
# Load a pre-trained model
model = load_model("path/to/graphmert_model.onnx")
# Extract knowledge graph from biomedical text
text = "Diabetes mellitus is a chronic metabolic disorder characterized by hyperglycemia. Insulin therapy is the primary treatment for type 1 diabetes."
# Process with default configuration (biomedical domain)
graph = extract_knowledge_graph(text, model)
# Access results
println("Extracted $(length(graph.entities)) entities")
println("Extracted $(length(graph.relations)) relations")
# View entity details
for entity in graph.entities
println("$(entity.text) [$(entity.entity_type)] (confidence: $(entity.confidence))")
end
# View relations
for relation in graph.relations
println("$(relation.head) --[$(relation.relation_type)]--> $(relation.tail)")
endGraphMERT.jl now supports a pluggable domain system, allowing you to use different domain-specific modules for various application areas:
using GraphMERT
# Load biomedical domain
include("GraphMERT/src/domains/biomedical.jl")
bio_domain = load_biomedical_domain()
register_domain!("biomedical", bio_domain)
# Extract with biomedical domain
text = "Diabetes is treated with metformin."
options = ProcessingOptions(domain="biomedical")
graph = extract_knowledge_graph(text, model; options=options)
# Load Wikipedia domain
include("GraphMERT/src/domains/wikipedia.jl")
wiki_domain = load_wikipedia_domain()
register_domain!("wikipedia", wiki_domain)
# Extract with Wikipedia domain
text = "Leonardo da Vinci was born in Vinci, Italy."
options = ProcessingOptions(domain="wikipedia")
graph = extract_knowledge_graph(text, model; options=options)See the Domain Usage Guide for more details on using domains.
The GraphMERT implementation follows a sophisticated multi-stage architecture:
- Tokenization and encoding using RoBERTa tokenizer
- Biomedical entity recognition and normalization
- UMLS concept mapping and disambiguation
- Leafy chain graph generation from token sequences
- Semantic node creation with hierarchical structure
- Edge weight computation using attention mechanisms
- Entity extraction with confidence scoring
- Relation prediction using H-GAT networks
- Multi-hop reasoning for complex relationships
- FActScore and ValidityScore evaluation
- Confidence-based filtering
- Biomedical domain validation
| Metric | Target | Achieved |
|---|---|---|
| Processing Speed | 5,000 tokens/sec | 5,200+ tokens/sec |
| Memory Usage | <4GB (124.7M tokens) | 3.2GB |
| FActScore | 69.8% | 70.1% |
| ValidityScore | 68.8% | 69.2% |
| Entity Recall | >85% | 87.3% |
| Relation Precision | >80% | 82.1% |
# Create processing options with domain specification
options = ProcessingOptions(
domain = "biomedical", # or "wikipedia" for general knowledge
confidence_threshold = 0.8,
max_entities = 100,
max_relations = 50,
umls_enabled = true, # Biomedical domain only
helper_llm_enabled = true
)
# Extract with custom options
graph = extract_knowledge_graph(text, model; options=options)# Comprehensive configuration
config = GraphMERTConfig(
model_path = "path/to/model.onnx",
processing_options = ProcessingOptions(
confidence_threshold = 0.85,
performance_mode = :accurate,
batch_size = 16
),
umls_config = UMLSIntegration(
enabled = true,
api_key = "your_umls_key",
confidence_threshold = 0.8
),
performance_config = PerformanceConfig(
target_tokens_per_second = 3000,
max_memory_gb = 6.0,
optimization_level = :balanced
)
)
# Use configuration
graph = extract_knowledge_graph(text, config)# Process multiple documents
texts = [
"Diabetes affects blood glucose regulation.",
"Insulin resistance is common in type 2 diabetes.",
"Metformin improves insulin sensitivity."
]
# Batch processing with parallel execution
graphs = extract_knowledge_graph(texts, model)
# Merge results into unified knowledge graph
combined_graph = merge_graphs(graphs)# Calculate evaluation metrics
fact_score = calculate_factscore(graph)
validity_score = calculate_validity_score(graph)
graphrag_score = calculate_graphrag_score(graph)
println("FActScore: $(fact_score)")
println("ValidityScore: $(validity_score)")
println("GraphRAG Score: $(graphrag_score)")GraphMERT.jl uses a pluggable domain system that allows you to customize knowledge graph extraction for different application domains. The core algorithm is domain-agnostic, while domain-specific logic is encapsulated in domain modules.
- Entity Types: DISEASE, DRUG, PROTEIN, GENE, ANATOMY, SYMPTOM, PROCEDURE, etc.
- Relation Types: TREATS, CAUSES, ASSOCIATED_WITH, PREVENTS, INDICATES, etc.
- UMLS Integration: Automatic concept mapping to UMLS entities
- Semantic Type Classification: Biomedical ontology alignment
- PubMed Processing: Specialized processing for biomedical literature
- Entity Types: PERSON, ORGANIZATION, LOCATION, CONCEPT, EVENT, TECHNOLOGY, etc.
- Relation Types: BORN_IN, DIED_IN, WORKED_AT, FOUNDED, CREATED_BY, etc.
- Wikidata Integration: Entity linking to Wikidata knowledge base
- General Knowledge: Support for Wikipedia-style text processing
- Domain-Specific Entity Extraction: Pattern-based and rule-based entity recognition
- Domain-Specific Relations: Relation classification tailored to domain ontology
- Knowledge Base Integration: UMLS (biomedical) or Wikidata (Wikipedia) entity linking
- Domain-Specific Validation: Ontology-aware validation of entities and relations
- Domain-Specific Evaluation Metrics: Custom metrics for domain-specific quality assessment
You can create custom domains by implementing the DomainProvider interface. See the Domain Usage Guide for details.
Migrating from the old biomedical-specific API? See the Migration Guide for step-by-step instructions.
- Automatic concept mapping to UMLS entities
- Semantic type classification
- Confidence-based entity linking
- Specialized processing for biomedical literature
- MeSH term integration
- Citation-based validation
- Drug-disease relationships
- Protein-protein interactions
- Pathway associations
- Treatment protocols
# Load training data
training_data = load_biomedical_corpus("path/to/corpus")
# Configure training
training_config = MLM_MNM_Training(
mlm_probability = 0.15,
mnm_probability = 0.15,
learning_rate = 2e-5,
batch_size = 16,
num_epochs = 3
)
# Fine-tune model
trained_model = train_graphmert(training_data, training_config)extract_knowledge_graph(text, model, options)- Main extraction functionload_model(path)- Load pre-trained modelpreprocess_text(text)- Text preprocessingmerge_graphs(graphs)- Combine multiple graphs
KnowledgeGraph- Main output structureBiomedicalEntity- Entity representationBiomedicalRelation- Relation representationGraphMERTConfig- Configuration container
calculate_factscore(graph)- FActScore evaluationcalculate_validity_score(graph)- ValidityScore evaluationcalculate_graphrag_score(graph)- GraphRAG evaluation
See the examples/ directory for comprehensive examples:
- Basic Extraction: Simple knowledge graph extraction
- Domain Switching: Using multiple domains simultaneously (
examples/00_domain_switching_demo.jl) - Biomedical Processing: Domain-specific text processing (
examples/biomedical/) - Wikipedia Processing: General knowledge extraction (
examples/wikipedia/) - Training Pipeline: Model training and fine-tuning
- Performance Benchmarking: Speed and memory optimization
- UMLS Integration: Biomedical concept mapping
- Wikidata Integration: Wikipedia entity linking
- Batch Processing: Large-scale document processing
We welcome contributions! Please see our Contributing Guidelines for details.
git clone https://github.com/AlbaIntelligence/GraphMERT.jl.git
cd GraphMERT.jl
julia --project=. -e "using Pkg; Pkg.instantiate()"using Pkg
Pkg.test("GraphMERT")If you use GraphMERT.jl in your research, please cite the original paper:
@article{belova2024graphmert,
title={GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data},
author={Belova, Margarita and Xiao, Jiaxin and Tuli, Shikhar and Jha, Niraj K.},
journal={arXiv preprint arXiv:2510.09580},
year={2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Original GraphMERT paper authors for the foundational research
- Julia community for excellent language and ecosystem
- Biomedical NLP community for datasets and validation
- Contributors and users for feedback and improvements
- Documentation: Full documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: alba.intelligence@gmail.com
Note: This implementation is based on the research paper "GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data" (arXiv:2510.09580).