# RDF Conversion Workflow Demo

This notebook demonstrates converting PMC XML articles to RDF for integration with knowledge graphs.

## Use Cases

- **Semantic Integration**: Integrate scientific literature with knowledge graphs
- **SPARQL Queries**: Query literature data using SPARQL
- **Ontology Alignment**: Align to standard ontologies (BIBO, FOAF, DCT)
- **GraphDB Loading**: Prepare data for triple stores

In [1]:
# Import required modules
from pyeuropepmc.processing.fulltext_parser import FullTextXMLParser
from pyeuropepmc.builders import build_paper_entities
from pyeuropepmc.mappers import RDFMapper
from rdflib import Graph, Namespace
from rdflib.namespace import RDF, RDFS
import os

## 1. Load and Parse a PMC XML File

Let's start by loading a real PMC article:

In [2]:
# Load a fixture file
fixture_path = "../tests/fixtures/fulltext_downloads/PMC3359999.xml"

if os.path.exists(fixture_path):
    with open(fixture_path, 'r') as f:
        xml_content = f.read()
    print(f"Loaded XML file: {len(xml_content)} characters")
else:
    print(f"File not found: {fixture_path}")
    print("Please adjust the path to point to a PMC XML file")

Loaded XML file: 64374 characters


In [3]:
# Parse the XML
parser = FullTextXMLParser(xml_content)

# Build entities
paper, authors, sections, tables, figures, references = build_paper_entities(parser)

print(f"Paper: {paper.title}")
print(f"PMCID: {paper.pmcid}")
print(f"DOI: {paper.doi}")
print(f"\nStatistics:")
print(f"  Authors: {len(authors)}")
print(f"  Sections: {len(sections)}")
print(f"  Tables: {len(tables)}")
print(f"  Figures: {len(figures)}")
print(f"  References: {len(references)}")

Paper: Risk Factors of Porcine Cysticercosis in the Eastern Cape Province, South Africa
PMCID: 3359999
DOI: 10.1371/journal.pone.0037718

Statistics:
  Authors: 8
  Sections: 10
  Tables: 2
  Figures: 0
  References: 28


## 2. Normalize and Validate Entities

Before converting to RDF, normalize and validate the data:

In [4]:
# Show DOI before normalization
print(f"DOI before normalization: {paper.doi}")

# Normalize paper
paper.normalize()
print(f"DOI after normalization: {paper.doi}")

# Validate paper
try:
    paper.validate()
    print("✓ Paper validation passed")
except ValueError as e:
    print(f"✗ Paper validation failed: {e}")

# Normalize and validate all authors
for author in authors:
    author.normalize()
    try:
        author.validate()
    except ValueError as e:
        print(f"✗ Author validation failed: {e}")

print(f"✓ All {len(authors)} authors validated")

DOI before normalization: 10.1371/journal.pone.0037718
DOI after normalization: 10.1371/journal.pone.0037718
✓ Paper validation passed
✓ All 8 authors validated


## 3. Convert to RDF

Now convert the entities to RDF triples:

In [5]:
# Initialize RDF mapper and graph
mapper = RDFMapper()
g = Graph()

# Bind namespaces for proper serialization
mapper._bind_namespaces(g)

# Add paper to graph
paper_uri = paper.to_rdf(g, mapper=mapper)
print(f"Paper URI: {paper_uri}")
print(f"Triples after adding paper: {len(g)}")

# Add authors
for i, author in enumerate(authors):
    author_uri = author.to_rdf(g, mapper=mapper)
    if i == 0:
        print(f"\nFirst author URI: {author_uri}")

print(f"Triples after adding {len(authors)} authors: {len(g)}")

# Add sections
for section in sections[:3]:  # Add first 3 sections for demo
    section.to_rdf(g, mapper=mapper)

print(f"Triples after adding 3 sections: {len(g)}")

# Add tables
for table in tables:
    table.to_rdf(g, mapper=mapper)

print(f"Triples after adding {len(tables)} tables: {len(g)}")

# Add references
for reference in references[:5]:  # Add first 5 references for demo
    reference.to_rdf(g, mapper=mapper)

print(f"\nTotal triples in graph: {len(g)}")

Paper URI: https://doi.org/10.1371/journal.pone.0037718
Triples after adding paper: 13

First author URI: http://example.org/data/author/rosina-claudia-krecek
Triples after adding 8 authors: 123
Triples after adding 3 sections: 141
Triples after adding 2 tables: 438

Total triples in graph: 478


## 4. Serialize to Turtle Format

Let's view the RDF in Turtle format:

In [6]:
# Serialize to Turtle
ttl = mapper.serialize_graph(g, format="turtle")

# Display first 1000 characters
print("RDF/Turtle Output (first 1000 characters):")
print("=" * 60)
print(ttl[:1000])
print("...")
print(f"\nTotal output length: {len(ttl)} characters")

RDF/Turtle Output (first 1000 characters):
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix ex: <http://example.org/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/data/reference/41f7a5d8-016f-4aae-91b2-c357291e203e> a bibo:Document ;
    dcterms:title "Seroprevalence of antibodies against Taenia solium cysticerci among refugees resettled in United States." ;
    bibo:journalTitle "Emerg Infect Dis" ;
    prov:generatedAtTime "2025-11-25T12:04:33.235435" ;
    prov:wasGeneratedBy "pyeuropepmc_parser" .

<http://example.org/data/reference/

## 5. Knowledge Graph Structure Options

PyEuropePMC supports different knowledge graph structures for different use cases:

- **Complete KG**: All entities (metadata + content) - best for comprehensive analysis
- **Metadata KG**: Only bibliographic metadata (papers, authors, institutions) - best for citation networks
- **Content KG**: Only document content (sections, references, tables) - best for text analysis

Let's demonstrate each approach:

In [7]:
# Prepare entities data in the expected format
entities_data = {
    paper.doi or paper.pmcid: {
        "entity": paper,
        "related_entities": {
            "authors": authors,
            "sections": sections,
            "tables": tables,
            "figures": figures,
            "references": references,
        }
    }
}

print("Entities data prepared:")
print(f"  Paper: {paper.title}")
print(f"  Authors: {len(authors)}")
print(f"  Sections: {len(sections)}")
print(f"  Tables: {len(tables)}")
print(f"  Figures: {len(figures)}")
print(f"  References: {len(references)}")

Entities data prepared:
  Paper: Risk Factors of Porcine Cysticercosis in the Eastern Cape Province, South Africa
  Authors: 8
  Sections: 10
  Tables: 2
  Figures: 0
  References: 28


In [8]:
# Create metadata-only knowledge graph
print("Creating Metadata-Only Knowledge Graph...")
metadata_graphs = mapper.save_metadata_rdf(
    entities_data=entities_data,
    output_dir="/tmp/rdf_output",
    extraction_info={"method": "notebook_demo", "timestamp": "2024-01-01T00:00:00Z"}
)

print(f"Metadata KG saved: {len(metadata_graphs)} graphs")
for identifier, g in metadata_graphs.items():
    print(f"  {identifier}: {len(g)} triples")

Creating Metadata-Only Knowledge Graph...
Converting to RDF: 10.1371/journal.pone.0037718 (paper)
  [OK] Successfully converted to RDF (139 triples)
  [OK] Saved to: /tmp/rdf_output/metadata_paper_10_1371_journal_pone_0037718.ttl
Successfully converted 1 entities to RDF
Metadata KG saved: 1 graphs
  10.1371/journal.pone.0037718: 139 triples


In [9]:
# Create content-only knowledge graph
print("\nCreating Content-Only Knowledge Graph...")
content_graphs = mapper.save_content_rdf(
    entities_data=entities_data,
    output_dir="/tmp/rdf_output",
    extraction_info={"method": "notebook_demo", "timestamp": "2024-01-01T00:00:00Z"}
)

print(f"Content KG saved: {len(content_graphs)} graphs")
for identifier, g in content_graphs.items():
    print(f"  {identifier}: {len(g)} triples")


Creating Content-Only Knowledge Graph...
Converting to RDF: 10.1371/journal.pone.0037718 (paper)
  [OK] Successfully converted to RDF (680 triples)
  [OK] Saved to: /tmp/rdf_output/content_paper_10_1371_journal_pone_0037718.ttl
Successfully converted 1 entities to RDF
Content KG saved: 1 graphs
  10.1371/journal.pone.0037718: 680 triples


In [10]:
# Create complete knowledge graph
print("\nCreating Complete Knowledge Graph...")
complete_graphs = mapper.save_complete_rdf(
    entities_data=entities_data,
    output_dir="/tmp/rdf_output",
    extraction_info={"method": "notebook_demo", "timestamp": "2024-01-01T00:00:00Z"}
)

print(f"Complete KG saved: {len(complete_graphs)} graphs")
for identifier, g in complete_graphs.items():
    print(f"  {identifier}: {len(g)} triples")


Creating Complete Knowledge Graph...
Converting to RDF: 10.1371/journal.pone.0037718 (paper)
  [OK] Successfully converted to RDF (806 triples)
  [OK] Saved to: /tmp/rdf_output/paper_10_1371_journal_pone_0037718.ttl
Successfully converted 1 entities to RDF
Complete KG saved: 1 graphs
  10.1371/journal.pone.0037718: 806 triples
  [OK] Successfully converted to RDF (806 triples)
  [OK] Saved to: /tmp/rdf_output/paper_10_1371_journal_pone_0037718.ttl
Successfully converted 1 entities to RDF
Complete KG saved: 1 graphs
  10.1371/journal.pone.0037718: 806 triples


In [11]:
# Compare the different KG structures
print("Knowledge Graph Structure Comparison:")
print("=" * 60)

# Load and analyze each graph
import os
from rdflib import Graph

graphs = {
    "Metadata": Graph(),
    "Content": Graph(),
    "Complete": Graph()
}

# Load the saved graphs
for kg_type in graphs.keys():
    filename = f"/tmp/rdf_output/{kg_type.lower()}_{paper.doi or paper.pmcid}.ttl"
    if os.path.exists(filename):
        graphs[kg_type].parse(filename, format="turtle")
        print(f"{kg_type} KG: {len(graphs[kg_type])} triples")
    else:
        print(f"{kg_type} KG: file not found")

# Query entity types in each graph
for kg_type, g in graphs.items():
    if len(g) > 0:
        type_query = """
        SELECT ?type (COUNT(?entity) as ?count)
        WHERE {
            ?entity a ?type .
        }
        GROUP BY ?type
        ORDER BY DESC(?count)
        """

        print(f"\n{kg_type} KG Entity Types:")
        results = g.query(type_query)
        for row in results:
            type_name = str(row.type).split("/")[-1].split("#")[-1]
            print(f"  {type_name}: {row.count}")

Knowledge Graph Structure Comparison:
Metadata KG: file not found
Content KG: file not found
Complete KG: file not found


In [12]:
# Use the convenience save_rdf method with configuration
print("Using Configured KG Type:")

# The mapper uses configuration from conf/rdf_map.yml
# By default it's set to "complete", but you can override
configured_graphs = mapper.save_rdf(
    entities_data=entities_data,
    output_dir="/tmp/rdf_output",
    kg_type="metadata",  # Override to create metadata-only
    extraction_info={"method": "notebook_demo"}
)

print(f"Configured KG saved: {len(configured_graphs)} graphs")

Using Configured KG Type:
Converting to RDF: 10.1371/journal.pone.0037718 (paper)
  [OK] Successfully converted to RDF (139 triples)
  [OK] Saved to: /tmp/rdf_output/metadata_paper_10_1371_journal_pone_0037718.ttl
Successfully converted 1 entities to RDF
Configured KG saved: 1 graphs


## 6. Export to Different RDF Formats

RDF can be serialized to various formats:

In [13]:
# Export to different formats
formats = ["turtle", "xml", "json-ld"]

for fmt in formats:
    output = mapper.serialize_graph(g, format=fmt)
    print(f"{fmt.upper()} format: {len(output)} characters")
    print(f"First 200 characters:")
    print(output[:200])
    print("...\n")

TURTLE format: 1 characters
First 200 characters:


...

XML format: 120 characters
First 200 characters:
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
</rdf:RDF>

...

JSON-LD format: 2 characters
First 200 characters:
[]
...

JSON-LD format: 2 characters
First 200 characters:
[]
...



## 7. Save to Files

Save the RDF to files for loading into a triple store:

In [14]:
# Save to files
output_dir = "/tmp/rdf_output"
os.makedirs(output_dir, exist_ok=True)

# Save as Turtle
ttl_path = f"{output_dir}/paper_{paper.pmcid}.ttl"
mapper.serialize_graph(g, format="turtle", destination=ttl_path)
print(f"Saved Turtle: {ttl_path}")

# Save as JSON-LD
jsonld_path = f"{output_dir}/paper_{paper.pmcid}.jsonld"
mapper.serialize_graph(g, format="json-ld", destination=jsonld_path)
print(f"Saved JSON-LD: {jsonld_path}")

# Save as RDF/XML
xml_path = f"{output_dir}/paper_{paper.pmcid}.rdf"
mapper.serialize_graph(g, format="xml", destination=xml_path)
print(f"Saved RDF/XML: {xml_path}")

print(f"\nAll files saved to: {output_dir}")

Saved Turtle: /tmp/rdf_output/paper_PMC3359999.ttl
Saved JSON-LD: /tmp/rdf_output/paper_PMC3359999.jsonld
Saved RDF/XML: /tmp/rdf_output/paper_PMC3359999.rdf

All files saved to: /tmp/rdf_output


## 8. Graph Statistics

Let's analyze the generated RDF graph:

In [15]:
# Count entities by type
from collections import Counter

# Query to get all types
type_query = """
SELECT ?type (COUNT(?entity) as ?count)
WHERE {
    ?entity a ?type .
}
GROUP BY ?type
ORDER BY DESC(?count)
"""

print("Entity Type Distribution:")
print("=" * 60)
results = g.query(type_query)
for row in results:
    type_name = str(row.type).split("/")[-1].split("#")[-1]
    print(f"{type_name}: {row.count}")

print(f"\nTotal triples: {len(g)}")
print(f"Total unique subjects: {len(set(g.subjects()))}")
print(f"Total unique predicates: {len(set(g.predicates()))}")
print(f"Total unique objects: {len(set(g.objects()))}")

Entity Type Distribution:

Total triples: 0
Total unique subjects: 0
Total unique predicates: 0
Total unique objects: 0

Total triples: 0
Total unique subjects: 0
Total unique predicates: 0
Total unique objects: 0


## 9. Using the CLI Tool

For batch processing, use the CLI tool:

In [16]:
# Example CLI usage (run in terminal)
print("""
Command-line usage:

# Convert single file to Turtle and JSON
python scripts/xml_to_rdf.py input.xml --ttl output.ttl --json output.json -v

# Convert multiple files
for file in *.xml; do
    python scripts/xml_to_rdf.py "$file" --ttl "${file%.xml}.ttl" -v
done

# Use custom RDF mapping configuration
python scripts/xml_to_rdf.py input.xml --ttl output.ttl --config custom_map.yml
""")


Command-line usage:

# Convert single file to Turtle and JSON
python scripts/xml_to_rdf.py input.xml --ttl output.ttl --json output.json -v

# Convert multiple files
for file in *.xml; do
    python scripts/xml_to_rdf.py "$file" --ttl "${file%.xml}.ttl" -v
done

# Use custom RDF mapping configuration
python scripts/xml_to_rdf.py input.xml --ttl output.ttl --config custom_map.yml



## Summary

This notebook demonstrated:

1. Loading and parsing PMC XML files
2. Building typed entity models
3. Normalizing and validating data
4. Converting to RDF with ontology alignment
5. Querying RDF with SPARQL
6. Exporting to multiple RDF formats
7. Saving for triple store integration
8. Analyzing graph statistics
9. Using the CLI tool for batch processing

## Next Steps

- Load RDF into GraphDB, Blazegraph, or other triple stores
- Validate with SHACL shapes (see `shacl/pub.shacl.ttl`)
- Extend ontology mappings in `conf/rdf_map.yml`
- Build federated queries across multiple papers
- Integrate with existing knowledge graphs