# RDF Conversion Workflow Demo

This notebook demonstrates converting PMC XML articles to RDF for integration with knowledge graphs.

## Use Cases

- **Semantic Integration**: Integrate scientific literature with knowledge graphs
- **SPARQL Queries**: Query literature data using SPARQL
- **Ontology Alignment**: Align to standard ontologies (BIBO, FOAF, DCT)
- **GraphDB Loading**: Prepare data for triple stores

In [None]:
# Import required modules
from pyeuropepmc.processing.fulltext_parser import FullTextXMLParser
from pyeuropepmc.builders import build_paper_entities
from pyeuropepmc.mappers import RDFMapper
from rdflib import Graph, Namespace
from rdflib.namespace import RDF, RDFS
import os

## 1. Load and Parse a PMC XML File

Let's start by loading a real PMC article:

In [None]:
# Load a fixture file
fixture_path = "../tests/fixtures/fulltext_downloads/PMC3359999.xml"

if os.path.exists(fixture_path):
    with open(fixture_path, 'r') as f:
        xml_content = f.read()
    print(f"Loaded XML file: {len(xml_content)} characters")
else:
    print(f"File not found: {fixture_path}")
    print("Please adjust the path to point to a PMC XML file")

In [None]:
# Parse the XML
parser = FullTextXMLParser(xml_content)

# Build entities
paper, authors, sections, tables, references = build_paper_entities(parser)

print(f"Paper: {paper.title}")
print(f"PMCID: {paper.pmcid}")
print(f"DOI: {paper.doi}")
print(f"\nStatistics:")
print(f"  Authors: {len(authors)}")
print(f"  Sections: {len(sections)}")
print(f"  Tables: {len(tables)}")
print(f"  References: {len(references)}")

## 2. Normalize and Validate Entities

Before converting to RDF, normalize and validate the data:

In [None]:
# Show DOI before normalization
print(f"DOI before normalization: {paper.doi}")

# Normalize paper
paper.normalize()
print(f"DOI after normalization: {paper.doi}")

# Validate paper
try:
    paper.validate()
    print("✓ Paper validation passed")
except ValueError as e:
    print(f"✗ Paper validation failed: {e}")

# Normalize and validate all authors
for author in authors:
    author.normalize()
    try:
        author.validate()
    except ValueError as e:
        print(f"✗ Author validation failed: {e}")

print(f"✓ All {len(authors)} authors validated")

## 3. Convert to RDF

Now convert the entities to RDF triples:

In [None]:
# Initialize RDF mapper and graph
mapper = RDFMapper()
g = Graph()

# Bind namespaces for prettier output
g.bind("dct", Namespace("http://purl.org/dc/terms/"))
g.bind("bibo", Namespace("http://purl.org/ontology/bibo/"))
g.bind("foaf", Namespace("http://xmlns.com/foaf/0.1/"))
g.bind("prov", Namespace("http://www.w3.org/ns/prov#"))
g.bind("nif", Namespace("http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#"))

# Add paper to graph
paper_uri = paper.to_rdf(g, mapper=mapper)
print(f"Paper URI: {paper_uri}")
print(f"Triples after adding paper: {len(g)}")

# Add authors
for i, author in enumerate(authors):
    author_uri = author.to_rdf(g, mapper=mapper)
    if i == 0:
        print(f"\nFirst author URI: {author_uri}")

print(f"Triples after adding {len(authors)} authors: {len(g)}")

# Add sections
for section in sections[:3]:  # Add first 3 sections for demo
    section.to_rdf(g, mapper=mapper)

print(f"Triples after adding 3 sections: {len(g)}")

# Add tables
for table in tables:
    table.to_rdf(g, mapper=mapper)

print(f"Triples after adding {len(tables)} tables: {len(g)}")

# Add references
for reference in references[:5]:  # Add first 5 references for demo
    reference.to_rdf(g, mapper=mapper)

print(f"\nTotal triples in graph: {len(g)}")

## 4. Serialize to Turtle Format

Let's view the RDF in Turtle format:

In [None]:
# Serialize to Turtle
ttl = mapper.serialize_graph(g, format="turtle")

# Display first 1000 characters
print("RDF/Turtle Output (first 1000 characters):")
print("=" * 60)
print(ttl[:1000])
print("...")
print(f"\nTotal output length: {len(ttl)} characters")

## 5. Query the RDF Graph with SPARQL

Now we can query the graph using SPARQL:

In [None]:
# Query 1: Get paper metadata
query1 = """
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?paper ?title ?doi ?journal
WHERE {
    ?paper a bibo:AcademicArticle .
    ?paper dct:title ?title .
    OPTIONAL { ?paper bibo:doi ?doi }
    OPTIONAL { ?paper bibo:journal ?journal }
}
"""

print("Query 1: Paper Metadata")
print("=" * 60)
results = g.query(query1)
for row in results:
    print(f"Paper: {row.paper}")
    print(f"Title: {row.title}")
    print(f"DOI: {row.doi}")
    print(f"Journal: {row.journal}")

In [None]:
# Query 2: Get all authors
query2 = """
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?author ?name
WHERE {
    ?author a foaf:Person .
    ?author foaf:name ?name .
}
"""

print("\nQuery 2: Authors")
print("=" * 60)
results = g.query(query2)
for i, row in enumerate(results, 1):
    print(f"{i}. {row.name}")

In [None]:
# Query 3: Get document sections
query3 = """
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>

SELECT ?section ?title ?content
WHERE {
    ?section a bibo:DocumentPart .
    ?section dct:title ?title .
    OPTIONAL { ?section nif:isString ?content }
}
"""

print("\nQuery 3: Document Sections")
print("=" * 60)
results = g.query(query3)
for i, row in enumerate(results, 1):
    content_preview = str(row.content)[:100] if row.content else "No content"
    print(f"{i}. {row.title}")
    print(f"   Content preview: {content_preview}...")
    print()

## 6. Export to Different RDF Formats

RDF can be serialized to various formats:

In [None]:
# Export to different formats
formats = ["turtle", "xml", "json-ld"]

for fmt in formats:
    output = mapper.serialize_graph(g, format=fmt)
    print(f"{fmt.upper()} format: {len(output)} characters")
    print(f"First 200 characters:")
    print(output[:200])
    print("...\n")

## 7. Save to Files

Save the RDF to files for loading into a triple store:

In [None]:
# Save to files
output_dir = "/tmp/rdf_output"
os.makedirs(output_dir, exist_ok=True)

# Save as Turtle
ttl_path = f"{output_dir}/paper_{paper.pmcid}.ttl"
mapper.serialize_graph(g, format="turtle", destination=ttl_path)
print(f"Saved Turtle: {ttl_path}")

# Save as JSON-LD
jsonld_path = f"{output_dir}/paper_{paper.pmcid}.jsonld"
mapper.serialize_graph(g, format="json-ld", destination=jsonld_path)
print(f"Saved JSON-LD: {jsonld_path}")

# Save as RDF/XML
xml_path = f"{output_dir}/paper_{paper.pmcid}.rdf"
mapper.serialize_graph(g, format="xml", destination=xml_path)
print(f"Saved RDF/XML: {xml_path}")

print(f"\nAll files saved to: {output_dir}")

## 8. Graph Statistics

Let's analyze the generated RDF graph:

In [None]:
# Count entities by type
from collections import Counter

# Query to get all types
type_query = """
SELECT ?type (COUNT(?entity) as ?count)
WHERE {
    ?entity a ?type .
}
GROUP BY ?type
ORDER BY DESC(?count)
"""

print("Entity Type Distribution:")
print("=" * 60)
results = g.query(type_query)
for row in results:
    type_name = str(row.type).split("/")[-1].split("#")[-1]
    print(f"{type_name}: {row.count}")

print(f"\nTotal triples: {len(g)}")
print(f"Total unique subjects: {len(set(g.subjects()))}")
print(f"Total unique predicates: {len(set(g.predicates()))}")
print(f"Total unique objects: {len(set(g.objects()))}")

## 9. Using the CLI Tool

For batch processing, use the CLI tool:

In [None]:
# Example CLI usage (run in terminal)
print("""
Command-line usage:

# Convert single file to Turtle and JSON
python scripts/xml_to_rdf.py input.xml --ttl output.ttl --json output.json -v

# Convert multiple files
for file in *.xml; do
    python scripts/xml_to_rdf.py "$file" --ttl "${file%.xml}.ttl" -v
done

# Use custom RDF mapping configuration
python scripts/xml_to_rdf.py input.xml --ttl output.ttl --config custom_map.yml
""")

## Summary

This notebook demonstrated:

1. Loading and parsing PMC XML files
2. Building typed entity models
3. Normalizing and validating data
4. Converting to RDF with ontology alignment
5. Querying RDF with SPARQL
6. Exporting to multiple RDF formats
7. Saving for triple store integration
8. Analyzing graph statistics
9. Using the CLI tool for batch processing

## Next Steps

- Load RDF into GraphDB, Blazegraph, or other triple stores
- Validate with SHACL shapes (see `shacl/pub.shacl.ttl`)
- Extend ontology mappings in `conf/rdf_map.yml`
- Build federated queries across multiple papers
- Integrate with existing knowledge graphs