# PyEuropePMC Data Models Demo

This notebook demonstrates the new structured data models and RDF mapping capabilities in PyEuropePMC.

## Features

- **Typed Entity Models**: PaperEntity, AuthorEntity, SectionEntity, TableEntity, ReferenceEntity
- **RDF Serialization**: Convert entities to RDF/Turtle format aligned with ontologies (BIBO, FOAF, DCT, etc.)
- **Builder Pattern**: Easy conversion from FullTextXMLParser outputs to typed entities
- **Validation & Normalization**: Built-in data validation and normalization

In [None]:
# Import required modules
from pyeuropepmc.processing.fulltext_parser import FullTextXMLParser
from pyeuropepmc.builders import build_paper_entities
from pyeuropepmc.mappers import RDFMapper
from pyeuropepmc.models import PaperEntity, AuthorEntity
from rdflib import Graph
import json

## 1. Create Entities Directly

You can create entity instances directly with typed fields:

In [None]:
# Create a paper entity
paper = PaperEntity(
    pmcid="PMC1234567",
    doi="10.1234/example.2024.001",
    title="Example Scientific Article",
    journal="Nature",
    volume="580",
    issue="7805",
    pages="123-127",
    pub_date="2024-01-15",
    keywords=["bioinformatics", "data science", "RDF"]
)

# Normalize the data (lowercase DOI, trim whitespace, etc.)
paper.normalize()

# Validate the data
paper.validate()

print(f"Paper: {paper.title}")
print(f"DOI (normalized): {paper.doi}")
print(f"Keywords: {paper.keywords}")

In [None]:
# Create author entities
author1 = AuthorEntity(
    full_name="Jane Doe",
    first_name="Jane",
    last_name="Doe",
    orcid="0000-0001-2345-6789"
)

author2 = AuthorEntity(
    full_name="John Smith",
    first_name="John",
    last_name="Smith"
)

print(f"Author 1: {author1.full_name} ({author1.orcid})")
print(f"Author 2: {author2.full_name}")

## 2. Convert Entities to JSON

All entities can be converted to JSON dictionaries:

In [None]:
# Convert to dictionary
paper_dict = paper.to_dict()
print(json.dumps(paper_dict, indent=2))

## 3. Convert Entities to RDF

Entities can be serialized to RDF using the RDFMapper:

In [None]:
# Initialize RDF mapper and graph
mapper = RDFMapper()
g = Graph()

# Add entities to graph
paper_uri = paper.to_rdf(g, mapper=mapper)
author1_uri = author1.to_rdf(g, mapper=mapper)
author2_uri = author2.to_rdf(g, mapper=mapper)

# Serialize to Turtle format
ttl = mapper.serialize_graph(g, format="turtle")
print(ttl)

## 4. Parse XML and Build Entities

The builder layer can convert FullTextXMLParser outputs to entities automatically:

In [None]:
# Sample XML content
sample_xml = '''<?xml version="1.0"?>
<article xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-title>Test Journal</journal-title>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmcid">1234567</article-id>
<article-id pub-id-type="doi">10.1234/test.2021.001</article-id>
<title-group>
<article-title>Sample Test Article Title</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Smith</surname>
<given-names>John</given-names>
</name>
</contrib>
</contrib-group>
<pub-date pub-type="ppub">
<year>2021</year>
<month>12</month>
<day>15</day>
</pub-date>
<volume>10</volume>
<issue>5</issue>
<kwd-group>
<kwd>keyword1</kwd>
<kwd>keyword2</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec>
<title>Introduction</title>
<p>This is the introduction section with some text.</p>
</sec>
</body>
</article>
'''

# Parse XML
parser = FullTextXMLParser(sample_xml)

# Build entities from parser
paper, authors, sections, tables, references = build_paper_entities(parser)

print(f"Paper: {paper.title}")
print(f"PMCID: {paper.pmcid}")
print(f"DOI: {paper.doi}")
print(f"\nAuthors: {len(authors)}")
for author in authors:
    print(f"  - {author.full_name}")
print(f"\nSections: {len(sections)}")
for section in sections:
    print(f"  - {section.title}")
print(f"\nKeywords: {paper.keywords}")

## 5. Complete Pipeline: XML to RDF

Putting it all together - parse XML, build entities, and convert to RDF:

In [None]:
# Normalize all entities
paper.normalize()
for author in authors:
    author.normalize()
for section in sections:
    section.normalize()

# Create RDF graph
mapper = RDFMapper()
g = Graph()

# Add all entities to graph
paper_uri = paper.to_rdf(g, mapper=mapper)
for author in authors:
    author.to_rdf(g, mapper=mapper)
for section in sections:
    section.to_rdf(g, mapper=mapper)

# Serialize and display
ttl = mapper.serialize_graph(g, format="turtle")
print("RDF/Turtle Output:")
print("=" * 60)
print(ttl)

## 6. Working with Real PMC Files

You can use the same approach with real PMC XML files:

In [None]:
# Example with a fixture file (adjust path as needed)
import os

fixture_path = "../tests/fixtures/fulltext_downloads/PMC3359999.xml"

if os.path.exists(fixture_path):
    with open(fixture_path, 'r') as f:
        xml_content = f.read()
    
    # Parse and build entities
    parser = FullTextXMLParser(xml_content)
    paper, authors, sections, tables, references = build_paper_entities(parser)
    
    # Normalize
    paper.normalize()
    
    print(f"Paper: {paper.title}")
    print(f"Authors: {len(authors)}")
    print(f"Sections: {len(sections)}")
    print(f"Tables: {len(tables)}")
    print(f"References: {len(references)}")
    
    # Convert to RDF
    mapper = RDFMapper()
    g = Graph()
    paper.to_rdf(g, mapper=mapper)
    
    print(f"\nRDF triples generated: {len(g)}")
else:
    print(f"Fixture file not found: {fixture_path}")

## 7. Querying the RDF Graph

You can query the generated RDF graph using SPARQL:

In [None]:
# Example SPARQL query to find all papers with titles
from rdflib.namespace import RDF

query = """
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?paper ?title
WHERE {
    ?paper a bibo:AcademicArticle .
    ?paper dct:title ?title .
}
"""

results = g.query(query)
print("Papers in graph:")
for row in results:
    print(f"  - {row.title}")

## Summary

This notebook demonstrated:

1. Creating typed entity models directly
2. Converting entities to JSON
3. Converting entities to RDF/Turtle
4. Building entities from XML parser outputs
5. Complete pipeline from XML to RDF
6. Working with real PMC files
7. Querying the RDF graph with SPARQL

## Next Steps

- Use the CLI script `scripts/xml_to_rdf.py` for batch processing
- Validate RDF output with SHACL shapes in `shacl/pub.shacl.ttl`
- Load RDF into a triple store (e.g., GraphDB, Blazegraph)
- Extend the ontology mappings in `conf/rdf_map.yml`