# PyEuropePMC Data Models Demo

This notebook demonstrates the new structured data models and RDF mapping capabilities in PyEuropePMC.

## Features

- **Typed Entity Models**: PaperEntity, AuthorEntity, SectionEntity, TableEntity, ReferenceEntity
- **RDF Serialization**: Convert entities to RDF/Turtle format aligned with ontologies (BIBO, FOAF, DCT, etc.)
- **Builder Pattern**: Easy conversion from FullTextXMLParser outputs to typed entities
- **Validation & Normalization**: Built-in data validation and normalization

In [20]:
# Import required modules
from pyeuropepmc.processing.fulltext_parser import FullTextXMLParser
from pyeuropepmc.builders import build_paper_entities
from pyeuropepmc.mappers import RDFMapper
from pyeuropepmc.models import PaperEntity, AuthorEntity
from rdflib import Graph
import json

## 1. Create Entities Directly

You can create entity instances directly with typed fields:

In [21]:
# Create a paper entity
paper = PaperEntity(
    pmcid="PMC1234567",
    doi="10.1234/example.2024.001",
    title="Example Scientific Article",
    journal="Nature",
    volume="580",
    issue="7805",
    pages="123-127",
    pub_date="2024-01-15",
    keywords=["bioinformatics", "data science", "RDF"]
)

# Normalize the data (lowercase DOI, trim whitespace, etc.)
paper.normalize()

# Validate the data
paper.validate()

print(f"Paper: {paper.title}")
print(f"DOI (normalized): {paper.doi}")
print(f"Keywords: {paper.keywords}")

Paper: Example Scientific Article
DOI (normalized): 10.1234/example.2024.001
Keywords: ['bioinformatics', 'data science', 'RDF']


In [22]:
# Create author entities
author1 = AuthorEntity(
    full_name="Jane Doe",
    first_name="Jane",
    last_name="Doe",
    orcid="0000-0001-2345-6789"
)

author2 = AuthorEntity(
    full_name="John Smith",
    first_name="John",
    last_name="Smith"
)

print(f"Author 1: {author1.full_name} ({author1.orcid})")
print(f"Author 2: {author2.full_name}")

Author 1: Jane Doe (0000-0001-2345-6789)
Author 2: John Smith


## 2. Convert Entities to JSON

All entities can be converted to JSON dictionaries:

In [23]:
# Convert to dictionary
paper_dict = paper.to_dict()
print(json.dumps(paper_dict, indent=2))

{
  "id": null,
  "label": "Example Scientific Article",
  "source_uri": null,
  "confidence": null,
  "types": [
    "bibo:AcademicArticle"
  ],
  "title": "Example Scientific Article",
  "doi": "10.1234/example.2024.001",
  "volume": 580,
  "pages": "123-127",
  "authors": null,
  "publication_year": null,
  "publication_date": null,
  "pmcid": "PMC1234567",
  "pmid": null,
  "semantic_scholar_id": null,
  "journal": "Nature",
  "issue": "7805",
  "pub_date": "2024-01-15",
  "keywords": [
    "bioinformatics",
    "data science",
    "RDF"
  ],
  "abstract": null,
  "citation_count": null,
  "influential_citation_count": null,
  "topics": null,
  "fields_of_study": null,
  "is_oa": null,
  "oa_status": null,
  "oa_url": null,
  "has_pdf": null,
  "has_supplementary": null,
  "in_epmc": null,
  "in_pmc": null,
  "cited_by_count": null,
  "reference_count": null,
  "pub_type": null,
  "journal_issn": null,
  "page_info": null,
  "first_page": null,
  "last_page": null,
  "publisher": n

## 3. Convert Entities to RDF

Entities can be serialized to RDF using the RDFMapper:

In [24]:
# Initialize RDF mapper and graph
mapper = RDFMapper()
g = Graph()

# Bind namespaces for proper serialization
mapper._bind_namespaces(g)

# Add entities to graph
paper_uri = paper.to_rdf(g, mapper=mapper)
author1_uri = author1.to_rdf(g, mapper=mapper)
author2_uri = author2.to_rdf(g, mapper=mapper)

# Serialize to Turtle format
ttl = mapper.serialize_graph(g, format="turtle")
print(ttl)

@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix datacite: <http://purl.org/spar/datacite/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mesh: <http://id.nlm.nih.gov/mesh/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/data/author/john-smith> a foaf:Person ;
    rdfs:label "John Smith" ;
    prov:generatedAtTime "2025-11-24T15:43:13.992891" ;
    prov:wasGeneratedBy "pyeuropepmc_parser" ;
    foaf:familyName "Smith" ;
    foaf:givenName "John" ;
    foaf:name "John Smith" .

<https://doi.org/10.1234/example.2024.001> a bibo:AcademicArticle ;
    rdfs:label "Example Scientific Article" ;
    mesh:hasSubject "RDF",
        "bioinformatics",
        "data science" ;
    dcterms:identifier "PMC1234567" ;
    dcterms:issued "2024-01-15" ;
    dcterms:su

## 4. Parse XML and Build Entities

The builder layer can convert FullTextXMLParser outputs to entities automatically:

In [25]:
# Sample XML content
sample_xml = '''<?xml version="1.0"?>
<article xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-title>Test Journal</journal-title>
</journal-meta>
<article-meta>
<article-id pub-id-type="pmcid">1234567</article-id>
<article-id pub-id-type="doi">10.1234/test.2021.001</article-id>
<title-group>
<article-title>Sample Test Article Title</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Smith</surname>
<given-names>John</given-names>
</name>
</contrib>
</contrib-group>
<pub-date pub-type="ppub">
<year>2021</year>
<month>12</month>
<day>15</day>
</pub-date>
<volume>10</volume>
<issue>5</issue>
<kwd-group>
<kwd>keyword1</kwd>
<kwd>keyword2</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec>
<title>Introduction</title>
<p>This is the introduction section with some text.</p>
</sec>
</body>
</article>
'''

# Parse XML
parser = FullTextXMLParser(sample_xml)

# Build entities from parser
paper, authors, sections, tables, figures, references = build_paper_entities(parser)

print(f"Paper: {paper.title}")
print(f"PMCID: {paper.pmcid}")
print(f"DOI: {paper.doi}")
print(f"\nAuthors: {len(authors)}")
for author in authors:
    print(f"  - {author.full_name}")
print(f"\nSections: {len(sections)}")
for section in sections:
    print(f"  - {section.title}")
print(f"\nTables: {len(tables)}")
print(f"Figures: {len(figures)}")
print(f"References: {len(references)}")
print(f"\nKeywords: {paper.keywords}")

Paper: Sample Test Article Title
PMCID: 1234567
DOI: 10.1234/test.2021.001

Authors: 1
  - John Smith

Sections: 1
  - Introduction

Tables: 0
Figures: 0
References: 0

Keywords: ['keyword1', 'keyword2']


## 5. Complete Pipeline: XML to RDF

Putting it all together - parse XML, build entities, and convert to RDF:

In [26]:
# Normalize all entities
paper.normalize()
for author in authors:
    author.normalize()
for section in sections:
    section.normalize()

# Create RDF graph
mapper = RDFMapper()
g = Graph()

# Add all entities to graph
paper_uri = paper.to_rdf(g, mapper=mapper)
for author in authors:
    author.to_rdf(g, mapper=mapper)
for section in sections:
    section.to_rdf(g, mapper=mapper)

# Serialize and display
ttl = mapper.serialize_graph(g, format="turtle")
print("RDF/Turtle Output:")
print("=" * 60)
print(ttl)

RDF/Turtle Output:
@prefix bibo: <http://purl.org/ontology/bibo/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix mesh: <http://id.nlm.nih.gov/mesh/> .
@prefix nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.org/data/author/john-smith> a foaf:Person ;
    rdfs:label "John Smith" ;
    prov:generatedAtTime "2025-11-24T15:43:14.022577" ;
    prov:wasGeneratedBy "pyeuropepmc_parser" ;
    foaf:familyName "Smith" ;
    foaf:givenName "John" ;
    foaf:name "John Smith" .

<http://example.org/data/section/f7f32db7-21ca-4020-98c8-fea084d6a720> a nif:Context,
        bibo:DocumentPart ;
    rdfs:label "Introduction" ;
    nif:isString "This is the introduction section with some text." ;
    dcterms:tit

## 6. Working with Real PMC Files

You can use the same approach with real PMC XML files:

In [27]:
# Example with a fixture file (adjust path as needed)
import os

fixture_path = "../tests/fixtures/fulltext_downloads/PMC3359999.xml"

if os.path.exists(fixture_path):
    with open(fixture_path, 'r') as f:
        xml_content = f.read()

    # Parse and build entities
    parser = FullTextXMLParser(xml_content)
    paper, authors, sections, tables, figures, references = build_paper_entities(parser)

    # Normalize
    paper.normalize()

    print(f"Paper: {paper.title}")
    print(f"Authors: {len(authors)}")
    print(f"Sections: {len(sections)}")
    print(f"Tables: {len(tables)}")
    print(f"Figures: {len(figures)}")
    print(f"References: {len(references)}")

    # Convert to RDF
    mapper = RDFMapper()
    g = Graph()
    paper.to_rdf(g, mapper=mapper)

    print(f"\nRDF triples generated: {len(g)}")
else:
    print(f"Fixture file not found: {fixture_path}")

Paper: Risk Factors of Porcine Cysticercosis in the Eastern Cape Province, South Africa
Authors: 8
Sections: 10
Tables: 2
Figures: 0
References: 28

RDF triples generated: 15


In [28]:
authors

[AuthorEntity(id=None, label='Rosina Claudia Krecek', source_uri=None, confidence=None, types=['foaf:Person'], full_name='Rosina Claudia Krecek', first_name='Rosina Claudia', last_name='Krecek', initials=None, affiliation_text='1\nDepartment of Research, Ross University School of Veterinary Medicine (RUSVM), Basseterre, St. Kitts, West Indies; 2\nDepartment of Zoology, University of Johannesburg, Auckland Park, South Africa', orcid=None, name=None, openalex_id=None, semantic_scholar_id=None, institutions=None, position=None, sources=None, email=None, semantic_scholar_author_id=None, scopus_author_id=None, researcher_id=None, orcid_works_count=None, h_index=None, citation_count=None, paper_count=None, data_sources=[], last_updated=None),
 AuthorEntity(id=None, label='Hamish Mohammed', source_uri=None, confidence=None, types=['foaf:Person'], full_name='Hamish Mohammed', first_name='Hamish', last_name='Mohammed', initials=None, affiliation_text='3\nUniversity of Trinidad and Tobago, Arima

## 7. Querying the RDF Graph

You can query the generated RDF graph using SPARQL:

In [29]:
# Example SPARQL query to find all papers with titles
from rdflib.namespace import RDF

query = """
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX dct: <http://purl.org/dc/terms/>

SELECT ?paper ?title
WHERE {
    ?paper a bibo:AcademicArticle .
    ?paper dct:title ?title .
}
"""

results = g.query(query)
print("Papers in graph:")
for row in results:
    print(f"  - {row.title}")

Papers in graph:
  - Risk Factors of Porcine Cysticercosis in the Eastern Cape Province, South Africa


## Summary

This notebook demonstrated:

1. Creating typed entity models directly
2. Converting entities to JSON
3. Converting entities to RDF/Turtle
4. Building entities from XML parser outputs
5. Complete pipeline from XML to RDF
6. Working with real PMC files
7. Querying the RDF graph with SPARQL

## Next Steps

- Use the CLI script `scripts/xml_to_rdf.py` for batch processing
- Validate RDF output with SHACL shapes in `shacl/pub.shacl.ttl`
- Load RDF into a triple store (e.g., GraphDB, Blazegraph)
- Extend the ontology mappings in `conf/rdf_map.yml`