# RDF Mapping Demo: From Search to Enriched Knowledge Graph

This notebook demonstrates the complete workflow for converting Europe PMC search results into RDF triples, both with basic metadata and enriched data from external APIs. We'll compare the two approaches to show the value of enrichment.

## Overview

1. **Basic Search**: Retrieve paper metadata from Europe PMC
2. **XML Retrieval**: Get full-text XML for selected papers
3. **RDF Conversion**: Convert XML to RDF triples
4. **Enrichment**: Enhance metadata using external APIs
5. **Enriched RDF**: Generate RDF from enriched data
6. **Comparison**: Analyze differences between basic and enriched RDF

## Import Required Libraries

In [1]:
# Import required libraries
from pyeuropepmc import SearchClient, FullTextClient, PaperEnricher, EnrichmentConfig
from pyeuropepmc.processing.fulltext_parser import FullTextXMLParser
from pyeuropepmc.builders import build_paper_entities
from pyeuropepmc.mappers import RDFMapper
from pyeuropepmc.models import PaperEntity, AuthorEntity, InstitutionEntity
from rdflib import Graph
import json
import os

print("Libraries imported successfully!")

Libraries imported successfully!


## Normal Europe PMC Search

Let's start with a basic search to retrieve paper metadata from Europe PMC.

In [2]:
# Perform a basic search
query = "TITLE:cancer AND immunotherapy AND OPEN_ACCESS:y AND (PUB_YEAR:[2020 TO 2025])"
print(f"Searching for: '{query}'")

with SearchClient() as client:
    results = client.search(query, resultType="core", pageSize=10)

    # Extract papers
    papers = results.get("resultList", {}).get("result", [])
    print(f"Found {len(papers)} papers")

    # Display first 5 papers
    for i, paper in enumerate(papers[:5], 1):
        print(f"\n{i}. {paper.get('title', 'No title')[:80]}...")
        print(f"   PMCID: {paper.get('pmcid', 'N/A')}")
        print(f"   DOI: {paper.get('doi', 'N/A')}")
        print(f"   PMID: {paper.get('pmid', 'N/A')}")

    # Store first 5 papers for further processing
    selected_papers = papers[:5]

Searching for: 'TITLE:cancer AND immunotherapy AND OPEN_ACCESS:y AND (PUB_YEAR:[2020 TO 2025])'
Found 10 papers

1. Characterizing Inquiries About Novel Cancer Therapies: Findings from the Nationa...
   PMCID: PMC12550823
   DOI: 10.36401/jipo-25-15
   PMID: 41143138

2. Immunotherapeutics in metastatic cancer management....
   PMCID: PMC12618787
   DOI: 10.1007/s12672-025-03561-5
   PMID: 41239147

3. Identification of cancer-associated fibroblast subpopulation and construction of...
   PMCID: PMC12605440
   DOI: 10.21037/tcr-2025-996
   PMID: 41234876

4. A Rare Case of Chemotherapy Combined with Immunotherapy for Dual Primary AFP-Pos...
   PMCID: PMC12628704
   DOI: 10.2147/cmar.s548762
   PMID: 41267890

5. Overcoming barriers in primary bone cancer: Nanomaterial-enabled immunotherapy...
   PMCID: PMC12639873
   DOI: N/A
   PMID: N/A
Found 10 papers

1. Characterizing Inquiries About Novel Cancer Therapies: Findings from the Nationa...
   PMCID: PMC12550823
   DOI: 10.36401/jipo-25

## Retrieve 5 Papers as XML

Now let's fetch the full-text XML for the selected papers.

In [3]:
# Retrieve XML for selected papers
xml_data = {}

with FullTextClient() as client:
    for i, paper in enumerate(selected_papers, 1):
        pmcid = paper.get('pmcid')
        if pmcid:
            try:
                xml_content = client.get_fulltext_content(pmcid, format_type="xml")
                xml_data[pmcid] = xml_content
                print(f"Retrieved XML for paper {i}: {pmcid} ({len(xml_content)} chars)")
            except Exception as e:
                print(f"Failed to retrieve XML for paper {i}: {pmcid} - {e}")
        else:
            print(f"Paper {i} has no PMCID, skipping XML retrieval")

print(f"\nSuccessfully retrieved XML for {len(xml_data)} papers")

Retrieved XML for paper 1: PMC12550823 (47950 chars)
Retrieved XML for paper 2: PMC12618787 (184154 chars)
Retrieved XML for paper 2: PMC12618787 (184154 chars)
Retrieved XML for paper 3: PMC12605440 (84377 chars)
Retrieved XML for paper 3: PMC12605440 (84377 chars)
Retrieved XML for paper 4: PMC12628704 (41440 chars)
Retrieved XML for paper 4: PMC12628704 (41440 chars)
Retrieved XML for paper 5: PMC12639873 (201680 chars)

Successfully retrieved XML for 5 papers
Retrieved XML for paper 5: PMC12639873 (201680 chars)

Successfully retrieved XML for 5 papers


## Convert XML to RDF

Convert the retrieved XML data to RDF triples using the parsing and mapping utilities.

In [4]:
# Convert XML to RDF
basic_rdf_graph = Graph()
mapper = RDFMapper()

for pmcid, xml_content in xml_data.items():
    try:
        # Parse XML
        parser = FullTextXMLParser(xml_content)

        # Build entities (returns: paper, authors, sections, tables, figures, references)
        paper, authors, sections, tables, figures, references = build_paper_entities(parser)

        # Add to RDF graph
        paper_uri = paper.to_rdf(basic_rdf_graph, mapper=mapper)

        print(f"Added paper {pmcid} to RDF graph (URI: {paper_uri})")

    except Exception as e:
        print(f"Failed to process XML for {pmcid}: {e}")

print(f"\nBasic RDF graph contains {len(basic_rdf_graph)} triples")

# Serialize basic RDF
basic_rdf_ttl = mapper.serialize_graph(basic_rdf_graph, format='turtle')
print(f"Basic RDF serialized to {len(basic_rdf_ttl)} characters")

Added paper PMC12550823 to RDF graph (URI: https://doi.org/10.36401/JIPO-25-15)
Added paper PMC12618787 to RDF graph (URI: https://doi.org/10.1007/s12672-025-03561-5)
Added paper PMC12605440 to RDF graph (URI: https://doi.org/10.21037/tcr-2025-996)
Added paper PMC12628704 to RDF graph (URI: https://doi.org/10.2147/CMAR.S548762)
Added paper PMC12639873 to RDF graph (URI: https://doi.org/10.1016/j.mtbio.2025.102530)

Basic RDF graph contains 125 triples
Basic RDF serialized to 5962 characters


## Enrich Papers Using Enrichment Client

Enhance the paper metadata using external APIs through the PaperEnricher.

In [5]:
# Configure enrichment
config = EnrichmentConfig(
    enable_crossref=True,
    enable_semantic_scholar=True,
    enable_openalex=True,
    enable_datacite=True,
    # Note: Unpaywall requires email, disabled for demo
    enable_unpaywall=False,
    enable_ror=True
)

# Enrich the selected papers
enriched_data = {}

with PaperEnricher(config) as enricher:
    for paper in selected_papers:
        doi = paper.get('doi')
        pmcid = paper.get('pmcid')

        # Try DOI first, then PMCID if DOI is not available
        identifier = doi or pmcid
        identifier_type = "DOI" if doi else "PMCID"

        if identifier:
            try:
                print(f"Enriching {identifier} ({identifier_type})...")
                result = enricher.enrich_paper(identifier=identifier)
                if result is not None and result.get('sources'):
                    enriched_data[doi or pmcid] = result
                    print(f"  Sources: {result.get('sources', [])}")
                else:
                    print(f"  No enrichment data found for {identifier}")
            except Exception as e:
                print(f"  Failed to enrich {identifier}: {e}")
        else:
            print(f"Paper has neither DOI nor PMCID, skipping enrichment")

print(f"\nEnriched {len(enriched_data)} papers")

Enriching 10.36401/jipo-25-15 (DOI)...


No data found for identifier: 10.36401/jipo-25-15
No data from datacite
No data from datacite


  Sources: ['crossref', 'semantic_scholar', 'openalex', 'ror']
Enriching 10.1007/s12672-025-03561-5 (DOI)...


No data found for identifier: 10.1007/s12672-025-03561-5
No data from datacite
No data from datacite


  Sources: ['crossref', 'semantic_scholar', 'openalex', 'ror']
Enriching 10.21037/tcr-2025-996 (DOI)...


No data found for identifier: 10.21037/tcr-2025-996
No data from datacite
No data from datacite


  Sources: ['crossref', 'semantic_scholar', 'openalex', 'ror']
Enriching 10.2147/cmar.s548762 (DOI)...


No data found for identifier: 10.2147/cmar.s548762
No data from datacite
No data from datacite


  Sources: ['crossref', 'semantic_scholar', 'openalex']
Enriching PMC12639873 (PMCID)...


Error resolving PMCID PMC12639873: No DOI found for PMCID PMC12639873


  Failed to enrich PMC12639873: Could not resolve PMCID PMC12639873 to DOI

Enriched 4 papers


## Generate RDF from Enriched Data

Create RDF triples from the enriched metadata.

In [6]:
# Create enriched RDF graph
from rdflib import Graph
enriched_rdf_graph = Graph()

# Process each enriched paper
for doi, enrichment_result in enriched_data.items():
    try:
        # Create entities from enrichment result
        paper = PaperEntity.from_enrichment_result(enrichment_result)

        # Create AuthorEntity objects from enrichment data
        merged = enrichment_result.get("merged", {})
        authors_data = merged.get("authors", [])
        authors = []
        for author_dict in authors_data:
            author = AuthorEntity.from_enrichment_dict(author_dict)
            authors.append(author)

        # Create InstitutionEntity objects from authors
        institutions = []
        seen_inst_ids = set()
        for author in authors:
            if author.institutions:
                for inst_dict in author.institutions:
                    inst_id = inst_dict.get("ror_id") or inst_dict.get("id")
                    if inst_id and inst_id not in seen_inst_ids:
                        seen_inst_ids.add(inst_id)
                        institution = InstitutionEntity.from_enrichment_dict(inst_dict)
                        institutions.append(institution)

        # Add entities to RDF graph
        paper_uri = paper.to_rdf(enriched_rdf_graph, mapper=mapper)

        # Add authors
        author_entities = []
        for author in authors:
            author_uri = author.to_rdf(enriched_rdf_graph, mapper=mapper)
            author_entities.append(author)

        # Add institutions
        for institution in institutions:
            institution.to_rdf(enriched_rdf_graph, mapper=mapper)

        # Map relationships
        related_entities = {"authors": author_entities}
        mapper.map_relationships(enriched_rdf_graph, paper_uri, paper, related_entities)

        # Link authors to institutions
        inst_map = {}
        for institution in institutions:
            inst_id = institution.ror_id or institution.openalex_id
            if inst_id:
                inst_map[inst_id] = institution

        for author in authors:
            if author.institutions:
                author_uri = mapper._generate_entity_uri(author)
                author_insts = []
                for inst_dict in author.institutions:
                    inst_id = inst_dict.get("ror_id") or inst_dict.get("id")
                    if inst_id and inst_id in inst_map:
                        author_insts.append(inst_map[inst_id])
                if author_insts:
                    mapper.map_relationships(enriched_rdf_graph, author_uri, author, {"institutions": author_insts})

        print(f"Added enriched paper {doi} to RDF graph")

    except Exception as e:
        print(f"Failed to process enriched data for {doi}: {e}")

print(f"\nEnriched RDF graph contains {len(enriched_rdf_graph)} triples")

# Serialize enriched RDF
enriched_rdf_ttl = mapper.serialize_graph(enriched_rdf_graph, format='turtle')
print(f"Enriched RDF serialized to {len(enriched_rdf_ttl)} characters")

Added enriched paper 10.36401/jipo-25-15 to RDF graph
Added enriched paper 10.1007/s12672-025-03561-5 to RDF graph
Added enriched paper 10.21037/tcr-2025-996 to RDF graph
Added enriched paper 10.2147/cmar.s548762 to RDF graph

Enriched RDF graph contains 426 triples
Enriched RDF serialized to 34322 characters


## Compare RDF Versions

Compare the RDF generated from basic XML versus enriched data to highlight the differences.

In [7]:
# Compare RDF versions
print("RDF Comparison Summary")
print("=" * 50)
print(f"Basic RDF triples: {len(basic_rdf_graph)}")
print(f"Enriched RDF triples: {len(enriched_rdf_graph)}")
print(f"Difference: {len(enriched_rdf_graph) - len(basic_rdf_graph)} additional triples")

# Find unique triples in enriched vs basic
enriched_only = enriched_rdf_graph - basic_rdf_graph
basic_only = basic_rdf_graph - enriched_rdf_graph

print(f"\nTriples only in enriched RDF: {len(enriched_only)}")
print(f"Triples only in basic RDF: {len(basic_only)}")

# Show some examples of enriched triples
print("\nSample enriched triples (first 10):")
for i, (s, p, o) in enumerate(enriched_only):
    if i >= 10:
        break
    print(f"  {s} {p} {o}")

# Analyze predicates used
basic_predicates = set()
for s, p, o in basic_rdf_graph:
    basic_predicates.add(p)

enriched_predicates = set()
for s, p, o in enriched_rdf_graph:
    enriched_predicates.add(p)

new_predicates = enriched_predicates - basic_predicates
print(f"\nNew predicates in enriched RDF: {len(new_predicates)}")
for pred in sorted(list(new_predicates))[:10]:
    print(f"  {pred}")

print("\n" + "=" * 50)
print("Key Benefits of Enrichment:")
print("• Additional citation metrics and impact data")
print("• Author affiliations and institutional data")
print("• Funding information and grant details")
print("• Topic classifications and research areas")
print("• Enhanced metadata completeness for knowledge graphs")
print("=" * 50)

RDF Comparison Summary
Basic RDF triples: 125
Enriched RDF triples: 426
Difference: 301 additional triples

Triples only in enriched RDF: 407
Triples only in basic RDF: 106

Sample enriched triples (first 10):
  https://openalex.org/A2331896408 http://example.org/openAlexId https://openalex.org/A2331896408
  https://orcid.org/https://orcid.org/0000-0001-8551-6468 http://www.w3.org/2000/01/rdf-schema#label Guanghui Wang
  https://orcid.org/https://orcid.org/0000-0001-8551-6468 http://xmlns.com/foaf/0.1/name Guanghui Wang
  https://ror.org/02tbvhh96 http://xmlns.com/foaf/0.1/homepage http://www.dyyy.xjtu.edu.cn/
  https://orcid.org/https://orcid.org/0000-0002-8429-4194 http://www.w3.org/2000/01/rdf-schema#label Diane Ng
  https://doi.org/10.2147/cmar.s548762 http://purl.org/ontology/bibo/issn ['1179-1322']
  https://openalex.org/A5102086970 http://www.w3.org/ns/prov#generatedAtTime 2025-11-24T11:19:45.010152
  https://doi.org/10.1007/s12672-025-03561-5 http://purl.org/ontology/bibo/issn 

In [8]:
# Save RDF graphs to TTL files
import os

# Create output directory if it does not exist
output_dir = "rdf_output"
os.makedirs(output_dir, exist_ok=True)

# Save basic RDF
basic_rdf_file = os.path.join(output_dir, "basic_rdf.ttl")
with open(basic_rdf_file, 'w', encoding='utf-8') as f:
    f.write(basic_rdf_ttl)
print(f"Basic RDF saved to: {basic_rdf_file}")

# Save enriched RDF
enriched_rdf_file = os.path.join(output_dir, "enriched_rdf.ttl")
with open(enriched_rdf_file, 'w', encoding='utf-8') as f:
    f.write(enriched_rdf_ttl)
print(f"Enriched RDF saved to: {enriched_rdf_file}")

# Create combined RDF graph (basic + enriched)
combined_rdf_graph = basic_rdf_graph + enriched_rdf_graph
combined_rdf_ttl = mapper.serialize_graph(combined_rdf_graph, format='turtle')

# Save combined RDF (final result)
final_rdf_file = os.path.join(output_dir, "final_combined_rdf.ttl")
with open(final_rdf_file, 'w', encoding='utf-8') as f:
    f.write(combined_rdf_ttl)
print(f"Final combined RDF saved to: {final_rdf_file}")
print(f"Combined RDF contains {len(combined_rdf_graph)} triples")

# Also save as N-Triples for easier processing
basic_nt_file = os.path.join(output_dir, "basic_rdf.nt")
enriched_nt_file = os.path.join(output_dir, "enriched_rdf.nt")
combined_nt_file = os.path.join(output_dir, "final_combined_rdf.nt")

mapper.serialize_graph(basic_rdf_graph, format='nt', destination=basic_nt_file)
mapper.serialize_graph(enriched_rdf_graph, format='nt', destination=enriched_nt_file)
mapper.serialize_graph(combined_rdf_graph, format='nt', destination=combined_nt_file)

print(f"Basic RDF also saved as N-Triples: {basic_nt_file}")
print(f"Enriched RDF also saved as N-Triples: {enriched_nt_file}")
print(f"Combined RDF also saved as N-Triples: {combined_nt_file}")

print(f"\nAll RDF files saved successfully in {output_dir}/ directory")
print("Final combined RDF includes both basic XML-derived data and enriched metadata from external APIs")

Basic RDF saved to: rdf_output/basic_rdf.ttl
Enriched RDF saved to: rdf_output/enriched_rdf.ttl
Final combined RDF saved to: rdf_output/final_combined_rdf.ttl
Combined RDF contains 532 triples
Basic RDF also saved as N-Triples: rdf_output/basic_rdf.nt
Enriched RDF also saved as N-Triples: rdf_output/enriched_rdf.nt
Combined RDF also saved as N-Triples: rdf_output/final_combined_rdf.nt

All RDF files saved successfully in rdf_output/ directory
Final combined RDF includes both basic XML-derived data and enriched metadata from external APIs


