# End-to-End Demo: Long COVID Research Analysis

This notebook demonstrates a complete workflow for analyzing scientific literature on Long COVID using PyEuropePMC:

1. Search for Long COVID papers
2. Filter for papers with full-text XML available
3. Select the 5 most cited papers
4. Download XML content
5. Extract structured metadata
6. Convert to RDF format

This showcases the full pipeline from search to structured knowledge representation.

## 1. Import Required Libraries

Import PyEuropePMC components and additional libraries for data processing and RDF handling.

In [1]:
# Import PyEuropePMC components
from pyeuropepmc.clients import SearchClient, FullTextClient
from pyeuropepmc.processing.fulltext_parser import FullTextXMLParser
from pyeuropepmc.builders import build_paper_entities
from pyeuropepmc.mappers import RDFMapper

# Import cache configuration
from pyeuropepmc.cache.cache import CacheConfig

# Additional libraries for data processing
import pandas as pd
from typing import List, Optional
import time

# For RDF handling
from rdflib import Graph, Namespace, URIRef, Literal
from rdflib.namespace import RDF, RDFS, XSD

print("Libraries imported successfully!")

Libraries imported successfully!


In [None]:
# Configure caching to reduce API calls and improve performance
cache_config = CacheConfig(
    enabled=True
)


Cache configured with:
  - Size limit: 500 MB
  - Search TTL: 1800 seconds
  - Full-text TTL: 7200 seconds
This will significantly reduce API calls and improve performance!


## 2. Search for Long COVID Papers

Use the SearchClient with QueryBuilder to construct a structured query for papers related to Long COVID. We'll search for papers containing "long covid", "post-acute sequelae of SARS-CoV-2", or "PASC" using the fluent QueryBuilder API.

In [3]:
# Import QueryBuilder
from pyeuropepmc import QueryBuilder

# Initialize the search client with caching
search_client = SearchClient(cache_config=cache_config)

# Build a structured query for Long COVID papers using QueryBuilder
qb = QueryBuilder(validate=False)
query = (
    qb.keyword("long covid")
    .or_()
    .keyword("post-acute sequelae of SARS-CoV-2")
    .or_()
    .keyword("PASC")
    .and_()
    .citation_count(min_count=5)
    .and_()
    .field("open_access", True)
    .build()
)

print(f"Built query using QueryBuilder: {query}")

# Perform the search with a reasonable limit for demo purposes
results = search_client.search(query, synonym = True, limit=100)

print(f"Found {len(results)} papers related to Long COVID")
print("\nSample of results:")
print(results)

Built query using QueryBuilder: "long covid" OR "post-acute sequelae of SARS-CoV-2" OR PASC AND (CITED:[5 TO *]) AND OPEN_ACCESS:y
Found 6 papers related to Long COVID

Sample of results:
{'version': '6.9', 'hitCount': 8436, 'nextCursorMark': 'AoIIQNNpfCg1Mjk0MTAzNA==', 'nextPageUrl': 'https://www.ebi.ac.uk/europepmc/webservices/rest/search?query="long covid" OR "post-acute sequelae of SARS-CoV-2" OR PASC AND (CITED:[5 TO *]) AND OPEN_ACCESS:y&cursorMark=AoIIQNNpfCg1Mjk0MTAzNA==&resultType=lite&synonym=true&format=json', 'request': {'queryString': '"long covid" OR "post-acute sequelae of SARS-CoV-2" OR PASC AND (CITED:[5 TO *]) AND OPEN_ACCESS:y', 'resultType': 'lite', 'cursorMark': '%2A', 'pageSize': 25, 'sort': '', 'synonym': True}, 'resultList': {'result': [{'id': '39575583', 'source': 'MED', 'pmid': '39575583', 'pmcid': 'PMC11740262', 'fullTextIdList': {'fullTextId': ['PMC11740262']}, 'doi': '10.1002/ana.27128', 'title': 'Neurologic Manifestations of Long COVID Disproportionately A

## 3. Filter Papers with XML Available

Filter the search results to only include papers that have full-text XML available for download and parsing. We check for the `inEPMC` field being 'Y' and the presence of `fullTextIdList` containing PMC IDs.

In [4]:
# Filter papers that have full-text XML available
xml_available_papers = []

# Access the actual results list from the response dictionary
results_list = results["resultList"]["result"]

for paper in results_list:
    # Check if the paper has full text XML available
    # Use inEPMC field and fullTextIdList to determine availability
    has_xml = False

    # Check inEPMC field
    if paper.get('inEPMC') == 'Y':
        has_xml = True

    # Also check for fullTextIdList which contains PMC IDs
    full_text_id_list = paper.get('fullTextIdList')
    if full_text_id_list and full_text_id_list.get('fullTextId'):
        has_xml = True

    if has_xml:
        xml_available_papers.append(paper)

print(f"Found {len(xml_available_papers)} papers with XML available out of {len(results_list)} total results")

# Display some information about the filtered papers
if xml_available_papers:
    print("\nPapers with XML available:")
    for i, paper in enumerate(xml_available_papers[:5]):
        pmc_ids = []
        full_text_id_list = paper.get('fullTextIdList')
        if full_text_id_list and full_text_id_list.get('fullTextId'):
            pmc_ids = full_text_id_list['fullTextId']
        print(f"{i+1}. {paper.get('title')}(Citations: {paper.get('citedByCount', 'N/A')}, PMC: {pmc_ids})")
else:
    print("No papers with XML available found. Using a subset for demonstration.")
    xml_available_papers = results_list[:10]  # Fallback for demo

Found 25 papers with XML available out of 25 total results

Papers with XML available:
1. Neurologic Manifestations of Long COVID Disproportionately Affect Young and Middle-Age Adults.(Citations: 12, PMC: ['PMC11740262'])
2. Real-world effectiveness and causal mediation study of BNT162b2 on long COVID risks in children and adolescents.(Citations: 13, PMC: ['PMC11667630'])
3. Long COVID or Post-Acute Sequelae of SARS-CoV-2 Infection (PASC) and the Urgent Need to Identify Diagnostic Biomarkers and Risk Factors.(Citations: 11, PMC: ['PMC11418572'])
4. Long COVID after SARS-CoV-2 during pregnancy in the United States.(Citations: 5, PMC: ['PMC11961632'])
5. Long COVID‑19 and pregnancy: A systematic review.(Citations: 5, PMC: ['PMC11609607'])


## 4. Select Top 5 Most Cited Papers

Sort the filtered papers by citation count and select the top 5 most cited papers for detailed analysis.

In [5]:
# Sort papers by citation count (citedByCount field)
def get_citation_count(paper):
    """Extract citation count from paper dictionary, defaulting to 0 if not available."""
    count = paper.get('citedByCount', 0)
    if count is None:
        return 0
    try:
        return int(count)
    except (ValueError, TypeError):
        return 0

# Sort by citation count in descending order
sorted_papers = sorted(xml_available_papers, key=get_citation_count, reverse=True)

# Select top 5 most cited papers
top_5_papers = sorted_papers[:5]

print(f"Selected top 5 most cited papers from {len(sorted_papers)} available papers:")
print()

for i, paper in enumerate(top_5_papers, 1):
    citations = get_citation_count(paper)
    title = paper.get('title', 'No title available')
    title_display = title[:80] + "..." if len(title) > 80 else title
    doi = paper.get('doi', 'No DOI available')
    journal = paper.get('journalTitle', 'N/A')
    print(f"{i}. Citations: {citations}")
    print(f"   Title: {title_display}")
    print(f"   DOI: {doi}")
    print(f"   Journal: {journal}")
    print()

Selected top 5 most cited papers from 25 available papers:

1. Citations: 27
   Title: Early biological markers of post-acute sequelae of SARS-CoV-2 infection.
   DOI: 10.1038/s41467-024-51893-7
   Journal: Nat Commun

2. Citations: 27
   Title: Prevalence and determinants of post-acute sequelae after SARS-CoV-2 infection (L...
   DOI: 10.1016/j.lana.2024.100688
   Journal: Lancet Reg Health Am

3. Citations: 20
   Title: Identification of risk factors of Long COVID and predictive modeling in the RECO...
   DOI: 10.1038/s43856-024-00549-0
   Journal: Commun Med (Lond)

4. Citations: 13
   Title: Real-world effectiveness and causal mediation study of BNT162b2 on long COVID ri...
   DOI: 10.1016/j.eclinm.2024.102962
   Journal: EClinicalMedicine

5. Citations: 12
   Title: Neurologic Manifestations of Long COVID Disproportionately Affect Young and Midd...
   DOI: 10.1002/ana.27128
   Journal: Ann Neurol



## 5. Download XML for Selected Papers

Download the full-text XML content for the top 5 selected papers using the FullTextClient.

In [6]:
# Initialize the full-text client with caching
fulltext_client = FullTextClient(cache_config=cache_config)

# Download XML for each of the top 5 papers
downloaded_xml = {}
paper_metadata = {}

print("Downloading XML for top 5 papers...")
print()

for i, paper in enumerate(top_5_papers, 1):
    try:
        doi = paper.get('doi', f'paper_{i}')
        print(f"Downloading paper {i}/5: {doi}")

        # Get the PMC ID for full-text access from fullTextIdList
        pmc_id = None
        if 'fullTextIdList' in paper and paper['fullTextIdList']:
            full_text_id_list = paper['fullTextIdList']
            if 'fullTextId' in full_text_id_list and full_text_id_list['fullTextId']:
                # Take the first PMC ID (usually there's only one)
                pmc_id = full_text_id_list['fullTextId'][0]

        # Fallback to pmcid field if fullTextIdList doesn't work
        if not pmc_id:
            pmc_id = paper.get('pmcid', None)

        if not pmc_id:
            print(f"  No PMC ID available for {doi}, skipping...")
            continue

        # Download the XML content using get_fulltext_content
        xml_content = fulltext_client.get_fulltext_content(pmc_id, format_type="xml")

        if xml_content:
            downloaded_xml[doi] = xml_content
            paper_metadata[doi] = paper
            print(f"  ✓ Successfully downloaded XML ({len(xml_content)} characters)")
        else:
            print(f"  ✗ Failed to download XML for {doi}")

        # Add a small delay to be respectful to the API
        time.sleep(1)

    except Exception as e:
        print(f"  ✗ Error downloading {doi}: {str(e)}")

print(f"\nSuccessfully downloaded XML for {len(downloaded_xml)} out of {len(top_5_papers)} papers")

Downloading XML for top 5 papers...

Downloading paper 1/5: 10.1038/s41467-024-51893-7
  ✓ Successfully downloaded XML (96554 characters)
Downloading paper 2/5: 10.1016/j.lana.2024.100688
  ✓ Successfully downloaded XML (113834 characters)
Downloading paper 3/5: 10.1038/s43856-024-00549-0
  ✓ Successfully downloaded XML (118648 characters)
Downloading paper 4/5: 10.1016/j.eclinm.2024.102962
  ✓ Successfully downloaded XML (129216 characters)
Downloading paper 5/5: 10.1002/ana.27128
  ✓ Successfully downloaded XML (177836 characters)

Successfully downloaded XML for 5 out of 5 papers


## 6. Extract Metadata from XML

Parse the downloaded XML files to extract structured metadata including title, authors, abstract, and other bibliographic information.

In [7]:
# Import build_paper_entities
from pyeuropepmc.builders import build_paper_entities

# Initialize the XML parser
parser = FullTextXMLParser()

# Parse XML and extract metadata for each downloaded paper
parsed_papers = {}

print("Parsing XML and extracting metadata...")
print()

for doi, xml_content in downloaded_xml.items():
    try:
        print(f"Parsing XML for: {doi}")

        # Parse the XML content
        parser.parse(xml_content)

        # Build entity objects from parser
        paper, authors, sections, tables, figures, references = build_paper_entities(parser)

        if paper:
            parsed_papers[doi] = (paper, authors, sections, tables, figures, references)
            print(f"  ✓ Successfully parsed - Title: {paper.title[:50] if paper.title else 'No title'}...")
            print(f"    Authors: {len(authors)}")
            print(f"    Sections: {len(sections)}")
            print(f"    Tables: {len(tables)}")
            print(f"    References: {len(references)}")
        else:
            print(f"  ✗ Failed to parse XML for {doi}")

    except Exception as e:
        print(f"  ✗ Error parsing {doi}: {str(e)}")

print(f"\nSuccessfully parsed {len(parsed_papers)} out of {len(downloaded_xml)} XML files")

# Display a summary of extracted metadata
if parsed_papers:
    print("\nExtracted Metadata Summary:")
    print("-" * 50)
    for doi, (paper, authors, sections, tables, figures, references) in parsed_papers.items():
        print(f"DOI: {doi}")
        print(f"Title: {paper.title[:60] if paper.title else 'No title'}...")
        print(f"Authors: {len(authors)} authors")
        print(f"Sections: {len(sections)} sections")
        print(f"Tables: {len(tables)} tables")
        print(f"References: {len(references)} references")
        print()
else:
    print("No papers were successfully parsed.")

Parsing XML and extracting metadata...

Parsing XML for: 10.1038/s41467-024-51893-7
  ✓ Successfully parsed - Title: Early biological markers of post-acute sequelae of...
    Authors: 33
    Sections: 21
    Tables: 2
    References: 45
Parsing XML for: 10.1016/j.lana.2024.100688
  ✓ Successfully parsed - Title: Prevalence and determinants of post-acute sequelae...
    Authors: 9
    Sections: 40
    Tables: 2
    References: 55
Parsing XML for: 10.1038/s43856-024-00549-0
  ✓ Successfully parsed - Title: Identification of risk factors of Long COVID and p...
    Authors: 20
    Sections: 33
    Tables: 2
    References: 42
Parsing XML for: 10.1016/j.eclinm.2024.102962
  ✓ Successfully parsed - Title: Real-world effectiveness and causal mediation stud...
    Authors: 27
    Sections: 28
    Tables: 5
    References: 60
Parsing XML for: 10.1002/ana.27128
  ✓ Successfully parsed - Title: Neurologic Manifestations of Long COVID Disproport...
    Authors: 13
    Sections: 21
    Tables: 4
  

## 7. Convert Metadata to RDF

Convert the extracted metadata into RDF format using the RMLRDFizer for structured knowledge representation.

In [1]:
# Import RDFMapper for RDF conversion
from pyeuropepmc.mappers import RDFMapper
from datetime import datetime

# Initialize the RDF mapper
mapper = RDFMapper()

# Convert parsed papers to RDF with enhanced mappings
rdf_graphs = {}

print("Converting metadata to RDF format with relationships and provenance...")
print(f"Number of parsed papers: {len(parsed_papers)}")
print()

for doi, (paper, authors, sections, tables, figures, references) in parsed_papers.items():
    try:
        print(f"Converting to RDF: {doi}")

        # Create RDF graph
        g = Graph()

        # Prepare extraction info for provenance
        extraction_info = {
            "timestamp": datetime.now().isoformat() + "Z",
            "method": "pyeuropepmc_xml_parser",
            "quality": {
                "validation_passed": True,
                "completeness_score": 0.95
            }
        }

        # Ensure pmcid is prefixed with PMC
        if paper.pmcid and not paper.pmcid.startswith('PMC'):
            paper.pmcid = f"PMC{paper.pmcid}"

        # Convert paper entity to RDF with relationships (excluding sections for now)
        related_entities = {
            "authors": authors,
            "tables": tables,
            "references": references
        }
        paper_uri = paper.to_rdf(g, mapper=mapper, related_entities=related_entities,
                    extraction_info=extraction_info)

        # Convert authors to RDF with paper relationships
        for author in authors:
            author_related = {"papers": [paper]}
            author.to_rdf(g, mapper=mapper, related_entities=author_related,
                         extraction_info=extraction_info)

        # Convert sections to RDF with paper relationships
        for i, section in enumerate(sections, 1):
            section_uri = URIRef(str(paper_uri) + f"/section/{i}")
            section_related = {"paper": [paper]}
            section.to_rdf(g, uri=section_uri, mapper=mapper, related_entities=section_related,
                          extraction_info=extraction_info)
            # Add section relationships manually to ensure correct URIs
            g.add((paper_uri, mapper._resolve_predicate("dct:hasPart"), section_uri))
            g.add((section_uri, mapper._resolve_predicate("dct:isPartOf"), paper_uri))

        # Convert tables to RDF with paper relationships
        for table in tables:
            table_related = {"paper": [paper]}
            table.to_rdf(g, mapper=mapper, related_entities=table_related,
                        extraction_info=extraction_info)

        # Convert references to RDF with citing paper relationships
        for reference in references:
            ref_related = {"citing_paper": [paper]}
            reference.to_rdf(g, mapper=mapper, related_entities=ref_related,
                           extraction_info=extraction_info)

        if g:
            rdf_graphs[doi] = g
            # Count triples in the graph
            triple_count = len(list(g))
            print(f"  [OK] Successfully converted to RDF ({triple_count} triples)")
        else:
            print(f"  [ERROR] Failed to convert {doi} to RDF")

    except Exception as e:
        print(f"  [ERROR] Error converting {doi}: {str(e)}")

print(f"Successfully converted {len(rdf_graphs)} papers to RDF")


# Display RDF summary and save to files
if rdf_graphs:
    print("RDF Conversion Summary:")
    print("-" * 50)

    for doi, graph in rdf_graphs.items():
        triple_count = len(list(graph))
        print(f"DOI: {doi}")
        print(f"Triples: {triple_count}")

        # Save RDF to file
        filename = f"rdf_output/long_covid_paper_{doi.replace('/', '_').replace('.', '_')}.ttl"
        try:
            mapper.serialize_graph(graph, format='turtle', destination=filename)
            print(f"Saved to: {filename}")
        except Exception as e:
            print(f"Error saving {filename}: {str(e)}")
        print()

    print("RDF conversion complete! Files saved in Turtle format.")
    print("You can now use these RDF files for:")
    print("- Knowledge graph construction")
    print("- Semantic search and querying")
    print("- Integration with other biomedical ontologies")
    print("- Linked data applications")
else:
    print("No RDF graphs were created.")

Converting metadata to RDF format with relationships and provenance...


NameError: name 'parsed_papers' is not defined