# PyEuropePMC FullText Client Examples

This notebook demonstrates how to use the FullTextClient for:

1. Checking full text availability for articles
2. Downloading PDFs and XML files with validation and fallback mechanisms
3. Bulk XML download from Europe PMC FTP OA archives
4. Retrieving full text content as strings
5. Performing batch downloads
6. Handling errors gracefully
7. Integration with SearchClient

## Important Note: Europe PMC Endpoints

The Europe PMC API has different endpoint structures for different formats:

- **XML**: Uses direct REST API endpoint (`/PMC{id}/fullTextXML`) with bulk FTP fallback
- **PDF**: Uses render endpoint (`?pdf=render`) with backend and ZIP fallbacks
- **HTML**: Uses web interface when available (supports both PMC and MED formats)

## PDF Download & Validation

The client now includes robust PDF validation:
- Checks PDF header (`%PDF`) to ensure valid format
- Verifies minimum file size to avoid empty downloads
- Uses temporary files during download to prevent corrupted saves
- Only saves files that pass validation checks

## XML Download with Bulk FTP Support

XML downloads support multiple sources:
- **Primary**: REST API endpoint for individual articles
- **Fallback**: Europe PMC FTP OA bulk archives (`.xml.gz` files)
- **Smart fallback**: Automatically determines correct archive based on PMC ID range
- **Bulk-only option**: Direct bulk download method available

The bulk archives are organized by PMC ID ranges (e.g., `PMC1000000_PMC1099999.xml.gz`) and are automatically unpacked to extract the specific article XML.

## Setup

First, let's import the necessary libraries and set up logging.

In [1]:
import logging
from pathlib import Path

from pyeuropepmc import FullTextClient, FullTextError

# Set up logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

## Example 1: Check Full Text Availability

Let's check what full text formats are available for different articles.

**Note**: 
- XML availability is checked via REST API
- PDF availability is checked via render endpoint
- HTML availability is checked via web interface (supports both PMC and MED formats)

In [2]:
print("=== Checking Full Text Availability ===")

pmcids = ["3312970", "4123456", "invalid"]  # Mix of valid and invalid IDs

with FullTextClient() as client:
    for pmcid in pmcids:
        try:
            print(f"\nChecking availability for PMC{pmcid}:")
            availability = client.check_fulltext_availability(pmcid)

            for format_type, available in availability.items():
                status = "✓ Available" if available else "✗ Not available"
                if format_type == "html":
                    status += " (requires MED ID to check)"
                print(f"  {format_type.upper()}: {status}")

        except FullTextError as e:
            print(f"  Error: {e}")

2025-07-15 14:40:14,524 - pyeuropepmc.base - INFO - Checking fulltext availability for PMC ID: 3312970


=== Checking Full Text Availability ===

Checking availability for PMC3312970:


2025-07-15 14:40:14,857 - pyeuropepmc.base - INFO - Availability for PMC3312970: {'pdf': False, 'xml': False, 'html': True}
2025-07-15 14:40:14,858 - pyeuropepmc.base - INFO - Checking fulltext availability for PMC ID: 4123456
2025-07-15 14:40:14,858 - pyeuropepmc.base - INFO - Checking fulltext availability for PMC ID: 4123456


  PDF: ✗ Not available
  XML: ✗ Not available
  HTML: ✓ Available (requires MED ID to check)

Checking availability for PMC4123456:


2025-07-15 14:40:15,072 - pyeuropepmc.base - INFO - Availability for PMC4123456: {'pdf': False, 'xml': False, 'html': True}
2025-07-15 14:40:15,073 - pyeuropepmc.base - INFO - Checking fulltext availability for PMC ID: invalid
2025-07-15 14:40:15,073 - pyeuropepmc.base - ERROR - Invalid PMC ID format: invalid
2025-07-15 14:40:15,073 - pyeuropepmc.base - INFO - Checking fulltext availability for PMC ID: invalid
2025-07-15 14:40:15,073 - pyeuropepmc.base - ERROR - Invalid PMC ID format: invalid


  PDF: ✗ Not available
  XML: ✗ Not available
  HTML: ✓ Available (requires MED ID to check)

Checking availability for PMCinvalid:
  Error: [FULL002] Invalid PMC ID format. Must be numeric.


## Example 2: Download PDF Files

Let's download PDF files for available articles.

**Note**: PDF download tries multiple endpoints in order:
1. Render endpoint (?pdf=render)
2. Backend render service (ptpmcrender.fcgi)
3. ZIP archive from OA collection

**Validation**: Each downloaded PDF is validated to ensure:
- Valid PDF header (`%PDF`)
- Minimum file size (>1KB)
- Only valid PDFs are saved to disk

In [3]:
print("=== Downloading PDF Files ===")

pmcids = ["3312970"]  # Known open access article as pdf but not xml fulltext
download_dir = Path("downloads")
download_dir.mkdir(exist_ok=True)

with FullTextClient() as client:
    for pmcid in pmcids:
        try:
            print(f"\nDownloading PDF for PMC{pmcid}...")
            print("  Trying render endpoint → backend service → ZIP archive...")
            output_path = download_dir / f"PMC{pmcid}.pdf"

            result = client.download_pdf_by_pmcid(pmcid, output_path)

            if result:
                file_size = result.stat().st_size / 1024  # Size in KB
                print(f"  ✓ Downloaded to {result} ({file_size:.1f} KB)")  

            else:
                print("  ✗ Download failed (not available via any endpoint)")

        except FullTextError as e:
            print(f"  ✗ Error: {e}")

2025-07-15 14:40:15,082 - pyeuropepmc.base - INFO - Starting PDF download for PMC3312970


=== Downloading PDF Files ===

Downloading PDF for PMC3312970...
  Trying render endpoint → backend service → ZIP archive...


2025-07-15 14:40:15,512 - pyeuropepmc.base - INFO - Downloaded valid PDF via render endpoint: downloads/PMC3312970.pdf


  ✓ Downloaded to downloads/PMC3312970.pdf (2376.5 KB)


## Example 3: Download XML Files

XML files often contain more structured information than PDFs.

In [4]:
print("=== Downloading XML Files ===")

pmcids = ["PMC3257301", # Known open access article as xml
          "PMC4123456", # Known open access article without xml
          "invalid" # Invalid ID to test error handling
          ]
download_dir = Path("downloads")
download_dir.mkdir(exist_ok=True)

with FullTextClient() as client:
    for pmcid in pmcids:
        try:
            print(f"\nDownloading XML for PMC{pmcid}...")
            print("  Trying REST API → bulk FTP archive fallback...")
            output_path = download_dir / f"PMC{pmcid}.xml"

            result = client.download_xml_by_pmcid(pmcid, output_path)

            if result:
                file_size = result.stat().st_size / 1024  # Size in KB
                print(f"  ✓ Downloaded to {result} ({file_size:.1f} KB)")
            else:
                print("  ✗ Download failed")

        except FullTextError as e:
            print(f"  ✗ Error: {e}")
            pass
        except Exception as e:
            print(f"  ✗ Unexpected error: {e}")

2025-07-15 14:40:15,522 - pyeuropepmc.base - INFO - Starting XML download for PMC ID: PMC3257301
2025-07-15 14:40:15,524 - pyeuropepmc.base - INFO - Downloading XML for PMC3257301
2025-07-15 14:40:15,524 - pyeuropepmc.base - INFO - Downloading XML for PMC3257301


=== Downloading XML Files ===

Downloading XML for PMCPMC3257301...
  Trying REST API → bulk FTP archive fallback...


2025-07-15 14:40:15,811 - pyeuropepmc.base - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/PMC3257301/fullTextXML succeeded with status 200
2025-07-15 14:40:16,861 - pyeuropepmc.base - INFO - Successfully downloaded XML to downloads/PMCPMC3257301.xml
2025-07-15 14:40:16,862 - pyeuropepmc.base - INFO - Starting XML download for PMC ID: PMC4123456
2025-07-15 14:40:16,862 - pyeuropepmc.base - INFO - Downloading XML for PMC4123456
2025-07-15 14:40:16,861 - pyeuropepmc.base - INFO - Successfully downloaded XML to downloads/PMCPMC3257301.xml
2025-07-15 14:40:16,862 - pyeuropepmc.base - INFO - Starting XML download for PMC ID: PMC4123456
2025-07-15 14:40:16,862 - pyeuropepmc.base - INFO - Downloading XML for PMC4123456
2025-07-15 14:40:16,924 - pyeuropepmc.base - ERROR - [BaseAPIClient] GET request failed
2025-07-15 14:40:16,924 - pyeuropepmc.base - ERROR - [BaseAPIClient] GET request failed


  ✓ Downloaded to downloads/PMCPMC3257301.xml (120.7 KB)

Downloading XML for PMCPMC4123456...
  Trying REST API → bulk FTP archive fallback...


2025-07-15 14:40:17,926 - pyeuropepmc.base - INFO - Trying bulk FTP download for PMC4123456
2025-07-15 14:40:17,926 - pyeuropepmc.base - INFO - Trying bulk FTP download for PMC4123456
2025-07-15 14:40:18,052 - pyeuropepmc.base - INFO - Starting XML download for PMC ID: invalid
2025-07-15 14:40:18,053 - pyeuropepmc.base - ERROR - Invalid PMC ID format: invalid
2025-07-15 14:40:18,052 - pyeuropepmc.base - INFO - Starting XML download for PMC ID: invalid
2025-07-15 14:40:18,053 - pyeuropepmc.base - ERROR - Invalid PMC ID format: invalid


  ✗ Error: [FULL003] Content not found for PMC ID 4123456.

Downloading XML for PMCinvalid...
  Trying REST API → bulk FTP archive fallback...
  ✗ Error: [FULL002] Invalid PMC ID format. Must be numeric.


## Example 3b: Bulk XML Download Only

Sometimes you might want to download XML files directly from the Europe PMC FTP OA bulk archives without trying the REST API first. This is useful when you know the REST API is temporarily unavailable or when you specifically want to use the bulk download method.

In [5]:
print("=== Bulk-Only XML Download ===")

pmcids = ["PMC3257301", "PMC1000000"]  # Test different PMC ID ranges
download_dir = Path("downloads")
download_dir.mkdir(exist_ok=True)

with FullTextClient() as client:
    for pmcid in pmcids:
        try:
            print(f"\nDownloading XML for PMC{pmcid} (bulk FTP only)...")
            output_path = download_dir / f"bulk_{pmcid}.xml"

            result = client.download_xml_by_pmcid_bulk(pmcid, output_path)

            if result:
                file_size = result.stat().st_size / 1024  # Size in KB
                print(f"  ✓ Downloaded from bulk archive to {result} ({file_size:.1f} KB)")

                # Show archive range info
                pmcid_num = int(pmcid.replace("PMC", ""))
                archive_range = client._determine_bulk_archive_range(pmcid_num)
                if archive_range:
                    start_id, end_id = archive_range
                    print(f"  📦 From archive: PMC{start_id}_PMC{end_id}.xml.gz")
            else:
                print("  ✗ Download failed")

        except FullTextError as e:
            print(f"  ✗ Error: {e}")
        except Exception as e:
            print(f"  ✗ Unexpected error: {e}")

2025-07-15 14:40:18,065 - pyeuropepmc.base - INFO - Starting bulk XML download for PMC ID: PMC3257301
2025-07-15 14:40:18,188 - pyeuropepmc.base - INFO - Starting bulk XML download for PMC ID: PMC1000000
2025-07-15 14:40:18,188 - pyeuropepmc.base - INFO - Starting bulk XML download for PMC ID: PMC1000000


=== Bulk-Only XML Download ===

Downloading XML for PMCPMC3257301 (bulk FTP only)...
  ✗ Error: [FULL003] Content not found for PMC ID 3257301.

Downloading XML for PMCPMC1000000 (bulk FTP only)...
  ✗ Error: [FULL003] Content not found for PMC ID 1000000.
  ✗ Error: [FULL003] Content not found for PMC ID 1000000.


## Example 4: Get Content as Strings

Sometimes we want to work with the content directly in memory rather than downloading files.

In [6]:
print("=== Getting Content as Strings ===")

pmcid = "PMC3257301"

with FullTextClient() as client:
    # Get XML content
    try:
        print(f"\nRetrieving XML content for PMC{pmcid}...")
        xml_content = client.get_fulltext_content(pmcid, "xml")

        # Show first 200 characters
        preview = xml_content[:200] + "..." if len(xml_content) > 200 else xml_content
        print(f"  ✓ XML content ({len(xml_content)} characters):")
        print(f"  {preview}")

    except Exception as e:
        print(f"  ✗ Error getting XML: {e}")



2025-07-15 14:40:18,318 - pyeuropepmc.base - INFO - Retrieving fulltext content for PMC ID: PMC3257301, format: xml
2025-07-15 14:40:18,319 - pyeuropepmc.base - INFO - Retrieving XML content for PMC3257301
2025-07-15 14:40:18,319 - pyeuropepmc.base - INFO - Retrieving XML content for PMC3257301


=== Getting Content as Strings ===

Retrieving XML content for PMCPMC3257301...


2025-07-15 14:40:18,542 - pyeuropepmc.base - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/PMC3257301/fullTextXML succeeded with status 200


  ✓ XML content (123550 characters):
  <!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.1 20151215//EN" "JATS-archivearticle1.dtd"> 
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml=...


## Example 5: Batch Download

For downloading multiple articles efficiently, we can use the batch download feature.

In [7]:
print("=== Batch Download ===")

pmcids = ["3312970", "4123456", "5678901"]  # Mix of articles
download_dir = Path("downloads/batch")
download_dir.mkdir(parents=True, exist_ok=True)

with FullTextClient() as client:
    print(f"\nBatch downloading PDFs for {len(pmcids)} articles...")

    results = client.download_fulltext_batch(
        pmcids, format_type="pdf", output_dir=download_dir, skip_errors=True
    )

    print("\nBatch download results:")
    successful = 0
    for pmcid, result_path in results.items():
        if result_path:
            file_size = result_path.stat().st_size / 1024
            print(f"  PMC{pmcid}: ✓ Downloaded ({file_size:.1f} KB)")
            successful += 1
        else:
            print(f"  PMC{pmcid}: ✗ Failed")

    print(f"\nSummary: {successful}/{len(pmcids)} downloads successful")

2025-07-15 14:40:19,603 - pyeuropepmc.base - INFO - Starting batch download for PMC IDs: ['3312970', '4123456', '5678901'], format: pdf
2025-07-15 14:40:19,604 - pyeuropepmc.base - INFO - Starting PDF download for PMC3312970
2025-07-15 14:40:19,604 - pyeuropepmc.base - INFO - Starting PDF download for PMC3312970


=== Batch Download ===

Batch downloading PDFs for 3 articles...


2025-07-15 14:40:19,926 - pyeuropepmc.base - INFO - Downloaded valid PDF via render endpoint: downloads/batch/PMC3312970.pdf
2025-07-15 14:40:19,927 - pyeuropepmc.base - INFO - Successfully downloaded PDF for PMC3312970
2025-07-15 14:40:19,927 - pyeuropepmc.base - INFO - Starting PDF download for PMC4123456
2025-07-15 14:40:19,927 - pyeuropepmc.base - INFO - Successfully downloaded PDF for PMC3312970
2025-07-15 14:40:19,927 - pyeuropepmc.base - INFO - Starting PDF download for PMC4123456
2025-07-15 14:40:20,200 - pyeuropepmc.base - INFO - Downloaded valid PDF via render endpoint: downloads/batch/PMC4123456.pdf
2025-07-15 14:40:20,200 - pyeuropepmc.base - INFO - Successfully downloaded PDF for PMC4123456
2025-07-15 14:40:20,201 - pyeuropepmc.base - INFO - Starting PDF download for PMC5678901
2025-07-15 14:40:20,200 - pyeuropepmc.base - INFO - Downloaded valid PDF via render endpoint: downloads/batch/PMC4123456.pdf
2025-07-15 14:40:20,200 - pyeuropepmc.base - INFO - Successfully download


Batch download results:
  PMC3312970: ✓ Downloaded (2376.5 KB)
  PMC4123456: ✓ Downloaded (118.7 KB)
  PMC5678901: ✓ Downloaded (680.8 KB)

Summary: 3/3 downloads successful


## Example 6: Error Handling

Let's demonstrate how different error conditions are handled.

In [8]:
print("=== Error Handling Examples ===")

with FullTextClient() as client:
    # Invalid PMC ID format
    try:
        print("\nTrying invalid PMC ID format...")
        client.download_pdf_by_pmcid("invalid_id")
    except FullTextError as e:
        print(f"  ✓ Caught expected error: {e}")

    # Non-existent PMC ID
    try:
        print("\nTrying non-existent PMC ID...")
        result = client.download_pdf_by_pmcid("99999999")
        if result is None:
            print("  ✓ Correctly returned None for non-existent ID")
    except FullTextError as e:
        print(f"  ✓ Caught expected error: {e}")

    # Invalid format for text content
    try:
        print("\nTrying invalid format for text content...")
        client.get_fulltext_content("3312970", "pdf")
    except FullTextError as e:
        print(f"  ✓ Caught expected error: {e}")

2025-07-15 14:40:20,567 - pyeuropepmc.base - ERROR - Invalid PMC ID format: invalid_id
2025-07-15 14:40:20,569 - pyeuropepmc.base - INFO - Starting PDF download for PMC99999999
2025-07-15 14:40:20,569 - pyeuropepmc.base - INFO - Starting PDF download for PMC99999999


=== Error Handling Examples ===

Trying invalid PMC ID format...
  ✓ Caught expected error: [FULL002] Invalid PMC ID format. Must be numeric.

Trying non-existent PMC ID...


2025-07-15 14:40:20,944 - pyeuropepmc.base - ERROR - PDF not available or invalid for PMC99999999
2025-07-15 14:40:20,945 - pyeuropepmc.base - INFO - Retrieving fulltext content for PMC ID: 3312970, format: pdf
2025-07-15 14:40:20,944 - pyeuropepmc.base - ERROR - PDF not available or invalid for PMC99999999
2025-07-15 14:40:20,945 - pyeuropepmc.base - INFO - Retrieving fulltext content for PMC ID: 3312970, format: pdf


  ✓ Correctly returned None for non-existent ID

Trying invalid format for text content...
  ✓ Caught expected error: [FULL004] Invalid format type. Use 'pdf', 'xml', or 'html'.


## Example 7: Integration with Search Client

Let's combine search functionality with full text retrieval.

In [9]:
print("=== Integration with Search Client ===")

try:
    from pyeuropepmc import SearchClient

    # Search for articles first
    with SearchClient() as search_client:
        print("\nSearching for articles about 'machine learning'...")
        results = search_client.search("machine learning", pageSize=5)
        print(f"This is the search result: {results}")


        if results and "resultList" in results and "result" in results["resultList"]:
            articles = results["resultList"]["result"]
            print(f"  Found {len(articles)} articles in search results")

            # Extract PMC IDs from search results
            pmcids = []
            for article in articles:
                if article.get("isOpenAccess") == "Y" and "pmcid" in article:
                    pmcid = article["pmcid"].replace("PMC", "")
                    pmcids.append(pmcid)
                    print(
                        f"  Found open access article: PMC{pmcid} - {article.get('title', 'No title')[:60]}..."
                    )

            # Now check full text availability
            if pmcids:
                print(f"\nChecking full text availability for {len(pmcids)} articles...")

                with FullTextClient() as fulltext_client:
                    for pmcid in pmcids[:3]:  # Check first 3 articles
                        try:
                            availability = fulltext_client.check_fulltext_availability(pmcid)
                            available_formats = [
                                fmt for fmt, avail in availability.items() if avail
                            ]

                            if available_formats:
                                print(
                                    f"  PMC{pmcid}: {', '.join(available_formats)} available"
                                )
                            else:
                                print(f"  PMC{pmcid}: No full text available")

                        except FullTextError as e:
                            print(f"  PMC{pmcid}: Error - {e}")
            else:
                print("  No PMC IDs found in search results")
        else:
            print("  No search results found")

except ImportError:
    print("  SearchClient not available - install pyeuropepmc to use this example")
except Exception as e:
    print(f"  Error in integration example: {e}")

2025-07-15 14:40:20,956 - pyeuropepmc.base - INFO - Performing search with params: {'query': 'machine learning', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 5, 'format': 'json', 'cursorMark': '*', 'sort': ''}


=== Integration with Search Client ===

Searching for articles about 'machine learning'...


2025-07-15 14:40:21,171 - pyeuropepmc.base - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200


This is the search result: {'version': '6.9', 'hitCount': 476039, 'nextCursorMark': 'AoIIQZT7ACg1MzMyODc5MA==', 'nextPageUrl': 'https://www.ebi.ac.uk/europepmc/webservices/rest/search?query=machine learning&cursorMark=AoIIQZT7ACg1MzMyODc5MA==&resultType=lite&pageSize=5&format=json', 'request': {'queryString': 'machine learning', 'resultType': 'lite', 'cursorMark': '%2A', 'pageSize': 5, 'sort': '', 'synonym': False}, 'resultList': {'result': [{'id': 'PPR1048238', 'source': 'PPR', 'doi': '10.20944/preprints202507.0912.v1', 'title': 'Unsupervised Machine Learning in Astronomy', 'authorString': 'Kuo C, Xu D, Friesen R.', 'pubYear': '2025', 'pubType': 'preprint', 'bookOrReportDetails': {'publisher': 'Preprints.org', 'yearOfPublication': 2025}, 'isOpenAccess': 'N', 'inEPMC': 'N', 'inPMC': 'N', 'hasPDF': 'N', 'hasBook': 'N', 'hasSuppl': 'N', 'citedByCount': 0, 'hasReferences': 'N', 'hasTextMinedTerms': 'N', 'hasDbCrossReferences': 'N', 'hasLabsLinks': 'N', 'versionNumber': 1, 'hasTMAccessionN

## Summary

This notebook demonstrated the key features of the PyEuropePMC FullTextClient:

1. **Availability checking** - Determine what formats are available before downloading
2. **PDF downloads** - Download PDF files with automatic fallback mechanisms and validation
3. **XML downloads** - Download structured XML content with REST API and bulk FTP fallback
4. **Bulk XML downloads** - Direct download from Europe PMC FTP OA archives (.xml.gz)
5. **Content retrieval** - Get content as strings for in-memory processing
6. **Batch operations** - Efficiently download multiple articles
7. **Error handling** - Robust handling of various error conditions
8. **Search integration** - Combine with SearchClient for complete workflows

### Key Features

**PDF Downloads:**
- Uses multiple endpoints with automatic fallback (render → backend → ZIP)
- Validates PDF content (header, size) before saving
- Only saves valid PDFs to disk

**XML Downloads:**
- Primary: REST API endpoint (`/PMC{id}/fullTextXML`)
- Fallback: Bulk FTP archives (`https://europepmc.org/ftp/oa/PMC{start}_{end}.xml.gz`)
- Automatic unpacking of gzipped archives
- Intelligent archive range determination based on PMC ID

**Bulk Download Support:**
- Europe PMC FTP OA archives organized by PMC ID ranges
- Automatic determination of correct archive for any PMC ID
- Efficient handling of compressed (.xml.gz) archives
- Dedicated bulk-only download method available

### Notes

- Downloaded files are saved in the `downloads/` directory
- The client automatically handles rate limiting to be respectful to the API
- XML downloads use REST API first, then fall back to bulk FTP archives
- PDF downloads use multiple endpoints with automatic fallback and validation
- Not all articles have full text available - this depends on publisher policies
- Check the logs for detailed information about API calls and any issues