# PyEuropePMC Basic Usage Examples

This notebook demonstrates various ways to use PyEuropePMC to search and retrieve scientific literature from Europe PMC.

## Setup

First, let's import the necessary libraries and configure logging.

In [1]:
import logging
import pprint
from typing import Any, Dict, List, Optional

# Import the main classes
from pyeuropepmc.clients.search import EuropePMCError, SearchClient

# Configure logging to see what's happening
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

## Basic Search Example

Let's start with a simple search for papers about CRISPR gene editing.

In [2]:
print("=== Basic Search Example ===")

with SearchClient(rate_limit_delay=0.5) as client:
    try:
        # Simple search
        results: Dict[str, Any] = client.search(
            "CRISPR gene editing", page_size=5, format="json"
        ) # type: ignore

        print(f"Found {results.get('hitCount', 0)} total papers")
        print(f"Retrieved {len(results.get('resultList', {}).get('result', []))} papers")

        # Display first few results
        for i, paper in enumerate(results.get("resultList", {}).get("result", [])[:3]):
            print(f"\n{i + 1}. {paper.get('title', 'No title')}")
            print(f"   Authors: {paper.get('authorString', 'N/A')}")
            print(f"   Journal: {paper.get('journalTitle', 'N/A')}")
            print(f"   Year: {paper.get('pubYear', 'N/A')}")

    except EuropePMCError as e:
        print(f"Search failed: {e}")

2025-11-21 12:09:24,986 - INFO - SearchClient initialized with cache disabled
2025-11-21 12:09:24,987 - INFO - Cache miss - performing search with params: {'query': 'CRISPR gene editing', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 5, 'format': 'json', 'cursorMark': '*', 'sort': ''}


=== Basic Search Example ===


2025-11-21 12:09:25,233 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200
2025-11-21 12:09:25,734 - INFO - Multi-layer cache closed successfully


Found 113469 total papers
Retrieved 5 papers

1. LeDNA: a cut-and-build toolkit to democratize CRISPR gene editing technology education.
   Authors: Kundlatsch GE, Rodrigues ASL, Zocca VFB, Amorim LAS, de Paiva GB, Neto APS, Campos JADB, Pedrolli DB.
   Journal: Nat Biotechnol
   Year: 2025

2. Lipid Nanoparticles for Delivery of CRISPR Gene Editing Components.
   Authors: Wu F, Li N, Xiao Y, Palanki R, Yamagata H, Mitchell MJ, Han X.
   Journal: Small Methods
   Year: 2025

3. Next-generation CRISPR gene editing tools in the precision treatment of Alzheimer's and Parkinson's disease.
   Authors: Meshram HK, Gupta SK, Gupta A, Nagori K, Ajazuddin.
   Journal: Ageing Res Rev
   Year: 2025


### Raw Output Inspection

Let's examine the raw structure of a search result:

In [3]:
# Show raw output of the first result
if results.get("resultList", {}).get("result", []):
    print("Raw output of first result:")
    first_result: Dict[str, Any] = results.get("resultList", {}).get("result", [])[0]
    pprint.pprint(first_result)

Raw output of first result:
{'authorString': 'Kundlatsch GE, Rodrigues ASL, Zocca VFB, Amorim LAS, de '
                 'Paiva GB, Neto APS, Campos JADB, Pedrolli DB.',
 'citedByCount': 0,
 'doi': '10.1038/s41587-025-02849-9',
 'firstIndexDate': '2025-10-16',
 'firstPublicationDate': '2025-10-01',
 'hasBook': 'N',
 'hasDbCrossReferences': 'N',
 'hasLabsLinks': 'Y',
 'hasPDF': 'N',
 'hasReferences': 'Y',
 'hasSuppl': 'N',
 'hasTMAccessionNumbers': 'N',
 'hasTextMinedTerms': 'N',
 'id': '41087699',
 'inEPMC': 'N',
 'inPMC': 'N',
 'isOpenAccess': 'N',
 'issue': '10',
 'journalIssn': '1087-0156; 1546-1696; ',
 'journalTitle': 'Nat Biotechnol',
 'journalVolume': '43',
 'pageInfo': '1730-1735',
 'pmid': '41087699',
 'pubType': 'journal article',
 'pubYear': '2025',
 'source': 'MED',
 'title': 'LeDNA: a cut-and-build toolkit to democratize CRISPR gene editing '
          'technology education.'}


## Advanced Search with Parsing

Now let's demonstrate more advanced search functionality with automatic parsing.

In [4]:
print("=== Advanced Search with Parsing ===")

with SearchClient() as client:
    try:
        # Search and parse results automatically
        papers: List[Dict[str, Any]] = client.search_and_parse(
            query="COVID-19 AND vaccine",
            format="json",
            pageSize=10,
            sort="CITED desc",  # Most cited first
        )

        print(f"Retrieved {len(papers)} papers")

        # Display top cited papers
        for i, paper in enumerate(papers[:5]):
            citations: int = paper.get("citedByCount", 0)
            print(f"\n{i + 1}. [{citations} citations] {paper.get('title', 'No title')}")
            print(f"   DOI: {paper.get('doi', 'N/A')}")

    except EuropePMCError as e:
        print(f"Advanced search failed: {e}")

2025-11-21 12:09:25,760 - INFO - SearchClient initialized with cache disabled
2025-11-21 12:09:25,761 - INFO - Cache miss - performing search with params: {'query': 'COVID-19 AND vaccine', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 10, 'format': 'json', 'cursorMark': '*', 'sort': 'CITED desc'}


=== Advanced Search with Parsing ===


2025-11-21 12:09:26,091 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200
2025-11-21 12:09:27,092 - INFO - Multi-layer cache closed successfully


Retrieved 10 papers

1. [11148 citations] Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine.
   DOI: 10.1056/nejmoa2034577

2. [10454 citations] Integrated analysis of multimodal single-cell data.
   DOI: 10.1016/j.cell.2021.04.048

3. [8205 citations] Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis.
   DOI: 10.1016/s0140-6736(21)02724-0

4. [8070 citations] Efficacy and Safety of the mRNA-1273 SARS-CoV-2 Vaccine.
   DOI: 10.1056/nejmoa2035389

5. [6533 citations] Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation.
   DOI: 10.1126/science.abb2507


## Pagination Example

For large result sets, we can automatically fetch multiple pages.

In [5]:
print("=== Pagination Example ===")

with SearchClient() as client:
    try:
        # Fetch multiple pages automatically
        all_papers: List[Dict[str, Any]] = client.fetch_all_pages(
            query="machine learning bioinformatics",
            page_size=25,
            max_results=100,  # Limit to 100 papers total
        )

        print(f"Retrieved {len(all_papers)} papers across multiple pages")

        # Analyze publication years
        years: Dict[str, int] = {}
        for paper in all_papers:
            year: Optional[str] = paper.get("pubYear")
            if year:
                years[year] = years.get(year, 0) + 1

        print("\nPublication years distribution:")
        for year in sorted(years.keys(), reverse=True)[:5]:
            print(f"  {year}: {years[year]} papers")

    except EuropePMCError as e:
        print(f"Pagination example failed: {e}")

2025-11-21 12:09:27,100 - INFO - SearchClient initialized with cache disabled
  all_papers: List[Dict[str, Any]] = client.fetch_all_pages(
2025-11-21 12:09:27,101 - INFO - Cache miss - performing search with params: {'query': 'machine learning bioinformatics', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 25, 'format': 'json', 'cursorMark': '*', 'sort': ''}


=== Pagination Example ===


2025-11-21 12:09:27,454 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200
2025-11-21 12:09:28,455 - INFO - Cache miss - performing search with params: {'query': 'machine learning bioinformatics', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 25, 'format': 'json', 'cursorMark': 'AoIIQJf+rSg1NDI3MzQ2NQ==', 'sort': ''}
2025-11-21 12:09:28,696 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200
2025-11-21 12:09:29,697 - INFO - Cache miss - performing search with params: {'query': 'machine learning bioinformatics', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 25, 'format': 'json', 'cursorMark': 'AoIIQJMaRig1NDE5MjI3NA==', 'sort': ''}
2025-11-21 12:09:29,943 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200
2025-11-21 12:09:30,944 - INFO - Cache miss - performing search with params: {'query': 'machine learning b

Retrieved 100 papers across multiple pages

Publication years distribution:
  2026: 9 papers
  2025: 88 papers
  2024: 2 papers
  2021: 1 papers


## Search Parameters Example

Let's explore various search parameters and query syntax.

In [6]:
print("=== Search Parameters Example ===")

with SearchClient() as client:
    try:
        # Search with specific parameters
        results: Dict[str, Any] = client.search(
            query='AUTHOR:"Smith J" AND JOURNAL:"Nature"',
            resultType="core",  # Get full metadata
            pageSize=5,
            format="json",
        ) # type: ignore

        papers: List[Dict[str, Any]] = results.get("resultList", {}).get("result", [])
        print(f"Found {len(papers)} papers by 'Smith J' in 'Nature'")

        for paper in papers:
            print(f"\n- {paper.get('title', 'No title')}")
            abstract: str = paper.get("abstractText", "N/A")
            print(f"  Abstract: {abstract[:100]}...")

    except EuropePMCError as e:
        print(f"Parameter search failed: {e}")

2025-11-21 12:09:32,181 - INFO - SearchClient initialized with cache disabled
2025-11-21 12:09:32,182 - INFO - Cache miss - performing search with params: {'query': 'AUTHOR:"Smith J" AND JOURNAL:"Nature"', 'resultType': 'core', 'synonym': 'FALSE', 'pageSize': 5, 'format': 'json', 'cursorMark': '*', 'sort': ''}


=== Search Parameters Example ===


2025-11-21 12:09:32,381 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200
2025-11-21 12:09:33,382 - INFO - Multi-layer cache closed successfully


Found 5 papers by 'Smith J' in 'Nature'

- Daily briefing: Greenhouse-gas emissions should peak by 2030, say researchers.
  Abstract: N/A...

- GREGoR: accelerating genomics for rare diseases.
  Abstract: Rare diseases are collectively common, affecting approximately 1 in 20 individuals worldwide. In rec...

- Daily briefing: 'Mind captioning' AI describes the images in your head.
  Abstract: N/A...

- Daily briefing: The bowhead whale's secret to living to 200.
  Abstract: N/A...

- Daily briefing: Surprise illnesses had a role in the demise of Napoleon's army.
  Abstract: N/A...


## Hit Count Example

Sometimes we just want to know how many papers match our query without retrieving the results.

In [7]:
print("=== Hit Count Example ===")

with SearchClient() as client:
    queries: List[str] = [
        "artificial intelligence",
        "CRISPR",
        "COVID-19",
        "climate change biology",
    ]

    for query in queries:
        try:
            count: int = client.get_hit_count(query)
            print(f"'{query}': {count:,} papers")
        except EuropePMCError as e:
            print(f"Failed to get count for '{query}': {e}")

2025-11-21 12:09:33,389 - INFO - SearchClient initialized with cache disabled
2025-11-21 12:09:33,390 - INFO - Cache miss - performing search with params: {'query': 'artificial intelligence', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 1, 'format': 'json', 'cursorMark': '*', 'sort': ''}
2025-11-21 12:09:33,566 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200


=== Hit Count Example ===


2025-11-21 12:09:34,567 - INFO - Cache miss - performing search with params: {'query': 'CRISPR', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 1, 'format': 'json', 'cursorMark': '*', 'sort': ''}
2025-11-21 12:09:34,633 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200


'artificial intelligence': 348,971 papers


2025-11-21 12:09:35,634 - INFO - Cache miss - performing search with params: {'query': 'COVID-19', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 1, 'format': 'json', 'cursorMark': '*', 'sort': ''}
2025-11-21 12:09:35,781 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200


'CRISPR': 200,095 papers


2025-11-21 12:09:36,782 - INFO - Cache miss - performing search with params: {'query': 'climate change biology', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 1, 'format': 'json', 'cursorMark': '*', 'sort': ''}
2025-11-21 12:09:36,892 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200


'COVID-19': 946,343 papers


2025-11-21 12:09:37,893 - INFO - Multi-layer cache closed successfully


'climate change biology': 119,201 papers


## Error Handling Example

Let's demonstrate how to handle various error conditions gracefully.

In [8]:
print("=== Error Handling Example ===")

with SearchClient() as client:
    # Try various problematic queries
    problematic_queries: List[str] = [
        "",  # Empty query
        "a",  # Too short
        'query with "unmatched quotes',  # Invalid syntax
        "normal query",  # This should work
    ]

    for query in problematic_queries:
        try:
            if client.validate_query(query):
                results: Dict[str, Any] = client.search(query, pageSize=1) # type: ignore
                print(f"✓ '{query}': {results.get('hitCount', 0)} results")
            else:
                print(f"✗ '{query}': Invalid query")
        except EuropePMCError as e:
            print(f"✗ '{query}': {e}")

2025-11-21 12:09:37,900 - INFO - SearchClient initialized with cache disabled
2025-11-21 12:09:37,901 - INFO - Cache miss - performing search with params: {'query': 'normal query', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 1, 'format': 'json', 'cursorMark': '*', 'sort': ''}


=== Error Handling Example ===
✗ '': Invalid query
✗ 'a': Invalid query
✗ 'query with "unmatched quotes': Invalid query


2025-11-21 12:09:38,125 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200
2025-11-21 12:09:39,126 - INFO - Multi-layer cache closed successfully


✓ 'normal query': 88815 results


In [9]:
# Synonym comparison: run each query with synonym=False and synonym=True
print('=== Synonym comparison (False vs True) ===')
queries = ["aspirin", "CRISPR gene editing", "diabetes clinical trial"]

import time
from typing import Any, Dict
import pprint
summary: Dict[str, Any] = {}
with SearchClient(rate_limit_delay=1) as client:
    for q in queries:
        summary[q] = {}
        for syn in (False, True):
            mode = 'TRUE' if syn else 'FALSE'
            try:
                t0 = time.time()
                res = client.search(q, pageSize=20, format="json", synonym=syn)
                t1 = time.time()
                if isinstance(res, dict):
                    hit = int(res.get('hitCount', 0))
                    top20 = [r.get('id') for r in res.get('resultList', {}).get('result', [])][:20]
                else:
                    hit = 0
                    top20 = []
                summary[q][mode] = {
                    'hitCount': hit,
                    'top20_ids': top20,
                    'query_time_s': round(t1 - t0, 3),
                }
            except EuropePMCError as e:
                summary[q][mode] = {'error': str(e)}



pprint.pprint(summary)

2025-11-21 12:09:39,134 - INFO - SearchClient initialized with cache disabled
2025-11-21 12:09:39,134 - INFO - Cache miss - performing search with params: {'query': 'aspirin', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 20, 'format': 'json', 'cursorMark': '*', 'sort': ''}


=== Synonym comparison (False vs True) ===


2025-11-21 12:09:39,388 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200
2025-11-21 12:09:40,390 - INFO - Cache miss - performing search with params: {'query': 'aspirin', 'resultType': 'lite', 'synonym': 'TRUE', 'pageSize': 20, 'format': 'json', 'cursorMark': '*', 'sort': ''}
2025-11-21 12:09:40,553 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200
2025-11-21 12:09:41,555 - INFO - Cache miss - performing search with params: {'query': 'CRISPR gene editing', 'resultType': 'lite', 'synonym': 'FALSE', 'pageSize': 20, 'format': 'json', 'cursorMark': '*', 'sort': ''}
2025-11-21 12:09:41,818 - INFO - GET request to https://www.ebi.ac.uk/europepmc/webservices/rest/search succeeded with status 200
2025-11-21 12:09:42,819 - INFO - Cache miss - performing search with params: {'query': 'CRISPR gene editing', 'resultType': 'lite', 'synonym': 'TRUE', 'pageSize': 20, 'format': 'json', 'cur

{'CRISPR gene editing': {'FALSE': {'hitCount': 113469,
                                   'query_time_s': 1.264,
                                   'top20_ids': ['41087699',
                                                 '40434188',
                                                 '40752775',
                                                 '40430848',
                                                 '39950370',
                                                 '40729503',
                                                 '39554638',
                                                 'PPR1043652',
                                                 '40054045',
                                                 '39704125',
                                                 'PPR921554',
                                                 'PPR914989',
                                                 '39455854',
                                                 'PPR881821',
                            

## Synonym search comparison — explanation and recommendations

The code cell above runs three representative queries with synonym expansion disabled and enabled, and prints a compact summary (hit counts, top-20 IDs, and per-query time).

Key observations you will typically see: 

- Enabling synonym expansion (synonym=True) increases hit counts — often by a noticeable percentage — because queries are expanded with MeSH synonyms (for example, "aspirin" ↔ "acetylsalicylic acid").
- Top results usually overlap substantially between modes, but synonym=True can surface additional relevant records that use alternate terminology.
- Runtime for a single small query (pageSize=20) is dominated by network latency and is similar between modes; fetching many pages may take longer if synonym expansion increases the number of matching documents you retrieve.

Practical recommendations:

- Keep the library default `synonym=False` to preserve precision and reproducibility. Let callers opt-in via `client.search(..., synonym=True)` when they want maximum recall.
- For discovery or systematic review prep, run `synonym=True` to broaden results and inspect the additional hits for relevance.
- If synonym expansion produces very long queries, use the POST search endpoint (`search_post`) or the `search` method with parameters that trigger POST usage in your environment to avoid URL-length issues.
- Consider adding a client-level default (for example `SearchClient(default_synonym=False)`) if a project prefers one mode consistently; I can implement this change if you want.

If you want, I can add the client-level default option, export the comparison into a reusable example cell, or build a small evaluation helper to quantify precision/recall differences on a set of queries.

## Summary

This notebook demonstrated:

1. **Basic search functionality** - Simple queries and result display
2. **Advanced search with parsing** - Using built-in parsing methods
3. **Pagination** - Handling large result sets across multiple pages
4. **Search parameters** - Using specific query syntax and parameters
5. **Hit counts** - Getting result counts without full retrieval
6. **Error handling** - Gracefully handling various error conditions

The PyEuropePMC SearchClient provides a robust interface for searching scientific literature with built-in error handling, rate limiting, and pagination support.