# ðŸ”¬ Europe PMC API Exploration

This notebook explores the Europe PMC REST API to understand the data we'll be working with for biodiversity genomics publication analysis.

**Contents:**
1. Connect to Europe PMC API
2. Search for "Darwin Tree of Life" articles
3. Inspect article metadata fields
4. Count publications by year
5. Explore text-mined annotations
6. Test citation retrieval

In [1]:
import sys
from pathlib import Path

# Add project root to path
sys.path.insert(0, str(Path.cwd().parent))

from src.data.europepmc_client import EuropePMCClient

# Initialize client
client = EuropePMCClient(rate_limit_delay=0.5)
print(f"\u2705 Client initialized: {client}")

âœ… Client initialized: EuropePMCClient(base_url='https://www.ebi.ac.uk/europepmc/webservices/rest', page_size=100, rate_limit_delay=0.5)


## 1. Search for Darwin Tree of Life Articles

Let's start by searching for articles mentioning "Darwin Tree of Life" and examine what we get back.

In [2]:
# Search for Darwin Tree of Life articles
query = '"Darwin Tree of Life"'
dtol_articles = client.search_articles(query, max_results=20)

print(f"Found {len(dtol_articles)} articles for query: {query}\n")

# Display first 5 articles
for i, article in enumerate(dtol_articles[:5]):
    meta = client.extract_article_metadata(article)
    print(f"{'='*70}")
    print(f"[{i+1}] {meta['title']}")
    print(f"    Journal: {meta['journal']} ({meta['year']})")
    print(f"    DOI: {meta['doi']}")
    print(f"    Cited by: {meta['cited_by_count']}")
    print(f"    Open Access: {meta['is_open_access']}")
    print(f"    PMID: {meta['pmid']} | PMCID: {meta['pmcid']}")
    
    # Show first 200 chars of abstract
    abstract = meta['abstract']
    if abstract:
        print(f"    Abstract: {abstract[:200]}...")
    print()

Found 20 articles for query: "Darwin Tree of Life"

[1] The genome sequence of the cryptorhynchine weevil, Kyklioacalles roboris (StÃ¼ben, 2003) (Coleoptera: Curculionidae) [version 1; peer review: 2 approved]
    Journal:  (2026)
    DOI: 
    Cited by: 0
    Open Access: True
    PMID:  | PMCID: PMC12895102

[2] The genome sequence of the leafhopper, Arthaldeus pascuellus (FallÃ©n, 1826) (Hemiptera: Cicadellidae) [version 1; peer review: 2 approved]
    Journal:  (2026)
    DOI: 
    Cited by: 0
    Open Access: True
    PMID:  | PMCID: PMC12859417

[3] The genome sequence of the red gurnard, Chelidonichthys lucerna (Linnaeus, 1758) (Perciformes: Triglidae) [version 1; peer review: 2 approved]
    Journal:  (2026)
    DOI: 
    Cited by: 0
    Open Access: True
    PMID:  | PMCID: PMC12856256

[4] The genome sequence of True's beaked whale, &lt;i&gt;Mesoplodon mirus&lt;/i&gt; True, 1913 (Artiodactyla: Ziphiidae).
    Journal:  (2026)
    DOI: 10.12688/wellcomeopenres.25413.1
    Cite

## 2. Inspect Article Metadata Fields

Let's look at ALL fields available in a raw Europe PMC response to understand what data we can work with.

In [3]:
# Take the first article and print all available keys
if dtol_articles:
    sample = dtol_articles[0]
    print("Available fields in raw Europe PMC article response:")
    print("=" * 50)
    for key in sorted(sample.keys()):
        value = sample[key]
        val_type = type(value).__name__
        val_preview = str(value)[:80] if not isinstance(value, (dict, list)) else f"({val_type})"
        print(f"  {key:30s} [{val_type:6s}] = {val_preview}")

Available fields in raw Europe PMC article response:
  authMan                        [str   ] = N
  authorList                     [dict  ] = (dict)
  authorString                   [str   ] = Crowley L.
  citedByCount                   [int   ] = 0
  dateOfCreation                 [str   ] = 2026-02-14
  dateOfRevision                 [str   ] = 2026-02-14
  electronicPublicationDate      [str   ] = 2026-01-01
  epmcAuthMan                    [str   ] = N
  firstIndexDate                 [str   ] = 2026-02-14
  firstPublicationDate           [str   ] = 2026-01-01
  fullTextIdList                 [dict  ] = (dict)
  fullTextReceivedDate           [str   ] = 2026-02-14
  fullTextUrlList                [dict  ] = (dict)
  grantsList                     [dict  ] = (dict)
  hasBook                        [str   ] = N
  hasData                        [str   ] = N
  hasDbCrossReferences           [str   ] = N
  hasEvaluations                 [str   ] = N
  hasLabsLinks                   [st

## 3. Count Publications by Year

Let's see how many biodiversity genomics publications exist per year.

In [4]:
import matplotlib
matplotlib.use('Agg')  # Use non-interactive backend if needed
import matplotlib.pyplot as plt

# Count DToL publications by year
queries_to_count = {
    '"Darwin Tree of Life"': [],
    '"Earth BioGenome Project"': [],
    '"Genome Note" AND "Wellcome Open Research"': [],
    '"reference genome" AND "biodiversity"': [],
}

years = range(2018, 2026)

for query_name, counts in queries_to_count.items():
    for year in years:
        full_query = f'{query_name} AND (FIRST_PDATE:[{year}-01-01 TO {year}-12-31])'
        count = client.count_results(full_query)
        counts.append(count)
        print(f"  {query_name[:40]:40s} | {year}: {count}")

# Plot
fig, ax = plt.subplots(figsize=(12, 6))
x = list(years)
for query_name, counts in queries_to_count.items():
    label = query_name.replace('"', '').replace(' AND ', ' & ')[:35]
    ax.plot(x, counts, marker='o', linewidth=2, label=label)

ax.set_xlabel("Year", fontsize=12)
ax.set_ylabel("Number of Publications", fontsize=12)
ax.set_title("Biodiversity Genomics Publications by Year", fontsize=14, fontweight='bold')
ax.legend(fontsize=9, loc='upper left')
ax.grid(True, alpha=0.3)
ax.set_xticks(x)

plt.tight_layout()
plt.savefig("../results/figures/publications_by_year_exploration.png", dpi=150, bbox_inches='tight')
plt.show()
print("\n\u2705 Saved to results/figures/publications_by_year_exploration.png")

  "Darwin Tree of Life"                    | 2018: 0
  "Darwin Tree of Life"                    | 2019: 1
  "Darwin Tree of Life"                    | 2020: 13
  "Darwin Tree of Life"                    | 2021: 75
  "Darwin Tree of Life"                    | 2022: 120
  "Darwin Tree of Life"                    | 2023: 355
  "Darwin Tree of Life"                    | 2024: 391
  "Darwin Tree of Life"                    | 2025: 514
  "Earth BioGenome Project"                | 2018: 10
  "Earth BioGenome Project"                | 2019: 31
  "Earth BioGenome Project"                | 2020: 34
  "Earth BioGenome Project"                | 2021: 46
  "Earth BioGenome Project"                | 2022: 72
  "Earth BioGenome Project"                | 2023: 75
  "Earth BioGenome Project"                | 2024: 117
  "Earth BioGenome Project"                | 2025: 406
  "Genome Note" AND "Wellcome Open Researc | 2018: 0
  "Genome Note" AND "Wellcome Open Researc | 2019: 0
  "Genome Note" AND "Wellc

  plt.show()


## 4. Explore Text-Mined Annotations

Europe PMC provides automatic text-mined annotations â€” organism names, gene/protein mentions, diseases, etc. Let's see what's available for a DToL article.

In [5]:
# Get annotations for a sample article
if dtol_articles:
    # Find first article with a PMID (some are PMC-only)
    sample_article = None
    sample_id = ""
    source = "MED"
    
    for a in dtol_articles:
        if a.get("pmid"):
            sample_article = a
            sample_id = a["pmid"]
            source = "MED"
            break
        elif a.get("pmcid"):
            sample_article = a
            sample_id = a["pmcid"]
            source = "PMC"
    
    if sample_id:
        print(f"Getting annotations for {source}:{sample_id}")
        print(f"Title: {sample_article.get('title', 'N/A')}\n")
        
        annotations = client.get_annotations([sample_id], source=source)
        
        if annotations:
            print(f"Found {len(annotations)} annotation records\n")
            for ann_record in annotations[:3]:
                article_id = ann_record.get("extId", "unknown")
                ann_list = ann_record.get("annotations", [])
                print(f"Article {article_id}: {len(ann_list)} annotations")
                
                # Group by type
                type_counts = {}
                for ann in ann_list:
                    ann_type = ann.get("type", "unknown")
                    type_counts[ann_type] = type_counts.get(ann_type, 0) + 1
                
                for ann_type, count in sorted(type_counts.items()):
                    print(f"  {ann_type}: {count}")
                    examples = [a for a in ann_list if a.get("type") == ann_type]
                    if examples:
                        ex = examples[0]
                        print(f"    Example: '{ex.get('exact', 'N/A')}' "
                              f"(prefix: '{ex.get('prefix', '')[:30]}...')")
                print()
        else:
            print("No annotations found")
    else:
        print("No article ID available")

Getting annotations for MED:41625987
Title: The genome sequence of True's beaked whale, &lt;i&gt;Mesoplodon mirus&lt;/i&gt; True, 1913 (Artiodactyla: Ziphiidae).

Found 1 annotation records

Article 41625987: 51 annotations
  Chemicals: 3
    Example: 'Nucleic acid' (prefix: '...')
  Experimental Methods: 8
    Example: 'Assay' (prefix: 'dsDNA High Sensitivity...')
  Gene Ontology: 7
    Example: 'sex' (prefix: 'including the X...')
  Gene_Proteins: 3
    Example: 'nuclease' (prefix: 'bead clean-up, and...')
  Organisms: 29
    Example: 'Mammalia' (prefix: 'beaked whale; Chordata;...')
  Resources: 1
    Example: 'Ensembl' (prefix: 'of this assembly on ...')



## 5. Test Citation Retrieval

Let's see how many articles cite a prominent DToL publication.

In [6]:
# Use get_article_by_id for a well-cited DToL-related paper
# PMID 33420778 = genome assembly curation paper (~1900+ citations)
article = client.get_article_by_id("33420778", source="MED")

if article:
    meta = client.extract_article_metadata(article)
    
    print(f"Well-known DToL-related article:")
    print(f"  Title: {meta['title']}")
    print(f"  Year: {meta['year']}")
    print(f"  Journal: {meta['journal']}")
    print(f"  Cited by: {meta['cited_by_count']} articles")
    print(f"  PMID: {meta['pmid']}")
    print()
    
    # Get actual citations
    citations = client.get_citations(meta['pmid'], source="MED", max_results=10)
    print(f"Retrieved {len(citations)} citing articles (showing first 5):\n")
    for i, cit in enumerate(citations[:5]):
        print(f"  [{i+1}] {cit.get('title', 'N/A')}")
        print(f"      {cit.get('journalAbbreviation', 'N/A')} ({cit.get('pubYear', 'N/A')})")
        print()
else:
    print("Could not retrieve article")

Well-known DToL-related article:
  Title: Significantly improving the quality of genome assemblies through curation.
  Year: 2021
  Journal: 
  Cited by: 1926 articles
  PMID: 33420778

Retrieved 10 citing articles (showing first 5):

  [1] The genome sequence of the Carline Skipper, Pyrgus carlinae (Rambur, 1839) (Lepidoptera: Hesperiidae)
      N/A (2026)

  [2] The genome sequence of the Beautiful Pearl, Agrotera nemoralis (Scopoli, 1763) (Lepidoptera: Crambidae)
      N/A (2026)

  [3] The genome sequence of the Mountain Burnet, Zygaena exulans (Hohenwarth, 1792) (Lepidoptera: Zygaenidae)
      N/A (2026)

  [4] The genome sequence of the Spanish Festoon, Zerynthia rumina (Linnaeus, 1758) (Lepidoptera: Papilionidae)
      N/A (2026)

  [5] The genome sequence of the Purple Treble-bar, Aplocera praeformata (HÃ¼bner, 1826) (Lepidoptera: Geometridae)
      N/A (2026)



## 6. Compare Positive vs Negative Queries

Let's verify that our positive (biodiversity) and negative (non-biodiversity) queries return distinct types of articles.

In [7]:
# Positive query
pos_query = '"Darwin Tree of Life"'
pos_articles = client.search_articles(pos_query, max_results=5)

# Negative query
neg_query = '"clinical trial" AND "drug"'
neg_articles = client.search_articles(neg_query, max_results=5)

print("POSITIVE (Biodiversity Genomics):")
print("=" * 60)
for a in pos_articles[:3]:
    meta = client.extract_article_metadata(a)
    print(f"  [+] {meta['title'][:80]}")
    print(f"     Journal: {meta['journal']} | Year: {meta['year']}")
    print()

print("\nNEGATIVE (Non-Biodiversity):")
print("=" * 60)
for a in neg_articles[:3]:
    meta = client.extract_article_metadata(a)
    print(f"  [-] {meta['title'][:80]}")
    print(f"     Journal: {meta['journal']} | Year: {meta['year']}")
    print()

print("Clear separation between positive and negative queries confirmed!")

POSITIVE (Biodiversity Genomics):
  [+] The genome sequence of the cryptorhynchine weevil, Kyklioacalles roboris (StÃ¼ben
     Journal:  | Year: 2026

  [+] The genome sequence of the leafhopper, Arthaldeus pascuellus (FallÃ©n, 1826) (Hem
     Journal:  | Year: 2026

  [+] The genome sequence of the red gurnard, Chelidonichthys lucerna (Linnaeus, 1758)
     Journal:  | Year: 2026


NEGATIVE (Non-Biodiversity):
  [-] Advances and perspectives in CEA-targeted therapies: From classic biomarker towa
     Journal:  | Year: 2026

  [-] To investigate the effect of a high-fat diet on pharmacokinetics/renal function/
     Journal:  | Year: 2026

  [-] A Single-Center, Randomized, Open-Label, Two-Formulation, Two-Sequence, Two-Peri
     Journal:  | Year: 2026

Clear separation between positive and negative queries confirmed!


## Summary

| Item | Finding |
|------|--------|
| API Status | Working â€” Europe PMC REST API accessible |
| DToL Articles | Found articles with full metadata |
| Available Fields | title, abstract, journal, year, authors, DOI, citations, keywords, pub type |
| Annotations API | Returns organism names, gene/protein mentions |
| Citations API | Returns citing articles |
| Positive/Negative Separation | Clear topical distinction confirmed |
