# Paper Metadata Enrichment Demo

This notebook demonstrates how to use the pyEuropePMC enrichment features to enhance paper metadata with information from external APIs like CrossRef, Unpaywall, Semantic Scholar, and OpenAlex.

## Overview

The enrichment module allows you to:
- Fetch additional metadata from multiple academic APIs
- Get citation counts from various sources
- Check open access availability and locations
- Retrieve funding information
- Access topic classifications and fields of study
- Combine data from multiple sources intelligently

## Available APIs

- **CrossRef**: Bibliographic metadata, citations, licensing
- **Unpaywall**: Open access status and full-text availability  
- **Semantic Scholar**: Academic impact metrics
- **OpenAlex**: Comprehensive academic metadata graph

## Import Required Libraries

In [None]:
# Import Required Libraries
from pyeuropepmc import PaperEnricher, EnrichmentConfig
from pyeuropepmc.cache.cache import CacheConfig
import os

# Example DOI for demonstration
DOI = "10.1371/journal.pone.0308090"

print("Libraries imported successfully!")
print(f"Demo DOI: {DOI}")

Libraries imported successfully!
Demo DOI: 10.1371/journal.pone.0308090


## Basic Enrichment Example

Let's start with a simple enrichment using default settings. This enables CrossRef, Semantic Scholar, and OpenAlex, but disables Unpaywall (which requires an email).

In [2]:
# Configure enrichment with default settings
config = EnrichmentConfig(
    enable_crossref=True,
    enable_semantic_scholar=True,
    enable_openalex=True,
    enable_datacite=True,
    enable_unpaywall=True,
    enable_ror=True,
    crossref_email=os.environ.get("CROSSREF_EMAIL"),
    datacite_email=os.environ.get("DATACITE_EMAIL"),
    unpaywall_email=os.environ.get("UNPAYWALL_EMAIL"),
    semantic_scholar_api_key=os.environ.get("SEMANTIC_SCHOLAR_API_KEY"),
    openalex_email=os.environ.get("OPENALEX_EMAIL"),
    ror_client_id=os.environ.get("ROR_CLIENT_ID"),
)

print("Enrichment configuration:")
print(f"- CrossRef: {'Enabled' if config.enable_crossref else 'Disabled'}")
print(f"- Semantic Scholar: {'Enabled' if config.enable_semantic_scholar else 'Disabled'}")
print(f"- OpenAlex: {'Enabled' if config.enable_openalex else 'Disabled'}")
print(f"- DataCite: {'Enabled' if config.enable_datacite else 'Disabled'}")
print(f"- Unpaywall: {'Enabled' if config.enable_unpaywall else 'Disabled'}")
print(f"- ROR: {'Enabled' if config.enable_ror else 'Disabled'}")

Enrichment configuration:
- CrossRef: Enabled
- Semantic Scholar: Enabled
- OpenAlex: Enabled
- DataCite: Enabled
- Unpaywall: Enabled
- ROR: Enabled


In [3]:
# Create enricher and enrich a paper
with PaperEnricher(config) as enricher:
    print(f"Enriching metadata for DOI: {DOI}")
    print("-" * 60)

    # Enrich the paper
    result = enricher.enrich_paper(identifier=DOI)

    # Display results
    print(f"Data sources used: {', '.join(result['sources'])}")
    print("\n" + "=" * 60)
    print("MERGED METADATA")
    print("=" * 60)

    merged = result.get("merged", {})

    # Title
    if "title" in merged:
        print(f"Title: {merged['title']}")

    # Authors
    if "authors" in merged:
        print(f"Authors ({len(merged['authors'])} total):")
        authors = merged["authors"][:5]
        for author in authors:
            if isinstance(author, dict):
                print(f"  - {author.get('name', author.get('display_name', 'Unknown'))}")
            else:
                print(f"  - {author}")
        if len(merged["authors"]) > 5:
            print(f"  ... and {len(merged['authors']) - 5} more")

    # Citation metrics
    if "citation_count" in merged:
        print(f"Citation Count: {merged['citation_count']}")
        if "citation_counts" in merged:
            print("  Citation counts by source:")
            for cite_info in merged["citation_counts"]:
                print(f"    - {cite_info['source']}: {cite_info['count']}")

    # Open access status
    if "is_oa" in merged:
        oa_status = "Yes" if merged["is_oa"] else "No"
        print(f"Open Access: {oa_status}")
        if merged.get("oa_status"):
            print(f"  Status: {merged['oa_status']}")

Enriching metadata for DOI: 10.1371/journal.pone.0308090
------------------------------------------------------------


No data found for identifier: 10.1371/journal.pone.0308090
No data from datacite
No data from datacite
No data from datacite
No data from ror
No data from ror
No data from ror


Data sources used: crossref, unpaywall, semantic_scholar, openalex, ror

MERGED METADATA
Title: Associations of dietary pattern, insulin resistance and risk of developing metabolic syndrome among Chinese population
Authors (11 total):
  - Liyong Kou
  - Jing Ping Sun
  - Ping Wu
  - Cheng Zhou
  - Ping Zhou
  ... and 6 more
Citation Count: 3
  Citation counts by source:
    - crossref: 3
    - semantic_scholar: 3
    - openalex: 3
Open Access: Yes
  Status: gold


In [4]:
result

{'identifier': '10.1371/journal.pone.0308090',
 'doi': '10.1371/journal.pone.0308090',
 'sources': ['crossref', 'unpaywall', 'semantic_scholar', 'openalex', 'ror'],
 'crossref': {'source': 'crossref',
  'title': 'Associations of dietary pattern, insulin resistance and risk of developing metabolic syndrome among Chinese population',
  'authors': ['Liyong Kou',
   'Jing Sun',
   'Ping Wu',
   'Zhou Cheng',
   'Ping Zhou',
   'Nana Li',
   'Liang Cheng',
   'Pengfei Xu',
   'Yunzhuo Xue',
   'Jiamin Tian',
   'Wei Chen'],
  'abstract': '<jats:p>Evidence regarding the role of dietary patterns in metabolic syndrome (MetS) is limited. The mechanistic links between dietary patterns, insulin resistance, and MetS are not fully understood. This study aimed to evaluate the associations between dietary patterns and the risk of MetS in a Chinese population using a longitudinal design. Data from the China Health and Nutrition Survey, a nationally representative survey, were analyzed. MetS cases were

### Journal and Publication Information

In [5]:
# Display journal and publication information
if "journal" in merged:
    journal = merged["journal"]
    if isinstance(journal, dict):
        print(f"Journal: {journal.get('display_name', journal)}")
    else:
        print(f"Journal: {journal}")
else:
    print("Journal: Not available")

if "publication_date" in merged:
    print(f"Publication Date: {merged['publication_date']}")
elif "publication_year" in merged:
    print(f"Publication Year: {merged['publication_year']}")
else:
    print("Publication Date: Not available")

Journal: PLOS ONE
Publication Date: 2024-08-06


### Additional Citation Metrics

In [6]:
# Display additional citation metrics
if "influential_citation_count" in merged:
    print(f"Influential Citations: {merged['influential_citation_count']}")
else:
    print("Influential Citations: Not available")

# Show citation velocity or other metrics if available
if "citation_velocity" in merged:
    print(f"Citation Velocity: {merged['citation_velocity']}")
if "recent_citations" in merged:
    print(f"Recent Citations: {merged['recent_citations']}")

Influential Citations: 0


### Topics and Fields of Study

In [7]:
# Display topics/concepts from OpenAlex
if "topics" in merged and merged["topics"]:
    print("Topics:")
    for topic in merged["topics"][:5]:
        if isinstance(topic, dict):
            name = topic.get("display_name", "Unknown")
            score = topic.get("score", "")
            if score:
                print(f"  - {name} (relevance: {score:.3f})")
            else:
                print(f"  - {name}")
        else:
            print(f"  - {topic}")
else:
    print("Topics: Not available")

# Display fields of study
if "fields_of_study" in merged and merged["fields_of_study"]:
    print("\nFields of Study:")
    for field in merged["fields_of_study"][:5]:
        print(f"  - {field}")
else:
    print("Fields of Study: Not available")

Topics:
  - Nutritional Studies and Diet (relevance: 0.994)
  - Metabolomics and Mass Spectrometry Studies (relevance: 0.993)
  - Diet and metabolism studies (relevance: 0.986)

Fields of Study:
  - Medicine


### License and Funding Information

In [8]:
# Display license information
if "license" in merged and merged["license"]:
    license_info = merged["license"]
    if isinstance(license_info, dict) and license_info.get("url"):
        print(f"License: {license_info['url']}")
    else:
        print(f"License: {license_info}")
else:
    print("License: Not available")

# Display funding information
if "funders" in merged and merged["funders"]:
    print(f"\nFunding: {len(merged['funders'])} funder(s)")
    for funder in merged["funders"][:3]:
        if isinstance(funder, dict):
            funder_name = funder.get('name', 'Unknown')
            print(f"  - {funder_name}")
            if funder.get("award"):
                awards = funder["award"]
                if isinstance(awards, list):
                    print(f"    Awards: {', '.join(awards)}")
                else:
                    print(f"    Award: {awards}")
else:
    print("Funding: Not available")

License: http://creativecommons.org/licenses/by/4.0/
Funding: Not available


### Individual Source Data

In [9]:
# Show individual source data
print("=" * 60)
print("INDIVIDUAL SOURCE DATA")
print("=" * 60)

for source in result["sources"]:
    print(f"\n{source.upper()}:")
    print("-" * 40)
    source_data = result.get(source, {})
    if source_data:
        # Show a few key fields from each source
        if "title" in source_data:
            print(f"  Title: {source_data['title'][:60]}...")
        if "citation_count" in source_data:
            print(f"  Citations: {source_data['citation_count']}")
        if "is_oa" in source_data:
            print(f"  Open Access: {source_data['is_oa']}")
        if "topics" in source_data and source_data["topics"]:
            topics_count = len(source_data["topics"])
            print(f"  Topics: {topics_count} available")
    else:
        print("  No data available")

INDIVIDUAL SOURCE DATA

CROSSREF:
----------------------------------------
  Title: Associations of dietary pattern, insulin resistance and risk...
  Citations: 3

UNPAYWALL:
----------------------------------------
  Open Access: True

SEMANTIC_SCHOLAR:
----------------------------------------
  Title: Associations of dietary pattern, insulin resistance and risk...
  Citations: 3

OPENALEX:
----------------------------------------
  Title: Associations of dietary pattern, insulin resistance and risk...
  Citations: 3
  Open Access: True
  Topics: 3 available

ROR:
----------------------------------------


### Complete Data per Service

In [10]:
# Show complete data from each individual service
import pprint

print("=" * 80)
print("COMPLETE DATA PER SERVICE")
print("=" * 80)

for source in result["sources"]:
    print(f"\n🔍 {source.upper()} - COMPLETE DATA:")
    print("=" * 60)
    source_data = result.get(source, {})

    if source_data:
        # Use pprint to show all data in a readable format
        pprint.pprint(source_data, width=100, depth=3)
    else:
        print("  No data available from this service")

    print("\n" + "-" * 80)

COMPLETE DATA PER SERVICE

🔍 CROSSREF - COMPLETE DATA:
{'abstract': '<jats:p>Evidence regarding the role of dietary patterns in metabolic syndrome (MetS) '
             'is limited. The mechanistic links between dietary patterns, insulin resistance, and '
             'MetS are not fully understood. This study aimed to evaluate the associations between '
             'dietary patterns and the risk of MetS in a Chinese population using a longitudinal '
             'design. Data from the China Health and Nutrition Survey, a nationally representative '
             'survey, were analyzed. MetS cases were identified based on biomarker data collected '
             'in 2009. Factor analysis was employed to identify dietary patterns, while logistic '
             'regression models were utilized to examine the association between dietary patterns '
             'and MetS. Mediation models were applied to assess multiple mediation effects. Two '
             'dietary patterns were revealed b

### Data Structure per Service

In [11]:
# Show data structure/keys available from each service
print("=" * 80)
print("DATA STRUCTURE PER SERVICE")
print("=" * 80)

def print_data_structure(data, prefix="", max_depth=2, current_depth=0):
    """Recursively print the structure of nested data"""
    if current_depth > max_depth:
        print(f"{prefix}... (nested data)")
        return

    if isinstance(data, dict):
        for key, value in data.items():
            if isinstance(value, (dict, list)):
                if isinstance(value, list) and value and isinstance(value[0], (dict, list)):
                    print(f"{prefix}{key}: [{len(value)} items]")
                    if len(value) > 0:
                        print_data_structure(value[0], f"{prefix}  ", max_depth, current_depth + 1)
                else:
                    print(f"{prefix}{key}: {type(value).__name__}")
                    if current_depth < max_depth:
                        print_data_structure(value, f"{prefix}  ", max_depth, current_depth + 1)
            else:
                print(f"{prefix}{key}: {type(value).__name__}")
    elif isinstance(data, list):
        print(f"{prefix}[{len(data)} items of type {type(data[0]).__name__ if data else 'empty'}]")
    else:
        print(f"{prefix}{type(data).__name__}")

for source in result["sources"]:
    print(f"\n📊 {source.upper()} - DATA STRUCTURE:")
    print("=" * 60)
    source_data = result.get(source, {})

    if source_data:
        print_data_structure(source_data)
    else:
        print("  No data available from this service")

    print("\n" + "-" * 80)

DATA STRUCTURE PER SERVICE

📊 CROSSREF - DATA STRUCTURE:
source: str
title: str
authors: list
  [11 items of type str]
abstract: str
journal: str
publication_date: str
citation_count: int
references_count: int
license: dict
  url: str
  start: str
  delay_in_days: int
funders: NoneType
type: str
issn: list
  [1 items of type str]
volume: str
issue: str
page: str
publisher: str

--------------------------------------------------------------------------------

📊 UNPAYWALL - DATA STRUCTURE:
source: str
is_oa: bool
oa_status: str
best_oa_location: dict
  url: str
  url_for_pdf: NoneType
  url_for_landing_page: str
  version: str
  license: str
  repository_type: str
  evidence: str
oa_locations: [2 items]
  url: str
  url_for_pdf: NoneType
  version: str
  license: str
  repository_type: str
oa_locations_embargoed: NoneType
first_oa_date: NoneType
journal_is_oa: bool
journal_is_in_doaj: bool
publisher: str
year: int

-------------------------------------------------------------------------

### Available Fields per Service

In [12]:
# Show available fields/keys from each service
print("=" * 80)
print("AVAILABLE FIELDS PER SERVICE")
print("=" * 80)

for source in result["sources"]:
    print(f"\n🔑 {source.upper()} - AVAILABLE FIELDS:")
    print("=" * 60)
    source_data = result.get(source, {})

    if source_data:
        fields = list(source_data.keys())
        print(f"Total fields: {len(fields)}")
        print("Fields:", ", ".join(fields))

        # Show data types for each field
        print("\nField types:")
        for field in fields:
            value = source_data[field]
            field_type = type(value).__name__
            if isinstance(value, list) and value:
                field_type += f"[{len(value)}]"
            elif isinstance(value, dict):
                field_type += f"{{{len(value)}}}"
            print(f"  {field}: {field_type}")
    else:
        print("  No data available from this service")

    print("\n" + "-" * 80)

AVAILABLE FIELDS PER SERVICE

🔑 CROSSREF - AVAILABLE FIELDS:
Total fields: 16
Fields: source, title, authors, abstract, journal, publication_date, citation_count, references_count, license, funders, type, issn, volume, issue, page, publisher

Field types:
  source: str
  title: str
  authors: list[11]
  abstract: str
  journal: str
  publication_date: str
  citation_count: int
  references_count: int
  license: dict{3}
  funders: NoneType
  type: str
  issn: list[1]
  volume: str
  issue: str
  page: str
  publisher: str

--------------------------------------------------------------------------------

🔑 UNPAYWALL - AVAILABLE FIELDS:
Total fields: 11
Fields: source, is_oa, oa_status, best_oa_location, oa_locations, oa_locations_embargoed, first_oa_date, journal_is_oa, journal_is_in_doaj, publisher, year

Field types:
  source: str
  is_oa: bool
  oa_status: str
  best_oa_location: dict{7}
  oa_locations: list[2]
  oa_locations_embargoed: NoneType
  first_oa_date: NoneType
  journal_is_

### Service Comparison Summary

In [13]:
# Create a comparison summary of what each service provides
print("=" * 100)
print("SERVICE COMPARISON SUMMARY")
print("=" * 100)

comparison_data = []

for source in result["sources"]:
    source_data = result.get(source, {})
    if source_data:
        fields = list(source_data.keys())
        info = {
            'Service': source.upper(),
            'Total Fields': len(fields),
            'Has Title': 'title' in source_data,
            'Has Authors': 'authors' in source_data,
            'Has Citations': 'citation_count' in source_data,
            'Has Abstract': 'abstract' in source_data,
            'Has OA Info': 'is_oa' in source_data,
            'Has Topics': 'topics' in source_data and source_data['topics'],
            'Has License': 'license' in source_data,
            'Has Funding': 'funders' in source_data and source_data['funders'],
        }
        comparison_data.append(info)

# Print comparison table
print(f"{'Service':<15} {'Fields':<6} {'Title':<5} {'Authors':<7} {'Cites':<5} {'Abstract':<8} {'OA':<2} {'Topics':<6} {'License':<7} {'Funding':<7}")
print("-" * 100)

for info in comparison_data:
    print(f"{info['Service']:<15} {info['Total Fields']:<6} {str(info['Has Title']):<5} {str(info['Has Authors']):<7} {str(info['Has Citations']):<5} {str(info['Has Abstract']):<8} {str(info['Has OA Info']):<2} {str(info['Has Topics']):<6} {str(info['Has License']):<7} {str(info['Has Funding']):<7}")

print("\n" + "=" * 100)
print("Legend: True/False indicates whether the service provides that type of data")
print("=" * 100)

SERVICE COMPARISON SUMMARY
Service         Fields Title Authors Cites Abstract OA Topics License Funding
----------------------------------------------------------------------------------------------------
CROSSREF        16     True  True    True  True     False False  True    None   
UNPAYWALL       11     False False   False False    True False  False   False  
SEMANTIC_SCHOLAR 19     True  True    True  True     False False  False   False  
OPENALEX        21     True  True    True  False    True [{'id': 'https://openalex.org/T10866', 'display_name': 'Nutritional Studies and Diet', 'score': 0.9941999912261963}, {'id': 'https://openalex.org/T10836', 'display_name': 'Metabolomics and Mass Spectrometry Studies', 'score': 0.9934999942779541}, {'id': 'https://openalex.org/T12267', 'display_name': 'Diet and metabolism studies', 'score': 0.9864000082015991}] False   False  
ROR             2      False False   False False    False False  False   False  

Legend: True/False indicates wheth

## Basic Search vs. Enriched Search

This section demonstrates the difference between a basic paper search using pyEuropePMC's core search functionality and the enhanced results provided by the enrichment module.

We'll search for the same paper using both methods and compare the metadata available.

In [14]:
# Basic search using pyEuropePMC's search client
from pyeuropepmc import SearchClient

# Example DOI for demonstration
DOI = "10.1371/journal.pone.0308090"

print("=" * 80)
print("BASIC SEARCH RESULTS FROM EUROPE PMC")
print("=" * 80)

with SearchClient() as search_client:
    # Search for the paper by DOI
    results = search_client.search(query=f"DOI:{DOI}", limit=1)

    # Extract the actual results from the response
    papers = results.get('resultList', {}).get('result', [])

    if papers and len(papers) > 0:
        paper = papers[0]
        print(f"Title: {paper.get('title', 'Not available')}")
        print(f"Authors: {paper.get('authorString', 'Not available')}")
        print(f"Journal: {paper.get('journalTitle', 'Not available')}")
        print(f"Publication Year: {paper.get('pubYear', 'Not available')}")
        print(f"DOI: {paper.get('doi', 'Not available')}")
        print(f"PMCID: {paper.get('pmcid', 'Not available')}")
        print(f"Abstract: Not available (not included in basic search)")

        # Show what fields are available
        available_fields = list(paper.keys())
        print(f"\nAvailable fields from basic search: {len(available_fields)}")
        print("Key fields:", ", ".join(available_fields[:10]) + ("..." if len(available_fields) > 10 else ""))
    else:
        print("No results found for this DOI.")

BASIC SEARCH RESULTS FROM EUROPE PMC


[BaseAPIClient] GET request failed
Unexpected error in request
Unexpected error in request
Unexpected error in request


SearchError: [NET001] Network connection failed. Check internet connectivity.

In [None]:
# Now show the enriched results
from pyeuropepmc import PaperEnricher, EnrichmentConfig

# Example DOI for demonstration
DOI = "10.1371/journal.pone.0308090"

# Configure enrichment with default settings
config = EnrichmentConfig(
    enable_crossref=True,
    enable_semantic_scholar=True,
    enable_openalex=True,
    enable_unpaywall=False,  # Disabled by default (requires email)
    unpaywall_email="your-email@example.com",  # Placeholder email for Unpaywall
)

print("\n" + "=" * 80)
print("ENRICHED SEARCH RESULTS")
print("=" * 80)

with PaperEnricher(config) as enricher:
    enriched_result = enricher.enrich_paper(identifier=DOI)
    merged = enriched_result.get("merged", {})

    print(f"Title: {merged.get('title', 'Not available')}")
    print(f"Authors: {', '.join([author.get('name', author.get('display_name', 'Unknown')) if isinstance(author, dict) else str(author) for author in merged.get('authors', [])])}")
    print(f"Journal: {merged.get('journal', {}).get('display_name', 'Not available') if isinstance(merged.get('journal'), dict) else merged.get('journal', 'Not available')}")
    print(f"Publication Date: {merged.get('publication_date', merged.get('publication_year', 'Not available'))}")
    print(f"DOI: {DOI}")
    print(f"Citation Count: {merged.get('citation_count', 'Not available')}")
    print(f"Open Access: {'Yes' if merged.get('is_oa') else 'No'}")
    print(f"Topics: {len(merged.get('topics', []))} available")
    print(f"Funding: {len(merged.get('funders', []))} funders")
    print(f"Abstract: {merged.get('abstract', 'Not available')[:300]}..." if merged.get('abstract') else "Abstract: Not available")

    # Show enhanced fields
    available_fields = list(merged.keys())
    print(f"\nAvailable fields from enriched search: {len(available_fields)}")
    print("Key fields:", ", ".join(available_fields[:15]) + ("..." if len(available_fields) > 15 else ""))


ENRICHED SEARCH RESULTS
Title: Associations of dietary pattern, insulin resistance and risk of developing metabolic syndrome among Chinese population
Authors: Liyong Kou, Jing Ping Sun, Ping Wu, Cheng Zhou, Ping Zhou, Nana Li, Liang Cheng, Pengfei Xu, Yunzhuo Xue, Jiamin Tian, Wei Chen
Journal: PLOS ONE
Publication Date: 2024-08-06
DOI: 10.1371/journal.pone.0308090
Citation Count: 3
Open Access: Yes
Topics: 3 available
Funding: 0 funders
Abstract: <jats:p>Evidence regarding the role of dietary patterns in metabolic syndrome (MetS) is limited. The mechanistic links between dietary patterns, insulin resistance, and MetS are not fully understood. This study aimed to evaluate the associations between dietary patterns and the risk of MetS in a Chi...

Available fields from enriched search: 18
Key fields: title, authors, abstract, journal, publication_date, citation_counts, citation_count, is_oa, oa_status, oa_url, influential_citation_count, fields_of_study, topics, license, external_ids..

In [None]:
# Show external IDs available for referencing
print("\n" + "=" * 80)
print("EXTERNAL IDs FOR REFERENCE")
print("=" * 80)

print("External identifiers that can be used to reference this paper:")
print()

# DOI (from merged or original)
doi = merged.get('doi') or DOI
print(f"DOI: {doi}")

# Check individual services for additional IDs
print("\nChecking individual services for external IDs:")

for source in enriched_result["sources"]:
    print(f"\n{source.upper()}:")
    source_data = enriched_result.get(source, {})
    if source_data:
        # Common external ID fields to check
        id_fields = ['pmid', 'pmcid', 'issn', 'isbn', 'openalex_id', 'semanticscholar_id', 'crossref_id', 'id', 'paper_id', 'external_ids']
        found_ids = {}
        for field in id_fields:
            if field in source_data and source_data[field]:
                found_ids[field] = source_data[field]

        if found_ids:
            for field, value in found_ids.items():
                print(f"  {field}: {value}")
        else:
            print("  No additional external IDs found")
    else:
        print("  No data available")

print("\n" + "=" * 80)
print("These external IDs can be used to:")
print("- Cite the paper in academic writing")
print("- Link to the paper from other databases")
print("- Cross-reference with other systems")
print("- Build citation networks and bibliometric analyses")
print("=" * 80)


EXTERNAL IDs FOR REFERENCE
External identifiers that can be used to reference this paper:

DOI: 10.1371/journal.pone.0308090

Checking individual services for external IDs:

CROSSREF:
  issn: ['1932-6203']

SEMANTIC_SCHOLAR:
  external_ids: {'PubMedCentral': '11302861', 'DOI': '10.1371/journal.pone.0308090', 'CorpusId': 271742135, 'PubMed': '39106225'}

OPENALEX:
  openalex_id: https://openalex.org/W4401361300

These external IDs can be used to:
- Cite the paper in academic writing
- Link to the paper from other databases
- Cross-reference with other systems
- Build citation networks and bibliometric analyses


In [None]:
# Show citation information and references
print("\n" + "=" * 80)
print("CITATION INFORMATION AND REFERENCES")
print("=" * 80)

print("Citation metrics:")
print(f"- Total citations: {merged.get('citation_count', 'Not available')}")
print(f"- Influential citations: {merged.get('influential_citation_count', 'Not available')}")

# Check for references/citations in individual services
print("\nChecking for references/citations in individual services:")

for source in enriched_result["sources"]:
    print(f"\n{source.upper()}:")
    source_data = enriched_result.get(source, {})
    if source_data:
        # Check for references, citations, or related works
        ref_fields = ['references', 'citations', 'cited_by', 'referenced_works', 'related_works']
        found_refs = {}
        for field in ref_fields:
            if field in source_data and source_data[field]:
                value = source_data[field]
                if isinstance(value, list):
                    found_refs[field] = f"{len(value)} items"
                    # Show first few items if it's related_works
                    if field == 'related_works' and len(value) > 0:
                        print(f"  {field}: {len(value)} items")
                        print("  Sample related works:")
                        for i, work in enumerate(value[:3]):
                            if isinstance(work, dict):
                                work_id = work.get('id', work.get('openalex_id', f'Work {i+1}'))
                                print(f"    - {work_id}")
                            else:
                                print(f"    - {work}")
                        if len(value) > 3:
                            print(f"    ... and {len(value) - 3} more")
                        found_refs.pop(field)  # Remove from general display since we showed details
                else:
                    found_refs[field] = str(value)[:100] + "..." if len(str(value)) > 100 else str(value)

        if found_refs:
            for field, value in found_refs.items():
                print(f"  {field}: {value}")
        elif not any(field in source_data and source_data[field] for field in ['related_works']):  # Don't show "no data" if we already showed related_works
            print("  No references/citations data found")
    else:
        print("  No data available")

print("\n" + "=" * 80)
print("The enrichment provides citation data that can be used to:")
print("- Analyze the paper's impact and influence")
print("- Build citation networks")
print("- Identify related research")
print("- Perform bibliometric analysis")
print("- Discover related works for further reading")
print("=" * 80)


CITATION INFORMATION AND REFERENCES
Citation metrics:
- Total citations: 3
- Influential citations: 0

Checking for references/citations in individual services:

CROSSREF:
  No references/citations data found

SEMANTIC_SCHOLAR:
  No references/citations data found

OPENALEX:
  related_works: 10 items
  Sample related works:
    - https://openalex.org/W4297998612
    - https://openalex.org/W1604849300
    - https://openalex.org/W4231328776
    ... and 7 more

The enrichment provides citation data that can be used to:
- Analyze the paper's impact and influence
- Build citation networks
- Identify related research
- Perform bibliometric analysis
- Discover related works for further reading


In [None]:
# Detailed author information analysis
print("\n" + "=" * 80)
print("DETAILED AUTHOR INFORMATION")
print("=" * 80)

print("Checking what author information is available from each service:")
print()

for source in enriched_result["sources"]:
    print(f"\n🔍 {source.upper()} - AUTHOR DETAILS:")
    print("=" * 60)
    source_data = enriched_result.get(source, {})

    if source_data and "authors" in source_data:
        authors = source_data["authors"]
        if isinstance(authors, list) and authors:
            print(f"Number of authors: {len(authors)}")
            print()

            # Show detailed info for first 2 authors
            for i, author in enumerate(authors[:2]):
                print(f"Author {i+1}:")
                if isinstance(author, dict):
                    # Print all available fields for this author
                    for key, value in author.items():
                        if value:  # Only show non-empty values
                            if isinstance(value, list):
                                value = ", ".join(str(v) for v in value if v)
                            print(f"  {key}: {value}")
                else:
                    print(f"  Name: {author}")
                print()

            if len(authors) > 2:
                print(f"... and {len(authors) - 2} more authors")
        else:
            print("Authors data available but not in expected format")
    else:
        print("No author information available from this service")

    print("\n" + "-" * 80)

print("\n" + "=" * 80)
print("AUTHOR INFORMATION SUMMARY:")
print("=" * 80)
print("The enrichment module provides detailed author information including:")
print("- Full names and display names")
print("- Author affiliations and institutions")
print("- Author IDs (ORCID, OpenAlex, etc.)")
print("- Author positions and roles")
print("- Institutional information")
print("=" * 80)


DETAILED AUTHOR INFORMATION
Checking what author information is available from each service:


🔍 CROSSREF - AUTHOR DETAILS:
Number of authors: 11

Author 1:
  Name: Liyong Kou

Author 2:
  Name: Jing Sun

... and 9 more authors

--------------------------------------------------------------------------------

🔍 SEMANTIC_SCHOLAR - AUTHOR DETAILS:
Number of authors: 11

Author 1:
  name: Liyong Kou
  author_id: 2315086174
  url: https://www.semanticscholar.org/author/2315086174

Author 2:
  name: Jing Sun
  author_id: 2315555663
  url: https://www.semanticscholar.org/author/2315555663

... and 9 more authors

--------------------------------------------------------------------------------

🔍 OPENALEX - AUTHOR DETAILS:
Number of authors: 11

Author 1:
  id: https://openalex.org/A5113322668
  display_name: Liyong Kou
  institutions: {'id': 'https://openalex.org/I111599522', 'display_name': 'Jiangnan University', 'country': 'China', 'type': 'education', 'ror_id': 'https://ror.org/04mkzax54

In [None]:
# Enhanced Semantic Scholar Author Information
print("\n" + "=" * 80)
print("ENHANCED SEMANTIC SCHOLAR AUTHOR INFORMATION")
print("=" * 80)

semantic_scholar_data = enriched_result.get("semantic_scholar", {})
if semantic_scholar_data and "authors" in semantic_scholar_data:
    authors = semantic_scholar_data["authors"]
    if isinstance(authors, list) and authors:
        print(f"Semantic Scholar provides author information for {len(authors)} authors:")
        print("Each author includes a direct link to their Semantic Scholar profile.")
        print()

        # Show enhanced info for first 3 authors
        for i, author in enumerate(authors[:3]):
            print(f"Author {i+1}:")
            if isinstance(author, dict):
                print(f"  Name: {author.get('name', 'Unknown')}")
                print(f"  Author ID: {author.get('author_id', 'Not available')}")
                if author.get('url'):
                    print(f"  Profile URL: {author['url']}")
                    print("    (Click to view author's Semantic Scholar profile)")
                if author.get('affiliations'):
                    print(f"  Affiliations: {', '.join(author['affiliations'])}")
                if author.get('homepage'):
                    print(f"  Homepage: {author['homepage']}")
            else:
                print(f"  Name: {author}")
            print()

        if len(authors) > 3:
            print(f"... and {len(authors) - 3} more authors with profile links")

        print("\n" + "-" * 60)
        print("Semantic Scholar Author Features:")
        print("• Direct links to author profiles on Semantic Scholar")
        print("• Author IDs for programmatic access")
        print("• Affiliation information (when available)")
        print("• Homepage links (when available)")
    else:
        print("No detailed author information available from Semantic Scholar")
else:
    print("Semantic Scholar data not available")

print("\n" + "=" * 80)



ENHANCED SEMANTIC SCHOLAR AUTHOR INFORMATION
Semantic Scholar provides author information for 11 authors:
Each author includes a direct link to their Semantic Scholar profile.

Author 1:
  Name: Liyong Kou
  Author ID: 2315086174
  Profile URL: https://www.semanticscholar.org/author/2315086174
    (Click to view author's Semantic Scholar profile)

Author 2:
  Name: Jing Sun
  Author ID: 2315555663
  Profile URL: https://www.semanticscholar.org/author/2315555663
    (Click to view author's Semantic Scholar profile)

Author 3:
  Name: Ping Wu
  Author ID: 2315102821
  Profile URL: https://www.semanticscholar.org/author/2315102821
    (Click to view author's Semantic Scholar profile)

... and 8 more authors with profile links

------------------------------------------------------------
Semantic Scholar Author Features:
• Direct links to author profiles on Semantic Scholar
• Author IDs for programmatic access
• Affiliation information (when available)
• Homepage links (when available)



In [None]:
# Enhanced Semantic Scholar Author Information
print("\n" + "=" * 80)
print("ENHANCED SEMANTIC SCHOLAR AUTHOR INFORMATION")
print("=" * 80)

semantic_scholar_data = enriched_result.get("semantic_scholar", {})
if semantic_scholar_data and "authors" in semantic_scholar_data:
    authors = semantic_scholar_data["authors"]
    if isinstance(authors, list) and authors:
        print(f"Semantic Scholar provides author information for {len(authors)} authors:")
        print("Each author includes a direct link to their Semantic Scholar profile.")
        print()

        # Show enhanced info for first 3 authors
        for i, author in enumerate(authors[:3]):
            print(f"Author {i+1}:")
            if isinstance(author, dict):
                print(f"  Name: {author.get('name', 'Unknown')}")
                print(f"  Author ID: {author.get('author_id', 'Not available')}")
                if author.get('url'):
                    print(f"  Profile URL: {author['url']}")
                    print("    (Click to view author's Semantic Scholar profile)")
                if author.get('affiliations'):
                    print(f"  Affiliations: {', '.join(author['affiliations'])}")
                if author.get('homepage'):
                    print(f"  Homepage: {author['homepage']}")
            else:
                print(f"  Name: {author}")
            print()

        if len(authors) > 3:
            print(f"... and {len(authors) - 3} more authors with profile links")

        print("\n" + "-" * 60)
        print("Semantic Scholar Author Features:")
        print("• Direct links to author profiles on Semantic Scholar")
        print("• Author IDs for programmatic access")
        print("• Affiliation information (when available)")
        print("• Homepage links (when available)")
    else:
        print("No detailed author information available from Semantic Scholar")
else:
    print("Semantic Scholar data not available")

print("\n" + "=" * 80)



ENHANCED SEMANTIC SCHOLAR AUTHOR INFORMATION
Semantic Scholar provides author information for 11 authors:
Each author includes a direct link to their Semantic Scholar profile.

Author 1:
  Name: Liyong Kou
  Author ID: 2315086174
  Profile URL: https://www.semanticscholar.org/author/2315086174
    (Click to view author's Semantic Scholar profile)

Author 2:
  Name: Jing Sun
  Author ID: 2315555663
  Profile URL: https://www.semanticscholar.org/author/2315555663
    (Click to view author's Semantic Scholar profile)

Author 3:
  Name: Ping Wu
  Author ID: 2315102821
  Profile URL: https://www.semanticscholar.org/author/2315102821
    (Click to view author's Semantic Scholar profile)

... and 8 more authors with profile links

------------------------------------------------------------
Semantic Scholar Author Features:
• Direct links to author profiles on Semantic Scholar
• Author IDs for programmatic access
• Affiliation information (when available)
• Homepage links (when available)



In [None]:
# Examine merged author data for affiliations and IDs
print("\n" + "=" * 80)
print("MERGED AUTHOR DATA ANALYSIS")
print("=" * 80)

merged_authors = merged.get('authors', [])
if merged_authors:
    print(f"Total authors in merged data: {len(merged_authors)}")
    print()

    for i, author in enumerate(merged_authors[:3]):  # Show first 3 authors
        print(f"Author {i+1}:")
        if isinstance(author, dict):
            # Check for common author fields
            name = author.get('name') or author.get('display_name') or author.get('full_name')
            if name:
                print(f"  Name: {name}")

            # Affiliation information
            affiliation = author.get('affiliation') or author.get('affiliations') or author.get('institution')
            if affiliation:
                if isinstance(affiliation, list):
                    print(f"  Affiliations: {', '.join(affiliation)}")
                else:
                    print(f"  Affiliation: {affiliation}")

            # Author IDs
            orcid = author.get('orcid') or author.get('orcid_id')
            if orcid:
                print(f"  ORCID: {orcid}")

            openalex_id = author.get('openalex_id') or author.get('id')
            if openalex_id:
                print(f"  OpenAlex ID: {openalex_id}")

            # Other potential fields
            for key in ['position', 'role', 'email', 'department', 'country']:
                value = author.get(key)
                if value:
                    print(f"  {key.title()}: {value}")

            # Show all available fields if they have additional info
            all_fields = list(author.keys())
            if len(all_fields) > 2:  # More than just basic name field
                print(f"  All available fields: {', '.join(all_fields)}")
        else:
            print(f"  Name: {author}")
        print()

    if len(merged_authors) > 3:
        print(f"... and {len(merged_authors) - 3} more authors")
else:
    print("No author data available in merged results")

print("\n" + "=" * 80)


MERGED AUTHOR DATA ANALYSIS
Total authors in merged data: 11

Author 1:
  Name: Liyong Kou
  OpenAlex ID: https://openalex.org/A5113322668
  Position: first
  All available fields: name, orcid, openalex_id, institutions, position, sources

Author 2:
  Name: Jing Ping Sun
  ORCID: https://orcid.org/0000-0003-2110-5189
  OpenAlex ID: https://openalex.org/A5058673686
  Position: middle
  All available fields: name, orcid, openalex_id, institutions, position, sources

Author 3:
  Name: Ping Wu
  OpenAlex ID: https://openalex.org/A5111281865
  Position: middle
  All available fields: name, orcid, openalex_id, institutions, position, sources

... and 8 more authors



In [None]:
# Show detailed author information from OpenAlex (which has the most comprehensive data)
print("\n" + "=" * 80)
print("COMPREHENSIVE AUTHOR INFORMATION FROM OPENALEX")
print("=" * 80)

openalex_data = enriched_result.get("openalex", {})
if openalex_data and "authors" in openalex_data:
    authors = openalex_data["authors"]
    if isinstance(authors, list) and authors:
        print(f"OpenAlex provides detailed information for {len(authors)} authors:")
        print()

        # Show detailed info for first author
        author = authors[0]
        if isinstance(author, dict):
            print("First author details:")
            print(f"  OpenAlex ID: {author.get('id', 'Not available')}")
            print(f"  Display Name: {author.get('display_name', 'Not available')}")
            print(f"  ORCID: {author.get('orcid', 'Not available')}")

            # Institutions/affiliations
            institutions = author.get('institutions', [])
            if institutions:
                print(f"  Institutions ({len(institutions)}):")
                for inst in institutions:
                    if isinstance(inst, dict):
                        inst_name = inst.get('display_name', inst.get('name', 'Unknown'))
                        inst_id = inst.get('id', '')
                        country = inst.get('country_code', '')
                        info = f"    - {inst_name}"
                        if inst_id:
                            info += f" (ID: {inst_id})"
                        if country:
                            info += f" ({country})"
                        print(info)
                    else:
                        print(f"    - {inst}")

            # Author position
            position = author.get('position', '')
            if position:
                print(f"  Position: {position}")

            print(f"\n  All available fields: {', '.join(author.keys())}")

        print("\n" + "-" * 60)
        print("Summary of author information available from each service:")
        print("- CrossRef: Author names only")
        print("- Semantic Scholar: Author names + internal author IDs")
        print("- OpenAlex: Comprehensive - names, ORCID, institutional affiliations, positions, OpenAlex IDs")
    else:
        print("No detailed author information available")
else:
    print("OpenAlex data not available")

print("\n" + "=" * 80)
print("AUTHOR ENRICHMENT CAPABILITIES:")
print("=" * 80)
print("The enrichment module provides varying levels of author detail:")
print("• Basic merged data: Author names only")
print("• CrossRef: Author names as strings")
print("• Semantic Scholar: Author names + author IDs")
print("• OpenAlex: Full profile - ORCID, affiliations, institutions, positions")
print("=" * 80)


COMPREHENSIVE AUTHOR INFORMATION FROM OPENALEX
OpenAlex provides detailed information for 11 authors:

First author details:
  OpenAlex ID: https://openalex.org/A5113322668
  Display Name: Liyong Kou
  ORCID: None
  Institutions (2):
    - Jiangnan University (ID: https://openalex.org/I111599522) (CN)
    - Wuxi Fourth People's Hospital (ID: https://openalex.org/I4210114781) (CN)
  Position: first

  All available fields: id, display_name, orcid, institutions, position

------------------------------------------------------------
Summary of author information available from each service:
- CrossRef: Author names only
- Semantic Scholar: Author names + internal author IDs
- OpenAlex: Comprehensive - names, ORCID, institutional affiliations, positions, OpenAlex IDs

AUTHOR ENRICHMENT CAPABILITIES:
The enrichment module provides varying levels of author detail:
• Basic merged data: Author names only
• CrossRef: Author names as strings
• Semantic Scholar: Author names + author IDs
• OpenA

## Enhanced DataCite Integration

DataCite provides comprehensive metadata for datasets, software, and research outputs. The enhanced DataCite client now extracts extensive metadata including usage statistics, relationships, contributors, and temporal information.

In [None]:
# Demonstrate enhanced DataCite capabilities with a DOI that has rich DataCite metadata
DATACITE_DOI = "10.14454/FXWS-0523"  # DataCite Metadata Schema documentation

print("Enhanced DataCite Integration Demo")
print("=" * 50)
print(f"DOI: {DATACITE_DOI}")
print("-" * 50)

from pyeuropepmc.enrichment import DataCiteClient

with DataCiteClient() as datacite_client:
    print("Fetching comprehensive DataCite metadata...")
    datacite_result = datacite_client.enrich(identifier=DATACITE_DOI)

    if datacite_result:
        print("✅ DataCite enrichment successful!")
        print(f"Source: {datacite_result.get('source')}")
        print(f"DOI: {datacite_result.get('doi')}")
        print(f"Title: {datacite_result.get('title', 'N/A')[:100]}...")
        print(f"Publisher: {datacite_result.get('publisher', 'N/A')}")
        print(f"Publication Year: {datacite_result.get('publication_year', 'N/A')}")
        print(f"Resource Type: {datacite_result.get('resource_type_general', 'N/A')}")

        # Usage statistics
        print(f"\n📊 Usage Statistics:")
        print(f"  Views: {datacite_result.get('view_count', 0)}")
        print(f"  Downloads: {datacite_result.get('download_count', 0)}")
        print(f"  Citations: {datacite_result.get('citation_count', 0)}")
        print(f"  References: {datacite_result.get('reference_count', 0)}")

        # Authors and contributors
        creators = datacite_result.get('creators', [])
        contributors = datacite_result.get('contributors', [])
        print(f"\n👥 Author Information:")
        print(f"  Creators: {len(creators)}")
        print(f"  Contributors: {len(contributors)}")

        if creators:
            print("  Sample creators:")
            for i, creator in enumerate(creators[:2]):
                name = creator.get('name', 'Unknown')
                orcid = creator.get('orcid', '')
                orcid_info = f" (ORCID: {orcid})" if orcid else ""
                print(f"    {i+1}. {name}{orcid_info}")

        # Relationships
        relationships = datacite_result.get('relationships', {})
        print(f"\n🔗 Relationships:")
        for rel_type, items in relationships.items():
            if items:
                print(f"  {rel_type}: {len(items)} items")

        # State and metadata info
        print(f"\n📋 Metadata Info:")
        print(f"  State: {datacite_result.get('state', 'N/A')}")
        print(f"  Created: {datacite_result.get('created', 'N/A')}")
        print(f"  Updated: {datacite_result.get('updated', 'N/A')}")
        print(f"  Schema Version: {datacite_result.get('schema_version', 'N/A')}")

        # Show XML availability
        has_xml = bool(datacite_result.get('xml'))
        print(f"  XML Metadata: {'Available' if has_xml else 'Not available'}")

    else:
        print("❌ DataCite enrichment failed")

print("\n" + "=" * 50)
print("ENHANCED DATACITE FEATURES:")
print("=" * 50)
print("The improved DataCite client now provides:")
print("• Comprehensive metadata extraction from DataCite REST API")
print("• Usage statistics (views, downloads, citations)")
print("• Author and contributor information with ORCID IDs")
print("• Relationship data (citations, references, versions)")
print("• Temporal information (creation, update dates)")
print("• Resource type classifications")
print("• Rights and licensing information")
print("• Geographic and funding references")
print("• Base64-encoded XML metadata for advanced use")
print("• Client and provider relationship information")
print("=" * 50)

Enhanced DataCite Integration Demo
DOI: 10.14454/FXWS-0523
--------------------------------------------------
Fetching comprehensive DataCite metadata...
✅ DataCite enrichment successful!
Source: datacite
DOI: 10.14454/fxws-0523
Title: DataCite Metadata Schema for the Publication and Citation of Research Data and Other Research Output...
Publisher: DataCite
Publication Year: 2021
Resource Type: Text

📊 Usage Statistics:
  Views: 0
  Downloads: 0
  Citations: 1
  References: 0

👥 Author Information:
  Creators: 1
  Contributors: 14
  Sample creators:
    1. DataCite Metadata Working Group

🔗 Relationships:
  client: 2 items
  provider: 2 items
  citations: 1 items

📋 Metadata Info:
  State: findable
  Created: 2021-03-30T10:19:45.000Z
  Updated: 2025-06-09T21:30:41.000Z
  Schema Version: http://datacite.org/schema/kernel-4
  XML Metadata: Available

ENHANCED DATACITE FEATURES:
The improved DataCite client now provides:
• Comprehensive metadata extraction from DataCite REST API
• Usage s

# PMCID Support

The enricher now supports PMCID identifiers in addition to DOIs. If you provide a PMCID, it will automatically search Europe PMC to find the corresponding DOI and then proceed with enrichment.

In [None]:
# Example PMCID for demonstration
PMCID = "PMC3258128"

print("Enriching using PMCID:")
print(f"PMCID: {PMCID}")
print("-" * 50)

with PaperEnricher(config) as enricher:
    pmcid_result = enricher.enrich_paper(identifier=PMCID)

    print(f"Resolved DOI: {pmcid_result.get('doi')}")
    print(f"Data sources used: {', '.join(pmcid_result['sources'])}")

    merged = pmcid_result.get("merged", {})
    if merged.get("title"):
        print(f"Title: {merged['title'][:80]}...")
    if merged.get("citation_count"):
        print(f"Citation Count: {merged['citation_count']}")

print("\n✅ PMCID enrichment successful!")

Enriching using PMCID:
PMCID: PMC3258128
--------------------------------------------------
Resolved DOI: 10.1093/nar/gkr715
Data sources used: crossref, semantic_scholar, openalex
Title: Hepato-specific microRNA-122 facilitates accumulation of newly synthesized miRNA...
Citation Count: 38

✅ PMCID enrichment successful!
Resolved DOI: 10.1093/nar/gkr715
Data sources used: crossref, semantic_scholar, openalex
Title: Hepato-specific microRNA-122 facilitates accumulation of newly synthesized miRNA...
Citation Count: 38

✅ PMCID enrichment successful!


## Batch Processing

The enricher now supports batch processing of multiple DOIs or metadata files.

In [None]:
# Batch processing example with mixed identifiers
batch_identifiers = [
    "10.1371/journal.pone.0308090",  # DOI
    "PMC3258128",                    # PMCID
    "10.1038/s41586-020-2649-2",     # Another DOI
]

print("Batch processing multiple identifiers (DOIs and PMCIDs):")
print(f"Processing {len(batch_identifiers)} identifiers")
print("-" * 60)

with PaperEnricher(config) as enricher:
    batch_results = enricher.enrich_papers_batch(batch_identifiers)

    for identifier, result in batch_results.items():
        if "error" in result:
            print(f"❌ {identifier}: {result['error']}")
        else:
            sources = result.get('sources', [])
            doi = result.get('doi', 'N/A')
            citation_count = result.get('merged', {}).get('citation_count', 'N/A')
            author_count = len(result.get('merged', {}).get('authors', []))
            print(f"✅ {identifier} → DOI: {doi}, {len(sources)} sources, {citation_count} citations, {author_count} authors")

print("\n" + "=" * 70)
print("BATCH PROCESSING SUMMARY")
print("=" * 70)
print("The enricher now supports:")
print("• enrich_papers_batch(): Process multiple DOIs or PMCIDs at once")
print("• enrich_from_metadata_files(): Enrich from existing JSON files")
print("• Automatic identifier resolution (PMCID → DOI)")
print("• Automatic error handling for failed enrichments")
print("• Progress tracking and result aggregation")

Batch processing multiple identifiers (DOIs and PMCIDs):
Processing 3 identifiers
------------------------------------------------------------


Failed to enrich 10.1038/s41586-020-2649-2: 'NoneType' object is not iterable


✅ 10.1371/journal.pone.0308090 → DOI: 10.1371/journal.pone.0308090, 3 sources, 3 citations, 11 authors
✅ PMC3258128 → DOI: 10.1093/nar/gkr715, 3 sources, 38 citations, 12 authors
❌ 10.1038/s41586-020-2649-2: 'NoneType' object is not iterable

BATCH PROCESSING SUMMARY
The enricher now supports:
• enrich_papers_batch(): Process multiple DOIs or PMCIDs at once
• enrich_from_metadata_files(): Enrich from existing JSON files
• Automatic identifier resolution (PMCID → DOI)
• Automatic error handling for failed enrichments
• Progress tracking and result aggregation


In [None]:
# Example ROR ID for Harvard University
ROR_ID = "03vek6s52"  # Harvard University

print("ROR Institutional Enrichment Demo")
print("=" * 50)
print(f"ROR ID: {ROR_ID}")
print("-" * 30)

from pyeuropepmc.enrichment import RorClient

with RorClient() as ror_client:
    ror_data = ror_client.enrich(ROR_ID)

    if ror_data:
        print("✅ ROR data retrieved successfully!")
        print(f"Institution: {ror_data.get('display_name', 'Unknown')}")
        print(f"Status: {ror_data.get('status', 'Unknown')}")
        print(f"Types: {', '.join(ror_data.get('types', []))}")
        print(f"Country: {ror_data.get('country', 'Unknown')}")
        print(f"Established: {ror_data.get('established', 'Unknown')}")

        # External IDs
        ext_ids = ror_data.get('external_ids', [])
        if ext_ids:
            print(f"External IDs: {len(ext_ids)} found")
            for ext_id in ext_ids[:3]:  # Show first 3
                id_type = ext_id.get('type')
                preferred = ext_id.get('preferred')
                if preferred:
                    print(f"  {id_type}: {preferred}")

        # Links
        links = ror_data.get('links', [])
        if links:
            for link in links:
                if link.get('type') == 'website':
                    print(f"Website: {link.get('value')}")
                    break
    else:
        print("❌ Failed to retrieve ROR data")

print("\n" + "=" * 50)
print("ROR INTEGRATION FEATURES:")
print("=" * 50)
print("• Standalone ROR client for institutional lookups")
print("• Automatic ROR enrichment in OpenAlex author affiliations")
print("• Comprehensive institutional metadata including:")
print("  - Names, locations, and establishment dates")
print("  - External identifiers (GRID, ISNI, Wikidata, etc.)")
print("  - Relationships with other organizations")
print("  - Website and contact information")
print("=" * 50)

ROR Institutional Enrichment Demo
ROR ID: 03vek6s52
------------------------------
✅ ROR data retrieved successfully!
Institution: Harvard University
Status: active
Types: education, funder
Country: United States
Established: 1636
External IDs: 4 found
  fundref: 100007229
  grid: grid.38142.3c
Website: https://www.harvard.edu

ROR INTEGRATION FEATURES:
• Standalone ROR client for institutional lookups
• Automatic ROR enrichment in OpenAlex author affiliations
• Comprehensive institutional metadata including:
  - Names, locations, and establishment dates
  - External identifiers (GRID, ISNI, Wikidata, etc.)
  - Relationships with other organizations
  - Website and contact information
✅ ROR data retrieved successfully!
Institution: Harvard University
Status: active
Types: education, funder
Country: United States
Established: 1636
External IDs: 4 found
  fundref: 100007229
  grid: grid.38142.3c
Website: https://www.harvard.edu

ROR INTEGRATION FEATURES:
• Standalone ROR client for insti

In [None]:
# Configure enrichment with OpenAlex and ROR enabled
config_with_ror = EnrichmentConfig(
    enable_crossref=False,
    enable_datacite=False,
    enable_unpaywall=False,
    enable_semantic_scholar=False,
    enable_openalex=True,
    enable_ror=True,  # Enable ROR for institutional enrichment
    openalex_email="your@email.com",  # Replace with your email
)

print("Enhanced OpenAlex with ROR Integration")
print("=" * 50)

with PaperEnricher(config_with_ror) as enricher:
    result_with_ror = enricher.enrich_paper(identifier=DOI)

    print(f"Data sources used: {', '.join(result_with_ror['sources'])}")

    merged = result_with_ror.get("merged", {})
    authors = merged.get("authors", [])

    if authors:
        print(f"\nShowing institutional data for first author:")
        first_author = authors[0]
        institutions = first_author.get("institutions", [])

        if institutions:
            for i, inst in enumerate(institutions[:2]):  # Show first 2 institutions
                print(f"Institution {i+1}:")
                print(f"  Name: {inst.get('display_name', 'Unknown')}")
                print(f"  Country: {inst.get('country', 'Unknown')}")
                print(f"  Type: {inst.get('type', 'Unknown')}")

                # ROR-enhanced fields
                if inst.get('ror_id'):
                    print(f"  ROR ID: {inst['ror_id']}")
                if inst.get('website'):
                    print(f"  Website: {inst['website']}")
                if inst.get('fundref_id'):
                    print(f"  FundRef ID: {inst['fundref_id']}")
                if inst.get('grid_id'):
                    print(f"  GRID ID: {inst['grid_id']}")

                print()
        else:
            print("No institutional data available")
    else:
        print("No author data available")

print("\n" + "=" * 50)
print("ROR-ENHANCED INSTITUTIONAL DATA:")
print("=" * 50)
print("When ROR is enabled with OpenAlex, you get:")
print("• Full institutional names and acronyms")
print("• Geographic locations (country, city, coordinates)")
print("• External identifiers (GRID, ISNI, Wikidata, FundRef)")
print("• Website and contact information")
print("• Organizational relationships and hierarchy")
print("• Enhanced validation to prevent invalid ROR ID lookups")
print("=" * 50)

Enhanced OpenAlex with ROR Integration


No data from ror


Data sources used: openalex

Showing institutional data for first author:
Institution 1:
  Name: Jiangnan University
  Country: China
  Type: education
  ROR ID: https://ror.org/04mkzax54

Institution 2:
  Name: Wuxi Fourth People's Hospital
  Country: China
  Type: healthcare
  ROR ID: https://ror.org/02ar02c28


ROR-ENHANCED INSTITUTIONAL DATA:
When ROR is enabled with OpenAlex, you get:
• Full institutional names and acronyms
• Geographic locations (country, city, coordinates)
• External identifiers (GRID, ISNI, Wikidata, FundRef)
• Website and contact information
• Organizational relationships and hierarchy
• Enhanced validation to prevent invalid ROR ID lookups


## Caching Control

All enrichment APIs support caching control through the `use_cache` parameter. This allows you to control whether results are retrieved from cache or fetched fresh from the APIs.

In [None]:
# Demonstrate caching functionality
from pyeuropepmc.enrichment import SemanticScholarClient
from pyeuropepmc.cache.cache import CacheConfig
import time
import tempfile

# Create a temporary cache directory for demonstration
cache_dir = tempfile.mkdtemp()
cache_config = CacheConfig(enabled=True, cache_dir=cache_dir)

print("Caching Control Demonstration")
print("=" * 50)
print(f"Cache directory: {cache_dir}")
print()

# Create Semantic Scholar client with caching enabled
with SemanticScholarClient(cache_config=cache_config) as ss_client:
    DOI = "10.1371/journal.pone.0308090"

    print("1. First request (will fetch from API and cache):")
    start_time = time.time()
    result1 = ss_client.enrich(identifier=DOI, use_cache=True)
    first_request_time = time.time() - start_time
    print(f"   Citation count: {result1.get('citation_count', 'N/A') if result1 else 'N/A'}")

    print("\n2. Second request (will use cached result):")
    start_time = time.time()
    result2 = ss_client.enrich(identifier=DOI, use_cache=True)
    second_request_time = time.time() - start_time
    print(f"   Results identical: {result1 == result2}")

    print("\n3. Third request (force fresh fetch, bypassing cache):")
    start_time = time.time()
    result3 = ss_client.enrich(identifier=DOI, use_cache=False)
    third_request_time = time.time() - start_time

    print("\n4. Batch processing with caching:")
    batch_dois = [
        "10.1371/journal.pone.0308090",
        "10.1038/s41586-020-2649-2"
    ]

    start_time = time.time()
    batch_result = ss_client.enrich_batch(identifiers=batch_dois, use_cache=True)
    batch_time = time.time() - start_time
    print(f"   Processed {len(batch_result)} papers")

    print("\n5. Author enrichment with caching:")
    # Get an author ID from the paper data
    if result1 and result1.get('authors'):
        first_author = result1['authors'][0]
        author_id = first_author.get('author_id')
        if author_id:
            start_time = time.time()
            author_result = ss_client.enrich_author(author_id=author_id, use_cache=True)
            author_time = time.time() - start_time
            print(".2f")
            if author_result:
                print(f"   Author name: {author_result.get('name', 'N/A')}")
                print(f"   Paper count: {author_result.get('paper_count', 'N/A')}")

    print("\n6. Search with caching:")
    start_time = time.time()
    search_result = ss_client.search_papers(query="CRISPR gene editing", limit=5, use_cache=True)
    search_time = time.time() - start_time
    print(f"   Found {len(search_result)} papers")

print("\n" + "=" * 50)
print("CACHING FEATURES:")
print("=" * 50)
print("• Automatic caching of API responses")
print("• Configurable cache directory and size limits")
print("• use_cache parameter for all methods:")
print("  - enrich(): Control caching for individual papers")
print("  - enrich_batch(): Control caching for batch operations")
print("  - enrich_author(): Control caching for author data")
print("  - search_papers(): Control caching for search results")
print("• Cache persistence across sessions")
print("• Significant performance improvements for repeated requests")
print("=" * 50)

# Clean up temporary cache
import shutil
shutil.rmtree(cache_dir)
print(f"\nCleaned up temporary cache directory: {cache_dir}")

Caching Control Demonstration
Cache directory: /tmp/tmpj3eqm71_

1. First request (will fetch from API and cache):
   Citation count: 3

2. Second request (will use cached result):
   Results identical: True

3. Third request (force fresh fetch, bypassing cache):
   Citation count: 3

2. Second request (will use cached result):
   Results identical: True

3. Third request (force fresh fetch, bypassing cache):

4. Batch processing with caching:

4. Batch processing with caching:
   Processed 2 papers

5. Author enrichment with caching:
   Processed 2 papers

5. Author enrichment with caching:
.2f
   Author name: Liyong Kou
   Paper count: 2

6. Search with caching:
.2f
   Author name: Liyong Kou
   Paper count: 2

6. Search with caching:


HTTP error for https://api.semanticscholar.org/graph/v1/paper/search: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=CRISPR+gene+editing&limit=5&fields=title%2Cabstract%2Cvenue%2Cyear%2CcitationCount%2CinfluentialCitationCount%2Cauthors%2CfieldsOfStudy%2CexternalIds%2CpaperId


APIClientError: [GENERIC002] HTTP error: 429 Client Error:  for url: https://api.semanticscholar.org/graph/v1/paper/search?query=CRISPR+gene+editing&limit=5&fields=title%2Cabstract%2Cvenue%2Cyear%2CcitationCount%2CinfluentialCitationCount%2Cauthors%2CfieldsOfStudy%2CexternalIds%2CpaperId

## External ID Flattening and Conflict Detection

The enrichment system now intelligently flattens external identifiers from multiple sources and detects conflicts when the same ID type has different values from different APIs. This ensures data consistency and provides transparency about potential data quality issues.

In [None]:
# Demonstrate external ID flattening and conflict detection
print("External ID Flattening and Conflict Detection Demo")
print("=" * 60)

# Use a DOI that exists in multiple sources to show alignment
DOI = "10.1371/journal.pone.0308090"

config = EnrichmentConfig(
    enable_crossref=True,
    enable_semantic_scholar=True,
    enable_openalex=True,
    enable_datacite=False,
    enable_unpaywall=False,
    enable_ror=False,
)

with PaperEnricher(config) as enricher:
    result = enricher.enrich_paper(identifier=DOI)

print("🔍 EXTERNAL ID ANALYSIS")
print("=" * 60)

# Show raw external IDs from each source
print("Raw external IDs from each API:")
print()

crossref_data = result.get("crossref", {})
if crossref_data:
    doi = crossref_data.get("DOI")
    print(f"CrossRef DOI: {doi}")

semantic_data = result.get("semantic_scholar", {})
if semantic_data:
    ss_ids = semantic_data.get("external_ids", {})
    print(f"Semantic Scholar CorpusId: {ss_ids.get('CorpusId')}")
    print(f"Semantic Scholar PubMed: {ss_ids.get('PubMed')}")

openalex_data = result.get("openalex", {})
if openalex_data:
    oa_ids = openalex_data.get("ids", {})
    print(f"OpenAlex ID: {oa_ids.get('openalex')}")
    print(f"OpenAlex DOI: {oa_ids.get('doi')}")
    print(f"OpenAlex PMID: {oa_ids.get('pmid')}")

print("\n📊 MERGED AND FLATTENED EXTERNAL IDS")
print("=" * 60)

merged = result.get("merged", {})
external_ids = merged.get("external_ids", {})
external_id_conflicts = merged.get("external_id_conflicts", {})

print("Flattened external IDs (standardized field names):")
for key, value in external_ids.items():
    conflict_marker = " ⚠️ CONFLICT" if key in external_id_conflicts else ""
    print(f"  {key}: {value}{conflict_marker}")

if external_id_conflicts:
    print("\n🚨 DETECTED CONFLICTS:")
    for field, conflicts in external_id_conflicts.items():
        print(f"  Field '{field}' has conflicts:")
        for conflict in conflicts:
            source = conflict["source"]
            value = conflict["value"]
            conflicts_with = conflict["conflicts_with"]
            print(f"    {source}: '{value}' conflicts with '{conflicts_with}'")

print("\n🏗️ PAPER ENTITY WITH FLATTENED IDS")
print("=" * 60)

# Create PaperEntity to show flattened access
from pyeuropepmc.models.paper import PaperEntity

paper = PaperEntity.from_enrichment_result(result)
print(f"Paper DOI: {paper.doi}")
print(f"Paper PMID: {paper.pmid}")
print(f"Semantic Scholar Corpus ID: {paper.semantic_scholar_corpus_id}")
print(f"OpenAlex ID: {paper.openalex_id}")
print(f"External IDs dict: {list(paper.external_ids.keys()) if paper.external_ids else 'None'}")

if paper.external_id_conflicts:
    print(f"Conflicts detected: {len(paper.external_id_conflicts)} fields")
else:
    print("No conflicts detected ✓")

print("\n🔗 RDF MAPPING OF FLATTENED IDS")
print("=" * 60)

# Show RDF mapping
from pyeuropepmc.mappers.rdf_mapper import RDFMapper
from rdflib import Graph

mapper = RDFMapper()
g = Graph()
paper.to_rdf(g, mapper=mapper)

# Find external ID triples
external_id_triples = []
for triple in g:
    if "semanticScholarCorpusId" in str(triple[1]) or "openAlexId" in str(triple[1]):
        external_id_triples.append(triple)

print("RDF triples generated for flattened external IDs:")
for triple in external_id_triples:
    predicate = str(triple[1]).split("#")[-1] if "#" in str(triple[1]) else str(triple[1]).split("/")[-1]
    print(f"  {predicate}: {triple[2]}")

print(f"\nTotal RDF triples generated: {len(list(g))}")

print("\n" + "=" * 60)
print("EXTERNAL ID FEATURES SUMMARY:")
print("=" * 60)
print("✅ Flattened external IDs: Direct access to semantic_scholar_corpus_id, openalex_id, etc.")
print("✅ Conflict detection: Automatic flagging when sources provide conflicting values")
print("✅ Standardized naming: Consistent field names across different APIs")
print("✅ RDF mapping: Flattened IDs generate proper RDF triples")
print("✅ Backward compatibility: Original external_ids dict preserved")
print("✅ Data quality: Transparency about potential data inconsistencies")
print("=" * 60)

In [None]:
# Demonstrate saving raw responses
print("Saving Raw API Responses Demo")
print("=" * 50)

DOI = "10.1371/journal.pone.0308090"

config = EnrichmentConfig(
    enable_crossref=True,
    enable_datacite=True,
    enable_unpaywall=True,
    enable_semantic_scholar=True,
    enable_openalex=True,
    enable_ror=True,
)

with PaperEnricher(config) as enricher:
    result = enricher.enrich_paper(
        identifier=DOI,
        save_responses=True  # Enable saving
    )

print(f"Enrichment complete. Sources: {result['sources']}")
print("Raw responses and merged result saved to: src/outputs/enrichment_responses/")
print("Files saved:")
print("- raw_crossref_{doi}.json")
print("- raw_semantic_scholar_{doi}.json")
print("- raw_openalex_{doi}.json")
print("- raw_datacite_{doi}.json")
print("- raw_unpaywall_{doi}.json")
print("- raw_ror_{doi}.json")
print("- merged_{doi}.json")
print("\nThese files contain the exact responses from each API and the final merged result.")
print("Useful for debugging, documentation, and understanding the enrichment process.")

print("\n" + "=" * 50)
print("RESPONSE SAVING FEATURES:")
print("=" * 50)
print("• save_responses parameter enables automatic saving")
print("• Raw responses from each API saved separately")
print("• Merged result saved with full enrichment data")
print("• Files stored in src/outputs/enrichment_responses/ by default")
print("• DOI sanitized for safe filenames")
print("• JSON format with proper indentation and Unicode support")
print("=" * 50)

Saving Raw API Responses Demo
Enrichment complete. Sources: ['crossref', 'semantic_scholar', 'openalex']
Raw responses and merged result saved to: src/outputs/enrichment_responses/
Files saved:
- raw_crossref_{doi}.json
- raw_semantic_scholar_{doi}.json
- raw_openalex_{doi}.json
- merged_{doi}.json

These files contain the exact responses from each API and the final merged result.
Useful for debugging, documentation, and understanding the enrichment process.

RESPONSE SAVING FEATURES:
• save_responses parameter enables automatic saving
• Raw responses from each API saved separately
• Merged result saved with full enrichment data
• Files stored in src/outputs/enrichment_responses/ by default
• DOI sanitized for safe filenames
• JSON format with proper indentation and Unicode support
Enrichment complete. Sources: ['crossref', 'semantic_scholar', 'openalex']
Raw responses and merged result saved to: src/outputs/enrichment_responses/
Files saved:
- raw_crossref_{doi}.json
- raw_semantic_sc