Skip to content

Add PMC full-text fallback + DOI prefix bugfix to literature fetcher#52

Merged
realmarcin merged 3 commits into
mainfrom
claude/wider-cache-fallback
May 16, 2026
Merged

Add PMC full-text fallback + DOI prefix bugfix to literature fetcher#52
realmarcin merged 3 commits into
mainfrom
claude/wider-cache-fallback

Conversation

@realmarcin
Copy link
Copy Markdown
Contributor

Summary

Extends the literature fetcher fallback chain to cover OA papers that are in PubMed Central but not MEDLINE-indexed and lack a CrossRef abstract.

  • New `fetch_pmcid_for_doi()` resolves DOI → PMCID via NCBI esearch (`db=pmc`)
  • New `fetch_pmc_abstract()` fetches the PMC JATS XML and regex-extracts the `` element, strips tags, collapses whitespace, caches as `pmc_.txt`
  • `fetch_paper()` fallback chain extended to: CrossRef → PMID → PubMed → PMCID → PMC XML

Also fixes a latent bug: `doi.replace("doi:", "")` only stripped the lowercase prefix, so `reference_id` values like `DOI:10.1007/...` were sent to Unpaywall/CrossRef/esearch with the prefix preserved (yielding 422/404). Replaced 5 occurrences with `re.sub(r"^(?i:doi:)", "", ...)` and added `import re` at module scope.

Impact

  • Cache refresh on the 17 references previously marked `content_type: unavailable`: 3 abstracts now available from PMC (`pmc_10490939`, `pmc_6417678`, `pmc_6746925`).
  • The remaining 14 are genuinely paywalled with no PMC record (older Springer/Elsevier/Wiley journals) — those stay `unavailable`, but the infrastructure is in place for future-curated DOIs that happen to be OA.
  • Net `just validate-references-all`: ~120 → 110 errors after this PR.

Test plan

  • Smoke test: PMC fallback resolves PLoS/Frontiers DOIs not present in PubMed
  • DOI prefix bugfix verified by re-running fetch on `DOI:`-prefixed references previously returning 422
  • No regressions in `just test` (102 pass, 9 skip)

🤖 Generated with Claude Code

When CrossRef has no abstract and PubMed has no PMID for a DOI, try
mapping the DOI to a PMC ID (via NCBI esearch on db=pmc) and extract
the <abstract> element from the JATS full-text XML. This covers OA
papers (preprints, BMC, PLoS, Frontiers) that aren't MEDLINE-indexed
but are in PMC.

- fetch_pmcid_for_doi(): NCBI esearch DOI -> PMCID
- fetch_pmc_abstract(): NCBI efetch PMC XML, regex-extract <abstract>,
  strip JATS tags, collapse whitespace, cache as pmc_<id>.txt
- fetch_paper(): extend the DOI fallback chain so PMC is tried after
  the existing CrossRef -> PMID -> PubMed chain falls through

Also fix case-insensitive DOI prefix stripping (the previous
"doi:" -> "" only handled the lowercase prefix, so "DOI:10.1007/..."
identifiers from reference_id fields hit Unpaywall/CrossRef/esearch
with the prefix preserved and returned 422/404). Add `import re` at
the module level and use `re.sub(r"^(?i:doi:)", "", ...)` across the
five DOI normalization sites.

Cache refresh on 17 references_cache files previously marked
content_type: unavailable: 3 additional abstracts now available from
PMC. Most of the remaining 14 are genuinely paywalled with no PMC
record (older Springer/Elsevier/Wiley journals) and stay unavailable.

Net validate-references-all: ~120 -> 110 errors after this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 16, 2026 07:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends LiteratureFetcher’s DOI-based fallback chain to retrieve abstracts from PubMed Central (PMC) when CrossRef and PubMed don’t provide one, and fixes DOI prefix normalization so DOI: (uppercase) is handled consistently across fetch paths.

Changes:

  • Added DOI → PMCID resolution (esearch db=pmc) and PMCID → abstract extraction from PMC XML (efetch db=pmc).
  • Updated DOI normalization to strip doi: case-insensitively using a regex (fixing DOI:-prefixed failures).
  • Added/updated cached reference artifacts and PMC abstract cache files.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/communitymech/literature.py Adds PMC fallback methods and updates DOI prefix stripping to be case-insensitive.
references_cache/pmc_6746925.txt New cached PMC abstract text.
references_cache/pmc_6417678.txt New cached PMC abstract text.
references_cache/pmc_10490939.txt New cached PMC abstract text.
references_cache/DOI_10.4056_sigs.922139.md New cached reference markdown for a DOI record.
references_cache/DOI_10.1099_00207713-36-2-197.md Updates cached reference markdown to abstract_only and adds abstract content.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/communitymech/literature.py Outdated
Comment thread src/communitymech/literature.py Outdated
Comment thread references_cache/DOI_10.1099_00207713-36-2-197.md Outdated
realmarcin and others added 2 commits May 16, 2026 00:44
- fetch_pmcid_for_doi: switch from NCBI esearch to the PMC ID
  Converter API. esearch silently falls back to an `[All Fields]`
  token search when a DOI isn't indexed in PMC, returning unrelated
  PMC records (observed: DOI 10.1099/00207713-36-2-197 fuzzy-matched
  to PMC 10490939, a 2023 acidophile proteostasis paper). The ID
  converter performs an exact DOI lookup and returns an explicit
  `errmsg: "Identifier not found in PMC"` for non-PMC DOIs.
- Revert references_cache/DOI_10.1099_00207713-36-2-197.md back to
  content_type: unavailable (it was overwritten with the wrong-paper
  abstract from the fuzzy esearch match). Delete the orphaned
  pmc_10490939.txt that was cached as part of that bad match.
- fetch_pmc_abstract: fix PMCID normalization order — strip "PMCID:"
  before stripping the bare "PMC" prefix; the reverse order would
  mangle "PMCID:3035377" to "ID:3035377".
- Drop the redundant inner `import re` in fetch_pmc_abstract; re is
  already imported at module scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .claude/ directory holds Claude Code's local agent state (e.g.,
scheduled_tasks.lock); it's per-machine and doesn't belong in the
repo. Add it to .gitignore and untrack the lock file that was
accidentally committed in a622262.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin merged commit b8ac7c9 into main May 16, 2026
@realmarcin realmarcin deleted the claude/wider-cache-fallback branch May 16, 2026 07:49
realmarcin added a commit that referenced this pull request May 16, 2026
Drive down `just validate-references-all` errors from 101 to 62 by
adding a project-level config for linkml-reference-validator and
opportunistically refreshing one cache via the new PMC fallback.

Changes:
- New `conf/reference_validator.yaml` config wired into the
  `validate-references` and `validate-references-all` just targets
  via the validator's `--config` flag:
  - `skip_prefixes: [BIOPROJECT]` so BIOPROJECT accessions like
    PRJNA1272773 stop erroring (the validator's Entrez fetch path
    hits a DTD tag-validation issue on these and there's no
    abstract to validate anyway).
  - `unknown_prefix_severity: WARNING` so the ~35 paywalled DOIs
    that fail both Crossref and DataCite degrade from ERROR to
    WARNING. The references stay in the data; the validator just
    stops failing the run for content it cannot fetch.
- Refresh `references_cache/DOI_10.3389_fmicb.2018.01853.md` via
  the PR #52 PMC fallback chain (DOI -> PMCID PMC6119820 -> JATS
  abstract); recovers 4 "No content available" errors across the
  communities that cite it.

The remaining 62 errors are all "No content available" for caches
that are genuinely paywalled with no OA mirror (older Springer,
Elsevier, IJSEM, etc.) - the validator's hardcoded ERROR severity
on empty cache content can't be relaxed via config, so these would
require either upstream curation changes or a feature ask to the
validator package.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants