Add PMC full-text fallback + DOI prefix bugfix to literature fetcher by realmarcin · Pull Request #52 · CultureBotAI/CommunityMech

realmarcin · 2026-05-16T07:36:11Z

Summary

Extends the literature fetcher fallback chain to cover OA papers that are in PubMed Central but not MEDLINE-indexed and lack a CrossRef abstract.

New `fetch_pmcid_for_doi()` resolves DOI → PMCID via NCBI esearch (`db=pmc`)
New `fetch_pmc_abstract()` fetches the PMC JATS XML and regex-extracts the `` element, strips tags, collapses whitespace, caches as `pmc_.txt`
`fetch_paper()` fallback chain extended to: CrossRef → PMID → PubMed → PMCID → PMC XML

Also fixes a latent bug: `doi.replace("doi:", "")` only stripped the lowercase prefix, so `reference_id` values like `DOI:10.1007/...` were sent to Unpaywall/CrossRef/esearch with the prefix preserved (yielding 422/404). Replaced 5 occurrences with `re.sub(r"^(?i:doi:)", "", ...)` and added `import re` at module scope.

Impact

Cache refresh on the 17 references previously marked `content_type: unavailable`: 3 abstracts now available from PMC (`pmc_10490939`, `pmc_6417678`, `pmc_6746925`).
The remaining 14 are genuinely paywalled with no PMC record (older Springer/Elsevier/Wiley journals) — those stay `unavailable`, but the infrastructure is in place for future-curated DOIs that happen to be OA.
Net `just validate-references-all`: ~120 → 110 errors after this PR.

Test plan

Smoke test: PMC fallback resolves PLoS/Frontiers DOIs not present in PubMed
DOI prefix bugfix verified by re-running fetch on `DOI:`-prefixed references previously returning 422
No regressions in `just test` (102 pass, 9 skip)

🤖 Generated with Claude Code

When CrossRef has no abstract and PubMed has no PMID for a DOI, try mapping the DOI to a PMC ID (via NCBI esearch on db=pmc) and extract the <abstract> element from the JATS full-text XML. This covers OA papers (preprints, BMC, PLoS, Frontiers) that aren't MEDLINE-indexed but are in PMC. - fetch_pmcid_for_doi(): NCBI esearch DOI -> PMCID - fetch_pmc_abstract(): NCBI efetch PMC XML, regex-extract <abstract>, strip JATS tags, collapse whitespace, cache as pmc_<id>.txt - fetch_paper(): extend the DOI fallback chain so PMC is tried after the existing CrossRef -> PMID -> PubMed chain falls through Also fix case-insensitive DOI prefix stripping (the previous "doi:" -> "" only handled the lowercase prefix, so "DOI:10.1007/..." identifiers from reference_id fields hit Unpaywall/CrossRef/esearch with the prefix preserved and returned 422/404). Add `import re` at the module level and use `re.sub(r"^(?i:doi:)", "", ...)` across the five DOI normalization sites. Cache refresh on 17 references_cache files previously marked content_type: unavailable: 3 additional abstracts now available from PMC. Most of the remaining 14 are genuinely paywalled with no PMC record (older Springer/Elsevier/Wiley journals) and stay unavailable. Net validate-references-all: ~120 -> 110 errors after this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR extends LiteratureFetcher’s DOI-based fallback chain to retrieve abstracts from PubMed Central (PMC) when CrossRef and PubMed don’t provide one, and fixes DOI prefix normalization so DOI: (uppercase) is handled consistently across fetch paths.

Changes:

Added DOI → PMCID resolution (esearch db=pmc) and PMCID → abstract extraction from PMC XML (efetch db=pmc).
Updated DOI normalization to strip doi: case-insensitively using a regex (fixing DOI:-prefixed failures).
Added/updated cached reference artifacts and PMC abstract cache files.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/communitymech/literature.py	Adds PMC fallback methods and updates DOI prefix stripping to be case-insensitive.
references_cache/pmc_6746925.txt	New cached PMC abstract text.
references_cache/pmc_6417678.txt	New cached PMC abstract text.
references_cache/pmc_10490939.txt	New cached PMC abstract text.
references_cache/DOI_10.4056_sigs.922139.md	New cached reference markdown for a DOI record.
references_cache/DOI_10.1099_00207713-36-2-197.md	Updates cached reference markdown to `abstract_only` and adds abstract content.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- fetch_pmcid_for_doi: switch from NCBI esearch to the PMC ID Converter API. esearch silently falls back to an `[All Fields]` token search when a DOI isn't indexed in PMC, returning unrelated PMC records (observed: DOI 10.1099/00207713-36-2-197 fuzzy-matched to PMC 10490939, a 2023 acidophile proteostasis paper). The ID converter performs an exact DOI lookup and returns an explicit `errmsg: "Identifier not found in PMC"` for non-PMC DOIs. - Revert references_cache/DOI_10.1099_00207713-36-2-197.md back to content_type: unavailable (it was overwritten with the wrong-paper abstract from the fuzzy esearch match). Delete the orphaned pmc_10490939.txt that was cached as part of that bad match. - fetch_pmc_abstract: fix PMCID normalization order — strip "PMCID:" before stripping the bare "PMC" prefix; the reverse order would mangle "PMCID:3035377" to "ID:3035377". - Drop the redundant inner `import re` in fetch_pmc_abstract; re is already imported at module scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The .claude/ directory holds Claude Code's local agent state (e.g., scheduled_tasks.lock); it's per-machine and doesn't belong in the repo. Add it to .gitignore and untrack the lock file that was accidentally committed in a622262. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drive down `just validate-references-all` errors from 101 to 62 by adding a project-level config for linkml-reference-validator and opportunistically refreshing one cache via the new PMC fallback. Changes: - New `conf/reference_validator.yaml` config wired into the `validate-references` and `validate-references-all` just targets via the validator's `--config` flag: - `skip_prefixes: [BIOPROJECT]` so BIOPROJECT accessions like PRJNA1272773 stop erroring (the validator's Entrez fetch path hits a DTD tag-validation issue on these and there's no abstract to validate anyway). - `unknown_prefix_severity: WARNING` so the ~35 paywalled DOIs that fail both Crossref and DataCite degrade from ERROR to WARNING. The references stay in the data; the validator just stops failing the run for content it cannot fetch. - Refresh `references_cache/DOI_10.3389_fmicb.2018.01853.md` via the PR #52 PMC fallback chain (DOI -> PMCID PMC6119820 -> JATS abstract); recovers 4 "No content available" errors across the communities that cite it. The remaining 62 errors are all "No content available" for caches that are genuinely paywalled with no OA mirror (older Springer, Elsevier, IJSEM, etc.) - the validator's hardcoded ERROR severity on empty cache content can't be relaxed via config, so these would require either upstream curation changes or a feature ask to the validator package. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 16, 2026 07:36

Copilot started reviewing on behalf of realmarcin May 16, 2026 07:36 View session

Copilot AI reviewed May 16, 2026

View reviewed changes

Comment thread src/communitymech/literature.py Outdated

Comment thread src/communitymech/literature.py Outdated

Comment thread references_cache/DOI_10.1099_00207713-36-2-197.md Outdated

realmarcin and others added 2 commits May 16, 2026 00:44

realmarcin merged commit b8ac7c9 into main May 16, 2026

realmarcin deleted the claude/wider-cache-fallback branch May 16, 2026 07:49

realmarcin mentioned this pull request May 16, 2026

Configure reference validator to silence unfetchable refs (101 -> 62 errors) #57

Merged

3 tasks

realmarcin mentioned this pull request May 18, 2026

Add pytest coverage for new LiteratureFetcher fallback methods #70

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PMC full-text fallback + DOI prefix bugfix to literature fetcher#52

Add PMC full-text fallback + DOI prefix bugfix to literature fetcher#52
realmarcin merged 3 commits into
mainfrom
claude/wider-cache-fallback

realmarcin commented May 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

realmarcin commented May 16, 2026

Summary

Impact

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants