Add PMC full-text fallback + DOI prefix bugfix to literature fetcher#52
Merged
Conversation
When CrossRef has no abstract and PubMed has no PMID for a DOI, try mapping the DOI to a PMC ID (via NCBI esearch on db=pmc) and extract the <abstract> element from the JATS full-text XML. This covers OA papers (preprints, BMC, PLoS, Frontiers) that aren't MEDLINE-indexed but are in PMC. - fetch_pmcid_for_doi(): NCBI esearch DOI -> PMCID - fetch_pmc_abstract(): NCBI efetch PMC XML, regex-extract <abstract>, strip JATS tags, collapse whitespace, cache as pmc_<id>.txt - fetch_paper(): extend the DOI fallback chain so PMC is tried after the existing CrossRef -> PMID -> PubMed chain falls through Also fix case-insensitive DOI prefix stripping (the previous "doi:" -> "" only handled the lowercase prefix, so "DOI:10.1007/..." identifiers from reference_id fields hit Unpaywall/CrossRef/esearch with the prefix preserved and returned 422/404). Add `import re` at the module level and use `re.sub(r"^(?i:doi:)", "", ...)` across the five DOI normalization sites. Cache refresh on 17 references_cache files previously marked content_type: unavailable: 3 additional abstracts now available from PMC. Most of the remaining 14 are genuinely paywalled with no PMC record (older Springer/Elsevier/Wiley journals) and stay unavailable. Net validate-references-all: ~120 -> 110 errors after this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR extends LiteratureFetcher’s DOI-based fallback chain to retrieve abstracts from PubMed Central (PMC) when CrossRef and PubMed don’t provide one, and fixes DOI prefix normalization so DOI: (uppercase) is handled consistently across fetch paths.
Changes:
- Added DOI → PMCID resolution (
esearch db=pmc) and PMCID → abstract extraction from PMC XML (efetch db=pmc). - Updated DOI normalization to strip
doi:case-insensitively using a regex (fixingDOI:-prefixed failures). - Added/updated cached reference artifacts and PMC abstract cache files.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/communitymech/literature.py | Adds PMC fallback methods and updates DOI prefix stripping to be case-insensitive. |
| references_cache/pmc_6746925.txt | New cached PMC abstract text. |
| references_cache/pmc_6417678.txt | New cached PMC abstract text. |
| references_cache/pmc_10490939.txt | New cached PMC abstract text. |
| references_cache/DOI_10.4056_sigs.922139.md | New cached reference markdown for a DOI record. |
| references_cache/DOI_10.1099_00207713-36-2-197.md | Updates cached reference markdown to abstract_only and adds abstract content. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- fetch_pmcid_for_doi: switch from NCBI esearch to the PMC ID Converter API. esearch silently falls back to an `[All Fields]` token search when a DOI isn't indexed in PMC, returning unrelated PMC records (observed: DOI 10.1099/00207713-36-2-197 fuzzy-matched to PMC 10490939, a 2023 acidophile proteostasis paper). The ID converter performs an exact DOI lookup and returns an explicit `errmsg: "Identifier not found in PMC"` for non-PMC DOIs. - Revert references_cache/DOI_10.1099_00207713-36-2-197.md back to content_type: unavailable (it was overwritten with the wrong-paper abstract from the fuzzy esearch match). Delete the orphaned pmc_10490939.txt that was cached as part of that bad match. - fetch_pmc_abstract: fix PMCID normalization order — strip "PMCID:" before stripping the bare "PMC" prefix; the reverse order would mangle "PMCID:3035377" to "ID:3035377". - Drop the redundant inner `import re` in fetch_pmc_abstract; re is already imported at module scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .claude/ directory holds Claude Code's local agent state (e.g., scheduled_tasks.lock); it's per-machine and doesn't belong in the repo. Add it to .gitignore and untrack the lock file that was accidentally committed in a622262. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
realmarcin
added a commit
that referenced
this pull request
May 16, 2026
Drive down `just validate-references-all` errors from 101 to 62 by
adding a project-level config for linkml-reference-validator and
opportunistically refreshing one cache via the new PMC fallback.
Changes:
- New `conf/reference_validator.yaml` config wired into the
`validate-references` and `validate-references-all` just targets
via the validator's `--config` flag:
- `skip_prefixes: [BIOPROJECT]` so BIOPROJECT accessions like
PRJNA1272773 stop erroring (the validator's Entrez fetch path
hits a DTD tag-validation issue on these and there's no
abstract to validate anyway).
- `unknown_prefix_severity: WARNING` so the ~35 paywalled DOIs
that fail both Crossref and DataCite degrade from ERROR to
WARNING. The references stay in the data; the validator just
stops failing the run for content it cannot fetch.
- Refresh `references_cache/DOI_10.3389_fmicb.2018.01853.md` via
the PR #52 PMC fallback chain (DOI -> PMCID PMC6119820 -> JATS
abstract); recovers 4 "No content available" errors across the
communities that cite it.
The remaining 62 errors are all "No content available" for caches
that are genuinely paywalled with no OA mirror (older Springer,
Elsevier, IJSEM, etc.) - the validator's hardcoded ERROR severity
on empty cache content can't be relaxed via config, so these would
require either upstream curation changes or a feature ask to the
validator package.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the literature fetcher fallback chain to cover OA papers that are in PubMed Central but not MEDLINE-indexed and lack a CrossRef abstract.
Also fixes a latent bug: `doi.replace("doi:", "")` only stripped the lowercase prefix, so `reference_id` values like `DOI:10.1007/...` were sent to Unpaywall/CrossRef/esearch with the prefix preserved (yielding 422/404). Replaced 5 occurrences with `re.sub(r"^(?i:doi:)", "", ...)` and added `import re` at module scope.
Impact
Test plan
🤖 Generated with Claude Code