Publisher-meta scraper fallback + 3 Springer cache refreshes (43 -> 39)#59
Merged
Conversation
Drive `just validate-references-all` errors from 43 to 39 (and from session-start 101 to 39) by adding a last-resort DOI page scraper to the literature fetcher and refreshing the three Springer caches it unblocks. Fetcher (src/communitymech/literature.py): - fetch_publisher_meta_abstract(): GET https://doi.org/<DOI>, follow redirects, and pull the abstract excerpt out of the page's twitter:description / og:description / description meta tag. Springer publishes the first ~200 characters of the abstract in twitter:description even for paywalled articles where Crossref / OpenAlex / Semantic Scholar / Europe PMC have no abstract. Includes on-disk caching as publisher_<safe-doi>.txt and strips the "Journal Name - " prefix Springer adds to that field. Elsevier ScienceDirect intentionally serves a bot-detection page and yields nothing - that's the residual cap. - fetch_paper() fallback chain now: CrossRef -> PMID -> PMC -> OpenAlex -> Semantic Scholar -> Europe PMC -> publisher meta scrape. Cache refresh (recovers 4 ERROR rows): - DOI_10.1007_s10311-019-00911-y (Ewaste copper bioleaching, Springer) - DOI_10.1007_s10230-008-0059-z (Iberian meromictic pit lakes, Springer) - DOI_10.1007_BF02106205 (Acidobacterium taxonomy paper, Current Microbiology / Springer; cited 2x in AMD_Acidophile_Heterotroph_Network) Snippet repairs: - Ewaste_Bioleaching_Consortium: replace title quote with the abstract's verbatim e-waste bioleaching framing. - Iberian_Pit_Lake_Stratified_Community: upgrade PARTIAL to SUPPORT and expand the snippet to the abstract's vertical-gradient quote. - AMD_Acidophile_Heterotroph_Network: replace two title quotes with the abstract's verbatim genus proposal. Remaining 39 "No content available" errors are all Elsevier 2024-2025 papers (j.jece.2025.120403, j.cej.2024.153492, j.ibiod.2025.106190, 10889868.2024.2407240) plus one ResearchGate preprint - their abstracts are not in any aggregator we can query and the publisher pages serve bot-detection HTML. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a last-resort DOI landing-page meta-tag scraper to the literature fetcher to recover abstract excerpts when aggregator APIs fail, then refreshes three Springer reference caches and updates the corresponding YAML snippets to use the recovered abstracts.
Changes:
- New
fetch_publisher_meta_abstract()method that scrapestwitter:description/og:description/descriptionmeta tags fromhttps://doi.org/<DOI>with on-disk caching, wired into the end of thefetch_paper()fallback chain. - Refreshed three Springer references caches (
s10311-019-00911-y,s10230-008-0059-z,BF02106205) fromunavailabletoabstract_onlywith abstract excerpts. - Updated three community YAMLs to use verbatim abstract snippets (and upgraded one
PARTIAL→SUPPORT).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/communitymech/literature.py | Adds DOI-page meta-tag scraper fallback and integrates into fetch_paper() chain. |
| references_cache/DOI_10.1007_s10311-019-00911-y.md | Updates content_type and adds abstract excerpt. |
| references_cache/DOI_10.1007_s10230-008-0059-z.md | Updates content_type and adds abstract excerpt. |
| references_cache/DOI_10.1007_BF02106205.md | Updates content_type and adds abstract excerpt. |
| kb/communities/Iberian_Pit_Lake_Stratified_Community.yaml | Upgrades support level and replaces snippet with abstract quote. |
| kb/communities/Ewaste_Bioleaching_Consortium.yaml | Replaces title-based snippet with abstract quote. |
| kb/communities/AMD_Acidophile_Heterotroph_Network.yaml | Replaces two title-based snippets with abstract genus-proposal quote. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Drive `just validate-references-all` errors from 43 → 39 (and from session-start 101 → 39, a −62 net) by adding a last-resort DOI-page scraper to the literature fetcher and refreshing the three Springer caches it unblocks.
Fetcher
Cache refreshes (recovers 4 ERROR rows)
Snippet repairs
Remaining 39
All Elsevier 2024-2025 papers (`j.jece.2025.120403`, `j.cej.2024.153492`, `j.ibiod.2025.106190`, `10889868.2024.2407240`) plus one ResearchGate preprint — abstracts not in any aggregator we can query and publisher pages serve bot-detection HTML.
Test plan
🤖 Generated with Claude Code