Skip to content

Publisher-meta scraper fallback + 3 Springer cache refreshes (43 -> 39)#59

Merged
realmarcin merged 1 commit into
mainfrom
claude/exhaust-fallback
May 16, 2026
Merged

Publisher-meta scraper fallback + 3 Springer cache refreshes (43 -> 39)#59
realmarcin merged 1 commit into
mainfrom
claude/exhaust-fallback

Conversation

@realmarcin
Copy link
Copy Markdown
Contributor

Summary

Drive `just validate-references-all` errors from 43 → 39 (and from session-start 101 → 39, a −62 net) by adding a last-resort DOI-page scraper to the literature fetcher and refreshing the three Springer caches it unblocks.

Fetcher

  • `fetch_publisher_meta_abstract()` — `GET https://doi.org/`, follow redirects, and pull the abstract excerpt from the page's `twitter:description` / `og:description` / `description` meta tag. Springer publishes the first ~200 chars of the abstract in `twitter:description` even for paywalled articles where Crossref / OpenAlex / Semantic Scholar / Europe PMC have no abstract. Includes on-disk caching as `publisher_.txt` and strips the `Journal Name - ` prefix Springer adds to that field. Elsevier ScienceDirect intentionally serves a bot-detection page and yields nothing — that's the residual cap.
  • `fetch_paper()` chain now: CrossRef → PMID/PubMed → PMCID/PMC → OpenAlex → Semantic Scholar → Europe PMC → publisher meta scrape.

Cache refreshes (recovers 4 ERROR rows)

  • `DOI_10.1007_s10311-019-00911-y` (E-waste copper bioleaching, Springer)
  • `DOI_10.1007_s10230-008-0059-z` (Iberian meromictic pit lakes, Springer)
  • `DOI_10.1007_BF02106205` (Acidobacterium taxonomy, Current Microbiology / Springer; cited 2× in `AMD_Acidophile_Heterotroph_Network`)

Snippet repairs

  • Ewaste_Bioleaching_Consortium: replace title quote with the abstract's verbatim e-waste bioleaching framing.
  • Iberian_Pit_Lake_Stratified_Community: upgrade PARTIAL→SUPPORT and expand snippet to the abstract's vertical-gradient quote.
  • AMD_Acidophile_Heterotroph_Network: replace two title quotes with the abstract's verbatim genus proposal.

Remaining 39

All Elsevier 2024-2025 papers (`j.jece.2025.120403`, `j.cej.2024.153492`, `j.ibiod.2025.106190`, `10889868.2024.2407240`) plus one ResearchGate preprint — abstracts not in any aggregator we can query and publisher pages serve bot-detection HTML.

Test plan

  • `just validate-references-all` reports 39 ERROR rows (was 43 / session-start 101); 0 text-mismatch
  • All 3 modified YAMLs still pass `just validate`
  • Caching smoke-tested: first call hits HTTP, second reads from `publisher_.txt`

🤖 Generated with Claude Code

Drive `just validate-references-all` errors from 43 to 39 (and from
session-start 101 to 39) by adding a last-resort DOI page scraper to
the literature fetcher and refreshing the three Springer caches it
unblocks.

Fetcher (src/communitymech/literature.py):
- fetch_publisher_meta_abstract(): GET https://doi.org/<DOI>, follow
  redirects, and pull the abstract excerpt out of the page's
  twitter:description / og:description / description meta tag. Springer
  publishes the first ~200 characters of the abstract in
  twitter:description even for paywalled articles where Crossref /
  OpenAlex / Semantic Scholar / Europe PMC have no abstract. Includes
  on-disk caching as publisher_<safe-doi>.txt and strips the
  "Journal Name - " prefix Springer adds to that field. Elsevier
  ScienceDirect intentionally serves a bot-detection page and yields
  nothing - that's the residual cap.
- fetch_paper() fallback chain now: CrossRef -> PMID -> PMC -> OpenAlex
  -> Semantic Scholar -> Europe PMC -> publisher meta scrape.

Cache refresh (recovers 4 ERROR rows):
- DOI_10.1007_s10311-019-00911-y (Ewaste copper bioleaching, Springer)
- DOI_10.1007_s10230-008-0059-z (Iberian meromictic pit lakes, Springer)
- DOI_10.1007_BF02106205 (Acidobacterium taxonomy paper, Current
  Microbiology / Springer; cited 2x in AMD_Acidophile_Heterotroph_Network)

Snippet repairs:
- Ewaste_Bioleaching_Consortium: replace title quote with the abstract's
  verbatim e-waste bioleaching framing.
- Iberian_Pit_Lake_Stratified_Community: upgrade PARTIAL to SUPPORT and
  expand the snippet to the abstract's vertical-gradient quote.
- AMD_Acidophile_Heterotroph_Network: replace two title quotes with the
  abstract's verbatim genus proposal.

Remaining 39 "No content available" errors are all Elsevier 2024-2025
papers (j.jece.2025.120403, j.cej.2024.153492, j.ibiod.2025.106190,
10889868.2024.2407240) plus one ResearchGate preprint - their abstracts
are not in any aggregator we can query and the publisher pages serve
bot-detection HTML.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 16, 2026 10:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a last-resort DOI landing-page meta-tag scraper to the literature fetcher to recover abstract excerpts when aggregator APIs fail, then refreshes three Springer reference caches and updates the corresponding YAML snippets to use the recovered abstracts.

Changes:

  • New fetch_publisher_meta_abstract() method that scrapes twitter:description/og:description/description meta tags from https://doi.org/<DOI> with on-disk caching, wired into the end of the fetch_paper() fallback chain.
  • Refreshed three Springer references caches (s10311-019-00911-y, s10230-008-0059-z, BF02106205) from unavailable to abstract_only with abstract excerpts.
  • Updated three community YAMLs to use verbatim abstract snippets (and upgraded one PARTIALSUPPORT).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/communitymech/literature.py Adds DOI-page meta-tag scraper fallback and integrates into fetch_paper() chain.
references_cache/DOI_10.1007_s10311-019-00911-y.md Updates content_type and adds abstract excerpt.
references_cache/DOI_10.1007_s10230-008-0059-z.md Updates content_type and adds abstract excerpt.
references_cache/DOI_10.1007_BF02106205.md Updates content_type and adds abstract excerpt.
kb/communities/Iberian_Pit_Lake_Stratified_Community.yaml Upgrades support level and replaces snippet with abstract quote.
kb/communities/Ewaste_Bioleaching_Consortium.yaml Replaces title-based snippet with abstract quote.
kb/communities/AMD_Acidophile_Heterotroph_Network.yaml Replaces two title-based snippets with abstract genus-proposal quote.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@realmarcin realmarcin merged commit 4801893 into main May 16, 2026
4 checks passed
@realmarcin realmarcin deleted the claude/exhaust-fallback branch May 16, 2026 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants