Skip to content

v3.2.0

Latest

Choose a tag to compare

@syswave-dev syswave-dev released this 25 Jun 15:23
33f1947

Added

  • X-Transcript-Status response header (#37, closes #36). YouTube conversions expose the transcript state (ok / none / blocked / error) as a header, so programmatic consumers can tell a transient block from a genuinely absent transcript without parsing the body.
  • ScienceDaily lead-image recipe. A shipped site recipe for *.sciencedaily.com/releases/** unwraps the article's #text container so the lead image survives extraction (Readability otherwise drops the sibling <figure> that holds it). General sites should still rely on the image frontmatter field (from og:image).

Fixed

  • Cache hits now serve complete metadata. The cached-response path rebuilt a minimal {title, url, quality} object, so on a cache hit the frontmatter lost image (og:image/twitter:image), description, author, language, site, and format=json returned metadata: null. The full metadata persisted at extraction time is now served on cache hits in both the frontmatter and the format=json response — no schema change needed.
  • Tracking pixels no longer leak into the Markdown. 1×1 invisible beacon images (e.g. republish/analytics counters embedded inline in article text) were kept by Readability and rendered as bogus Markdown images. cleanDom now drops any <img> whose declared width and height are both ≤ 1; genuine wide/tall banners are spared.
  • YouTube: transient 429 distinguished from a missing transcript (#34, closes #33). A per-IP 429 on the timedtext endpoint was mislabeled as a permanent block; it is now classified honestly and not cached.
  • Sidecar: bundle yt_transcript.py in the markitdown image (#35).