You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
Added
X-Transcript-Status response header (#37, closes #36). YouTube conversions expose the transcript state (ok / none / blocked / error) as a header, so programmatic consumers can tell a transient block from a genuinely absent transcript without parsing the body.
ScienceDaily lead-image recipe. A shipped site recipe for *.sciencedaily.com/releases/** unwraps the article's #text container so the lead image survives extraction (Readability otherwise drops the sibling <figure> that holds it). General sites should still rely on the image frontmatter field (from og:image).
Fixed
Cache hits now serve complete metadata. The cached-response path rebuilt a minimal {title, url, quality} object, so on a cache hit the frontmatter lost image (og:image/twitter:image), description, author, language, site, and format=json returned metadata: null. The full metadata persisted at extraction time is now served on cache hits in both the frontmatter and the format=json response — no schema change needed.
Tracking pixels no longer leak into the Markdown. 1×1 invisible beacon images (e.g. republish/analytics counters embedded inline in article text) were kept by Readability and rendered as bogus Markdown images. cleanDom now drops any <img> whose declared width and height are both ≤ 1; genuine wide/tall banners are spared.
YouTube: transient 429 distinguished from a missing transcript (#34, closes #33). A per-IP 429 on the timedtext endpoint was mislabeled as a permanent block; it is now classified honestly and not cached.
Sidecar: bundle yt_transcript.py in the markitdown image (#35).