Skip to content

Srcset images not being archived #243

@JubilantJerry

Description

@JubilantJerry

grab-site seems to not archive the <picture><srcset> URL in a Substack blog that I tried the tool on. I believe this may be an issue in wpull.

image

image

Step-by-step reproduction instructions

First I run:

grab-site --level=2 --concurrency=20 --page-requisites-level=2 --import-ignores=$(pwd)/ignores 'https://promptingweekly.substack.com/p/prompting-principle-if-youre-fighting' 'https://substackcdn.com/bundle/assets/store.modern-3dec36e9.js' 'https://substack-post-media.s3.amazonaws.com/public/images/4206cf36-9fcc-4b06-95e1-d751f9f4c3b7_388x388.jpeg'

I include these other two URLs so that their domain names shouldn't be considered "offsite".

The contents of the ignores file is:

platform.openai.com
reddit.com
discord.com
discordapp.com
^https?://[^p][^.]+.substack.com
shopify.com
^https://static.airtable.com/esbuild/by_sha
https://promptingweekly.substack.com/account\?utm_medium=web&utm_source=subscribe-widget
https://promptingweekly.substack.com/p/[^?/]+\?utm_source=substack&utm_medium=email&utm_content=share&action=share&token=

Then I open the archive using ReplayWeb.page-2.2.4.AppImage, and navigate to the page: https://promptingweekly.substack.com/p/prompting-principle-if-youre-fighting

You can download the WARC here: https://drive.google.com/file/d/1fJuWwgSTVfh9IdD47RC2lw67tWSryG4S/view?usp=sharing

Appearance of replayed page

There are several images on the page that directly get displayed when opening the live site. However, archiving the page with grab-site and replaying with ReplayWeb.page, the images do not load directly, appearing as broken images or blank spaces.

Archived:
image

Live site:
image

Archived:
image

Live site:
image

The same issues are observed with pywb

In addition, some scripts don't work properly. When navigating to the previous or next blog page, ReplayWeb.page will first display a page saying "Post not found". Refreshing the page will make it load properly (but still with the missing images).

image

My belief is that both the missing images and the script errors are caused by missing files in the crawl.

Additional details

I run Ubuntu 20.04 LTS.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions