Skip to content

Support downloading seed file from URL #852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jul 3, 2025
Merged

Conversation

tw4l
Copy link
Member

@tw4l tw4l commented Jun 17, 2025

Fixes #841

Crawler work toward long URL lists in Browsertrix. This PR moves seed handling from the arg parser's validation step to the crawler's bootstrap step in order to be able to async fetch the seed file from a URL.

@tw4l tw4l requested a review from ikreymer June 17, 2025 14:59
@tw4l tw4l requested a review from ikreymer July 1, 2025 14:54
@tw4l tw4l force-pushed the issue-841-online-seed-file branch from 4408c5f to ba8041f Compare July 1, 2025 15:05
@tw4l
Copy link
Member Author

tw4l commented Jul 2, 2025

It looks like GitHub always returns a content-type of text/plain for raw files such as https://raw.githubusercontent.com/webrecorder/browsertrix-crawler/refs/heads/main/tests/custom-behaviors/custom-2.js, so we may want to compute the mime type from what we receive rather than relying on what the web server says it is.

This might be harder than it appears at first glance. mime and mime-types check only by extension, while file-type checks by magic number and so doesn't work well for text-based files. We may be better off reverting back to checking file extension, or at least doing so as a fallback when the content-type header doesn't match expected values.

Alternatively we could use a file characterization tool like Siegfried, which may net much more accurate results once the content is written to a file.

@tw4l
Copy link
Member Author

tw4l commented Jul 2, 2025

@ikreymer I reverted the MIME check changes and think this should be good to go now.

I'm not sure that we need to enforce a .txt extension for seed files, since the crawler will later fail anyway if it's not able to parse any valid seeds from what's passed. It seems like it might be nice for the crawler to be flexible enough to accept a file that doesn't have an extension, even if we do enforce that in the Browsertrix UI.

@tw4l tw4l merged commit 2af94ff into main Jul 3, 2025
4 checks passed
@tw4l tw4l deleted the issue-841-online-seed-file branch July 3, 2025 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support downloading seedFile from online source
2 participants