Support downloading seed file from URL #852

tw4l · 2025-06-17T14:59:18Z

Fixes #841

Crawler work toward long URL lists in Browsertrix. This PR moves seed handling from the arg parser's validation step to the crawler's bootstrap step in order to be able to async fetch the seed file from a URL.

src/util/file_reader.ts

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>

…me types and use default ext ensure exceptions logged correctly using formatErr

tw4l · 2025-07-02T14:49:15Z

It looks like GitHub always returns a content-type of text/plain for raw files such as https://raw.githubusercontent.com/webrecorder/browsertrix-crawler/refs/heads/main/tests/custom-behaviors/custom-2.js, so we may want to compute the mime type from what we receive rather than relying on what the web server says it is.

This might be harder than it appears at first glance. mime and mime-types check only by extension, while file-type checks by magic number and so doesn't work well for text-based files. We may be better off reverting back to checking file extension, or at least doing so as a fallback when the content-type header doesn't match expected values.

Alternatively we could use a file characterization tool like Siegfried, which may net much more accurate results once the content is written to a file.

tw4l · 2025-07-02T19:12:53Z

@ikreymer I reverted the MIME check changes and think this should be good to go now.

I'm not sure that we need to enforce a .txt extension for seed files, since the crawler will later fail anyway if it's not able to parse any valid seeds from what's passed. It seems like it might be nice for the crawler to be flexible enough to accept a file that doesn't have an extension, even if we do enforce that in the Browsertrix UI.

tw4l requested a review from ikreymer June 17, 2025 14:59

ikreymer reviewed Jul 1, 2025

View reviewed changes

src/util/file_reader.ts Outdated Show resolved Hide resolved

tw4l requested a review from ikreymer July 1, 2025 14:54

tw4l and others added 6 commits July 1, 2025 11:04

Support specifying seed file by URL

70656e7

Fix scope tests

2565fb9

Remove scopedSeeds from replaycrawler

51153fb

Only fail crawl if no seeds and no qa source

2857bd8

Update src/util/file_reader.ts

f216b04

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>

Run lint:fix

ba8041f

tw4l force-pushed the issue-841-online-seed-file branch from 4408c5f to ba8041f Compare July 1, 2025 15:05

ikreymer and others added 6 commits July 1, 2025 11:32

some improved type checking for args data

221ea40

default to 'index.html' if basename is empty

3d985d2

consolidate getting temp file into writeUrlContentsToFile(), check mi…

cefae50

…me types and use default ext ensure exceptions logged correctly using formatErr

lowercase

8bce779

Only check content-type before semicolon

e60b8e8

Fix linting

49bcdab

tw4l added 3 commits July 2, 2025 15:03

Revert mimetype checking

ec3c312

Remove extra empty line

fead251

Fix linting

477e778

ikreymer approved these changes Jul 3, 2025

View reviewed changes

tw4l merged commit 2af94ff into main Jul 3, 2025
4 checks passed

tw4l deleted the issue-841-online-seed-file branch July 3, 2025 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support downloading seed file from URL #852

Support downloading seed file from URL #852

Uh oh!

tw4l commented Jun 17, 2025

Uh oh!

Uh oh!

tw4l commented Jul 2, 2025 •

edited

Loading

Uh oh!

tw4l commented Jul 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Support downloading seed file from URL #852

Support downloading seed file from URL #852

Uh oh!

Conversation

tw4l commented Jun 17, 2025

Uh oh!

Uh oh!

tw4l commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tw4l commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tw4l commented Jul 2, 2025 •

edited

Loading

tw4l commented Jul 2, 2025 •

edited

Loading