-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend WARC file with all requests made via all archive methods #130
Comments
I've been investigating using So far it looks promising, I'm just resolving this before pushing it: webrecorder/pywb#434 |
Have you considered swapping out wget for ArchiveTeam/wpull? It's a python re-implementation of wget specifically for web crawling and archiving, and might provide the flexibility you seek. |
I have considered it, I just talked to the wpull authors this week in San Francisco. |
A quick update, this is still blocked by python-hyper/brotlicffi#146 |
This can now proceed as pywb now disables brotli when it's unavailable: |
Any updates on this? |
There is a way to do this already right now:
|
Correct me if I am wrong, but I don't think there is a way to pass Chrome arguments using the CLI as of now. The following are the only options it reads from the config file. @pirate Lines 219 to 263 in 49faec8
|
Right now the
FETCH_WARC
option only creates a simple html file WARC with wget, it doesn't save all the requests made dynamically after JS executes by chrome headless.We should set up https://github.com/internetarchive/warcprox so that all requests made during the archiving process are saved to a unified WARC file.
In the ideal scenario, the WARC should include:
I think we can record the
wget
warc first, then usewarcat
to merge it with a warcproxy-created warc containing all the chrome headless requests.The text was updated successfully, but these errors were encountered: