Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend WARC file with all requests made via all archive methods #130

Open
pirate opened this issue Jan 13, 2019 · 8 comments
Open

Extend WARC file with all requests made via all archive methods #130

pirate opened this issue Jan 13, 2019 · 8 comments
Labels
size: hard status: wip Work is in-progress / has already been partially completed why: functionality Intended to improve ArchiveBox functionality or features

Comments

@pirate
Copy link
Member

pirate commented Jan 13, 2019

Right now the FETCH_WARC option only creates a simple html file WARC with wget, it doesn't save all the requests made dynamically after JS executes by chrome headless.

We should set up https://github.com/internetarchive/warcprox so that all requests made during the archiving process are saved to a unified WARC file.

In the ideal scenario, the WARC should include:

  • √ base html for the page
  • √ all assets like images, styles, fonts, js
  • all dynamically requested assets after JS executes in chrome (e.g. images, ajax requests, etc)
  • any media files requested

I think we can record the wget warc first, then use warcat to merge it with a warcproxy-created warc containing all the chrome headless requests.

@pirate pirate added size: hard status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet why: functionality Intended to improve ArchiveBox functionality or features labels Jan 13, 2019
@pirate
Copy link
Member Author

pirate commented Jan 28, 2019

I've been investigating using pywb's wayback --proxy-record --proxy archivebox and google-chrome --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-security to pipe all chrome and wget requests into a warc file.

So far it looks promising, I'm just resolving this before pushing it: webrecorder/pywb#434

@brandongalbraith
Copy link

Have you considered swapping out wget for ArchiveTeam/wpull? It's a python re-implementation of wget specifically for web crawling and archiving, and might provide the flexibility you seek.

@pirate
Copy link
Member Author

pirate commented Feb 6, 2019

I have considered it, I just talked to the wpull authors this week in San Francisco. For now I think we'll stick with wget because it's nice to keep dependencies to a minimum, many people already have wget installed. We're switching to wpull in order to use pure python dependencies when packaging via pip.

@pirate
Copy link
Member Author

pirate commented Feb 20, 2019

A quick update, this is still blocked by python-hyper/brotlicffi#146

@pirate
Copy link
Member Author

pirate commented Feb 28, 2019

This can now proceed as pywb now disables brotli when it's unavailable:
webrecorder/pywb#434 (comment)

@goelayu
Copy link

goelayu commented May 18, 2022

Any updates on this?
Specifically "all requests made during the archiving process (using Chrome for example) are saved to a unified WARC file" seems like a really helpful feature.
@pirate

@pirate
Copy link
Member Author

pirate commented May 18, 2022

There is a way to do this already right now:

  1. Uncomment the example pywb proxy server in the docker-compose file
  2. Enable using that proxy via CLI flag on chrome/other dependencies you want to use it with archivebox config CHROME_ARGS

@goelayu
Copy link

goelayu commented May 19, 2022

Correct me if I am wrong, but I don't think there is a way to pass Chrome arguments using the CLI as of now. The following are the only options it reads from the config file. @pirate

def chrome_args(**options) -> List[str]:
"""helper to build up a chrome shell command with arguments"""
from .config import CHROME_OPTIONS
options = {**CHROME_OPTIONS, **options}
if not options['CHROME_BINARY']:
raise Exception('Could not find any CHROME_BINARY installed on your system')
cmd_args = [options['CHROME_BINARY']]
if options['CHROME_HEADLESS']:
cmd_args += ('--headless',)
if not options['CHROME_SANDBOX']:
# assume this means we are running inside a docker container
# in docker, GPU support is limited, sandboxing is unecessary,
# and SHM is limited to 64MB by default (which is too low to be usable).
cmd_args += (
'--no-sandbox',
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-software-rasterizer',
'--run-all-compositor-stages-before-draw',
'--hide-scrollbars',
'--single-process',
'--no-zygote',
)
if not options['CHECK_SSL_VALIDITY']:
cmd_args += ('--disable-web-security', '--ignore-certificate-errors')
if options['CHROME_USER_AGENT']:
cmd_args += ('--user-agent={}'.format(options['CHROME_USER_AGENT']),)
if options['RESOLUTION']:
cmd_args += ('--window-size={}'.format(options['RESOLUTION']),)
if options['TIMEOUT']:
cmd_args += ('--timeout={}'.format(options['TIMEOUT'] * 1000),)
if options['CHROME_USER_DATA_DIR']:
cmd_args.append('--user-data-dir={}'.format(options['CHROME_USER_DATA_DIR']))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: hard status: wip Work is in-progress / has already been partially completed why: functionality Intended to improve ArchiveBox functionality or features
Projects
None yet
Development

No branches or pull requests

3 participants