Extend WARC file with all requests made via all archive methods #130

pirate · 2019-01-13T01:18:28Z

Right now the FETCH_WARC option only creates a simple html file WARC with wget, it doesn't save all the requests made dynamically after JS executes by chrome headless.

We should set up https://github.com/internetarchive/warcprox so that all requests made during the archiving process are saved to a unified WARC file.

In the ideal scenario, the WARC should include:

√ base html for the page
√ all assets like images, styles, fonts, js
all dynamically requested assets after JS executes in chrome (e.g. images, ajax requests, etc)
any media files requested

I think we can record the wget warc first, then use warcat to merge it with a warcproxy-created warc containing all the chrome headless requests.

The text was updated successfully, but these errors were encountered:

pirate · 2019-01-28T09:11:43Z

I've been investigating using pywb's wayback --proxy-record --proxy archivebox and google-chrome --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-security to pipe all chrome and wget requests into a warc file.

So far it looks promising, I'm just resolving this before pushing it: webrecorder/pywb#434

brandongalbraith · 2019-02-06T01:41:13Z

Have you considered swapping out wget for ArchiveTeam/wpull? It's a python re-implementation of wget specifically for web crawling and archiving, and might provide the flexibility you seek.

pirate · 2019-02-06T23:36:06Z

I have considered it, I just talked to the wpull authors this week in San Francisco. ~~For now I think we'll stick with wget because it's nice to keep dependencies to a minimum, many people already have wget installed.~~ We're switching to wpull in order to use pure python dependencies when packaging via pip.

pirate · 2019-02-20T06:43:40Z

A quick update, this is still blocked by python-hyper/brotlicffi#146

pirate · 2019-02-28T20:40:23Z

This can now proceed as pywb now disables brotli when it's unavailable:
webrecorder/pywb#434 (comment)

goelayu · 2022-05-18T20:28:44Z

Any updates on this?
Specifically "all requests made during the archiving process (using Chrome for example) are saved to a unified WARC file" seems like a really helpful feature.
@pirate

pirate · 2022-05-18T23:58:02Z

There is a way to do this already right now:

Uncomment the example pywb proxy server in the docker-compose file
Enable using that proxy via CLI flag on chrome/other dependencies you want to use it with archivebox config CHROME_ARGS

goelayu · 2022-05-19T14:03:06Z

Correct me if I am wrong, but I don't think there is a way to pass Chrome arguments using the CLI as of now. The following are the only options it reads from the config file. @pirate

ArchiveBox/archivebox/util.py

Lines 219 to 263 in 49faec8

    
           def chrome_args(**options) -> List[str]: 
        
               """helper to build up a chrome shell command with arguments""" 
        
               from .config import CHROME_OPTIONS 
        
               options = {**CHROME_OPTIONS, **options} 
        
               if not options['CHROME_BINARY']: 
        
                   raise Exception('Could not find any CHROME_BINARY installed on your system') 
        
               cmd_args = [options['CHROME_BINARY']] 
        
               if options['CHROME_HEADLESS']: 
        
                   cmd_args += ('--headless',) 
        
               if not options['CHROME_SANDBOX']: 
        
                   # assume this means we are running inside a docker container 
        
                   # in docker, GPU support is limited, sandboxing is unecessary,  
        
                   # and SHM is limited to 64MB by default (which is too low to be usable). 
        
                   cmd_args += ( 
        
                       '--no-sandbox', 
        
                       '--disable-gpu', 
        
                       '--disable-dev-shm-usage', 
        
                       '--disable-software-rasterizer', 
        
                       '--run-all-compositor-stages-before-draw', 
        
                       '--hide-scrollbars', 
        
                       '--single-process', 
        
                       '--no-zygote', 
        
                   ) 
        
               if not options['CHECK_SSL_VALIDITY']: 
        
                   cmd_args += ('--disable-web-security', '--ignore-certificate-errors') 
        
               if options['CHROME_USER_AGENT']: 
        
                   cmd_args += ('--user-agent={}'.format(options['CHROME_USER_AGENT']),) 
        
               if options['RESOLUTION']: 
        
                   cmd_args += ('--window-size={}'.format(options['RESOLUTION']),) 
        
               if options['TIMEOUT']: 
        
                   cmd_args += ('--timeout={}'.format(options['TIMEOUT'] * 1000),) 
        
               if options['CHROME_USER_DATA_DIR']: 
        
                   cmd_args.append('--user-data-dir={}'.format(options['CHROME_USER_DATA_DIR']))

pirate added size: hard status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet why: functionality Intended to improve ArchiveBox functionality or features labels Jan 13, 2019

pirate mentioned this issue Jan 13, 2019

Archive Method: Add WARC file output #6

Closed

pirate mentioned this issue Mar 11, 2019

Archive Interactive Site #166

Closed

pirate added status: wip Work is in-progress / has already been partially completed and removed status: idea-phase Work is tentatively approved and is being planned / laid out, but is not ready to be implemented yet labels Mar 11, 2019

This was referenced Mar 15, 2019

Add ability to run JS scripts during archiving with Playwright/Puppeteer #51

Open

Add official support for taking multiple snapshots of websites over time #179

Open

This was referenced Apr 30, 2019

Architecture: Block ads and trackers during archiving #211

Open

Ability to import user-provided/3rd-party WARCs from other archiving services (e.g. if user tries to archive a URL that is already down) #160

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend WARC file with all requests made via all archive methods #130

Extend WARC file with all requests made via all archive methods #130

pirate commented Jan 13, 2019 •

edited

Loading

pirate commented Jan 28, 2019

brandongalbraith commented Feb 6, 2019

pirate commented Feb 6, 2019 •

edited

Loading

pirate commented Feb 20, 2019

pirate commented Feb 28, 2019

goelayu commented May 18, 2022

pirate commented May 18, 2022

goelayu commented May 19, 2022

Extend WARC file with all requests made via all archive methods #130

Extend WARC file with all requests made via all archive methods #130

Comments

pirate commented Jan 13, 2019 • edited Loading

pirate commented Jan 28, 2019

brandongalbraith commented Feb 6, 2019

pirate commented Feb 6, 2019 • edited Loading

pirate commented Feb 20, 2019

pirate commented Feb 28, 2019

goelayu commented May 18, 2022

pirate commented May 18, 2022

goelayu commented May 19, 2022

pirate commented Jan 13, 2019 •

edited

Loading

pirate commented Feb 6, 2019 •

edited

Loading