Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC Fixes: Payload hash and CDX records #360

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Commits on Apr 6, 2019

  1. warc: Fix bad payload hash when HTTP response headers have extra whit…

    …espace
    
    The payload offset was being obtained by taking len(response.to_bytes()),
    but since leading/trailing whitespace is discarded from the response
    class's name/value pairs, the length of the generated string would not
    necessarily reflect the actual size of the received headers, leading to a
    checksum being calculated from the wrong position in the file.
    
    To prevent this, the WARC recorder will now independently figure out where
    the headers really end in the file.
    Frogging101 committed Apr 6, 2019
    Configuration menu
    Copy the full SHA
    9187761 View commit details
    Browse the repository at this point in the history
  2. warc: Fix CDX fields missing with multi-line HTTP headers

    The regex used to find the end of the HTTP headers would not match if
    there was a newline in between, so get_http_header would return nothing
    and the status code and MIME type fields would be empty in the CDX record.
    Frogging101 committed Apr 6, 2019
    Configuration menu
    Copy the full SHA
    451cd2e View commit details
    Browse the repository at this point in the history