Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC Fixes: Payload hash and CDX records #360

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Frogging101
Copy link

@Frogging101 Frogging101 commented Mar 26, 2017

I remembered to:

  • Update or add unit tests if needed
  • Update or add documentation/comments if needed
  • Made sure stray files or whitespace didn't get committed
  • If significant changes, branch from develop and set to merge into develop
  • Read the guidelines for contributing

Changes:

  • Fix a problem where HTTP headers with values containing extra leading/trailing whitespace would cause wpull to calculate an incorrect payload hash.
  • Fix another problem that I noticed after adding the test case for the above issue, where CDX records would be missing status code and MIME type if the HTTP headers had multiple lines (i.e. anything more than the status line).

@chfoo chfoo added the todo label Aug 30, 2017
…espace

The payload offset was being obtained by taking len(response.to_bytes()),
but since leading/trailing whitespace is discarded from the response
class's name/value pairs, the length of the generated string would not
necessarily reflect the actual size of the received headers, leading to a
checksum being calculated from the wrong position in the file.

To prevent this, the WARC recorder will now independently figure out where
the headers really end in the file.
The regex used to find the end of the HTTP headers would not match if
there was a newline in between, so get_http_header would return nothing
and the status code and MIME type fields would be empty in the CDX record.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants