Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain all relevant header information for resources #68

Closed
2 tasks done
nightsh opened this issue Mar 17, 2020 · 2 comments · Fixed by #87
Closed
2 tasks done

Retain all relevant header information for resources #68

nightsh opened this issue Mar 17, 2020 · 2 comments · Fixed by #87
Assignees

Comments

@nightsh
Copy link
Collaborator

nightsh commented Mar 17, 2020

As we're not downloading resources, we currently have no way of knowing some basic information about them without hitting each URL.

However, downloading would be a rather costly action, both in time and disk space. But we can probably get the headers info only.

An idea would be to use Scrapy's cache, if possible, but we need to investigate.

Examples of useful headers to fetch for each downloadable file:

  • Content-Type
  • Content-Length

Acceptance criteria:

  • the JSON dumps generated by scrapers contain headers for all resources
  • all the scrapers still run with no errors
@nightsh
Copy link
Collaborator Author

nightsh commented Mar 23, 2020

ETA: 3h

@nightsh
Copy link
Collaborator Author

nightsh commented Mar 30, 2020

Implemented in #87, pending review and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant