Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for HTTP header "If-Modified-Since" #637

Closed
jetnet opened this issue Sep 3, 2019 · 2 comments
Closed

Support for HTTP header "If-Modified-Since" #637

jetnet opened this issue Sep 3, 2019 · 2 comments

Comments

@jetnet
Copy link

jetnet commented Sep 3, 2019

Hello Pascal,

it seems, the crawler always download documents from web servers to check, if they have been changed, e.g.:

www.zoo-dresden.de: 2019-08-31 05:13:52 INFO -          DOCUMENT_FETCHED: https://www.zoo-dresden.de/daten/1/slider/Roter-Panda.jpg
www.zoo-dresden.de: 2019-08-31 05:13:52 INFO -       REJECTED_UNMODIFIED: https://www.zoo-dresden.de/daten/1/slider/Roter-Panda.jpg

even though the webserver supports the If-Modified-Since header (run the curl twice):

curl -s -v -o Roter-Panda.jpg -z Roter-Panda.jpg https://www.zoo-dresden.de/daten/1/slider/Roter-Panda.jpg

> GET /daten/1/slider/Roter-Panda.jpg HTTP/1.1
> Host: www.zoo-dresden.de
> User-Agent: curl/7.59.0
> Accept: */*
> If-Modified-Since: Tue, 03 Sep 2019 08:40:31 GMT
>
< HTTP/1.1 304 Not Modified

I quickly searched the git repository for If-Modified-Since, but could not find, if the header is used.
Could you please clarify that? And if the header is not used, than could you please add this for future releases in order to minimize download traffic?
Thanks a lot!

@jetnet
Copy link
Author

jetnet commented Sep 3, 2019

I guess, in the meantime, the Metadata Fetcher can be used to check, if remote documents have been changed:

image

I'll try it!

@essiembre
Copy link
Contributor

You are correct that adding the HTTP Metadata fetcher will save you from downloading the files as long as the Last-Modified is provided in the HTTP Response (but costs an extra call).

If-Modified-Since is indeed not supported right now. I am all for reducing traffic where we can.
I am marking this as a feature request.

@essiembre essiembre added this to the 3.0.0 milestone Sep 11, 2019
@essiembre essiembre added this to To do in Version 3.0.0 Dec 24, 2019
@essiembre essiembre moved this from To do to In progress in Version 3.0.0 Feb 18, 2020
@essiembre essiembre moved this from In progress to Done in Version 3.0.0 Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Version 3.0.0
  
Done
Development

No branches or pull requests

2 participants