Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returns 200 and success even when the URL requested is down or offline the second time [Cache issue?] #26

Open
kringo opened this issue Aug 9, 2022 · 3 comments
Labels
help wanted Extra attention is needed

Comments

@kringo
Copy link

kringo commented Aug 9, 2022

Hello,

First of all great work, creating a declarative scraper. We were testing out worker using docker, ran using

docker run -d -p 8080:8080 montferret/worker

and it's running great. We sent a POST request to above with below payload and got 200 OK which is good.

{ "text": "LET doc = DOCUMENT(@url, { driver: \"cdp\", userAgent: \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome 76.0.3809.87 Safari/537.36\"}) RETURN {}", "params": { "url": "http://192.168.0.10/test" } }

However the problem is, when the URL is down/offline (we intentionally took site http://192.168.0.10/test down) and we're still getting the same 200 and OK. Looks like the previously successful request is cached since http://192.168.0.10/test was running when the very 1st time request went through. [if we restart docker container while http://192.168.0.10/test is down and send a new fresh request, it's showing net::ERR_ADDRESS_UNREACHABLE as expected and working correctly]

Not sure if this is due to Chrome caching or ferret caching it?

If it is cache, is there a way to disable the cache so that every time it hits the live URL instead of using the cache version?

If there is a flag how to pass it to the docker image?

Appreciate your help, thanks in advance.

@kringo
Copy link
Author

kringo commented Aug 16, 2022

@ziflex @3timeslazy any thoughts?

@ziflex
Copy link
Member

ziflex commented Nov 16, 2022

Hey, sorry for the late reply.

Worker caches compiled queries only.
Chrome itself might cache the target page though.
On the other hand, by default, Ferret uses Incognito mode for each page, you should not be able to use previously fetched cached version, unless you set params.keepCookies to true.

@ziflex ziflex added the help wanted Extra attention is needed label Nov 16, 2022
@kringo
Copy link
Author

kringo commented Dec 1, 2022

Thanks. That's Interesting, lets grab a new build and try it out again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants