New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continue to use content after exception PhantomJS #729
Comments
Your page has PhantomJS return an exit value of I just deployed a new 2.9.1-SNAPSHOT release that adds the ability to specify which exit values should be considered "valid". I tried your page with the following addition to the <validExitCodes>0,1</validExitCodes> Please try and confirm. |
Pascal, thank you for getting back to me. Not sure why return codes are different, but on my machine it was 143. So I added this to config and the good news is the crawler finished. The bad news is that is seems to be missing the tmp file. Spidex: 2020-12-14 13:15:32 INFO - PhantomJS screenshot enabled: false
Spidex: 2020-12-14 13:15:44 ERROR - Cannot fetch document: https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ (/tmp/1607951732879000000 (No such file or directory)) |
That's interesting. I had that error a few times and it seemed intermittent. When I run it with the debugger (pausing at each line around where it writes that file) the file is always there. That leads me to believe PhantomJS sometimes returns before that file is written (when I debug that gives it time while I pause). I made the crawler re-look for that missing file every 0.5 seconds for up to 10 seconds and it fixed it for me (that's been added to the code as well). I wonder if one of the reasons could be the page has not finished rendering when it returns, which could be caused due to timeouts being reached. Look at timeout options and see if increasing them significantly changes anything. If that does not change anything for you, we'll have to look deeper into why this happens. |
For me it is constant. I raised timeout to
This made no difference. No files in /tmp |
@OkkeKlein, do you still have the issue? I just tried and now I get a 404 when trying to access the page directly, so I cannot reproduce. Were there some changes done to the site? |
Yes @essiembre I still need a fix. The domain moved to production. https://www.voordeelvloeren.nl/klantenservice/retourneren Thank you and best wishes. |
Hello @OkkeKlein, I am getting a "Not Found" error from PhantomJS, which seems accurate since I get the same thing in my browser, as per the attached picture. Has the page disappeared? Do you have another URL that I can use to reproduce the original issue? |
@essiembre The URL works for me. Maybe the developer was working on it. Please try again. Or try https://voordeelvloeren-development.web.app/klantenservice as an alternative. |
I tried with both URLs and they are no longer failing. On the other hand, what I get for content is:
The PhantomJS logs have this:
It seems it has trouble interpreting the JavaScript on your page. This is a recurrent issue (increasing in frequency). PhantomJS has not been updated in quite a while (it is no longer supported by its authors). It can be too dated already. Some people have created polyfills to help with JavaScript interpreter issues on more recent JavaScript versions/libraries, but at some point, I fear there is a limit to how the old PhantomJS can be hacked to maintain it compatible with newer sites. I think your options are getting more and more limited with PhantomJS. You can try to research workarounds by injecting scripts that help with the page rendering, or you can try with HTTP Collector v3. HTTP Collector v3 has native browser support (chrome, firefox, etc.). It is available for download but not "officially" released yet (snapshot releases only), so using it depends on your comfort level. Another approach is to create your own Finally, you can try the PhantomJS support community at https://github.com/ariya/phantomjs/issues but I am not sure how active it is. |
OK. Because the content changed since I created this issue. It is hard to replicate. But the issue originally (that you confirmed on the 15th of December) was not with PhantomJS but the fact that the collector could not find the temp file anymore. The temp file at that time contained the content I needed. Feel free to close this issue, as the original issue can't be replicated anymore. |
OK, I could not reproduce the original issue once I applied my fix to it (but you kept having it). If it pops up again, feel free to reopen, or create a new ticket. |
When running the PhantomJS command below and reviewing the content of temp file. I would like to ignore the exception and work with the content that was stored in temp file. The HTML in there has the content I am looking for.
Please advise.
Thank you!
The text was updated successfully, but these errors were encountered: