Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue to use content after exception PhantomJS #729

Closed
OkkeKlein opened this issue Dec 10, 2020 · 11 comments
Closed

Continue to use content after exception PhantomJS #729

OkkeKlein opened this issue Dec 10, 2020 · 11 comments

Comments

@OkkeKlein
Copy link

OkkeKlein commented Dec 10, 2020

When running the PhantomJS command below and reviewing the content of temp file. I would like to ignore the exception and work with the content that was stored in temp file. The HTML in there has the content I am looking for.

Please advise.

Thank you!

Exception in thread "StreamConsumer-STDOUT" com.norconex.commons.lang.io.StreamException: Problem consuming input stream.
        at com.norconex.commons.lang.io.InputStreamConsumer.run(InputStreamConsumer.java:102)
Caused by: java.io.IOException: Stream closed
        at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
        at java.io.FilterInputStream.read(FilterInputStream.java:107)
        at com.norconex.commons.lang.io.InputStreamConsumer.run(InputStreamConsumer.java:97)
ERROR [SystemCommand] Command returned with exit value 143 (command properly escaped?). Command: /opt/phantomjs-2.1.1-linux-x86_64/bin/phantomjs --ssl-protocol=any --ignore-ssl-errors=true --web-security=false --cookies-file=/tmp/cookies.txt --load-images=false /opt/crawler/scripts/phantom.js https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ /tmp/1607605936369000000 3000 -1 https   1.0 Error: ""
ERROR [PhantomJSDocumentFetcher] PhantomJS:
  ReferenceError: Can't find variable: fetch
INFO  [PhantomJSDocumentFetcher] PhantomJS:
  ERROR TypeError: undefined is not a constructor (evaluating '_angular_core__WEBPACK_IMPORTED_MODULE_0__["ɵɵpureFunction0"](2, _c0).includes(link_r1.type)')

    https://voordeelvloeren-c005d.web.app/vendor-es5.js:26884 in defaultErrorLogger
@essiembre
Copy link
Contributor

Your page has PhantomJS return an exit value of 1. A non-zero exit value is considered an error and the page is rejected.

I just deployed a new 2.9.1-SNAPSHOT release that adds the ability to specify which exit values should be considered "valid". I tried your page with the following addition to the PhantomJSDocumentFetcher and it worked just fine:

<validExitCodes>0,1</validExitCodes>

Please try and confirm.

@OkkeKlein
Copy link
Author

Pascal, thank you for getting back to me.

Not sure why return codes are different, but on my machine it was 143. So I added this to config and the good news is the crawler finished. The bad news is that is seems to be missing the tmp file.

Spidex: 2020-12-14 13:15:32 INFO - PhantomJS screenshot enabled: false
Spidex: 2020-12-14 13:15:34 ERROR - Command returned with exit value 143 (command properly escaped?). Command: /opt/phantomjs-2.1.1-linux-x86_64/bin/phantomjs --ssl-protocol=any --ignore-ssl-errors=true --web-security=false --cookies-file=/tmp/cookies.txt --load-images=false /opt/crawler/scripts/phantom.js https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ /tmp/1607951732879000000 3000 -1 https 1.0 30000 Error: ""
Spidex: 2020-12-14 13:15:34 ERROR - PhantomJS:
ReferenceError: Can't find variable: fetch
Spidex: 2020-12-14 13:15:34 INFO - PhantomJS:
ERROR TypeError: undefined is not a constructor (evaluating 'angular_core__WEBPACK_IMPORTED_MODULE_0_["ɵɵpureFunction0"](2, _c0).includes(link_r1.type)')

https://voordeelvloeren-c005d.web.app/vendor-es5.js:26884 in defaultErrorLogger

https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1 in requiredApisAvailable
https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1 in fn
https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1 in hn
https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1
https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1 in getOrInitializeService
https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1 in getImmediate

Spidex: 2020-12-14 13:15:44 ERROR - Cannot fetch document: https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ (/tmp/1607951732879000000 (No such file or directory))
Spidex: 2020-12-14 13:15:44 INFO - Spidex: REJECTED_ERROR: https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ (com.norconex.collector.core.CollectorException: java.io.FileNotFoundException: /tmp/1607951732879000000 (No such file or directory))
Spidex: 2020-12-14 13:15:44 INFO - Spidex: Could not process document: https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ (java.io.FileNotFoundException: /tmp/1607951732879000000 (No such file or directory))
Spidex: 2020-12-14 13:15:44 INFO - Spidex: DOCUMENT_COMMITTED_REMOVE: https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/
Spidex: 2020-12-14 13:15:44 INFO - Spidex: 100% completed (1 processed/1 total)

@essiembre
Copy link
Contributor

That's interesting. I had that error a few times and it seemed intermittent. When I run it with the debugger (pausing at each line around where it writes that file) the file is always there. That leads me to believe PhantomJS sometimes returns before that file is written (when I debug that gives it time while I pause). I made the crawler re-look for that missing file every 0.5 seconds for up to 10 seconds and it fixed it for me (that's been added to the code as well).

I wonder if one of the reasons could be the page has not finished rendering when it returns, which could be caused due to timeouts being reached. Look at timeout options and see if increasing them significantly changes anything.

If that does not change anything for you, we'll have to look deeper into why this happens.

@OkkeKlein
Copy link
Author

OkkeKlein commented Dec 15, 2020

For me it is constant. I raised timeout to

   <renderWaitTime>300000</renderWaitTime>
   <resourceTimeout>600000</resourceTimeout>

This made no difference. No files in /tmp

@essiembre
Copy link
Contributor

@OkkeKlein, do you still have the issue? I just tried and now I get a 404 when trying to access the page directly, so I cannot reproduce. Were there some changes done to the site?

@OkkeKlein
Copy link
Author

OkkeKlein commented Jan 4, 2021

Yes @essiembre I still need a fix. The domain moved to production. https://www.voordeelvloeren.nl/klantenservice/retourneren

Thank you and best wishes.

@essiembre
Copy link
Contributor

Hello @OkkeKlein, I am getting a "Not Found" error from PhantomJS, which seems accurate since I get the same thing in my browser, as per the attached picture.

image

Has the page disappeared? Do you have another URL that I can use to reproduce the original issue?

@OkkeKlein
Copy link
Author

@essiembre The URL works for me. Maybe the developer was working on it.

Please try again. Or try https://voordeelvloeren-development.web.app/klantenservice as an alternative.

@essiembre
Copy link
Contributor

I tried with both URLs and they are no longer failing. On the other hand, what I get for content is:

Please enable JavaScript to continue using this application.

The PhantomJS logs have this:

ReferenceError: Can't find variable: globalThis

It seems it has trouble interpreting the JavaScript on your page. This is a recurrent issue (increasing in frequency). PhantomJS has not been updated in quite a while (it is no longer supported by its authors). It can be too dated already. Some people have created polyfills to help with JavaScript interpreter issues on more recent JavaScript versions/libraries, but at some point, I fear there is a limit to how the old PhantomJS can be hacked to maintain it compatible with newer sites.

I think your options are getting more and more limited with PhantomJS. You can try to research workarounds by injecting scripts that help with the page rendering, or you can try with HTTP Collector v3.

HTTP Collector v3 has native browser support (chrome, firefox, etc.). It is available for download but not "officially" released yet (snapshot releases only), so using it depends on your comfort level.

Another approach is to create your own IHttpDocumentFetcher which would use a local browser or some other javascript interpreter. At that point though, it is probably easier to migrate to v3.

Finally, you can try the PhantomJS support community at https://github.com/ariya/phantomjs/issues but I am not sure how active it is.

@OkkeKlein
Copy link
Author

OkkeKlein commented Jan 15, 2021

OK. Because the content changed since I created this issue. It is hard to replicate. But the issue originally (that you confirmed on the 15th of December) was not with PhantomJS but the fact that the collector could not find the temp file anymore. The temp file at that time contained the content I needed.

Feel free to close this issue, as the original issue can't be replicated anymore.

@essiembre
Copy link
Contributor

OK, I could not reproduce the original issue once I applied my fix to it (but you kept having it). If it pops up again, feel free to reopen, or create a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants