Continue to use content after exception PhantomJS #729

OkkeKlein · 2020-12-10T13:30:11Z

When running the PhantomJS command below and reviewing the content of temp file. I would like to ignore the exception and work with the content that was stored in temp file. The HTML in there has the content I am looking for.

Please advise.

Thank you!

Exception in thread "StreamConsumer-STDOUT" com.norconex.commons.lang.io.StreamException: Problem consuming input stream.
        at com.norconex.commons.lang.io.InputStreamConsumer.run(InputStreamConsumer.java:102)
Caused by: java.io.IOException: Stream closed
        at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
        at java.io.FilterInputStream.read(FilterInputStream.java:107)
        at com.norconex.commons.lang.io.InputStreamConsumer.run(InputStreamConsumer.java:97)
ERROR [SystemCommand] Command returned with exit value 143 (command properly escaped?). Command: /opt/phantomjs-2.1.1-linux-x86_64/bin/phantomjs --ssl-protocol=any --ignore-ssl-errors=true --web-security=false --cookies-file=/tmp/cookies.txt --load-images=false /opt/crawler/scripts/phantom.js https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ /tmp/1607605936369000000 3000 -1 https   1.0 Error: ""
ERROR [PhantomJSDocumentFetcher] PhantomJS:
  ReferenceError: Can't find variable: fetch
INFO  [PhantomJSDocumentFetcher] PhantomJS:
  ERROR TypeError: undefined is not a constructor (evaluating '_angular_core__WEBPACK_IMPORTED_MODULE_0__["ɵɵpureFunction0"](2, _c0).includes(link_r1.type)')

    https://voordeelvloeren-c005d.web.app/vendor-es5.js:26884 in defaultErrorLogger

The text was updated successfully, but these errors were encountered:

essiembre · 2020-12-14T03:24:55Z

Your page has PhantomJS return an exit value of 1. A non-zero exit value is considered an error and the page is rejected.

I just deployed a new 2.9.1-SNAPSHOT release that adds the ability to specify which exit values should be considered "valid". I tried your page with the following addition to the PhantomJSDocumentFetcher and it worked just fine:

<validExitCodes>0,1</validExitCodes>

Please try and confirm.

(default). #729

OkkeKlein · 2020-12-14T13:21:00Z

Pascal, thank you for getting back to me.

Not sure why return codes are different, but on my machine it was 143. So I added this to config and the good news is the crawler finished. The bad news is that is seems to be missing the tmp file.

Spidex: 2020-12-14 13:15:32 INFO - PhantomJS screenshot enabled: false
Spidex: 2020-12-14 13:15:34 ERROR - Command returned with exit value 143 (command properly escaped?). Command: /opt/phantomjs-2.1.1-linux-x86_64/bin/phantomjs --ssl-protocol=any --ignore-ssl-errors=true --web-security=false --cookies-file=/tmp/cookies.txt --load-images=false /opt/crawler/scripts/phantom.js https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ /tmp/1607951732879000000 3000 -1 https 1.0 30000 Error: ""
Spidex: 2020-12-14 13:15:34 ERROR - PhantomJS:
ReferenceError: Can't find variable: fetch
Spidex: 2020-12-14 13:15:34 INFO - PhantomJS:
ERROR TypeError: undefined is not a constructor (evaluating 'angular_core__WEBPACK_IMPORTED_MODULE_0_["ɵɵpureFunction0"](2, _c0).includes(link_r1.type)')

https://voordeelvloeren-c005d.web.app/vendor-es5.js:26884 in defaultErrorLogger

https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1 in requiredApisAvailable
https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1 in fn
https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1 in hn
https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1
https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1 in getOrInitializeService
https://www.gstatic.com/firebasejs/7.21.1/firebase-performance-standalone.js:1 in getImmediate

Spidex: 2020-12-14 13:15:44 ERROR - Cannot fetch document: https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ (/tmp/1607951732879000000 (No such file or directory))
Spidex: 2020-12-14 13:15:44 INFO - Spidex: REJECTED_ERROR: https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ (com.norconex.collector.core.CollectorException: java.io.FileNotFoundException: /tmp/1607951732879000000 (No such file or directory))
Spidex: 2020-12-14 13:15:44 INFO - Spidex: Could not process document: https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/ (java.io.FileNotFoundException: /tmp/1607951732879000000 (No such file or directory))
Spidex: 2020-12-14 13:15:44 INFO - Spidex: DOCUMENT_COMMITTED_REMOVE: https://voordeelvloeren-c005d.web.app/klantenservice/retourneren/
Spidex: 2020-12-14 13:15:44 INFO - Spidex: 100% completed (1 processed/1 total)

essiembre · 2020-12-15T06:28:15Z

That's interesting. I had that error a few times and it seemed intermittent. When I run it with the debugger (pausing at each line around where it writes that file) the file is always there. That leads me to believe PhantomJS sometimes returns before that file is written (when I debug that gives it time while I pause). I made the crawler re-look for that missing file every 0.5 seconds for up to 10 seconds and it fixed it for me (that's been added to the code as well).

I wonder if one of the reasons could be the page has not finished rendering when it returns, which could be caused due to timeouts being reached. Look at timeout options and see if increasing them significantly changes anything.

If that does not change anything for you, we'll have to look deeper into why this happens.

OkkeKlein · 2020-12-15T10:11:33Z

For me it is constant. I raised timeout to

   <renderWaitTime>300000</renderWaitTime>
   <resourceTimeout>600000</resourceTimeout>

This made no difference. No files in /tmp

essiembre · 2021-01-04T03:36:46Z

@OkkeKlein, do you still have the issue? I just tried and now I get a 404 when trying to access the page directly, so I cannot reproduce. Were there some changes done to the site?

OkkeKlein · 2021-01-04T10:11:01Z

Yes @essiembre I still need a fix. The domain moved to production. https://www.voordeelvloeren.nl/klantenservice/retourneren

Thank you and best wishes.

essiembre · 2021-01-07T03:31:33Z

Hello @OkkeKlein, I am getting a "Not Found" error from PhantomJS, which seems accurate since I get the same thing in my browser, as per the attached picture.

Has the page disappeared? Do you have another URL that I can use to reproduce the original issue?

OkkeKlein · 2021-01-07T10:24:16Z

@essiembre The URL works for me. Maybe the developer was working on it.

Please try again. Or try https://voordeelvloeren-development.web.app/klantenservice as an alternative.

essiembre · 2021-01-11T19:15:55Z

I tried with both URLs and they are no longer failing. On the other hand, what I get for content is:

Please enable JavaScript to continue using this application.

The PhantomJS logs have this:

ReferenceError: Can't find variable: globalThis

It seems it has trouble interpreting the JavaScript on your page. This is a recurrent issue (increasing in frequency). PhantomJS has not been updated in quite a while (it is no longer supported by its authors). It can be too dated already. Some people have created polyfills to help with JavaScript interpreter issues on more recent JavaScript versions/libraries, but at some point, I fear there is a limit to how the old PhantomJS can be hacked to maintain it compatible with newer sites.

I think your options are getting more and more limited with PhantomJS. You can try to research workarounds by injecting scripts that help with the page rendering, or you can try with HTTP Collector v3.

HTTP Collector v3 has native browser support (chrome, firefox, etc.). It is available for download but not "officially" released yet (snapshot releases only), so using it depends on your comfort level.

Another approach is to create your own IHttpDocumentFetcher which would use a local browser or some other javascript interpreter. At that point though, it is probably easier to migrate to v3.

Finally, you can try the PhantomJS support community at https://github.com/ariya/phantomjs/issues but I am not sure how active it is.

OkkeKlein · 2021-01-15T10:59:27Z

OK. Because the content changed since I created this issue. It is hard to replicate. But the issue originally (that you confirmed on the 15th of December) was not with PhantomJS but the fact that the collector could not find the temp file anymore. The temp file at that time contained the content I needed.

Feel free to close this issue, as the original issue can't be replicated anymore.

essiembre · 2021-01-19T03:19:08Z

OK, I could not reproduce the original issue once I applied my fix to it (but you kept having it). If it pops up again, feel free to reopen, or create a new ticket.

essiembre added a commit that referenced this issue Dec 14, 2020

New "validExitCodes" on PhantomJSDocumentFetcher to support other than 0

85998d3

(default). #729

essiembre closed this as completed Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue to use content after exception PhantomJS #729

Continue to use content after exception PhantomJS #729

OkkeKlein commented Dec 10, 2020 •

edited by essiembre

essiembre commented Dec 14, 2020

OkkeKlein commented Dec 14, 2020

essiembre commented Dec 15, 2020

OkkeKlein commented Dec 15, 2020 •

edited

essiembre commented Jan 4, 2021

OkkeKlein commented Jan 4, 2021 •

edited

essiembre commented Jan 7, 2021

OkkeKlein commented Jan 7, 2021

essiembre commented Jan 11, 2021

OkkeKlein commented Jan 15, 2021 •

edited

essiembre commented Jan 19, 2021

Continue to use content after exception PhantomJS #729

Continue to use content after exception PhantomJS #729

Comments

OkkeKlein commented Dec 10, 2020 • edited by essiembre

essiembre commented Dec 14, 2020

OkkeKlein commented Dec 14, 2020

essiembre commented Dec 15, 2020

OkkeKlein commented Dec 15, 2020 • edited

essiembre commented Jan 4, 2021

OkkeKlein commented Jan 4, 2021 • edited

essiembre commented Jan 7, 2021

OkkeKlein commented Jan 7, 2021

essiembre commented Jan 11, 2021

OkkeKlein commented Jan 15, 2021 • edited

essiembre commented Jan 19, 2021

OkkeKlein commented Dec 10, 2020 •

edited by essiembre

OkkeKlein commented Dec 15, 2020 •

edited

OkkeKlein commented Jan 4, 2021 •

edited

OkkeKlein commented Jan 15, 2021 •

edited