PhantomJSFetcher returns -1 StatusCode #383

popthink · 2017-09-04T11:08:02Z

I found that phantomjs returns -1 status code on some url.
(but connected properly)

I set a Breakpoint on CmdGrabber to figure out the problem.

URLs returning -1 status code didn't make any output. So the grabber couldn't process it.

In my guess, the reason is Java Script Error which could disable function of 'phantom.js'.

I'll send the URLs by email if you can check it.

Thank you.

popthink · 2017-09-05T02:08:25Z

Sometimes PhantomJSFetcher is hanging.

Because of PhantomJS...

PhantomJS doesn't return forever, and PhantomJSFetcher wait for its end.

Thread [pool-1-thread-1] (Suspended)	
	waiting for: UNIXProcess  (id=199)	
	Object.wait(long) line: not available [native method]	
	UNIXProcess(Object).wait() line: 502	
	UNIXProcess.waitFor() line: 395	
	ExecUtil.watchProcess(Process, InputStream, IInputStreamListener[], IInputStreamListener[]) line: 159	
	SystemCommand.execute(InputStream, boolean) line: 299	
	SystemCommand.execute(boolean) line: 230	
	SystemCommand.execute() line: 211	
	PhantomJSDocumentFetcher.fetchDocument(HttpClient, HttpDocument) line: 493	
	DocumentFetcherStage.executeStage(HttpImporterPipelineContext) line: 42	
	DocumentFetcherStage(AbstractImporterStage).execute(ImporterPipelineContext) line: 31	
	DocumentFetcherStage(AbstractImporterStage).execute(Object) line: 24	
	HttpImporterPipeline(Pipeline<T>).execute(T) line: 91	
	HttpCrawler.executeImporterPipeline(ICrawler, ImporterDocument, ICrawlDataStore, BaseCrawlData, BaseCrawlData) line: 358	
	HttpCrawler(AbstractCrawler).processNextQueuedCrawlData(BaseCrawlData, ICrawlDataStore, boolean) line: 521	
	HttpCrawler(AbstractCrawler).processNextReference(ICrawlDataStore, JobStatusUpdater, boolean) line: 407	
	AbstractCrawler$ProcessReferencesRunnable.run() line: 789	
	ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1142	
	ThreadPoolExecutor$Worker.run() line: 617	
	Thread.run() line: 745

OS : Linux (Deb)
PhantomJS : 2.1
Norconex : 2.7.1
Rendering wait : default (3000)

essiembre · 2017-09-05T04:00:11Z

Are you using it only for HTML pages (best) or ALL pages? One thing you can try is setting up a short-enough timeout on PhantomJS. You can play with the /scripts/phantom.js script. Try putting this line anywhere below where var page = ... is declared:

page.settings.resourceTimeout = 3000;

See if it makes any difference. Otherwise please share a URL that can help reproduce the problem if you can.

popthink · 2017-09-05T07:49:54Z

some pages... returns 'fail' status but page.content is collected properly.(200 res code)

So I modified phantom.js as following


page.open(url, function (status) {
    if (status !== 'success') {
    	system.stderr.writeLine('Unable to load: ' + url + ' (status=' + status + ').');
    	system.stderr.writeLine('Content: ' + page.content);
               if(page.content)
                fs.write(outfile, page.content, 'w');
        phantom.exit();
    } else {
        window.setTimeout(function () {
        	if (thumbnailFile) {
                page.render(thumbnailFile);
        	}
            fs.write(outfile, page.content, 'w');
            phantom.exit();
        }, timeout);
    }
});

I don't know..why PhantomJS got screwed.
(I think it is a phantomjs's problem.)

Anyway it works on the condition (Crawled content but phantom return fail).

I sent them by email.

Thank you.

resourceTimeout option resolved the hanging Thank you.

(github #383).

essiembre · 2017-09-05T20:40:54Z

Thanks! I have added your page.content fix to the latest snapshot version, along with the ability to specify a <resourceTimeout> (milliseconds) in XML configuration. Can you please test and confirm when you have a chance?

popthink · 2017-09-06T02:05:34Z

<metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" /> 
	<documentFetcher  
      class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher"
      detectContentType="true" detectCharset="true">
      <exePath>/path/to/phantomjs</exePath>
       <scriptPath>/path/to/phantom.js</scriptPath>     

<validStatusCodes>200, -1</validStatusCodes>
<renderWaitTime>3000</renderWaitTime>
<resourceTimeout>5000</resourceTimeout>
  </documentFetcher>

Done!

It works well.

I added '-1' to validStatusCode to handle URLs returning '-1' with 200 response.

Thank you.

essiembre · 2017-09-06T06:09:37Z

Great! Thanks for confirming.

essiembre added a commit that referenced this issue Sep 5, 2017

Added ability to set "resourceTimeout" on PhantomJSDocumentFetcher

f621a8b

(github #383).

essiembre closed this as completed Sep 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PhantomJSFetcher returns -1 StatusCode #383

PhantomJSFetcher returns -1 StatusCode #383

popthink commented Sep 4, 2017 •

edited

Loading

popthink commented Sep 5, 2017 •

edited

Loading

essiembre commented Sep 5, 2017

popthink commented Sep 5, 2017 •

edited

Loading

essiembre commented Sep 5, 2017

popthink commented Sep 6, 2017

essiembre commented Sep 6, 2017

PhantomJSFetcher returns -1 StatusCode #383

PhantomJSFetcher returns -1 StatusCode #383

Comments

popthink commented Sep 4, 2017 • edited Loading

popthink commented Sep 5, 2017 • edited Loading

essiembre commented Sep 5, 2017

popthink commented Sep 5, 2017 • edited Loading

essiembre commented Sep 5, 2017

popthink commented Sep 6, 2017

essiembre commented Sep 6, 2017

popthink commented Sep 4, 2017 •

edited

Loading

popthink commented Sep 5, 2017 •

edited

Loading

popthink commented Sep 5, 2017 •

edited

Loading