Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhantomJSFetcher returns -1 StatusCode #383

Closed
popthink opened this issue Sep 4, 2017 · 6 comments
Closed

PhantomJSFetcher returns -1 StatusCode #383

popthink opened this issue Sep 4, 2017 · 6 comments

Comments

@popthink
Copy link

popthink commented Sep 4, 2017

I found that phantomjs returns -1 status code on some url.
(but connected properly)

I set a Breakpoint on CmdGrabber to figure out the problem.

URLs returning -1 status code didn't make any output. So the grabber couldn't process it.

In my guess, the reason is Java Script Error which could disable function of 'phantom.js'.

I'll send the URLs by email if you can check it.

Thank you.

@popthink
Copy link
Author

popthink commented Sep 5, 2017

  • Sometimes PhantomJSFetcher is hanging.

Because of PhantomJS...

PhantomJS doesn't return forever, and PhantomJSFetcher wait for its end.

Thread [pool-1-thread-1] (Suspended)	
	waiting for: UNIXProcess  (id=199)	
	Object.wait(long) line: not available [native method]	
	UNIXProcess(Object).wait() line: 502	
	UNIXProcess.waitFor() line: 395	
	ExecUtil.watchProcess(Process, InputStream, IInputStreamListener[], IInputStreamListener[]) line: 159	
	SystemCommand.execute(InputStream, boolean) line: 299	
	SystemCommand.execute(boolean) line: 230	
	SystemCommand.execute() line: 211	
	PhantomJSDocumentFetcher.fetchDocument(HttpClient, HttpDocument) line: 493	
	DocumentFetcherStage.executeStage(HttpImporterPipelineContext) line: 42	
	DocumentFetcherStage(AbstractImporterStage).execute(ImporterPipelineContext) line: 31	
	DocumentFetcherStage(AbstractImporterStage).execute(Object) line: 24	
	HttpImporterPipeline(Pipeline<T>).execute(T) line: 91	
	HttpCrawler.executeImporterPipeline(ICrawler, ImporterDocument, ICrawlDataStore, BaseCrawlData, BaseCrawlData) line: 358	
	HttpCrawler(AbstractCrawler).processNextQueuedCrawlData(BaseCrawlData, ICrawlDataStore, boolean) line: 521	
	HttpCrawler(AbstractCrawler).processNextReference(ICrawlDataStore, JobStatusUpdater, boolean) line: 407	
	AbstractCrawler$ProcessReferencesRunnable.run() line: 789	
	ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1142	
	ThreadPoolExecutor$Worker.run() line: 617	
	Thread.run() line: 745	

OS : Linux (Deb)
PhantomJS : 2.1
Norconex : 2.7.1
Rendering wait : default (3000)

@essiembre
Copy link
Contributor

Are you using it only for HTML pages (best) or ALL pages? One thing you can try is setting up a short-enough timeout on PhantomJS. You can play with the /scripts/phantom.js script. Try putting this line anywhere below where var page = ... is declared:

page.settings.resourceTimeout = 3000;

See if it makes any difference. Otherwise please share a URL that can help reproduce the problem if you can.

@popthink
Copy link
Author

popthink commented Sep 5, 2017

some pages... returns 'fail' status but page.content is collected properly.(200 res code)

So I modified phantom.js as following


page.open(url, function (status) {
    if (status !== 'success') {
    	system.stderr.writeLine('Unable to load: ' + url + ' (status=' + status + ').');
    	system.stderr.writeLine('Content: ' + page.content);
               if(page.content)
                fs.write(outfile, page.content, 'w');
        phantom.exit();
    } else {
        window.setTimeout(function () {
        	if (thumbnailFile) {
                page.render(thumbnailFile);
        	}
            fs.write(outfile, page.content, 'w');
            phantom.exit();
        }, timeout);
    }
});

I don't know..why PhantomJS got screwed.
(I think it is a phantomjs's problem.)

Anyway it works on the condition (Crawled content but phantom return fail).

I sent them by email.

Thank you.

  • resourceTimeout option resolved the hanging Thank you.

@essiembre
Copy link
Contributor

Thanks! I have added your page.content fix to the latest snapshot version, along with the ability to specify a <resourceTimeout> (milliseconds) in XML configuration. Can you please test and confirm when you have a chance?

@popthink
Copy link
Author

popthink commented Sep 6, 2017

<metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" /> 
	<documentFetcher  
      class="com.norconex.collector.http.fetch.impl.PhantomJSDocumentFetcher"
      detectContentType="true" detectCharset="true">
      <exePath>/path/to/phantomjs</exePath>
       <scriptPath>/path/to/phantom.js</scriptPath>     

<validStatusCodes>200, -1</validStatusCodes>
<renderWaitTime>3000</renderWaitTime>
<resourceTimeout>5000</resourceTimeout>
  </documentFetcher>

Done!

It works well.

I added '-1' to validStatusCode to handle URLs returning '-1' with 200 response.

Thank you.

@essiembre
Copy link
Contributor

Great! Thanks for confirming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants