Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PhantomJS Page Screen Shot Configuration #512

Closed
HappyCustomers opened this issue Aug 20, 2018 · 17 comments
Closed

PhantomJS Page Screen Shot Configuration #512

HappyCustomers opened this issue Aug 20, 2018 · 17 comments

Comments

@HappyCustomers
Copy link

HappyCustomers commented Aug 20, 2018

Please find attached the configuration for http collector. I am trying to get the screen shots of web pages which is crawled. I need help on the following issues

1 Store the Image path in MYSQL database along with the content data in the crawlerimage
table
2. I am not able to get the images of all the pages in the website
3. how to fetch page images of https website

Thank you

@HappyCustomers
Copy link
Author

One more observation that every time I rerun the crawler on the same website, it give me different number of screenshots. it is not consistent with the number of URLs in the database table

@essiembre
Copy link
Contributor

  1. You are currently specifying you want screenshots saved to disk. If you want them stored in the database instead, you can use this:
<screenshotStorage>inline</screenshotStorage>
<screenshotStorageInlineField>MyImageField</screenshotStorageInlineField>
  1. PhantomJS won't be able to take screenshots of all pages. First, for non HTML pages, it typically tries to download the files so they are not "rendered" (nothing displayed to take screenshots). Also, there are pages sometimes generated with Javascript only after a certain time or certain user interaction. It can be difficult to have the page fully rendered in an automated way so you can take a screenshot.

  2. If https does not work out of the box, you can try using the HTTP Client proxy, as described in the class documentation here.

As for varying number of images, are you starting fresh each time (wiping out your DB and crawler workdir)? If not, it will perform incremental indexing by default so the number may vary each time. If you are concerned about having more URLs than images, then it may be that for some URLs the screenshots could not be obtained.

@HappyCustomers
Copy link
Author

HappyCustomers commented Aug 21, 2018

Thank you for the quick response.

  1. I want to store images in the disk and the image folder path in the database field.
  2. I wanted images of only websites not other document types like pdf or word
  3. Will try the HTTP client proxy
  4. Varying Number of Images:
    Each time I run fresh crawler on the same website. For Example www.xyz.com. Say it has 20 Web Pages. first run it I get 12 pages screen shot out of 20 web pages. In the second run starting afresh ( by deleting the work folder and data in the MYSQL table) I get 8 - 10 pages. What I meant varying is I do not get the same number of screen shots. Also I am not getting all the pages screen shots. in one run I get certain pages and in another run I get different pages.

I need screenshots of all the pages of given website to show to the user the webpage thumbnails, while displaying the URLs.

  1. Also I tried this config
    screenshotStorage>inline</screenshotStorage
    screenshotStorageInlineField>image</screenshotStorageInlineField

CREATE TABLE ${tableName} (
wid INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (wid),
${targetReferenceField} VARCHAR(3000) NOT NULL,
${targetContentField} LONGTEXT,
title VARCHAR(1000),
keywords VARCHAR(2000),
description VARCHAR(2000),
Server VARCHAR(500),
classification VARCHAR(500),
Date VARCHAR(100),
Last_Modified VARCHAR(100),
document_reference VARCHAR(3000),
image BLOB
)

I am getting the following error "ERROR - Screenshot file not created for" and the image is not storing the database. Using the same configuration file attached earlier.
Thank you

@essiembre
Copy link
Contributor

  1. This is what you had then. The path should be stored in your "image" field according to your config.

Af for why you get an inconsistent number of screenshots, I am not sure. Do you have any indication in the logs? Have you tried increasing the PhantomJS-related timeouts to significantly higher numbers?

@HappyCustomers
Copy link
Author

HappyCustomers commented Aug 22, 2018

  1. Yes you are right that the path should be stored in the image field, however the path is not getting stored in the field. The field is null.

I tested again with the below configuration and the image path is not getting stored in the database field imagepath. Please advise

		<exePath>D:\hh_dev\noroconex\norconex-collector-http-2.8.0\bin\phantomjs.exe</exePath>
		<scriptPath>D:\hh_dev\noroconex\norconex-collector-http-2.8.0\scripts\phantom.js</scriptPath>
		<resourceTimeout>30000</resourceTimeout>
		<validStatusCodes>200,302,403</validStatusCodes>
		<notFoundStatusCodes>404</notFoundStatusCodes>
		<referencePattern>^http://.*</referencePattern>
		<referencePattern>^https://.*</referencePattern>
		<renderWaitTime>30000</renderWaitTime>
		<screenshotDimensions>1600X900</screenshotDimensions>
		<screenshotZoomFactor>1</screenshotZoomFactor>
		<screenshotScaleDimensions>1000</screenshotScaleDimensions>
		<screenshotScaleStretch>false</screenshotScaleStretch>
		<screenshotScaleQuality>max</screenshotScaleQuality>
		<screenshotImageFormat>jpg</screenshotImageFormat>
		
		<!-- <screenshotStorage>inline</screenshotStorage> -->
		<!-- <screenshotStorageInlineField>screenshot</screenshotStorageInlineField> -->
		
		<screenshotStorage>disk</screenshotStorage>
		<screenshotStorageDiskDir structure="url2path">./hhi_2000JPG/screenshot</screenshotStorageDiskDir>
		<screenshotStorageDiskField>imagepath</screenshotStorageDiskField>
	</documentFetcher>

CREATE TABLE ${tableName} (
wid INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (wid),
${targetReferenceField} VARCHAR(3000) NOT NULL,
${targetContentField} LONGTEXT,
title VARCHAR(1000),
keywords VARCHAR(2000),
description VARCHAR(2000),
Server VARCHAR(500),
classification VARCHAR(500),
Date VARCHAR(100),
Last_Modified VARCHAR(100),
document_reference VARCHAR(3000),
imagepath LONGTEXT
)

@essiembre
Copy link
Contributor

I had a second look at your config and the reason the image paths are not stored is you are getting rid of the "image" field you specified by not having it listed in the KeepOnlyTagger. If you add "image" there it will be committed to your SQL table.

@HappyCustomers
Copy link
Author

HappyCustomers commented Aug 23, 2018

Thank you very much the image path is saving in the MYSQL database.

I found another issue while indexing using PhantomJS. Whenever I get an error in the PhantomJS screen capture that page does not get indexed. Below is the error while screen capture and that page does not get indexed in MYSQL. Please advise

I would like the page to be indexed even if there is an error in page image capture

Example: 2018-08-23 15:52:20 ERROR - Command returned with exit value 1 (command properly escaped?). Command: cmd.exe /C "D:\hh_dev\noroconex\norconex-collector-http-2.8.0\bin\phantomjs.exe --ssl-protocol=any --ignore-ssl-errors=true --web-security=false --cookies-file="C:\Users\lenovo\AppData\Local\Temp\cookies.txt" --load-images=true "D:\hh_dev\noroconex\norconex-collector-http-2.8.0\scripts\phantom.js" "http://www.xyz.com/?page_id=29" "C:\Users\lenovo\AppData\Local\Temp\1535019705996000001" 30000 -1 http "C:\Users\lenovo\AppData\Local\Temp\1535019705996000000.png" "1600.0x900.0" 1.0 30000" Error: "http://www.xyz.com/wp-content/themes/veniteck/style.css?ver=3.8.1: Operation canceled"
Example: 2018-08-23 15:52:20 ERROR - PhantomJS:
http://www.xyz.com/wp-content/themes/veniteck/style.css?ver=3.8.1: Operation canceled
ReferenceError: Can't find variable: google
Hello Hotel Example: 2018-08-23 15:52:20 INFO - PhantomJS:

undefined:1 in eval code
:0 in eval

Example: 2018-08-23 15:52:20 ERROR - Screenshot file not created for http://www.xyz.com/?page_id=29

@essiembre
Copy link
Contributor

As a test, if you disable screenshots does it get indexed? I would like to confirm if the page failing to load has anything to do with taking a screenshot.

@HappyCustomers
Copy link
Author

HappyCustomers commented Aug 26, 2018

Yes, I retested now and If I disable the screenshots config the collector indexes the concerned page. once I enable the documentFetcher the log shows the above error and the concerned page does not get indexed. I have attached the config file for your testing.
Thank you

@essiembre
Copy link
Contributor

I tested with your latest config and sometimes it works sometimes it does not. What makes a difference are the timeout values (<renderWaitTime> and <resourceTimeout>). The higher the value (10, 20 or even 30 seconds), then the fewer errors I get.

When I get errors, it seems I get them regardless whether the screenshot is enabled or not, but it does
happen more frequently when enabled. Loading images increase the total page load time and you hit the maximum specified more regularly. When screenshots are disabled, images are not loaded by PhamtomJS so it goes faster.

I recommend you tell PhantomJS to wait significantly longer. If that makes your crawl slower, you can increase the number of threads to compensate (if you have enough resources for that).

@HappyCustomers
Copy link
Author

Sorry for the delay in responding. I have tried setting higher values in waitTime and Timeout. However, whenever there is an error in taking Screeshot using the PhantomJS that Pages consistently is not indexed. All Other pages are indexed.
My observation is even if I keep 180 secs timout, there is an error while taking the screen shot of the page and the page does not get indexed.

is there a way to configure that the page to be indexed even when PhantomJS fails??

@essiembre
Copy link
Contributor

There is possibly something to be done. Do you have a specific URL that consistently works fine with screenshot disabled but always fails when screenshots are enabled?

@HappyCustomers
Copy link
Author

HappyCustomers commented Sep 5, 2018

Sorry for the delay in responding. You can try this URL
http://www.xyz.com/?page_id=29. It consistently fails when screenshot is enabled.

@essiembre
Copy link
Contributor

OK, with that URL I can reproduce but it does not always fail for me. And when screenshot fails, the content IS most often processed as expected.

Once in a while though, the content is not obtained. It appears to happen only when PhantomJS fails to download the page. That happens when the return code from PhantomJS is 1 and I get the following:

ERROR [SystemCommand] Command returned with exit value 1 [...]
  ReferenceError: Can't find variable: google
INFO  [PhantomJSDocumentFetcher] PhantomJS:
  
    undefined:1 in eval code
    :0 in eval

In such case, there is no file to process (no download) so this is why you do not get the document committed. It does seem to occur only when screenshot is enabled (or maybe it just makes fail more frequently).

Short of fixing PhantomJS, we can only try to work around this. The only things I can think of is to have an optional parameter that specifies how many times to retry a failed page. Another option is to retry without screenshot enabled if it fails with it. I can turn this into a feature request if any of these options would work for you. If you have another approach to suggest, let me know.

In the meantime, I am afraid the only available (non-coding) workaround would be to recrawl the site more frequently in hope bad pages eventually go through (since they will be retried on each crawl).

If the HTTP response last updated date can be relied upon on that site, you can enable an HTTP metadata fetcher to check if a page has changed before redownloading it again, making your re-crawls faster.

@HappyCustomers
Copy link
Author

HappyCustomers commented Sep 6, 2018

Thanks for your response. The issues is I am getting the PhantomJS error in a considerable number fo web pages resulting in those pages not getting indexed. Today also one of the websites failed in the index page itself and the whole website did not get indexed.

Also re crawling the website also does not guarantee the indexation of the page.

if any one page is not indexed it is resulting in wrong search results in our application.

As mentioned earlier if the screenshot is disabled the pages gets indexed. Is there a way to continue indexing the pages irrespective of the output from the screenshot process like

<ignore>on DocumentFetch Error</ignore>

if not, can you please add a feature to continue indexing the pages even if the error exists in the screenshot process.

Thanks

@essiembre
Copy link
Contributor

Right now no, when that specific scenario occurs with PhantomJS, it will not continue with the content since there is no content produced by PhantomJS. As we have limited control over how PhantomJS behaves on certain errors, we will need to add to the crawler the ability to retry a URL when it fails. I am marking as a feature request.

In the meantime, you can always try to learn about the scripting API used by PhantomJS and modify the phantom.js script to change its current behavior on failure (if possible).

@HappyCustomers
Copy link
Author

HappyCustomers commented Sep 7, 2018

Thanks for the quick response, I am getting the Phantom JS errors in considerable number of websites preventing me to move forward with taking screen shot and indexing the website. I request you to add the feature to index the page even if there is PhantomJS screenshot error. I have sent you the configuration of another website by email where you can reproduce the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants