New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PhantomJS Page Screen Shot Configuration #512
Comments
One more observation that every time I rerun the crawler on the same website, it give me different number of screenshots. it is not consistent with the number of URLs in the database table |
As for varying number of images, are you starting fresh each time (wiping out your DB and crawler workdir)? If not, it will perform incremental indexing by default so the number may vary each time. If you are concerned about having more URLs than images, then it may be that for some URLs the screenshots could not be obtained. |
Thank you for the quick response.
I need screenshots of all the pages of given website to show to the user the webpage thumbnails, while displaying the URLs.
CREATE TABLE ${tableName} ( I am getting the following error "ERROR - Screenshot file not created for" and the image is not storing the database. Using the same configuration file attached earlier. |
Af for why you get an inconsistent number of screenshots, I am not sure. Do you have any indication in the logs? Have you tried increasing the PhantomJS-related timeouts to significantly higher numbers? |
I tested again with the below configuration and the image path is not getting stored in the database field imagepath. Please advise
|
I had a second look at your config and the reason the image paths are not stored is you are getting rid of the "image" field you specified by not having it listed in the |
Thank you very much the image path is saving in the MYSQL database. I found another issue while indexing using PhantomJS. Whenever I get an error in the PhantomJS screen capture that page does not get indexed. Below is the error while screen capture and that page does not get indexed in MYSQL. Please advise I would like the page to be indexed even if there is an error in page image capture
Example: 2018-08-23 15:52:20 ERROR - Screenshot file not created for http://www.xyz.com/?page_id=29 |
As a test, if you disable screenshots does it get indexed? I would like to confirm if the page failing to load has anything to do with taking a screenshot. |
Yes, I retested now and If I disable the screenshots config the collector indexes the concerned page. once I enable the documentFetcher the log shows the above error and the concerned page does not get indexed. I have attached the config file for your testing. |
I tested with your latest config and sometimes it works sometimes it does not. What makes a difference are the timeout values ( When I get errors, it seems I get them regardless whether the screenshot is enabled or not, but it does I recommend you tell PhantomJS to wait significantly longer. If that makes your crawl slower, you can increase the number of threads to compensate (if you have enough resources for that). |
Sorry for the delay in responding. I have tried setting higher values in waitTime and Timeout. However, whenever there is an error in taking Screeshot using the PhantomJS that Pages consistently is not indexed. All Other pages are indexed. is there a way to configure that the page to be indexed even when PhantomJS fails?? |
There is possibly something to be done. Do you have a specific URL that consistently works fine with screenshot disabled but always fails when screenshots are enabled? |
Sorry for the delay in responding. You can try this URL |
OK, with that URL I can reproduce but it does not always fail for me. And when screenshot fails, the content IS most often processed as expected. Once in a while though, the content is not obtained. It appears to happen only when PhantomJS fails to download the page. That happens when the return code from PhantomJS is 1 and I get the following:
In such case, there is no file to process (no download) so this is why you do not get the document committed. It does seem to occur only when screenshot is enabled (or maybe it just makes fail more frequently). Short of fixing PhantomJS, we can only try to work around this. The only things I can think of is to have an optional parameter that specifies how many times to retry a failed page. Another option is to retry without screenshot enabled if it fails with it. I can turn this into a feature request if any of these options would work for you. If you have another approach to suggest, let me know. In the meantime, I am afraid the only available (non-coding) workaround would be to recrawl the site more frequently in hope bad pages eventually go through (since they will be retried on each crawl). If the HTTP response last updated date can be relied upon on that site, you can enable an HTTP metadata fetcher to check if a page has changed before redownloading it again, making your re-crawls faster. |
Thanks for your response. The issues is I am getting the PhantomJS error in a considerable number fo web pages resulting in those pages not getting indexed. Today also one of the websites failed in the index page itself and the whole website did not get indexed. Also re crawling the website also does not guarantee the indexation of the page. if any one page is not indexed it is resulting in wrong search results in our application. As mentioned earlier if the screenshot is disabled the pages gets indexed. Is there a way to continue indexing the pages irrespective of the output from the screenshot process like
if not, can you please add a feature to continue indexing the pages even if the error exists in the screenshot process. Thanks |
Right now no, when that specific scenario occurs with PhantomJS, it will not continue with the content since there is no content produced by PhantomJS. As we have limited control over how PhantomJS behaves on certain errors, we will need to add to the crawler the ability to retry a URL when it fails. I am marking as a feature request. In the meantime, you can always try to learn about the scripting API used by PhantomJS and modify the phantom.js script to change its current behavior on failure (if possible). |
Thanks for the quick response, I am getting the Phantom JS errors in considerable number of websites preventing me to move forward with taking screen shot and indexing the website. I request you to add the feature to index the page even if there is PhantomJS screenshot error. I have sent you the configuration of another website by email where you can reproduce the error. |
Please find attached the configuration for http collector. I am trying to get the screen shots of web pages which is crawled. I need help on the following issues
1 Store the Image path in MYSQL database along with the content data in the crawlerimage
table
2. I am not able to get the images of all the pages in the website
3. how to fetch page images of https website
Thank you
The text was updated successfully, but these errors were encountered: