New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
running collector-http examples with Solr #52
Comments
Do I assume right that you meant the "lib" foler? The content of the lib folder in the norconex-commiter-solr-2.0.0 zip file should go in the lib folder of your HTTP Collector installation (i.e. Jars with Jars). If you find duplicate Jars (different version), you can delete the oldest ones. Once you have done this, you can look here for configuration options. For instance, you want to replace this from the minimum-config.xml ...
... to something like this (change localhost:8080 to match your Solr instance)...
The target reference and content fields need to match what you have defined in your Solr config/schema for the Solr unique key and default fulltext field, respectively. The source reference field is the default. "document.reference" is always a field of every document crawled, unless you explicitelly take it off. Let me know if that works for you. |
I've run collector-http folowing your advice but it seems that I've got an error. The following text is an excerpt of the execution. If there is a way to send you the the full text, let me know. INFO [AbstractFileQueueCommitter] Committing 2 files Caused by: java.lang.IllegalArgumentException: Illegal character in opaque partat index 5: http:\localhost:8939\solr\example\solr\collection1/update?wt=javabi |
Can you paste the configuration portion you have for Solr? From the stacktrace, it seems to be that your Solr URL has an invalid character in it. Can it be you have not specified the protocol properly? I see http:\ in the stacktrace, while it should be http:// Can you double-check that? |
I'm very, very, very, very sorry!!!!!!!. You were right. .................... Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8939/solr/example/solr/collection1 |
It is trying to connect to this URL: http://localhost:8939/solr/example/solr/collection1 Have you tried contacting this URL in your browser, from the same computer that's running the HTTP Collector? What do you get? It looks to me that URL is wrong. Should it be http://localhost:8939/solr/collection1 instead? (dropping solr/example/) Having a copy of the relevant portion of your configuration would help. |
This is the access to Solr Admin from the browser. In the picture, to the right, you can see that the instance is at c:\solr\example\solr\collection1, although the url reads http://localhost:8983/solr/#/collection1 With http://localhost:8983/solr/collection1, in the browser, I get error 404. Please, let me know what more information you need me to send you. Thank you excerpt from minimum-config-solr.xml
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.norconex.com/p Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ... 13 more |
You are making progress. I can see content is sent to Solr now so your committer configuration is now OK. This last error is about a field being sent to Solr, but not defined in your Solr Schema. This is a typical error with Solr, and it is fairly easy to fix. Here are two options: Option 1) Add a wildcard field in your Solr schema.xml and Solr will automatically create a new Solr field for every crawled field sent its way. Option 2) Tell HTTP Collector to only keep the fields you have configured in your Solr schema. You can do this easily by setting a
The coma-separated list of fields you specify must exist in your Solr Config. |
Since I want to collect and archive in Solr web pages contents and files referenced in the web pages, I'm not sure I can decide the field names. I suppose it depends on the site web pages. What do you advise me?. I'm very new to crawlers and Solr Just to follow on my test I'll try option 1. Thank you |
Good idea while you are developing. You will have a clear picture of all fields captured by the crawl activities. About page references. The HTTP Collector will store all URLs found in a document in a metadata field. That allows you to build search features, such as "find all pages that link to this URL". There is another HTTP Collector feature you may want to turn on (it is off by default). That is, for every document, store which page linked to it (if many pages point to the same file, only one will be kept). You can enable this by having this config:
Documentation on HtmlLinkExtractor can be found here. |
I have "minimum" web pages recorded on Solr and Have an idea of the great variety of fields. "Congratulations! If you read this text from your target repository (e.g. file system, search engine, ...) "We are excited that you are trying the Norconex HTTP Collector. This standalone web page was created to help you test your installation is running properly. Once you're done working with this document, make sure to familiarize yourself with the many configuration options available to you How can this text be indexed?. And, finally, I'd like pdf, word, etc. files, referenced in web pages, to be indexed. Could you give me any advice to get the files indexed on Solr? Thank you very much. |
The text you mention should be in Solr. Please provide the following:
As for PDFs and other non-HTML files, they are picked up by default. So unless you explicitly exclude them somehow, you'll get them. |
Another thing... the field that you map the content to in Solr... did you define it with the "store" flag being true? |
The field:
What configuration file? The Select Solr URL:
The xml result:
|
I mean the HTTP Collector configuration file. I see from about you kept |
Your are right. In HTTP Collector, is "text" by defult, and in schema.xml, it is stated is that "content" is for highlighting document content and "text" to search the content.
HTTP Collector configuration file:
|
It all depends what you want to do with the document content you crawl. Typically, you want to search on it and that's OK that Solr has the "text" field as If you do not want to search on it, but you would like to display the content to your application users, then make the target field "content" in the HTTP Collector config (or change the text field in Solr to be If you want to do both search and display it, you can leave it as is, but also mark the "text" field in your schema as After you change your Solr schema, if you experience issues, the safest is to wipe out the existing content in Solr, restart it, and index again. |
Thank you |
No problem. As it seems you got everything working now, I am closing this issue. Feel free to re-open if you encounter a related issue, or create a new issue. Thanks for using the Norconex HTTP Collector and good luck with your project! |
One final question. |
Your expectations are good, but your configuration does not match your expectations. :-) The the sample configuration is limited to crawl only one page on purpose (since that's just a test). There are two configuration settings at play here:
I recommend you do not remove these, but change them instead to match the site you want to crawl. Put a reasonable max depth (e.g. 20), and change the reference filter to match the domain name you are crawling (unless you want to crawl the entire internet!). I welcome your questions anytime and I am glad to see you are making good progress, but I would appreciate you create new tickets/issues for new questions. It will separate better your questions with answers being "on-topic" with your title, helping others find answers more easily when looking at the closed issue list. |
I'd like to test "minimum" and "complex" examples with solr but not sure what changes to make to minimum-config.xml and complex-config.xml. I'm trying, at the same time Solr, so my repository is collection1 (C:\solr\example\solr\collection1).
I've downloaded "norconex-committer-solr-2.0.0" and copied bin directory onto collector-http's.
I'd appreciate some advice.
Thanks
Carlos
The text was updated successfully, but these errors were encountered: