Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running collector-http examples with Solr #52

Closed
csaezl opened this issue Feb 6, 2015 · 20 comments
Closed

running collector-http examples with Solr #52

csaezl opened this issue Feb 6, 2015 · 20 comments

Comments

@csaezl
Copy link

csaezl commented Feb 6, 2015

I'd like to test "minimum" and "complex" examples with solr but not sure what changes to make to minimum-config.xml and complex-config.xml. I'm trying, at the same time Solr, so my repository is collection1 (C:\solr\example\solr\collection1).
I've downloaded "norconex-committer-solr-2.0.0" and copied bin directory onto collector-http's.
I'd appreciate some advice.
Thanks
Carlos

@essiembre
Copy link
Contributor

Do I assume right that you meant the "lib" foler? The content of the lib folder in the norconex-commiter-solr-2.0.0 zip file should go in the lib folder of your HTTP Collector installation (i.e. Jars with Jars). If you find duplicate Jars (different version), you can delete the oldest ones.

Once you have done this, you can look here for configuration options. For instance, you want to replace this from the minimum-config.xml ...

      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./examples-output/minimum/crawledFiles</directory>
      </committer>

... to something like this (change localhost:8080 to match your Solr instance)...

  <committer class="com.norconex.committer.solr.SolrCommitter">
      <solrURL>http://localhost:8080/solr/collection1</solrURL>
      <sourceReferenceField keep="false">document.reference</sourceReferenceField>
      <targetReferenceField>id</targetReferenceField>
      <targetContentField>text</targetContentField>
      <commitBatchSize>10</commitBatchSize>
      <queueDir>/optional/queue/path/</queueDir>
      <queueSize>100</queueSize>
      <maxRetries>2</maxRetries>
      <maxRetryWait>5000</maxRetryWait>
  </committer>

The target reference and content fields need to match what you have defined in your Solr config/schema for the Solr unique key and default fulltext field, respectively.

The source reference field is the default. "document.reference" is always a field of every document crawled, unless you explicitelly take it off.

Let me know if that works for you.

@csaezl
Copy link
Author

csaezl commented Feb 9, 2015

I've run collector-http folowing your advice but it seems that I've got an error. The following text is an excerpt of the execution. If there is a way to send you the the full text, let me know.
Thank you
Carlos
....................
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.

INFO [AbstractFileQueueCommitter] Committing 2 files
INFO [SolrCommitter] Sending 2 documents to Solr for update/deletion.
ERROR [AbstractBatchCommitter] Could not commit batched operations.
com.norconex.committer.core.CommitterException: Cannot index document batch to Solr.
at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:198)
at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:178)
at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:158)
at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:249)
at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:246)
at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:207)
at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:169)
at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:351)
at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:301)
at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:171)
at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:116)
at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:69)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

Caused by: java.lang.IllegalArgumentException: Illegal character in opaque partat index 5: http:\localhost:8939\solr\example\solr\collection1/update?wt=javabi
n&version=2
at java.net.URI.create(Unknown Source)
at org.apache.http.client.methods.HttpPost.(HttpPost.java:76)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:328)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:195)
... 13 more
Caused by: java.net.URISyntaxException: Illegal character in opaque part at index 5: http:\localhost:8939\solr\example\solr\collection1/update?wt=javabin&versi
on=2
at java.net.URI$Parser.fail(Unknown Source)
at java.net.URI$Parser.checkChars(Unknown Source)
at java.net.URI$Parser.parse(Unknown Source)
at java.net.URI.(Unknown Source)
... 19 more
....................

@essiembre
Copy link
Contributor

Can you paste the configuration portion you have for Solr? From the stacktrace, it seems to be that your Solr URL has an invalid character in it. Can it be you have not specified the protocol properly?

I see http:\ in the stacktrace, while it should be http://

Can you double-check that?

@csaezl
Copy link
Author

csaezl commented Feb 9, 2015

I'm very, very, very, very sorry!!!!!!!. You were right.
But I'm still getting errors.
Thank you
Carlos

....................
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: SolrCommitter [solrURL=http://l
ocalhost:8939/solr/example/solr/collection1, updateUrlParams={}, solrServerFacto
ry=DefaultSolrServerFactory [server=null], com.norconex.committer.solr.SolrCommi
tter@1851003[queueSize=100,docCount=6,queue=com.norconex.committer.core.impl.Fil
eSystemCommitter@715c6f[directory=/optional/queue/path/],commitBatchSize=10,maxR
etries=2,maxRetryWait=5000,operations=[],docCount=0,targetReferenceField=id,sour
ceReferenceField=document.reference,keepSourceReferenceField=false,targetContent
Field=text,sourceContentField=,keepSourceContentField=false]])
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committing documents.
INFO [AbstractFileQueueCommitter] Committing 6 files
INFO [SolrCommitter] Sending 6 documents to Solr for update/deletion.
ERROR [AbstractBatchCommitter] Could not commit batched operations.
com.norconex.committer.core.CommitterException: Cannot index document batch to Solr.
at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:198)
at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:178)
at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:158)
at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:249)
at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:246)
at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:207)
at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:169)
at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:351)
at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:301)
at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:171)
at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:116)
at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:69)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8939/solr/example/solr/collection1
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:500)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:195)
... 13 more
Caused by: java.net.ConnectException: Connection refused: connect
at java.net.DualStackPlainSocketImpl.connect0(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:117)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:178)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:395)
... 16 more
....................

@essiembre
Copy link
Contributor

It is trying to connect to this URL: http://localhost:8939/solr/example/solr/collection1

Have you tried contacting this URL in your browser, from the same computer that's running the HTTP Collector? What do you get?

It looks to me that URL is wrong. Should it be http://localhost:8939/solr/collection1 instead? (dropping solr/example/)

Having a copy of the relevant portion of your configuration would help.

@csaezl
Copy link
Author

csaezl commented Feb 9, 2015

This is the access to Solr Admin from the browser.

09-02-2015 18-18-49

In the picture, to the right, you can see that the instance is at c:\solr\example\solr\collection1, although the url reads http://localhost:8983/solr/#/collection1

With http://localhost:8983/solr/collection1, in the browser, I get error 404.
Anyway I've put http://localhost:8983/solr/collection1 in minimum-config-solr.xml, and seems to work (not sure) but still get errors.

Please, let me know what more information you need me to send you.

Thank you
Carlos


excerpt from minimum-config-solr.xml

  <!-- Decide what to do with your files by specifying a Committer. -->
  <committer class="com.norconex.committer.solr.SolrCommitter">
    <solrURL>http://localhost:8983/solr/collection1</solrURL>
    <sourceReferenceField keep="false">document.reference</sourceReferenceField>
    <targetReferenceField>id</targetReferenceField>
    <targetContentField>text</targetContentField>
    <commitBatchSize>10</commitBatchSize>
    <queueDir>/optional/queue/path/</queueDir>
    <queueSize>100</queueSize>
    <maxRetries>2</maxRetries>
    <maxRetryWait>5000</maxRetryWait>
  </committer>

INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.norconex.com/p
roduct/collector-http-test/minimum.php (Subject: SolrCommitter [solrURL=http://l
ocalhost:8983/solr/collection1, updateUrlParams={}, solrServerFactory=DefaultSol
rServerFactory [server=null], com.norconex.committer.solr.SolrCommitter@de3cea[q
ueueSize=100,docCount=9,queue=com.norconex.committer.core.impl.FileSystemCommitt
er@6ba7bf[directory=/optional/queue/path/],commitBatchSize=10,maxRetries=2,maxRe
tryWait=5000,operations=[],docCount=0,targetReferenceField=id,sourceReferenceFie
ld=document.reference,keepSourceReferenceField=false,targetContentField=text,sou
rceContentField=,keepSourceContentField=false]])
INFO [AbstractCrawler] Norconex Minimum Test Page: Crawler finishing: committin
g documents.
INFO [AbstractFileQueueCommitter] Committing 9 files
INFO [SolrCommitter] Sending 9 documents to Solr for update/deletion.
ERROR [AbstractBatchCommitter] Could not commit batched operations.
com.norconex.committer.core.CommitterException: Cannot index document batch to Solr.
at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:198)
at com.norconex.committer.core.AbstractBatchCommitter.commitAndCleanBatch(AbstractBatchCommitter.java:178)
at com.norconex.committer.core.AbstractBatchCommitter.commitComplete(AbstractBatchCommitter.java:158)
at com.norconex.committer.core.AbstractFileQueueCommitter.commit(AbstractFileQueueCommitter.java:249)
at com.norconex.collector.core.crawler.AbstractCrawler.execute(AbstractCrawler.java:246)
at com.norconex.collector.core.crawler.AbstractCrawler.doExecute(AbstractCrawler.java:207)
at com.norconex.collector.core.crawler.AbstractCrawler.startExecution(AbstractCrawler.java:169)
at com.norconex.jef4.job.AbstractResumableJob.execute(AbstractResumableJob.java:49)
at com.norconex.jef4.suite.JobSuite.runJob(JobSuite.java:351)
at com.norconex.jef4.suite.JobSuite.doExecute(JobSuite.java:301)
at com.norconex.jef4.suite.JobSuite.execute(JobSuite.java:171)
at com.norconex.collector.core.AbstractCollector.start(AbstractCollector.java:116)
at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:69)
at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)

Caused by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
ERROR: [doc=http://www.norconex.com/product/collector-http-test/complex1.php] unknown field 'Content-Length'
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
at com.norconex.committer.solr.SolrCommitter.commitBatch(SolrCommitter.java:195)

... 13 more

@essiembre
Copy link
Contributor

You are making progress. I can see content is sent to Solr now so your committer configuration is now OK.

This last error is about a field being sent to Solr, but not defined in your Solr Schema. This is a typical error with Solr, and it is fairly easy to fix. Here are two options:

Option 1) Add a wildcard field in your Solr schema.xml and Solr will automatically create a new Solr field for every crawled field sent its way.

Option 2) Tell HTTP Collector to only keep the fields you have configured in your Solr schema. You can do this easily by setting a KeepOnlyTagger class in the importer section of your configuration file. Like this:

<importer>

    <postParseHandlers>

        <!-- This is what you need to add: -->
        <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger"
                fields="document.reference,id,title,myField,myOtherField,etc" >
        </tagger>

    </postParseHandlers>

</importer>                                                                     

The coma-separated list of fields you specify must exist in your Solr Config.

@csaezl
Copy link
Author

csaezl commented Feb 9, 2015

Since I want to collect and archive in Solr web pages contents and files referenced in the web pages, I'm not sure I can decide the field names. I suppose it depends on the site web pages. What do you advise me?. I'm very new to crawlers and Solr

Just to follow on my test I'll try option 1.

Thank you
Carlos

@essiembre
Copy link
Contributor

Good idea while you are developing. You will have a clear picture of all fields captured by the crawl activities.

About page references. The HTTP Collector will store all URLs found in a document in a metadata field. That allows you to build search features, such as "find all pages that link to this URL".

There is another HTTP Collector feature you may want to turn on (it is off by default). That is, for every document, store which page linked to it (if many pages point to the same file, only one will be kept). You can enable this by having this config:

  <extractor class="com.norconex.collector.http.url.impl.HtmlLinkExtractor" keepReferrerData="true">

Documentation on HtmlLinkExtractor can be found here.

@csaezl
Copy link
Author

csaezl commented Feb 9, 2015

I have "minimum" web pages recorded on Solr and Have an idea of the great variety of fields.
There is something that doesn't work as I have expected. I mean the texts on web pages. For example, on http://www.norconex.com/product/collector-http-test/minimum.php, every text should be a candidate to the index. Texts as:

"Congratulations! If you read this text from your target repository (e.g. file system, search engine, ...)
it means that you successfully ran the Norconex HTTP Collector minimum example."

"We are excited that you are trying the Norconex HTTP Collector. This standalone web page was created to help you test your installation is running properly. Once you're done working with this document, make sure to familiarize yourself with the many configuration options available to you
on the Norconex HTTP Collector web site"

How can this text be indexed?.

And, finally, I'd like pdf, word, etc. files, referenced in web pages, to be indexed. Could you give me any advice to get the files indexed on Solr?

Thank you very much.
Carlos

@essiembre
Copy link
Contributor

The text you mention should be in Solr. Please provide the following:

  • Your configuration file
  • URL you use to query Solr and the response you get back.

As for PDFs and other non-HTML files, they are picked up by default. So unless you explicitly exclude them somehow, you'll get them.

@essiembre
Copy link
Contributor

Another thing... the field that you map the content to in Solr... did you define it with the "store" flag being true?

@csaezl
Copy link
Author

csaezl commented Feb 9, 2015

The field:

   <dynamicField name="*"    type="string"  indexed="true"  stored="true" 
          multiValued="true"/>

What configuration file?


The Select Solr URL:

http://localhost:8983/solr/collection1/select?q=norconex&wt=xml&indent=true

The xml result:

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
    <str name="indent">true</str>
    <str name="q">norconex</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="3" start="0">
  <doc>
    <arr name="Content-Length">
      <str>3623</str>
    </arr>
    <arr name="Connection">
      <str>close</str>
    </arr>
    <arr name="X-Powered-By">
      <str>PleskLin</str>
    </arr>
    <arr name="Server">
      <str>Apache</str>
    </arr>
    <str name="id">http://www.norconex.com/product/collector-http-test/complex1.php</str>
    <arr name="SITE">
      <str>Norconex Test Site</str>
    </arr>
    <arr name="collector.referenced-urls">
      <str>http://www.norconex.com/collectors/img/collector-http.png</str>
      <str>http://www.norconex.com/collectors/img/norconex-logo-blue-241x51.png</str>
    </arr>
    <str name="author">Norconex Inc.</str>
    <str name="author_s">Norconex Inc.</str>
    <arr name="title">
      <str>Norconex HTTP Collector Test Page</str>
    </arr>
    <arr name="MS-Author-Via">
      <str>DAV</str>
    </arr>
    <arr name="Date">
      <str>Mon, 09 Feb 2015 16:13:39 GMT</str>
    </arr>
    <arr name="Content-Location">
      <str>http://www.norconex.com/product/collector-http-test/complex1.php</str>
    </arr>
    <arr name="Content-Encoding">
      <str>UTF-8</str>
    </arr>
    <arr name="collector.content-type">
      <str>text/html</str>
    </arr>
    <arr name="document.contentFamily">
      <str>html</str>
    </arr>
    <arr name="collector.content-encoding">
      <str>text/html</str>
    </arr>
    <arr name="Content-Type">
      <str>text/html</str>
      <str>text/html; charset=UTF-8</str>
    </arr>
    <arr name="document.contentType">
      <str>text/html</str>
    </arr>
    <arr name="dc:title">
      <str>Norconex HTTP Collector Test Page</str>
    </arr>
    <arr name="collector.depth">
      <str>0</str>
    </arr>
    <long name="_version_">1492660268137709568</long></doc>
  <doc>
    <arr name="Content-Length">
      <str>3623</str>
    </arr>
    <arr name="Connection">
      <str>close</str>
    </arr>
    <arr name="X-Powered-By">
      <str>PleskLin</str>
    </arr>
    <arr name="Server">
      <str>Apache</str>
    </arr>
    <str name="id">http://www.norconex.com/product/collector-http-test/complex2.php</str>
    <arr name="SITE">
      <str>Norconex Test Site</str>
    </arr>
    <arr name="collector.referenced-urls">
      <str>http://www.norconex.com/collectors/img/collector-http.png</str>
      <str>http://www.norconex.com/collectors/img/norconex-logo-blue-241x51.png</str>
    </arr>
    <str name="author">Norconex Inc.</str>
    <str name="author_s">Norconex Inc.</str>
    <arr name="title">
      <str>Norconex HTTP Collector Test Page</str>
    </arr>
    <arr name="MS-Author-Via">
      <str>DAV</str>
    </arr>
    <arr name="Date">
      <str>Mon, 09 Feb 2015 16:13:36 GMT</str>
    </arr>
    <arr name="Content-Location">
      <str>http://www.norconex.com/product/collector-http-test/complex2.php</str>
    </arr>
    <arr name="Content-Encoding">
      <str>UTF-8</str>
    </arr>
    <arr name="collector.content-type">
      <str>text/html</str>
    </arr>
    <arr name="document.contentFamily">
      <str>html</str>
    </arr>
    <arr name="collector.content-encoding">
      <str>text/html</str>
    </arr>
    <arr name="Content-Type">
      <str>text/html</str>
      <str>text/html; charset=UTF-8</str>
    </arr>
    <arr name="document.contentType">
      <str>text/html</str>
    </arr>
    <arr name="dc:title">
      <str>Norconex HTTP Collector Test Page</str>
    </arr>
    <arr name="collector.depth">
      <str>0</str>
    </arr>
    <long name="_version_">1492660268141903872</long></doc>
  <doc>
    <str name="id">http://www.norconex.com/product/collector-http-test/minimum.php</str>
    <arr name="title">
      <str>Norconex HTTP Collector Test Page</str>
    </arr>
    <long name="_version_">1492660268145049601</long></doc>
</result>
</response>

@essiembre
Copy link
Contributor

I mean the HTTP Collector configuration file. I see from about you kept text as the field name where to store the content in Solr (defined in the <targetContentField>). Is this field explicitly defined in your Solr schema? I suspect it is, but it is not flagged to be stored.

@csaezl
Copy link
Author

csaezl commented Feb 9, 2015

Your are right. In HTTP Collector, is "text" by defult, and in schema.xml, it is stated is that "content" is for highlighting document content and "text" to search the content.
Sould I change stored= false for "text" or targetContentField>context?

        <targetContentField>text</targetContentField>

   <!-- Main body of document extracted by SolrCell.
        NOTE: This field is not indexed by default, since it is also copied to "text"
        using copyField below. This is to save space. Use this field for returning and
        highlighting document content. Use the "text" field to search the content. -->

   <field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>

HTTP Collector configuration file:

<?xml version="1.0" encoding="UTF-8"?>
<!-- 
   Copyright 2010-2014 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<!-- This configuration shows the minimum required and minimum recommended to 
     run a crawler.  
     -->
<httpcollector id="Minimum Config HTTP Collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./examples-output/minimum/progress</progressDir>
  <logsDir>./examples-output/minimum/logs</logsDir>

  <crawlers>
    <crawler id="Norconex Minimum Test Page">

      <!-- === Minimum required: =========================================== -->

      <!-- Requires at least one start URL. -->
      <startURLs>
        <url>http://www.norconex.com/product/collector-http-test/minimum.php</url>
      </startURLs>

      <!-- === Minimum recommended: ======================================== -->

      <!-- Where the crawler default directory to generate files is. -->
      <workDir>./examples-output/minimum</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>0</maxDepth>

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- At a minimum make sure you stay on your domain. -->
      <referenceFilters>
        <filter 
            class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="include" >
          http://www\.norconex\.com/product/collector-http-test/.*
        </filter>
      </referenceFilters>

      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger"
                  fields="title,keywords,description,document.reference"/>
        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.solr.SolrCommitter">
        <solrURL>http://localhost:8983/solr/collection1</solrURL>
        <sourceReferenceField keep="false">document.reference</sourceReferenceField>
        <targetReferenceField>id</targetReferenceField>
        <targetContentField>text</targetContentField>
        <commitBatchSize>10</commitBatchSize>
        <queueDir>/optional/queue/path/</queueDir>
        <queueSize>100</queueSize>
        <maxRetries>2</maxRetries>
        <maxRetryWait>5000</maxRetryWait>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>

@essiembre
Copy link
Contributor

It all depends what you want to do with the document content you crawl. Typically, you want to search on it and that's OK that Solr has the "text" field as indexed="true". If that's all you want to do, leave it as is.

If you do not want to search on it, but you would like to display the content to your application users, then make the target field "content" in the HTTP Collector config (or change the text field in Solr to be stored="true", indexed="false").

If you want to do both search and display it, you can leave it as is, but also mark the "text" field in your schema as stored="true".

After you change your Solr schema, if you experience issues, the safest is to wipe out the existing content in Solr, restart it, and index again.

@csaezl
Copy link
Author

csaezl commented Feb 9, 2015

Thank you
Carlos

@essiembre
Copy link
Contributor

No problem. As it seems you got everything working now, I am closing this issue. Feel free to re-open if you encounter a related issue, or create a new issue.

Thanks for using the Norconex HTTP Collector and good luck with your project!

@csaezl
Copy link
Author

csaezl commented Feb 10, 2015

One final question.
I've realized that processing the "minimum" (the same for the "complex") test, only page "http://www.norconex.com/product/collector-http-test/minimum.php" is process.
I supposed that page "http://www.norconex.com/collectors/collector-http/configuration" should have also been processed, becausd is referenced in "minimum.php".
Isn't it the way the crawler is supposed to work?
Carlos

@essiembre
Copy link
Contributor

Your expectations are good, but your configuration does not match your expectations. :-)

The the sample configuration is limited to crawl only one page on purpose (since that's just a test). There are two configuration settings at play here:

  • maxDepth is set to zero, which means it won't crawl any deeper than the URL(s) you provide (so 1 page only in this case).
  • There is a filter set to only accept URLs that match the test page URL (http://www\.norconex\.com/product/collector-http-test/.*).

I recommend you do not remove these, but change them instead to match the site you want to crawl. Put a reasonable max depth (e.g. 20), and change the reference filter to match the domain name you are crawling (unless you want to crawl the entire internet!).

I welcome your questions anytime and I am glad to see you are making good progress, but I would appreciate you create new tickets/issues for new questions. It will separate better your questions with answers being "on-topic" with your title, helping others find answers more easily when looking at the closed issue list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants