Rejected URL using RegexLinkExtractor with "java.lang.NullPointerException" #422

dhildreth · 2017-11-08T16:09:03Z

I think I've stumbled upon a bug here. I'm attempting to use a .txt file as a sitemap of sorts. The file has one URL per line. It looks something like this:

https://wiki.mydomain.com/QC-Procedure-Continual+Improvement
https://wiki.mydomain.com/QC-Procedure-DMR+EPICOR+Entry+and+Processing
https://wiki.mydomain.com/QC-Procedure-Fire+Safety
https://wiki.mydomain.com/QC-Procedure-General+Safety+and+Health
https://wiki.mydomain.com/QC-Procedure-Hazard+Communication
https://wiki.mydomain.com/QC-Procedure-Incident+Investigation
https://wiki.mydomain.com/QC-Procedure-Incoming+EPICOR+Entry
https://wiki.mydomain.com/QC-Procedure-Internal+Audits
https://wiki.mydomain.com/QC-Procedure-Non+Emergency+Injury

Anyways, I'm using the RegexLinkExtractor like this:

<linkExtractors>
  <extractor class="com.norconex.collector.http.url.impl.RegexLinkExtractor">
      <linkExtractionPatterns>
          <pattern>
              <match><![CDATA[(?m)(^.*)]]></match>
              <replace>$1</replace>
          </pattern>
      </linkExtractionPatterns>
  </extractor>
</linkExtractors>

When running the crawler, I get this error for each of the URLs in the sitemap.txt file:

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://wiki.mydomain.com/Customer
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://wiki.mydomain.com/Customer
INFO  [CrawlerEventManager]            REJECTED_ERROR: https://wiki.mydomain.com/Customer (java.lang.NullPointerException)
ERROR [AbstractCrawler] Internal CMS Crawler: Could not process document: https://wiki.mydomain.com/Customer (null)
java.lang.NullPointerException
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.toCleanAbsoluteURL(RegexLinkExtractor.java:347)
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.extractLinks(RegexLinkExtractor.java:329)
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.extractLinks(RegexLinkExtractor.java:201)
        at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:73)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

Any suggestions would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

essiembre · 2017-11-09T04:31:54Z

It will be investigated, but is it possible to share a copy of your config? In the meantime, have you tried defining your start URLs using <urlsFile>...</urlsFile> instead of <url>...</url>. It allows you to pass the path to a file that contains one URL per line the way you want to do it. That way you won't need to use the regex link extractor.

NullPointerException (github #422).

essiembre · 2017-11-09T05:24:11Z

Found and fixed the issue. I was able to reproduce when the URL file had blank lines in it. The latest snapshot now has this fix.

Please confirm.

dhildreth · 2017-11-09T15:23:43Z

Thank you so much! Works for me. 👍

I will use the urlsFile option. It's amazing what tools this crawler offers. Just when I think I understand most of the features, I learn something new!

essiembre added a commit that referenced this issue Nov 9, 2017

Fixed blank URLs extracted by RegexLinkExtractorin throwing

dd22867

NullPointerException (github #422).

essiembre added bug resolved labels Nov 9, 2017

dhildreth closed this as completed Nov 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rejected URL using RegexLinkExtractor with "java.lang.NullPointerException" #422

Rejected URL using RegexLinkExtractor with "java.lang.NullPointerException" #422

dhildreth commented Nov 8, 2017 •

edited

Loading

essiembre commented Nov 9, 2017

essiembre commented Nov 9, 2017

dhildreth commented Nov 9, 2017

Rejected URL using RegexLinkExtractor with "java.lang.NullPointerException" #422

Rejected URL using RegexLinkExtractor with "java.lang.NullPointerException" #422

Comments

dhildreth commented Nov 8, 2017 • edited Loading

essiembre commented Nov 9, 2017

essiembre commented Nov 9, 2017

dhildreth commented Nov 9, 2017

dhildreth commented Nov 8, 2017 •

edited

Loading