Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rejected URL using RegexLinkExtractor with "java.lang.NullPointerException" #422

Closed
dhildreth opened this issue Nov 8, 2017 · 3 comments

Comments

@dhildreth
Copy link

dhildreth commented Nov 8, 2017

I think I've stumbled upon a bug here. I'm attempting to use a .txt file as a sitemap of sorts. The file has one URL per line. It looks something like this:

https://wiki.mydomain.com/QC-Procedure-Continual+Improvement
https://wiki.mydomain.com/QC-Procedure-DMR+EPICOR+Entry+and+Processing
https://wiki.mydomain.com/QC-Procedure-Fire+Safety
https://wiki.mydomain.com/QC-Procedure-General+Safety+and+Health
https://wiki.mydomain.com/QC-Procedure-Hazard+Communication
https://wiki.mydomain.com/QC-Procedure-Incident+Investigation
https://wiki.mydomain.com/QC-Procedure-Incoming+EPICOR+Entry
https://wiki.mydomain.com/QC-Procedure-Internal+Audits
https://wiki.mydomain.com/QC-Procedure-Non+Emergency+Injury

Anyways, I'm using the RegexLinkExtractor like this:

<linkExtractors>
  <extractor class="com.norconex.collector.http.url.impl.RegexLinkExtractor">
      <linkExtractionPatterns>
          <pattern>
              <match><![CDATA[(?m)(^.*)]]></match>
              <replace>$1</replace>
          </pattern>
      </linkExtractionPatterns>
  </extractor>
</linkExtractors>

When running the crawler, I get this error for each of the URLs in the sitemap.txt file:

INFO  [CrawlerEventManager]          DOCUMENT_FETCHED: https://wiki.mydomain.com/Customer
INFO  [CrawlerEventManager]       CREATED_ROBOTS_META: https://wiki.mydomain.com/Customer
INFO  [CrawlerEventManager]            REJECTED_ERROR: https://wiki.mydomain.com/Customer (java.lang.NullPointerException)
ERROR [AbstractCrawler] Internal CMS Crawler: Could not process document: https://wiki.mydomain.com/Customer (null)
java.lang.NullPointerException
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.toCleanAbsoluteURL(RegexLinkExtractor.java:347)
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.extractLinks(RegexLinkExtractor.java:329)
        at com.norconex.collector.http.url.impl.RegexLinkExtractor.extractLinks(RegexLinkExtractor.java:201)
        at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:73)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
        at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
        at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
        at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
        at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
        at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

Any suggestions would be greatly appreciated.

@essiembre
Copy link
Contributor

It will be investigated, but is it possible to share a copy of your config? In the meantime, have you tried defining your start URLs using <urlsFile>...</urlsFile> instead of <url>...</url>. It allows you to pass the path to a file that contains one URL per line the way you want to do it. That way you won't need to use the regex link extractor.

essiembre added a commit that referenced this issue Nov 9, 2017
@essiembre
Copy link
Contributor

Found and fixed the issue. I was able to reproduce when the URL file had blank lines in it. The latest snapshot now has this fix.

Please confirm.

@dhildreth
Copy link
Author

Thank you so much! Works for me. 👍

I will use the urlsFile option. It's amazing what tools this crawler offers. Just when I think I understand most of the features, I learn something new!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants