You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think I've stumbled upon a bug here. I'm attempting to use a .txt file as a sitemap of sorts. The file has one URL per line. It looks something like this:
When running the crawler, I get this error for each of the URLs in the sitemap.txt file:
INFO [CrawlerEventManager] DOCUMENT_FETCHED: https://wiki.mydomain.com/Customer
INFO [CrawlerEventManager] CREATED_ROBOTS_META: https://wiki.mydomain.com/Customer
INFO [CrawlerEventManager] REJECTED_ERROR: https://wiki.mydomain.com/Customer (java.lang.NullPointerException)
ERROR [AbstractCrawler] Internal CMS Crawler: Could not process document: https://wiki.mydomain.com/Customer (null)
java.lang.NullPointerException
at com.norconex.collector.http.url.impl.RegexLinkExtractor.toCleanAbsoluteURL(RegexLinkExtractor.java:347)
at com.norconex.collector.http.url.impl.RegexLinkExtractor.extractLinks(RegexLinkExtractor.java:329)
at com.norconex.collector.http.url.impl.RegexLinkExtractor.extractLinks(RegexLinkExtractor.java:201)
at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:73)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:360)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:812)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Any suggestions would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
It will be investigated, but is it possible to share a copy of your config? In the meantime, have you tried defining your start URLs using <urlsFile>...</urlsFile> instead of <url>...</url>. It allows you to pass the path to a file that contains one URL per line the way you want to do it. That way you won't need to use the regex link extractor.
I will use the urlsFile option. It's amazing what tools this crawler offers. Just when I think I understand most of the features, I learn something new!
I think I've stumbled upon a bug here. I'm attempting to use a .txt file as a sitemap of sorts. The file has one URL per line. It looks something like this:
Anyways, I'm using the RegexLinkExtractor like this:
When running the crawler, I get this error for each of the URLs in the sitemap.txt file:
Any suggestions would be greatly appreciated.
The text was updated successfully, but these errors were encountered: