Javascript generated URLs #119

AntonioAmore · 2015-06-15T15:44:57Z

I tried to crawl a site and get following error in log:

site: 2015-06-10 21:38:12 DEBUG - ACCEPTED document reference. Reference=http://www.site.com/Projects/c2c/channel/images/'+L140413[1+Math.round(Math.random()*(1-1))]+' Filter=com.norconex.collector.core.filter.impl.ExtensionReferenceFilter@3ec300f1[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,swf,css,js,pdf,doc,xls,ppt,txt,odt,zip,rar,gz,swf,xlsx,docx,pptx,mp3,wav,mid,caseSensitive=false]
site: 2015-06-10 21:38:12 ERROR - site: Could not process document: http://www.site.com/dccom/0-5-7171-1-1810165-1-0-0-0-0-0-9296-0-0-0-0-0-0-0-0.html (Invalid URL syntax: http://www.site.com/Projects/c2c/channel/images/'+L140413[1+Math.round(Math.random()*(1-1))]+')
com.norconex.commons.lang.url.URLException: Invalid URL syntax: http://www.site.com/Projects/c2c/channel/images/'+L140413[1+Math.round(Math.random()*(1-1))]+'
    at com.norconex.commons.lang.url.URLNormalizer.<init>(URLNormalizer.java:181)
    at com.norconex.collector.http.url.impl.GenericURLNormalizer.normalizeURL(GenericURLNormalizer.java:159)
    at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline$URLNormalizerStage.executeStage(HttpQueuePipeline.java:133)
    at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31)
    at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:94)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.URISyntaxException: Illegal character in path at index 60: http://www.site.com/Projects/c2c/channel/images/'+L140413[1+Math.round(Math.random()*(1-1))]+'
    at java.net.URI$Parser.fail(URI.java:2848)
    at java.net.URI$Parser.checkChars(URI.java:3021)
    at java.net.URI$Parser.parseHierarchical(URI.java:3105)
    at java.net.URI$Parser.parse(URI.java:3053)
    at java.net.URI.<init>(URI.java:588)
    at com.norconex.commons.lang.url.URLNormalizer.<init>(URLNormalizer.java:179)
    ... 16 more

After reading the page's HTML I found the following fragment there:

<SCRIPT language=javascript><!--- 
 document.write('<img src="/Projects/c2c/channel/images/'+L140413[1+Math.round(Math.random()*(1-1))]+'"  border="0">');
// --->
...

Seems it were causing the exception, and looks it is the reason there is no pages imported while crawl.

Is there a way to avoid such links, or don't break importing if something happening like that?

The text was updated successfully, but these errors were encountered:

essiembre · 2015-06-17T18:08:52Z

I'll flag this as a bug, since a single broken URL should not prevent the extraction of good ones in the same document (and that document should not be rejected).

Whether we should be considering URLs generated via script is another question. Most often than not, since Javascript is not interpreted, I think we'll get errors. So I think we shall also make sure script-generated URLs are not considered in the first place. If one day we add javascript support, it should be with something browser-like that interprets javascript.

AntonioAmore · 2015-06-18T12:23:15Z

Thank you.

It is logical that the collector doesn't process JS-generated urls because it hasn't JS interpreter (at least yet), and correctly continue with correct ones.

so that JavaScript-generated URLs can no longer cause trouble (gitub #119). - Fixed HTML documents being skipped when HtmlLinkExtractor found a URL of invalid format. Now a warning is thrown for each bad URLs instead and the document is processed anyway, and good URLs are extracted (gitub #119).

essiembre · 2015-06-19T02:49:34Z

The fix is in the latest snapshot release.

AntonioAmore · 2015-06-23T13:01:15Z

Thanks a lot! It is working perfectly.

OkkeKlein · 2015-06-30T10:29:35Z

Looking at the regex that matches the script tag, it will only match <script> and not <SCRIPT language=javascript>

Github #119.

essiembre · 2015-07-22T18:46:06Z

Norconex HTTP Collector 2.2.0 official release is out. It includes this fix. You can download it here.

essiembre added the bug label Jun 17, 2015

essiembre added this to the 2.2.0 milestone Jun 17, 2015

essiembre added the resolved label Jun 19, 2015

essiembre added a commit that referenced this issue Jul 2, 2015

Fixed HtmlLinkExtractor not ignoring <script> tags with attributes.

f3d6e03

Github #119.

essiembre mentioned this issue Jul 13, 2015

Fawlty links causes Norconex to throw pages away #122

Closed

essiembre closed this as completed Jul 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Javascript generated URLs #119

Javascript generated URLs #119

AntonioAmore commented Jun 15, 2015

essiembre commented Jun 17, 2015

AntonioAmore commented Jun 18, 2015

essiembre commented Jun 19, 2015

AntonioAmore commented Jun 23, 2015

OkkeKlein commented Jun 30, 2015

essiembre commented Jul 22, 2015

Javascript generated URLs #119

Javascript generated URLs #119

Comments

AntonioAmore commented Jun 15, 2015

essiembre commented Jun 17, 2015

AntonioAmore commented Jun 18, 2015

essiembre commented Jun 19, 2015

AntonioAmore commented Jun 23, 2015

OkkeKlein commented Jun 30, 2015

essiembre commented Jul 22, 2015