Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Javascript generated URLs #119

Closed
AntonioAmore opened this issue Jun 15, 2015 · 6 comments
Closed

Javascript generated URLs #119

AntonioAmore opened this issue Jun 15, 2015 · 6 comments
Milestone

Comments

@AntonioAmore
Copy link

I tried to crawl a site and get following error in log:

site: 2015-06-10 21:38:12 DEBUG - ACCEPTED document reference. Reference=http://www.site.com/Projects/c2c/channel/images/'+L140413[1+Math.round(Math.random()*(1-1))]+' Filter=com.norconex.collector.core.filter.impl.ExtensionReferenceFilter@3ec300f1[onMatch=EXCLUDE,extensions=jpg,gif,png,ico,swf,css,js,pdf,doc,xls,ppt,txt,odt,zip,rar,gz,swf,xlsx,docx,pptx,mp3,wav,mid,caseSensitive=false]
site: 2015-06-10 21:38:12 ERROR - site: Could not process document: http://www.site.com/dccom/0-5-7171-1-1810165-1-0-0-0-0-0-9296-0-0-0-0-0-0-0-0.html (Invalid URL syntax: http://www.site.com/Projects/c2c/channel/images/'+L140413[1+Math.round(Math.random()*(1-1))]+')
com.norconex.commons.lang.url.URLException: Invalid URL syntax: http://www.site.com/Projects/c2c/channel/images/'+L140413[1+Math.round(Math.random()*(1-1))]+'
    at com.norconex.commons.lang.url.URLNormalizer.<init>(URLNormalizer.java:181)
    at com.norconex.collector.http.url.impl.GenericURLNormalizer.normalizeURL(GenericURLNormalizer.java:159)
    at com.norconex.collector.http.pipeline.queue.HttpQueuePipeline$URLNormalizerStage.executeStage(HttpQueuePipeline.java:133)
    at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:31)
    at com.norconex.collector.http.pipeline.queue.AbstractQueueStage.execute(AbstractQueueStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:94)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:90)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:213)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:473)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:373)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessURLsRunnable.run(AbstractCrawler.java:631)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.URISyntaxException: Illegal character in path at index 60: http://www.site.com/Projects/c2c/channel/images/'+L140413[1+Math.round(Math.random()*(1-1))]+'
    at java.net.URI$Parser.fail(URI.java:2848)
    at java.net.URI$Parser.checkChars(URI.java:3021)
    at java.net.URI$Parser.parseHierarchical(URI.java:3105)
    at java.net.URI$Parser.parse(URI.java:3053)
    at java.net.URI.<init>(URI.java:588)
    at com.norconex.commons.lang.url.URLNormalizer.<init>(URLNormalizer.java:179)
    ... 16 more

After reading the page's HTML I found the following fragment there:

<SCRIPT language=javascript><!--- 
 document.write('<img src="/Projects/c2c/channel/images/'+L140413[1+Math.round(Math.random()*(1-1))]+'"  border="0">');
// --->
...

Seems it were causing the exception, and looks it is the reason there is no pages imported while crawl.

Is there a way to avoid such links, or don't break importing if something happening like that?

@essiembre
Copy link
Contributor

I'll flag this as a bug, since a single broken URL should not prevent the extraction of good ones in the same document (and that document should not be rejected).

Whether we should be considering URLs generated via script is another question. Most often than not, since Javascript is not interpreted, I think we'll get errors. So I think we shall also make sure script-generated URLs are not considered in the first place. If one day we add javascript support, it should be with something browser-like that interprets javascript.

@essiembre essiembre added the bug label Jun 17, 2015
@essiembre essiembre added this to the 2.2.0 milestone Jun 17, 2015
@AntonioAmore
Copy link
Author

Thank you.

It is logical that the collector doesn't process JS-generated urls because it hasn't JS interpreter (at least yet), and correctly continue with correct ones.

essiembre added a commit that referenced this issue Jun 19, 2015
so that JavaScript-generated URLs can no longer cause trouble (gitub
#119).
- Fixed HTML documents being skipped when HtmlLinkExtractor found a URL
of invalid format. Now a warning is thrown for each bad URLs instead and
the document is processed anyway, and good URLs are extracted (gitub
#119).
@essiembre
Copy link
Contributor

The fix is in the latest snapshot release.

@AntonioAmore
Copy link
Author

Thanks a lot! It is working perfectly.

@OkkeKlein
Copy link

Looking at the regex that matches the script tag, it will only match <script> and not <SCRIPT language=javascript>

@essiembre
Copy link
Contributor

Norconex HTTP Collector 2.2.0 official release is out. It includes this fix. You can download it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants