-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Javascript generated URLs #119
Comments
I'll flag this as a bug, since a single broken URL should not prevent the extraction of good ones in the same document (and that document should not be rejected). Whether we should be considering URLs generated via script is another question. Most often than not, since Javascript is not interpreted, I think we'll get errors. So I think we shall also make sure script-generated URLs are not considered in the first place. If one day we add javascript support, it should be with something browser-like that interprets javascript. |
Thank you. It is logical that the collector doesn't process JS-generated urls because it hasn't JS interpreter (at least yet), and correctly continue with correct ones. |
The fix is in the latest snapshot release. |
Thanks a lot! It is working perfectly. |
Looking at the regex that matches the script tag, it will only match |
Norconex HTTP Collector 2.2.0 official release is out. It includes this fix. You can download it here. |
I tried to crawl a site and get following error in log:
After reading the page's HTML I found the following fragment there:
Seems it were causing the exception, and looks it is the reason there is no pages imported while crawl.
Is there a way to avoid such links, or don't break importing if something happening like that?
The text was updated successfully, but these errors were encountered: