New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to extract links using a custom pattern? #236
Comments
Right now, anything that is not a regular URL (absolute or relative) is ignored by the GenericLinkExtractor. At this moment, to be able to extract the Javascript, you would need to create your own implementation of ILinkExtractor. I will make this a feature request to add the ability to provide custom regex to extract just about anything you like as URLs. |
I take back my last comment. It turns out you should be able to get your URLs by adding "javascript" as a supported scheme. Like this: <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
<schemes>http,https,ftp,javascript</schemes>
</extractor> Please give it a try and confirm. I will still keep this as a feature request since I think supporting custom regex is a nice to have. |
OK, will have a try, thanks. |
please can you said to me if it worked or not , because i have the same issue with Javascript. |
Maybe you can paste your config and/or sample HTML code that does not get extracted how you would expect? |
I want to extract links having javascript as href, like below. is it possible ? <a href="javascript:removeFiltersProc();" class="btn btn-success btn-binder-success btn-sm">Remove Selected</a> and there is another situation for loading page , i can't catch any links because the page not loaded yet , it possible ? |
The crawler does not interpret the JavaScript, so it is currently not possible to do what you want out-of-the-box. There is already a feature request for interpreting JavaScript in issue #95. We are working on a solution for the next feature release (no set time-frame for it). |
The latest snapshot has a new RegexLinkExtractor that makes it possible to use custom patterns to extract URLs. For JavaScript generated URLs, you should have a look at the PhantomJSDocumentFetcher, tracked in #95. |
For some reasons, I have to extract links having javascript as href, like below. is it possible ?
<a href="javascript:__doPostBack('MoreInfoList1$Pager','2')" title=""><img src="/sczw/images/page/2n.gif" align="Baseline" border="0"></a>
The text was updated successfully, but these errors were encountered: