Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to extract links using a custom pattern? #236

Closed
bruce-genhot opened this issue Mar 15, 2016 · 8 comments
Closed

Is it possible to extract links using a custom pattern? #236

bruce-genhot opened this issue Mar 15, 2016 · 8 comments

Comments

@bruce-genhot
Copy link

For some reasons, I have to extract links having javascript as href, like below. is it possible ?
<a href="javascript:__doPostBack('MoreInfoList1$Pager','2')" title=""><img src="/sczw/images/page/2n.gif" align="Baseline" border="0"></a>

@essiembre
Copy link
Contributor

Right now, anything that is not a regular URL (absolute or relative) is ignored by the GenericLinkExtractor. At this moment, to be able to extract the Javascript, you would need to create your own implementation of ILinkExtractor.

I will make this a feature request to add the ability to provide custom regex to extract just about anything you like as URLs.

@essiembre essiembre added this to the 2.5.0 milestone Mar 20, 2016
@essiembre essiembre self-assigned this Mar 20, 2016
@essiembre
Copy link
Contributor

I take back my last comment. It turns out you should be able to get your URLs by adding "javascript" as a supported scheme. Like this:

<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
    <schemes>http,https,ftp,javascript</schemes>
</extractor>

Please give it a try and confirm.

I will still keep this as a feature request since I think supporting custom regex is a nice to have.

@essiembre essiembre removed this from the 2.5.0 milestone Mar 20, 2016
@bruce-genhot
Copy link
Author

OK, will have a try, thanks.

@MohamedElsakka
Copy link

please can you said to me if it worked or not , because i have the same issue with Javascript.
for me i edited in class GenericLinkExtractor
private static final String[] DEFAULT_SCHEMES =
new String[] { "http", "https", "ftp","javascript" }; that's ok or not .

@essiembre
Copy link
Contributor

Maybe you can paste your config and/or sample HTML code that does not get extracted how you would expect?

@MohamedElsakka
Copy link

MohamedElsakka commented Jan 4, 2017

I want to extract links having javascript as href, like below. is it possible ?

<a href="javascript:removeFiltersProc();" class="btn btn-success btn-binder-success btn-sm">Remove Selected</a>

and there is another situation for loading page , i can't catch any links because the page not loaded yet , it possible ?

@essiembre
Copy link
Contributor

The crawler does not interpret the JavaScript, so it is currently not possible to do what you want out-of-the-box. There is already a feature request for interpreting JavaScript in issue #95. We are working on a solution for the next feature release (no set time-frame for it).

@essiembre essiembre changed the title Is it possible to extract links having javascript as href ? Is it possible to extract links using a custom pattern? Feb 20, 2017
@essiembre
Copy link
Contributor

The latest snapshot has a new RegexLinkExtractor that makes it possible to use custom patterns to extract URLs. For JavaScript generated URLs, you should have a look at the PhantomJSDocumentFetcher, tracked in #95.

@essiembre essiembre added this to the 2.7.0 milestone Feb 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants