Is it possible to extract links using a custom pattern? #236

bruce-genhot · 2016-03-15T05:52:47Z

For some reasons, I have to extract links having javascript as href, like below. is it possible ?
<a href="javascript:__doPostBack('MoreInfoList1$Pager','2')" title=""><img src="/sczw/images/page/2n.gif" align="Baseline" border="0"></a>

The text was updated successfully, but these errors were encountered:

essiembre · 2016-03-20T19:59:15Z

Right now, anything that is not a regular URL (absolute or relative) is ignored by the GenericLinkExtractor. At this moment, to be able to extract the Javascript, you would need to create your own implementation of ILinkExtractor.

I will make this a feature request to add the ability to provide custom regex to extract just about anything you like as URLs.

essiembre · 2016-03-20T20:10:06Z

I take back my last comment. It turns out you should be able to get your URLs by adding "javascript" as a supported scheme. Like this:

<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
    <schemes>http,https,ftp,javascript</schemes>
</extractor>

Please give it a try and confirm.

I will still keep this as a feature request since I think supporting custom regex is a nice to have.

bruce-genhot · 2016-03-21T02:25:01Z

OK, will have a try, thanks.

MohamedElsakka · 2016-12-25T08:15:07Z

please can you said to me if it worked or not , because i have the same issue with Javascript.
for me i edited in class GenericLinkExtractor
private static final String[] DEFAULT_SCHEMES =
new String[] { "http", "https", "ftp","javascript" }; that's ok or not .

essiembre · 2016-12-26T06:00:48Z

Maybe you can paste your config and/or sample HTML code that does not get extracted how you would expect?

MohamedElsakka · 2017-01-04T07:24:32Z

I want to extract links having javascript as href, like below. is it possible ?

<a href="javascript:removeFiltersProc();" class="btn btn-success btn-binder-success btn-sm">Remove Selected</a>

and there is another situation for loading page , i can't catch any links because the page not loaded yet , it possible ?

essiembre · 2017-01-05T05:26:29Z

The crawler does not interpret the JavaScript, so it is currently not possible to do what you want out-of-the-box. There is already a feature request for interpreting JavaScript in issue #95. We are working on a solution for the next feature release (no set time-frame for it).

essiembre · 2017-02-23T06:10:45Z

The latest snapshot has a new RegexLinkExtractor that makes it possible to use custom patterns to extract URLs. For JavaScript generated URLs, you should have a look at the PhantomJSDocumentFetcher, tracked in #95.

essiembre added the feature-request label Mar 20, 2016

essiembre added this to the 2.5.0 milestone Mar 20, 2016

essiembre self-assigned this Mar 20, 2016

essiembre removed this from the 2.5.0 milestone Mar 20, 2016

essiembre added the question label Mar 22, 2016

essiembre changed the title ~~Is it possible to extract links having javascript as href ?~~ Is it possible to extract links using a custom pattern? Feb 20, 2017

essiembre mentioned this issue Feb 20, 2017

[GenericLinkExtractor] RSS - Links with CDATA are not extracted #319

Closed

essiembre added a commit that referenced this issue Feb 23, 2017

New XMLFeedLinkExtractor and RegexLinkExtractor (github #319 and #236).

e0a3ff4

essiembre added the resolved label Feb 23, 2017

essiembre added this to the 2.7.0 milestone Feb 23, 2017

essiembre closed this as completed Apr 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to extract links using a custom pattern? #236

Is it possible to extract links using a custom pattern? #236

bruce-genhot commented Mar 15, 2016

essiembre commented Mar 20, 2016

essiembre commented Mar 20, 2016

bruce-genhot commented Mar 21, 2016

MohamedElsakka commented Dec 25, 2016

essiembre commented Dec 26, 2016

MohamedElsakka commented Jan 4, 2017 •

edited by essiembre

essiembre commented Jan 5, 2017

essiembre commented Feb 23, 2017

Is it possible to extract links using a custom pattern? #236

Is it possible to extract links using a custom pattern? #236

Comments

bruce-genhot commented Mar 15, 2016

essiembre commented Mar 20, 2016

essiembre commented Mar 20, 2016

bruce-genhot commented Mar 21, 2016

MohamedElsakka commented Dec 25, 2016

essiembre commented Dec 26, 2016

MohamedElsakka commented Jan 4, 2017 • edited by essiembre

essiembre commented Jan 5, 2017

essiembre commented Feb 23, 2017

MohamedElsakka commented Jan 4, 2017 •

edited by essiembre