[GenericLinkExtractor] RSS - Links with CDATA are not extracted #319

sylvainroussy · 2017-02-20T15:09:15Z

Hi,

I would to parse a RSS stream (https://medlineplus.gov/feeds/news_en.xml) but the Url values of link tags are enclosed in CDATA and it seems they're ignored by the GenericLinkExtractor.

Is there a way to catch them ?

Thanks.
Sylvain

Edit: somme corrections

The text was updated successfully, but these errors were encountered:

essiembre · 2017-02-20T17:06:29Z

Right now, this would be possible by writing your own ILinkExtractor.

The default link extractor is GenericLinkExtractor and is geared towards HTML pages. There is a feature request (#236) to add the ability to use regex to extract links using custom patterns but that's not there yet.

To keep the generic link extractor and add yours for RSS feeds via XML config, you can do it like this:

    <linkExtractors>
        <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor"/>
        <extractor class="your.custom.Extractor"/>
    </linkExtractors>

sylvainroussy · 2017-02-22T09:59:55Z

Ok, thank you.

essiembre · 2017-02-23T06:09:32Z

FYI, the latest snapshot release has a new XMLFeedLinkExtractor you can use.

essiembre added the question label Feb 20, 2017

sylvainroussy closed this as completed Feb 22, 2017

essiembre added a commit that referenced this issue Feb 23, 2017

New XMLFeedLinkExtractor and RegexLinkExtractor (github #319 and #236).

e0a3ff4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GenericLinkExtractor] RSS - Links with CDATA are not extracted #319

[GenericLinkExtractor] RSS - Links with CDATA are not extracted #319

sylvainroussy commented Feb 20, 2017 •

edited

essiembre commented Feb 20, 2017

sylvainroussy commented Feb 22, 2017

essiembre commented Feb 23, 2017

[GenericLinkExtractor] RSS - Links with CDATA are not extracted #319

[GenericLinkExtractor] RSS - Links with CDATA are not extracted #319

Comments

sylvainroussy commented Feb 20, 2017 • edited

essiembre commented Feb 20, 2017

sylvainroussy commented Feb 22, 2017

essiembre commented Feb 23, 2017

sylvainroussy commented Feb 20, 2017 •

edited