Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GenericLinkExtractor] RSS - Links with CDATA are not extracted #319

Closed
sylvainroussy opened this issue Feb 20, 2017 · 3 comments
Closed
Labels

Comments

@sylvainroussy
Copy link

sylvainroussy commented Feb 20, 2017

Hi,

I would to parse a RSS stream (https://medlineplus.gov/feeds/news_en.xml) but the Url values of link tags are enclosed in CDATA and it seems they're ignored by the GenericLinkExtractor.

Is there a way to catch them ?

Thanks.
Sylvain

Edit: somme corrections

@essiembre
Copy link
Contributor

Right now, this would be possible by writing your own ILinkExtractor.

The default link extractor is GenericLinkExtractor and is geared towards HTML pages. There is a feature request (#236) to add the ability to use regex to extract links using custom patterns but that's not there yet.

To keep the generic link extractor and add yours for RSS feeds via XML config, you can do it like this:

    <linkExtractors>
        <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor"/>
        <extractor class="your.custom.Extractor"/>
    </linkExtractors>

@sylvainroussy
Copy link
Author

Ok, thank you.

@essiembre
Copy link
Contributor

FYI, the latest snapshot release has a new XMLFeedLinkExtractor you can use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants