You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would to parse a RSS stream (https://medlineplus.gov/feeds/news_en.xml) but the Url values of link tags are enclosed in CDATA and it seems they're ignored by the GenericLinkExtractor.
Is there a way to catch them ?
Thanks.
Sylvain
Edit: somme corrections
The text was updated successfully, but these errors were encountered:
Right now, this would be possible by writing your own ILinkExtractor.
The default link extractor is GenericLinkExtractor and is geared towards HTML pages. There is a feature request (#236) to add the ability to use regex to extract links using custom patterns but that's not there yet.
To keep the generic link extractor and add yours for RSS feeds via XML config, you can do it like this:
Hi,
I would to parse a RSS stream (https://medlineplus.gov/feeds/news_en.xml) but the Url values of
link
tags are enclosed inCDATA
and it seems they're ignored by the GenericLinkExtractor.Is there a way to catch them ?
Thanks.
Sylvain
Edit: somme corrections
The text was updated successfully, but these errors were encountered: