Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidyRSS fails to parse feeds: "xmlXPathEval: evaluation failed" #31

Closed
alastairrushworth opened this issue Jan 11, 2020 · 4 comments
Closed

Comments

@alastairrushworth
Copy link

Hi @RobertMyles

Thanks for the amazing tidyRSS package, I find it very useful indeed! Thought I'd get in touch to file a quick issue as I've noticed that quite a number of feeds don't parse correctly.

For example:

# tested with v1.2.11
library(tidyRSS)
tidyfeed("http://abigailsee.com/feed.xml")

Returns the error:

Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = 1) : 
  xmlXPathEval: evaluation failed

I think the feed is ok, and it seems like tidyfeed gathers the feed ok, but something goes awry with the parsing somewhere? I noticed this issue with several other feeds that I've copied below

feed_vec <- 
  c("http://abigailsee.com/feed.xml",
    "https://adamgoodkind.com/feed.xml",
    "http://adomingues.github.io/feed.xml",
    "http://aebou.rbind.io/index.xml",
    "http://agrarianresearch.org/blog/?feed=rss2",
    "http://akosm.netlify.com/index.xml",
    "http://alburez.me/feed.xml",
    "http://alexmorley.me/feed.xml",
    "https://alexwhan.com/index.xml",
    "http://allthingsr.blogspot.com/feeds/posts/default?alt=rss",
    "http://allthiswasfield.blogspot.com/feeds/posts/default?alt=rss",
    "http://almostrandom.netlify.com/index.xml",
    "http://altran-data-analytics.netlify.com/index.xml",
    "https://www.amitkohli.com/index.xml",
    "http://analisisydecision.es/feed/",
    "http://andysouth.github.io/feed.xml",
    "http://annakrystalli.me/index.xml",
    "http://annarborrusergroup.github.io/feed.xml",
    "http://anotherblogaboutr.blogspot.com/feeds/posts/default?alt=rss",
    "http://anpefi.eu/index.xml",
    "https://fishandwhistle.net/index.xml",
    "https://www.ardata.fr/index.xml",
    "http://arnab.org/blog/atom.xml",
    "http://arunatma.blogspot.com/feeds/posts/default?alt=rss",
    "http://asbcllc.com/feed.xml",
    "http://ashiklom.github.io/feed.xml",
    "http://aurielfournier.github.io/feed.xml",
    "http://austinwehrwein.com/index.xml")

I'm working on a side project at the moment that involves about 3K RSS feeds, which I'm happy to share once I've tidied up a bit, it might be helpful with identifying other edge cases - I know how finicky RSS feeds can be! I'm also happy to help with this issue if you can point me in the right direction!

Thanks,

Alastair

@RobertMyles
Copy link
Owner

Hi Alastair,

Yeah, RSS feeds can be a pain, and I've seen this error a few times. I'm not sure off the top of my head where exactly it pops up. I'll have a look as soon as I can, but if you're interested in contributing, it's probably happening in one of the *_parse functions. I'm trying to clean up a lot of little things in the package for a 1.3 version, so your list will help a lot. In the meantime, I'll leave this open until I can figure out the source of the error.

Rob

@RobertMyles
Copy link
Owner

I had a quick chance to look at this today and with the dev version I'm getting:

> tidyfeed("http://abigailsee.com/feed.xml")
# A tibble: 5 x 5
  feed_link    item_title          item_date_published item_description                 item_link             
  <chr>        <chr>               <dttm>              <chr>                            <chr>                 
1 http://abigWhat makes a good2019-08-13 00:00:00 "<!--excerpt.start-->\n<p><em>T… http://abigailsee.com…
2 http://abig… Deep Learning, Str… 2018-02-21 00:00:00 "<!--excerpt.start-->\n<html>\nhttp://abigailsee.com3 http://abigFour deep learning2017-08-30 00:00:00 "<head>\n<script src=\"https://… http://abigailsee.com…
4 http://abig… Four deep learning… 2017-08-30 00:00:00 "<head>\n<script src=\"https://… http://abigailsee.com…
5 http://abig… Taming Recurrent N… 2017-04-16 00:00:00 "<!--excerpt.start-->\n<p><em>Thttp://abigailsee.com

With your vector of feeds, I get:

purrr::map(feed_vec, ~ {
  stfeed <- purrr::safely(tidyfeed)
  ret <- stfeed(.x)
  if (is.null(ret$error)) {
    print("Feed OK")
  } else {
    print("Feed unavailable")
  }
})

[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed unavailable"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed unavailable"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed unavailable"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"
[1] "Feed OK"

I'll try to get 1.3 finished as soon as possible, as you can see, it fixes most of these problems. In the meanwhile, if you'd like to try the dev version, it should help.

@alastairrushworth
Copy link
Author

Hi Rob - that's perfect, I think that fixes it completely for me. Thanks a lot for that, I'll stick to dev until 1.3.

I'll drop you a note when I've got the long list of feeds tidied up, in case it can help.

Cheers!

@RobertMyles
Copy link
Owner

That would be a help, thanks Alastair.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants