Low Retrieval Rate #28

grantnolasco · 2019-03-11T23:39:57Z

One user of the Bibscan library is asking on tips on how to improve the retrieval rate so my task for today was to figure out why the retrieval rate was so low. First, I ran the code given to me and got the same number of successful pdf retrievals. Based on the error messages given, it appears that the links don't work (don't know if this is obvious or not due to lack of knowledge about this package). To look into it more, I looked at the first ten documents. Some problems that I noticed was the documents from elsevier and wiley were not working. After trying to figure out why, I landed on this page: CrossRef/rest-api-doc#96. Also, in the crimer package, they said "At least Elsevier and I think Wiley also check your IP address in addition to requiring the authentication token". So maybe that's why these websites aren't working. For springerlink, it says that "Page Not Found". For the cambridge website, it gives me the warning pop up message "Unfortunately you do not have access to this content, please use the Get access link below for information on how to access this content." These are the websites/links that were from the first ten rows. Other than these error messages, I'm not really sure what else to look at since I'm pretty new on how this code (especially crimer) works.

nathanhwangbo · 2019-10-18T20:33:10Z

Looks like the problem is in the call to crminer::crm_text(). It looks like we use crminer::crm_links() to map DOI -> url.

However, this mapping is often failing to find PDF links (instead finding html/xml). I'll keep playing around with it and see if I can improve the results any.

nathanhwangbo · 2019-10-22T00:01:46Z

I think we can partition the errors into a few different types:

Bad links (eg elsevier, mislabeled pdfs)
Bad permissions (ie link returns HTTP 403 error), (eg syndication.highwire)
"pdf error": illegal characters stopping us from downloading
?? no idea why it's not working

For the most part, I don't think we can fix 1 or 2, but 3,4 might be fixable?

1 and 4 look like they are the biggest source of error, at least using my sample bib file.

nathanhwangbo self-assigned this Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low Retrieval Rate #28

Low Retrieval Rate #28

grantnolasco commented Mar 11, 2019 •

edited

nathanhwangbo commented Oct 18, 2019

nathanhwangbo commented Oct 22, 2019

Low Retrieval Rate #28

Low Retrieval Rate #28

Comments

grantnolasco commented Mar 11, 2019 • edited

nathanhwangbo commented Oct 18, 2019

nathanhwangbo commented Oct 22, 2019

grantnolasco commented Mar 11, 2019 •

edited