Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low Retrieval Rate #28

Open
grantnolasco opened this issue Mar 11, 2019 · 2 comments
Open

Low Retrieval Rate #28

grantnolasco opened this issue Mar 11, 2019 · 2 comments
Assignees

Comments

@grantnolasco
Copy link
Contributor

grantnolasco commented Mar 11, 2019

One user of the Bibscan library is asking on tips on how to improve the retrieval rate so my task for today was to figure out why the retrieval rate was so low. First, I ran the code given to me and got the same number of successful pdf retrievals. Based on the error messages given, it appears that the links don't work (don't know if this is obvious or not due to lack of knowledge about this package). To look into it more, I looked at the first ten documents. Some problems that I noticed was the documents from elsevier and wiley were not working. After trying to figure out why, I landed on this page: CrossRef/rest-api-doc#96. Also, in the crimer package, they said "At least Elsevier and I think Wiley also check your IP address in addition to requiring the authentication token". So maybe that's why these websites aren't working. For springerlink, it says that "Page Not Found". For the cambridge website, it gives me the warning pop up message "Unfortunately you do not have access to this content, please use the Get access link below for information on how to access this content." These are the websites/links that were from the first ten rows. Other than these error messages, I'm not really sure what else to look at since I'm pretty new on how this code (especially crimer) works.

@nathanhwangbo nathanhwangbo self-assigned this Oct 18, 2019
@nathanhwangbo
Copy link

Looks like the problem is in the call to crminer::crm_text(). It looks like we use crminer::crm_links() to map DOI -> url.

However, this mapping is often failing to find PDF links (instead finding html/xml). I'll keep playing around with it and see if I can improve the results any.

@nathanhwangbo
Copy link

I think we can partition the errors into a few different types:

  1. Bad links (eg elsevier, mislabeled pdfs)
  2. Bad permissions (ie link returns HTTP 403 error), (eg syndication.highwire)
  3. "pdf error": illegal characters stopping us from downloading
  4. ?? no idea why it's not working

For the most part, I don't think we can fix 1 or 2, but 3,4 might be fixable?

1 and 4 look like they are the biggest source of error, at least using my sample bib file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants