-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New Extractor Idea: scihub-dl
to auto-detect inline DOI numbers and download academic paper PDFs
#720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Captchas would be a problem. |
In my testing so far it's not a problem as long as you're not doing dozens of papers at a time. |
Two more avenues to explore:
Would be great to capture full text for academic publications where its available. |
Parsing DOIs out of PDF/HTML content:
Found good candidate extractor dependencies to try for the major free scientific paper DBs:
and more via...
something like this might be interesting for linking together citations too https://inciteful.xyz/ |
I have a rough script for this working (just using Should I just wait for the plugin system? Or I can create a repo with the script and modified |
If you can post it in a gist and share the link that would be helpful @benmuth! |
Here's what I have so far https://gist.github.com/benmuth/b2c12cbb40ca4d8183c6f17f819e2f2d @pirate Usage:
or
It should either
In the second case, we can probably be smarter about only downloading the DOIs intended (based on title or something?), but it's pretty dumb right now. |
@benmuth it might take me a month or more till I'm able to merge this, as I'm working on some paid ArchiveBox projects right now for a client that are taking up most of my time. Don't worry, I'm keen to merge it though, I've been wanting this feature personally for a long time. |
No worries! In the meantime maybe I can keep adding to it. I've kept it simple for demonstration purposes, but it seems straightforward to add some of the ideas you linked here, like the PDF parsing. |
scihub-dl
to auto-detect inline DOI numbers and download academic paper PDFs
@benmuth after thinking about this a bit more, I think it could be fun to release your Publishing a package is great resume candy and it'll make this tool usable to a much wider audience than just ArchiveBox users. Then we can just add it to Here's how I'm imagining it could work (starting simple at first, lots of this can wait for v2,v3,etc.):
Here are some examples as inspiration for CLI interface, PyPI packaging setup, README, etc.:
Your tool could even call out to these and others ^ if you want to be an all-in-one paper / research downloader. We can keep brainstorming on names, What do you think? Does it sound like a project that you'd be excited about? You could own the package yourself and help maintain it, or we could do it under the I think if done well, this tool has the potential to be super popular, I'm sure it'd get a bunch of attention on Github/HN/Reddit because there are so many people trying to scrape scientific research for AI training right now. |
@pirate Yeah, I think that's a great idea, I'd be happy to try to work on this. I think a more comprehensive tool should definitely exist. Thanks for the overview, that's really helpful. As for ownership, I'm not really sure. Maybe I can start it and I can transfer ownership to ArchiveBox if it grows and the maintenance burden proves too much for whatever reason? I don't feel strongly about it either. I'll start a repo with what I have so far soon. I think I'll go with |
Is this happening? I would be very interested in using a tool like this. |
New extractor idea: SCIHUB
e.g. take this academic paper for example: https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1
If a full paper PDF is available on scihub, e.g.: https://sci-hub.se/https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1 it could be downloaded to a
./archive/<timestmap>/scihub/
output folder.We could also look for a DOI number in the page URL or page html contents e.g.:
10.1016/j.cub.2019.11.030
using a regex and try downloading that.New Dependencies:
New Extractors:
extractors/scihub.py
New Config Options:
SAVE_SCIHUB=True
The text was updated successfully, but these errors were encountered: