Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all open access articles have "full text" on the web (eg. arXiv) #6

Open
imrehg opened this issue Jun 23, 2014 · 2 comments
Open

Comments

@imrehg
Copy link
Contributor

imrehg commented Jun 23, 2014

arXiv is a large collection of pre-print papers, and it's often used still in quotations. All their articles are in PDF, so on the web it's only the abstract, not the full text. It is still open access, and would be good to scrape, especially because the full text is not actually stored in the search anyways.

I could imaging hacking the fulltext field for the arXiv extractor to go over the 1000 chars limit that is now the heuristic to see open access. Is that acceptable in this case?

I'm almost finished with the relevant extractor, just want to check this aspect.

@jure
Copy link
Member

jure commented Jun 23, 2014

Hacking the fulltext field to be longer is not a good idea, I think. However, the 1000 chars limit is a rather crude indication of whether an article is indexable or not. I can imagine an "indexable" method that returns true or false, based on a number of things. Perhaps indexable can also be a part of the rule set.

@imrehg
Copy link
Contributor Author

imrehg commented Jun 24, 2014

Got to think about the "indexable" method, I like the current simplicity of the extractors. Though scraping is always less straightforward than APIs, unfortunately.

Also, arXiv papers don't have DOI, but have their own reference, that would have to be taken into account I guess...

So all in all maybe a more robust document model is needed, so such open resources could be included too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants