You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
arXiv is a large collection of pre-print papers, and it's often used still in quotations. All their articles are in PDF, so on the web it's only the abstract, not the full text. It is still open access, and would be good to scrape, especially because the full text is not actually stored in the search anyways.
I could imaging hacking the fulltext field for the arXiv extractor to go over the 1000 chars limit that is now the heuristic to see open access. Is that acceptable in this case?
I'm almost finished with the relevant extractor, just want to check this aspect.
The text was updated successfully, but these errors were encountered:
Hacking the fulltext field to be longer is not a good idea, I think. However, the 1000 chars limit is a rather crude indication of whether an article is indexable or not. I can imagine an "indexable" method that returns true or false, based on a number of things. Perhaps indexable can also be a part of the rule set.
Got to think about the "indexable" method, I like the current simplicity of the extractors. Though scraping is always less straightforward than APIs, unfortunately.
Also, arXiv papers don't have DOI, but have their own reference, that would have to be taken into account I guess...
So all in all maybe a more robust document model is needed, so such open resources could be included too?
arXiv is a large collection of pre-print papers, and it's often used still in quotations. All their articles are in PDF, so on the web it's only the abstract, not the full text. It is still open access, and would be good to scrape, especially because the full text is not actually stored in the search anyways.
I could imaging hacking the fulltext field for the arXiv extractor to go over the 1000 chars limit that is now the heuristic to see open access. Is that acceptable in this case?
I'm almost finished with the relevant extractor, just want to check this aspect.
The text was updated successfully, but these errors were encountered: