Not all open access articles have "full text" on the web (eg. arXiv) #6

imrehg · 2014-06-23T03:32:16Z

arXiv is a large collection of pre-print papers, and it's often used still in quotations. All their articles are in PDF, so on the web it's only the abstract, not the full text. It is still open access, and would be good to scrape, especially because the full text is not actually stored in the search anyways.

I could imaging hacking the fulltext field for the arXiv extractor to go over the 1000 chars limit that is now the heuristic to see open access. Is that acceptable in this case?

I'm almost finished with the relevant extractor, just want to check this aspect.

jure · 2014-06-23T05:24:02Z

Hacking the fulltext field to be longer is not a good idea, I think. However, the 1000 chars limit is a rather crude indication of whether an article is indexable or not. I can imagine an "indexable" method that returns true or false, based on a number of things. Perhaps indexable can also be a part of the rule set.

imrehg · 2014-06-24T07:48:50Z

Got to think about the "indexable" method, I like the current simplicity of the extractors. Though scraping is always less straightforward than APIs, unfortunately.

Also, arXiv papers don't have DOI, but have their own reference, that would have to be taken into account I guess...

So all in all maybe a more robust document model is needed, so such open resources could be included too?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all open access articles have "full text" on the web (eg. arXiv) #6

Not all open access articles have "full text" on the web (eg. arXiv) #6

imrehg commented Jun 23, 2014

jure commented Jun 23, 2014

imrehg commented Jun 24, 2014

Not all open access articles have "full text" on the web (eg. arXiv) #6

Not all open access articles have "full text" on the web (eg. arXiv) #6

Comments

imrehg commented Jun 23, 2014

jure commented Jun 23, 2014

imrehg commented Jun 24, 2014