In [1]:
from arxiv import Paper, arXiv, inSPIRE

%load_ext autoreload
%autoreload 2

# Harvest papers on arXiv

## Get papers (that were modified) in a date range 

in order to scrape arxiv.org, first we need to instantiate a `Paper` object which specifies the date range (via `from_` and `to_` arguments, and the category via `set_` argument. For example:

In [29]:
paper = Paper(from_="2010-01-01",
              to_="2010-02-01",
              set_="physics:astro-ph",
             )

In [30]:
paper

Papers from 2010-01-01 to 2010-02-01 in the physics:astro-ph category

by passing this object into an instance of `arXiv`, we start harvesting the records:

In [31]:
arxiv = arXiv(paper)

requesting records from 2010-01-01 to 2010-02-01 in the physics:astro-ph category

request url: http://export.arxiv.org/oai2?verb=ListRecords&from=2010-01-01&until=2010-02-01&metadataPrefix=arXiv&set=physics:astro-ph

response ok? : True

No resumption tokens found...

1
bowls = 646


In [32]:
paper.pile.head()

Unnamed: 0,id,title,abstract,author,setSpec,categories,created,updated,comments,doi,datestamp,n_authors
0,704.2196,Absolute Calibration and Characterization of t...,The absolute calibration and characterization ...,"[Gordon Karl D., Engelbracht Charles W., Fadda...",physics:astro-ph,[astro-ph],2007-04-17,NaT,"19 pages, PASP, in press",10.1086/522675,2010-01-07,22
1,704.3441,Anisotropic Locations of Satellite Galaxies: C...,We investigate the locations of the satellites...,"[Agustsson Ingolfur, Brainerd Tereasa G.]",physics:astro-ph,[astro-ph],2007-04-25,2009-12-03,"43 pages, 13 figures, ApJ in press",10.1088/0004-637X/709/2/1321,2010-01-21,2
2,704.3704,Multimodal nested sampling: an efficient and r...,In performing a Bayesian analysis of astronomi...,"[Feroz Farhan, Hobson M. P.]",physics:astro-ph,[astro-ph],2007-04-27,2007-07-23,"14 pages, 11 figures, submitted to MNRAS, some...",10.1111/j.1365-2966.2007.12353.x,2010-01-11,2
3,704.3839,Multiband optical surface brightness profile d...,We present preliminary results of the Johnson-...,"[Mihov Boyko, Slavcheva-Mihova Lyuba]",physics:astro-ph,[astro-ph],2007-04-29,2010-02-01,A poster presented at the scientific conferenc...,,2010-02-01,2
4,706.0148,The radio properties of type II quasars,Quasars (of type I) are the luminous analogs o...,"[Lal Dharam Vir, Ho Luis C.]",physics:astro-ph,[astro-ph],2007-06-01,NaT,"4 pages, 3 figures, To appear in ""The Central ...",,2010-01-22,2


Note that the `from_`-`to_` date range does not necessarily specify the creation or updation (🤔?!) date 

In [33]:
print(f"'datestamps' range between: {paper.pile.datestamp.min()} and {paper.pile.datestamp.max()}")

'datestamps' range between: 2010-01-02 00:00:00 and 2010-02-01 00:00:00


In [34]:
print(f"'created' range between: {paper.pile.created.min()} and {paper.pile.created.max()}")

'created' range between: 1993-06-10 00:00:00 and 2010-01-29 00:00:00


In [35]:
print(f"'updated' range between: {paper.pile.updated.min()} and {paper.pile.updated.max()}")

'updated' range between: 1994-10-27 00:00:00 and 2010-02-01 00:00:00


## save to file 

We can easily save the harvested data using

In [36]:
paper.save_to_file()

pile saved to 
set=physics:astro-ph-from=2010-01-01-to=2010-02-01.csv


by default a cvs file is saved in the current directory under the following name:

In [37]:
paper.get_file_name()

'set=physics:astro-ph-from=2010-01-01-to=2010-02-01.csv'

## Get today's papers

If no dates are passed to the `Paper` instance, the default date range is set to today's papers. Also the default category for `set_` is `physics:astro-ph`.

In [38]:
paper = Paper()
arxiv = arXiv(paper)

requesting records from 2019-10-06 to 2019-10-07 in the physics:astro-ph category

request url: http://export.arxiv.org/oai2?verb=ListRecords&from=2019-10-06&until=2019-10-07&metadataPrefix=arXiv&set=physics:astro-ph

response ok? : True

No resumption tokens found...

1
bowls = 77


In [39]:
paper.pile.head()

Unnamed: 0,id,title,abstract,author,setSpec,categories,created,updated,comments,doi,datestamp,n_authors
0,1607.07881,Galaxy Quenching from Cosmic Web Detachment,We propose the Cosmic Web Detachment (CWD) mod...,"[Aragon-Calvo Miguel A., Neyrinck Mark C., Sil...",physics:astro-ph,"[astro-ph.GA, astro-ph.CO]",2016-07-26,2019-07-28,"20 pages, accepted for publication in OJA. Hig...",10.21105/astro.1607.07881,2019-10-07,3
1,1803.1004,Bimodal distribution of short gamma-ray bursts...,"Recently, GRB 170817A was confirmed to be asso...","[Yu Y. B., Li L. B., Li B., Geng J. J., Huang ...",physics:astro-ph,[astro-ph.HE],2018-03-27,2019-10-04,"12 pages, 2 figures, 1 table, accepted by New ...",,2019-10-07,5
2,1810.02581,Search for gravitational waves from a long-liv...,One unanswered question about the binary neutr...,"[The LIGO Scientific Collaboration, the Virgo ...",physics:astro-ph,"[gr-qc, astro-ph.HE]",2018-10-05,2019-10-04,main paper: 9 pages and 3 figures; total with ...,10.3847/1538-4357/ab0f3d,2019-10-07,1141
3,1811.01963,Subaru High-z Exploration of Low-Luminosity Qu...,We present new measurements of the quasar lumi...,"[Matsuoka Yoshiki, Strauss Michael A., Kashika...",physics:astro-ph,[astro-ph.GA],2018-11-05,NaT,Accepted for publication in ApJ,10.3847/1538-4357/aaee7a,2019-10-07,47
4,1811.05159,IAU MDC Meteor Orbits Database -- A Sample of ...,We announce an upgrade of the IAU MDC photogra...,"[Narziev M., Chebotarev R. P., Jopek T. J., Ne...",physics:astro-ph,[astro-ph.EP],2018-11-13,2019-10-04,"11, pages, 2 figures",,2019-10-07,13


# Harvest number of citations from inSPIRE

The number of citations can be scraped via the `inSPIRE` object. Let's load the papers that we previouslt scraped from arXiv.

In [2]:
paper = Paper(from_="2010-01-01",
              to_="2010-02-01",
              set_="physics:astro-ph",
             )
paper.load_from_file()

pile loaded from 
set=physics:astro-ph-from=2010-01-01-to=2010-02-01.csv


By passing `paper` to an instance of `inSPIRE` we start scraping the number of citations. The optional argument `n_chunks` makes parallelizes this process. By default `n_chunk` is set to the value returned by `multiprocessing.cpu_counts()`.
Let's slice the first 100 papers and find the number of times they've been cited so far from inSPIRE.

In [3]:
paper.pile = paper.pile[:100]

In [4]:
inspire = inSPIRE(paper,)

Running...
Splitting the data into 4 chunks...

Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...
Still Running...


In [14]:
paper.pile["n_citations"].head()

0    117
1     53
2    587
3      0
4      1
Name: n_citations, dtype: int64

**Voila!**