# ArXiv API

The ArXiV API allows programmatic access to the arXiv's e-print content and metadata. "The goal of the interface is to facilitate new and creative use of the the vast body of material on the arXiv by providing a low barrier to entry for application developers." https://arxiv.org/help/api

The API's user manual (https://arxiv.org/help/api/user-manual) provides helpful documentation for using the API and retrieving article metadata.

Our examples below will introduce you to the basics of querying the ArXiV API.

## Install Packages

In [None]:
import urllib
import arxiv
import requests
import json
import csv
import pandas as pd
from collections import Counter, defaultdict
import numpy as np # for array manipulation
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline 
import datetime

## Query the API

Perform a simple query for "graphene." We'll limit results to the titles of the 10 most recent papers. 

In [None]:
search = arxiv.Search(
  query = "graphene",
  max_results = 10,
  sort_by = arxiv.SortCriterion.SubmittedDate
)

for result in search.results():
  print(result.title)

Do another query for the topic "quantum dots," but note that you could swap in a topic of your liking.

You can define a custom arXiv API client with specialized pagination behavior. This time we'll process each paper as it's fetched rather than exhausting the result-generator into a `list`; this is useful for running analysis while the client sleeps.

Because this `arxiv.Search` doesn't bound the number of results with `max_results`, it will fetch *every* matching paper (roughly 10,000). This may take several minutes.

In [None]:
results_generator = arxiv.Client(
  page_size=1000,
  delay_seconds=3,
  num_retries=3
).results(arxiv.Search(
  query='"quantum dots"',
  id_list=[],
  sort_by=arxiv.SortCriterion.Relevance,
  sort_order=arxiv.SortOrder.Descending,
))

quantum_dots = []
for paper in results_generator:
  # You could do per-paper analysis here; for now, just collect them in a list.
  quantum_dots.append(paper)

## Organize and analyze your results

Create a dataframe to better analyze your results. This example uses Python's [`vars`](https://docs.python.org/3/library/functions.html#vars) built-in function to convert search results into Python dictionaries of paper metadata.

In [None]:
qd_df = pd.DataFrame([vars(paper) for paper in quantum_dots])

We'll look at the first 10 results.

In [None]:
qd_df.head(10)

Next, we'll create list of all of the columns in the dataframe to see what else is there:

In [None]:
list(qd_df)

We have 14 columns overall. We'll add two derived columns––the name of the first listed author and a reference to the original `arxiv.Result` object-–then narrow the dataframe to paper titles, `published` dates, and first authors to run some analysis of publishing patterns over time.

In [None]:
# Add a first_author column: the name of the first author among each paper's list of authors.
qd_df['first_author'] = [authors_list[0].name for authors_list in qd_df['authors']]
# Keep a reference to the original results in the dataframe: this is useful for downloading PDFs.
qd_df['_result'] = quantum_dots

# Narrow our dataframe to just the columns we want for our analysis.
qd_df = qd_df[['title', 'published', 'first_author', '_result']]
qd_df

## Visualize your results

Get a sense of the how your topic has trended over time. When did research on your topic take off? Create a bar chart of the number of articles published in each year.

In [None]:
qd_df["published"].groupby(qd_df["published"].dt.year).count().plot(kind="bar")

Explore authors to see who is publishing your topic. Group by author, then sort and select the top 20 authors.

In [None]:
qd_authors = qd_df.groupby(qd_df["first_author"])["first_author"].count().sort_values(ascending=False)
qd_authors.head(20)

## Identify and download papers

Let's download the oldest paper about quantum dots co-authored by Piotr Trocha:

In [None]:
qd_Trocha_sorted  = qd_df[qd_df['first_author']=='Piotr Trocha'].sort_values('published')
qd_Trocha_sorted

In [None]:
# Use the arxiv.Result object stored in the _result column to trigger a PDF download.
qd_Trocha_oldest = qd_Trocha_sorted.iloc[0]
qd_Trocha_oldest._result.download_pdf()

Confirm that the PDF has downloaded!

## Bibliography

- Tim Head: https://betatim.github.io/posts/analysing-the-arxiv/
- Lukas Schwab: https://github.com/lukasschwab/arxiv.py
- ArXiV API user manual: https://arxiv.org/help/api/user-manual        