<a href="https://colab.research.google.com/github/MJMortensonWarwick/large_scale_data_for_research/blob/main/extracting_data_from_the_arXiv_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting Data from the arXiv API
In this tutorial we will be extracting data from an API. In order to keep things simple we will use an API that doesn't require the developer (i.e. you) to possess an API key, although (as per the lecture) most important APIs have this requirement. To make things relevant we will use the API of one of the biggest academic databases in machine learning (and other computer science and general science topics) arXiv.

In fact, as is the case for many useful APIs, there is a Python package designed to interact with the arXiv API ... imaginatively named arxiv. While in the real world it would make more sense to use this, to gain a slightly fuller experience we will access the API by writing our own code.

Let's begin with the example straight out of the arXiv docs:

In [1]:
import urllib.request as libreq

with libreq.urlopen('http://export.arxiv.org/api/query?search_query=all:electron&start=0&max_results=1') as url:
    r = url.read()
    print(r)

b'<?xml version="1.0" encoding="UTF-8"?>\n<feed xmlns="http://www.w3.org/2005/Atom">\n  <link href="http://arxiv.org/api/query?search_query%3Dall%3Aelectron%26id_list%3D%26start%3D0%26max_results%3D1" rel="self" type="application/atom+xml"/>\n  <title type="html">ArXiv Query: search_query=all:electron&amp;id_list=&amp;start=0&amp;max_results=1</title>\n  <id>http://arxiv.org/api/cHxbiOdZaP56ODnBPIenZhzg5f8</id>\n  <updated>2024-01-22T00:00:00-05:00</updated>\n  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">205411</opensearch:totalResults>\n  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>\n  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:itemsPerPage>\n  <entry>\n    <id>http://arxiv.org/abs/cond-mat/0102536v1</id>\n    <updated>2001-02-28T20:12:09Z</updated>\n    <published>2001-02-28T20:12:09Z</published>\n    <title>Impact of Electron-Electron C

The key part of our request is the URL we open:

'http://export.arxiv.org/api/query?search_query=all:electron&start=0&max_results=1'

This includes all the key elements of our request.

We are passing a "search query" (as opposed to requesting a specific article ID). This is essentially the same as typing in our search keyword into the website search engine. In the example our keyword is "electron" and we select "all" fields (author, title, abstract/summary, etc.). Secondly we pass two parameters. The first is "start=0" (i.e. start at the beginning rather than with an offset) and the second says we want only one result. You can read more about the different options here.

Given this, let's modify our code to get a slightly different result:

In [2]:
import urllib.request as libreq

with libreq.urlopen('http://export.arxiv.org/api/query?search_query=all:big%20data&start=0&max_results=2') as url:
    r = url.read()
    print(r)

b'<?xml version="1.0" encoding="UTF-8"?>\n<feed xmlns="http://www.w3.org/2005/Atom">\n  <link href="http://arxiv.org/api/query?search_query%3Dall%3Abig%20data%26id_list%3D%26start%3D0%26max_results%3D2" rel="self" type="application/atom+xml"/>\n  <title type="html">ArXiv Query: search_query=all:big data&amp;id_list=&amp;start=0&amp;max_results=2</title>\n  <id>http://arxiv.org/api/IUokeS+hVBNCJX8CzzelXRfmcnA</id>\n  <updated>2024-01-22T00:00:00-05:00</updated>\n  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">496292</opensearch:totalResults>\n  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>\n  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">2</opensearch:itemsPerPage>\n  <entry>\n    <id>http://arxiv.org/abs/2012.09109v1</id>\n    <updated>2020-12-15T16:18:52Z</updated>\n    <published>2020-12-15T16:18:52Z</published>\n    <title>Big Data</title>\n    <summary>  

Here we have changed the search term to be "big data". Note we use the standard URL encoding for a space between each word which is "%20". We have also requested two results ("max_results=2") rather than one. Now that we have some data, we can use tools such as BeutifulSoup to extract the information we need (see the previous tutorial).