Skip to content

Commit

Permalink
Early documentation
Browse files Browse the repository at this point in the history
Fixes part of #5
Structure:

- Installation
- Tutorials
  - Tutorial 1: Retrieve a singe article
  - Tutorial 2: Retrieve an article from various APIs
  - Tutorial 3: Retrieving a large number of articles from different APIs
- Reference
  - list of apis
  - results set
  • Loading branch information
Nikoleta-v3 committed Mar 26, 2017
1 parent 08998b5 commit dc39af3
Show file tree
Hide file tree
Showing 17 changed files with 268 additions and 26 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ result.json
*arcas.egg-info*
build/
dist/
docs/
docs/_build
.hypothesis/
results.json
Notes.ipynb
Expand Down
14 changes: 3 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,12 @@ Status](https://travis-ci.org/Nikoleta-v3/Arcas.svg?branch=master)](https://trav

# Arcas

Arcas is python tool designed to help scraping APIs for academic articles.
Currently it supports the following APIs:
- IEEE
- springer
- arXiv
- nature

A more analytic list with various APIS can be found here: http://guides.lib
.berkeley.edu/information-studies/apis.

Arcas is python tool designed to help with collecting academic articles
from various APIs.

## Installation

The easiest way to install is from pypi:
The easiest way to install is:

```bash
$ pip install arcas
Expand Down
9 changes: 9 additions & 0 deletions docs/Reference/Apis/arxiv.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
arXiv
=====

arXiv Api is hosted at arXiv.org, is a document submission and retrieval system
that is heavily used by the physics, mathematics and computer science
communities.

arXiv is set as the default api for arcas. For more information visit
the official site: https://arxiv.org/help/api/user-manual#Architecture.
9 changes: 9 additions & 0 deletions docs/Reference/Apis/ieee.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
IEEE Xplore
============

Query the Institute of Electrical and Electronics Engineers content
repository and retrieve results for manipulation and presentation on local
web interfaces.

For more information on IEEE Xplore visit the official site:
http://ieeexplore.ieee.org/gateway/.
13 changes: 13 additions & 0 deletions docs/Reference/Apis/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
List of available APIS
=====================
A list of the APIs you can ping with Arcas.
Contents:

.. toctree::
:maxdepth: 2

arxiv.rst
ieee.rst
nature.rst
springer.rst
plos.rst
8 changes: 8 additions & 0 deletions docs/Reference/Apis/nature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Nature
======

The nature.com OpenSearch API provides an open, bibliographic search service
for content hosted on nature.com, comprising around half a million news and
research articles and citations

For more information please visit the official site: http://www.nature.com/developers/documentation/api-references/opensearch-api/.
8 changes: 8 additions & 0 deletions docs/Reference/Apis/plos.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
PLOS
====

Query content from the seven open-access peer-reviewed journals from the
Public Library of Science using any of the twenty-three terms in the PLOS Search.

For more information on PLOS Search API visit the official site:
http://api.plos.org/solr/faq/.
10 changes: 10 additions & 0 deletions docs/Reference/Apis/springer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Springer
========

Springer Open Access API - Provides metadata and full-text content for more than
370,000 online documents from Springer open access xml, including BioMed Central
and SpringerOpen journals.

Note that springer does not have an abstract search query and springer
requires the user to register for a key. For more
information visit the official site: https://dev.springer.com/restfuloperations.
10 changes: 10 additions & 0 deletions docs/Reference/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
Reference
=========

Contents:

.. toctree::
:maxdepth: 2

Apis/index.rst
results_set.rst
34 changes: 34 additions & 0 deletions docs/Reference/results_set.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
.. _results-set:

Results set
===========

Each response of the API returns a list of metadata for a given article.
This list differs for each API. Arcas is designed to return a similar set of
metadata for any given API. Thus the json results of Arcas has the following
list of metadata:

- :code:`key`
- A generated key containing an authors name and publication year (e.g. Glynatsi2017)
- :code:`unique_key`
- A unique key generated using the `hashlib <https://docs.python.org/2/library/hashlib.html>`_
python library. The hashable string is created by: [author name, title,
year,abstract]
- :code:`title`
- Title of article
- :code:`author`
- A single entity of an author from the list of authors of the respective article
- :code:`abstract`
- The abstract of the article
- :code:`date`
- Date of publication
- :code:`journal`
- Journal of publication
- :code:`pages`
- Pages of publication
- :code:`key_word`
- A single entity of a keyword assigned to the article by the given journal
- :code:`provenance`
- Scholarly database for where the article was collected
- :code:`score`
- Score given to article by the given journal
18 changes: 18 additions & 0 deletions docs/Tutorial/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Tutorials
=========

Arcas' tutorial wil cover the basic usage of the library. It covers a
tutorial on
retrieving a single article from a single API, retrieving the same article from
various APIs and finally retrieving a large number of articles from different
APIs.


Contents:

.. toctree::
:maxdepth: 2

tutorial_i.rst
tutorial_ii.rst
tutorial_iii.rst
48 changes: 48 additions & 0 deletions docs/Tutorial/tutorial_i.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
.. _tutorial-i:

=======================================
Tutorial I: Retrieving a single article
=======================================

In this tutorial the aim is to retrieve a single article for the journal
arXiv, where the word 'Game' is contained in the title or the abstract.

Initially, let us import Arcas::

>>> import arcas

The APIs, are implemented as classes. Here we make an API instance of the API
arXiv::

>>> api = arcas.Arxiv()

We will now create the query, to which arXiv listens to. :code:`records` is the
number of records we are requesting for::

>>> parameters = api.parameters_fix(title='Game', abstract='Game', records=1)
>>> url = api.create_url_search(parameters)

The query will be used to ping the API and afterwards we parse the xml file
that has been retrieved::

>>> request = api.make_request(url)
>>> root = api.get_root(request)
>>> raw_article = api.parse(root)
>>> article = api.to_dataframe(raw_article[0])

Note that we are using the library `pandas <http://pandas.pydata.org/>`_ to
store the results. The data frame contains metadata on an article as they
are recorded in the journal arXiv. Here we can type the following to see the
columns of the data frame::

>>> article.columns
Index(['key', 'unique_key', 'title', 'author', 'abstract', 'date', 'journal',
'pages', 'key_word', 'provenance'],dtype='object')

and we can ask for the title::

>>> article.title.unique()
array([ 'A New Approach to Solve a Class of Continuous-Time Nonlinear
Quadratic Zero-Sum Game Using ADP'], dtype=object)

The structure of the results is discussed in depth in :ref:`result set<results-set>`.
29 changes: 29 additions & 0 deletions docs/Tutorial/tutorial_ii.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
.. _tutorial-ii:

===================================================
Tutorial II: Retrieve an article from various APIs
===================================================

In this tutorial we are aiming to make a similar query, to that in
:ref:`tutorial I <tutorial-i>`, from different APIs.

To achieve that we will use a :code:`for` loop, to loop over a list of given
APIs classes. For each instance then repeat the following procedure::

>>> for p in [arcas.Ieee, arcas.Plos, arcas.Arxiv, arcas.Springer, arcas.Nature]:

... api = p()
... parameters = api.parameters_fix(title='Game', abstract='Game', records=1)
... url = api.create_url_search(parameters)
... request = api.make_request(url)
... root = api.get_root(request)
... raw_article = api.parse(root)

... for art in raw_article:
... article = api.to_dataframe(art)
... api.export(article, 'results_{}.json'.format(api.__class__.__name__))


The :code:`export` function, is a function that writes the results to a `json
<http://www.json.org/>`_ file. Here the results of each API are stored to
a different file named after which API they come from.
35 changes: 35 additions & 0 deletions docs/Tutorial/tutorial_iii.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
.. _tutorial-iii:

====================================================
Tutorial III: Retrieving a large number of articles
====================================================

Now that we have learned to ping several APIs for a single article, we will
repeat the procedure for a large number of articles. In this example the
number of articles we would like to retrieve is 20 from each API.

Often, we are looking for hundreds of articles. Rather than asking the API
for all the results at once, the APIs offer a paging mechanism through
:code:`start` and :code:`records`. That way we can receive chunks of the
result set at a time. :code:`start` defines the index of the first returned
article and :code:`records` the number of articles returned by the query.

>>> for p in [arcas.Ieee, arcas.Plos, arcas.Arxiv, arcas.Springer, arcas.Nature]:
... for start in range(2):
...
... api = p()
... parameters = api.parameters_fix(title='Game', abstract='Game',
... records=10, start=(start * 10))
... url = api.create_url_search(parameters)
... request = api.make_request(url)
... root = api.get_root(request)
... raw_article = api.parse(root)
...
... for art in raw_article:
... article = api.to_dataframe(art)
... api.export(article, 'results_{}.json'.format(api.__class__.__name__))

In our example this might not seem as an important difference. But assume you
were asking for a hundred of articles. Some APIs have a limited number of
articles that be can returned, thus using this practice we avoid overloading
the API.
9 changes: 7 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))

Expand Down Expand Up @@ -120,7 +120,12 @@
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'
on_rtd = os.environ.get('READTHEDOCS', None) == 'True'

if not on_rtd: # only import and set the theme if we're building docs locally
import sphinx_rtd_theme
html_theme = 'sphinx_rtd_theme'
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]

# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
Expand Down
22 changes: 10 additions & 12 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,27 +1,25 @@
Welcome to Arcas's documentation!
=================================
Arcas is python tool designed to help scraping APIs for academic articles.
Currently only the following APIs have been implement:
- IEEE
- springer
- arXiv
- nature
A large number of scholarly databases and collections offer some form of API
access. An API is an online tool to access data straight from the databases.

Contents:
Arcas is python tool designed to help communicate/ping various of these APIs.

Table of Contents
=================

.. toctree::
:maxdepth: 2

installation.rst
Apis/index.rst
Example/index.rst

Tutorial/index.rst
Reference/index.rst


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
.. * :ref:`genindex`
.. * :ref:`modindex`
* :ref:`search`

16 changes: 16 additions & 0 deletions docs/installation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
================
Installing Arcas
================

From PyPi::

$ pip install arcas

From GitHub::

$ git clone https://github.com/Nikoleta-v3/Arcas.git
$ cd Arcas
$ pip install -r requirements.txt
$ python setup.py install

Arcas is supported by Python 3.5.

0 comments on commit dc39af3

Please sign in to comment.