# Appendix A: Loading PYBOSSA Tasks into a Dataframe

The code presented in this notebook downloads all PYBOSSA task objects via the and loads them into a dataframe. The code is not explained in detail as the main purpose of the notebooks in this repository is not to explore how to download PYBOSSA domain objects.

At the time of writing, the LibCrowds rate limit is set to 1000 per 15 minutes and the PYBOSSA API allows us to retrieve a maximum of 100 task objects with one request. So, we can request up to 100,000 tasks every 15 minutes. The code below reads the rate limit headers returned with each request and if the limit is hit will sleep until it is reset.

For more details about the PYBOSSA API please refer to the [PYBOSSA documentation](http://docs.pybossa.com).

In [2]:
import requests
import datetime
import pandas


API_BASE = 'https://www.libcrowds.com/api/'


def get(domain_obj, offset=0):
    """Get a set of domain objects."""
    r = requests.get(API_BASE + domain_obj, params={
        'offset': offset,
        'limit': 100,
        'all': 1
    })
    r.raise_for_status()
    return r


def load(domain_obj):
    """Load all of the chosen domain objects."""
    data = []
    last_fetched = []
    while _not_exhausted(last_fetched):
        r = get(domain_obj, len(data))
        last_fetched = r.json()
        data += last_fetched
        respect_rate_limits(r)
    return data


def _not_exhausted(last_fetched):
    """Check if the last fetched tasks were the last available."""
    return len(last_fetched) == 0 or len(last_fetched) == 100


def respect_rate_limits(response):
    """If we have exceeded the rate limit sleep until it is refreshed."""
    reset = response.headers['x-ratelimit-reset']
    reset_dt = datetime.datetime.fromtimestamp(float(reset))
    remaining = response.headers['x-ratelimit-remaining']
    if remaining == 0:
        while reset_dt > datetime.datetime.now():
            sleep(1)

            
items = load('task')
df = pandas.DataFrame(items)
df.to_json('../data/pybossa_tasks.gz', compression='gzip')