# Appendix A: Loading PYBOSSA Tasks into a Dataframe

The code presented in this notebook downloads all of the [PYBOSSA](https://docs.pybossa.com) task objects and loads them into a dataframe. The code is not explained in detail as the main purpose of the notebooks in this repository is not to explore how to download PYBOSSA domain objects. In short, the code queries the PYBOSSA API iteratively to request all task objects associated with [*In the Spotlight*](https://www.libcrowds.com/collection/playbills) projects.

At the time of writing, the LibCrowds rate limit is set to 1000 per 15 minutes and the PYBOSSA API allows us to retrieve a maximum of 100 task objects with one request. So, we can request up to 100,000 tasks every 15 minutes. The code below reads the rate limit headers returned with each request and if the limit is hit will sleep until it is reset.

In [34]:
import requests
import datetime
import pandas


API_BASE = 'https://www.libcrowds.com/api/'
CATEGORY_ID = 22


def get(domain_obj, offset=0):
    return requests.get(API_BASE + domain_obj, params={
        'offset': offset,
        'limit': 100,
        'all': 1
    })


def load(domain_obj):
    data = []
    last_fetched = []
    while _not_exhausted(last_fetched):
        r = get(domain_obj, len(data))
        last_fetched = r.json()
        data += last_fetched
        respect_rate_limits(r)
    return data


def _not_exhausted(last_fetched):
    return len(last_fetched) == 0 or len(last_fetched) == 100


def respect_rate_limits(response):
    reset = response.headers['x-ratelimit-reset']
    reset_dt = datetime.datetime.fromtimestamp(float(reset))
    remaining = response.headers['x-ratelimit-remaining']
    if remaining == 0:
        while reset_dt > datetime.datetime.now():
            sleep(1)


items = load(obj)
df = pandas.DataFrame(items)
path = '../data/pybossa_tasks.pkl'.format(obj)
df.to_pickle(path)