# API Pagination

In this section, we'll finish up the initial discussion of using APIs by looking at how we can use pagination to retrieve larger amounts of data than we're able to get in a single query.


In [67]:
from requests import get 
import pandas as pd 
import numpy as np
import yaml
import time

# reading in our keys
with open('../../keys.yml', 'r') as file:
    keys = yaml.safe_load(file)


We'll start by sending a query to the New York Times article search API

In [54]:
nyt_key = keys['nyt_api_key']
article_base = 'https://api.nytimes.com/svc/search/v2/articlesearch.json'


In [55]:
inflation_1_24 = {'q':'inflation',
            'begin_date':'20240101',
            'end_date':'20240107',
            'api-key':nyt_key}



response_01_2024 = get(article_base, params= inflation_1_24)


Take a look at the number of of hits and compare it to the number of documents returned, you should notice that we've only retrieved 10 results, but there are a lot more than that available.





In [None]:
# response_01_2024.json()
# Get the number of hits/number returned: 


This is a pretty common problem: many APIs won't return all of the relevant data on the first query, instead, you'll need to send multiple queries to assemble a full list of relevant results. To retrieve all of the available articles, we'd typically use an **offset** or **page** parameter. If there's 30 results and each query only returns 10 articles, then:

- page =  0 would return articles  1-10
- page = 10 would return articles 11-20
- page = 20 would return articles 21-30

Of course, we'd want to write this out using a loop instead of performing the query manually. The process will vary depending on our API, but in general we'll want to set up some kind of a loop that sends a requests, stores the result, and then increments an offset or page counter until there's no more data to retrieve.

Since we don't know exactly how many results we need, we can use a `while` loop instead of a `for` loop. Unlike `for` loops, `while` loops just run until some statement evaluates to `FALSE` or until they encounter the `break` statement. Also unlike `for` loops, they don't automatically keep track of the number of iterations run, so we may need to manually increment any relevant counters.

Here's an example of a while loop that just runs for 10 iterations and prints a value:

In [51]:
counter = 0 # manually creating a counter
while counter < 10:
    print(counter, end=' ')
    counter += 1
    time.sleep(.3)  # waiting a tiny bit
    

0 1 2 3 4 5 6 7 8 9 

An alternative way to set a while loop up is to just use `while True`. This will just creating a loop that runs indefinitely, but we can use a conditional `break` statement to end the loop when a certain condition is reached. Note that this is doing the exact same thing as the previous loop, its just a slightly different way of assembling it:

In [52]:
counter = 0 
while True:
    print(counter, end=' ')
    counter += 1
    time.sleep(.3) # waiting a tiny bit
    if counter >= 10:
        break

0 1 2 3 4 5 6 7 8 9 

So, for our pagination process we need to create a while loop that does the following:

1. send a get request to return page i
2. append the result to a list
3. check to see if we've reached the last page yet
   if no: then increment the counter by 1 and return to step 1.
   if yes: then break the while loop (perhaps after a small waiting period to avoid sending too many queries at once)

So here's an example of how we could set this up 

In [56]:
# Start with parameters for the basic search and set the page set to 0
article_base = 'https://api.nytimes.com/svc/search/v2/articlesearch.json'
params = {'q':'inflation',
            'begin_date':'20240101',
            'end_date':'20240107',
            'api-key':nyt_key}


In [202]:
all_articles = []

while True:                                               # while loop, runs until expression == false (here, this just runs until the break statement triggers)
    print('current page:', params['page'], end='\r')      # using a print statement to track the progress of the loop
    r = get(article_base, params= params).json()          # step 1. send the request
    articles = r['response']['docs']                      
    all_articles.extend(articles)
    if len(all_articles) >= r['response']['meta']['hits']: # step 2. how many articles do we have? If its = or > the total number of hits, then we're at the end and we don't need to send more requests!
        break                                             # the break statement stops the loop                                                         # ... Otherwise append the data to the full list of articles
    params['page'] +=  1                                  # step 3. Increment the page parameter                                   
    time.sleep(12)                                          # Wait 12 seconds (the NYT API asks users to only send 5 queries per minute, so 60/12 = 5 queries per minute) before sending a new query
   

current page: 7

Now we've got the full result!

In [203]:

len(all_articles) # 


75

## Pagination in the World Bank API


This query returns carbon emissions for all countries in 2020:



In [33]:
wdi_params = {'format':'json',
              'per_page':100,
              'date':2010
             }
url = 'https://api.worldbank.org/v2/country/all/indicator/EN.GHG.ALL.MT.CE.AR5'
response = get(url, params = wdi_params)


However, our query only returns the first 100 results, obviously, we would expect to have more data than this. 

In [34]:
response.json()[0]

{'page': 1,
 'pages': 3,
 'per_page': 100,
 'total': 266,
 'sourceid': '2',
 'lastupdated': '2025-01-28'}

[According to the documentation for this API](https://datahelpdesk.worldbank.org/knowledgebase/articles/898581), we can get the next page of results by incrementing the `page` parameter in our request. So the next page of results would just add "&page=2" to the URL we just requested. 

We could just write all three links out separately, but a more generalizable approach would be to write a loop that makes use of the pagination information that the API gives us. The code below uses a `while` loop to continuously send requests until we reach the final page. After running it, we'll have a list of responses that we can then concatenate into a single data frame

In [63]:
# start with an empty list
results_list = []

morepages = True
i = 1

while morepages == True:
    wdi_params = {'format':'json',
                  'per_page':100,
                  'date':2020, 
                  'page':i}
    url = 'https://api.worldbank.org/v2/country/all/indicator/EN.GHG.ALL.MT.CE.AR5'
    response = get(url, params = wdi_params).json()
    # append page i to results_list
    results_list.append(response)
    # check to see if we've reached the final page:
    morepages = i < response[0].get('pages')
    
    time.sleep(1)
    i +=1






Now we just need to format and concatenate all the results. To do that, I've written a function that takes a single response from the WDI API and turns it into a data frame. I'll apply it to each list element using a list comprehension, and then use `pd.concat` to create a single data frame

In [64]:

def wdi_parser(resp):
    result_dict = [{'country_id':i['countryiso3code'],
                    'country_name':i['country']['value'],
                    'date': int(i['date']),
                    'indicator': i['indicator']['id'],
                    'indicator_description' : i['indicator']['value'],
                    'indicator_value': np.float64(i['value'])} for i in resp[1]]
    return pd.DataFrame(result_dict)


In [65]:
parsed_responses = [wdi_parser(i) for i in results_list]
wdi_df = pd.concat(parsed_responses)
wdi_df.shape

(266, 6)

Now, we should have results for all 266 countries:

In [66]:
wdi_df.tail()

Unnamed: 0,country_id,country_name,date,indicator,indicator_description,indicator_value
61,VIR,Virgin Islands (U.S.),2020,EN.GHG.ALL.MT.CE.AR5,Total greenhouse gas emissions excluding LULUC...,0.0244
62,PSE,West Bank and Gaza,2020,EN.GHG.ALL.MT.CE.AR5,Total greenhouse gas emissions excluding LULUC...,
63,YEM,"Yemen, Rep.",2020,EN.GHG.ALL.MT.CE.AR5,Total greenhouse gas emissions excluding LULUC...,32.8781
64,ZMB,Zambia,2020,EN.GHG.ALL.MT.CE.AR5,Total greenhouse gas emissions excluding LULUC...,27.3441
65,ZWE,Zimbabwe,2020,EN.GHG.ALL.MT.CE.AR5,Total greenhouse gas emissions excluding LULUC...,26.7706


### Pagination with offsets
Keep in mind that the process of paginating through data will not always be the same across all APIs. For instance: the [Nobel Prize API](https://app.swaggerhub.com/apis/NobelMedia/NobelMasterData/2.1) uses an offset parameter rather than a pagination parameter. So you would write something like `offset=0&limit=100` to get results 1-100, and then you would increment that by 100 (`offset=100&limit=100`) to get 101 through 200 and so on and you would continue until your offset was greater than or equal to the maximum number of responses. 

However, while the specific parameters might be different, the basic ingredients for pagination are more-or-less the same:
1. You need code that takes a response object and then creates a URL to retrieve the next page of data
2. You need code that can detect when there are no pages left
3. You need code to format all of the pages into a single data frame

<b style="color:red;">
<h3>Question 1A</h3> The request below gets a single page of results from the PokeApi (see <a href ='https://pokeapi.co/docs/v2#pokemon'>documentation</a>) Start by writing code that will retrieve/create a request for the next page of data</b>

(Note that you can either use an offset parameter or the "next" url to get results here.)

In [None]:
params = {'offset':0,
         'limit':100
         }
request = get('https://pokeapi.co/api/v2/pokemon', params=params)

request.url

In [None]:
# code to get the next page of results

<b style="color:red;">
<h3>Question 1B</h3> The request below shows you what the final page of data would look like. Use this response to write some code that will return `False` if we've reached the final page


In [None]:
params = {'offset':request.json()['count']-10,
         'limit':100
         }
request = get('https://pokeapi.co/api/v2/pokemon', params=params)
request.url


<b style="color:red;">
<h3>Question 1C</h3>
Use the code above to create a while loop that iterates through each page of results and collects the name and url of each Pokemon in a list. Remember to put a short pause between each iteration of the loop. </b>

If you find your loop runs for a really long time, you might want to interrupt the kernal by pressing the stop button at the top of your notebook.


In [None]:
# code to create a list with all the responses


### A note on gathering complex data

Depending on how the data are structured, there may be cases where you need to query one part of the API to get a URL for a separate endpoint that has more detailed data about that subject. The PokeApi is a good example of this: we retrieved a list of names and URLs, but if we navigate to any one of those URLs we'll get even more detailed information about the selected Pokemon. So if we wanted to create a data set with detailed information on each Pokemon, we would need to iterate over all of these URLs and then format all of our results in data frame. The way that data are organized is really up to the person who maintains the data set, so you'll want to spend some time getting to know an API before you can really get a good sense of what you can do with it.

# Extra code



Here's an example of using a custom pagination function to automatically retrieve data from the congress.gov API. To make this code work, you'll need to sign up for an data.gov API key: 

https://api.data.gov/signup/

And then add that key to the keys.yml file in the root directory for this BSOS326.

In [68]:
import requests
from requests.models import PreparedRequest
import time
import pandas as pd
def member_parser(response):
    """A function to parse a response from the members endpoint on Congress.gov"""
    members_json = response.json()['members']
    member = [{'bioguideId' : i.get('bioguideId'),
              'district' : i.get('district'),
              'name' : i.get('name'),
              'partyName' : i.get('partyName'),
              'state': i.get('state'),
              'chamber':i.get('terms').get('item')[-1].get('chamber'),
              'startYear':i.get('terms').get('item')[-1].get('startYear'),
              'endYear':i.get('terms').get('item')[-1].get('endYear'),
              'url' : i.get('url')} for i in members_json]
    member_frame = pd.DataFrame(member)
    return member_frame

def congress_paginate(initial_url, params):
    """A function that automatically paginates a query to the congress.gov API"""
    # remove the API key from the parameters list 
    apikey = params.pop('api_key')
    req = PreparedRequest()
    req.prepare_url(initial_url, params) # create a url
    nextpage = req.url 
    responses_list = []
    # iterate over next page URLs
    while nextpage!=None:
        nextpage_url = nextpage + '&api_key=' + apikey
        response = requests.get(nextpage_url)
        responses_list.append(response)
        nextpage = response.json().get('pagination').get('next')
        time.sleep(5000/3600)
    return responses_list

In [72]:
cong = congress_paginate('https://api.congress.gov/v3/member/congress/119', params = {'currentMember':False, 'api_key':keys['data_gov']})

In [None]:
members = pd.concat([member_parser(i) for i in cong])

Now we can do things like look at the longest serving members of the current congress:

In [83]:
members.sort_values(['startYear'])[:10]

Unnamed: 0,bioguideId,district,name,partyName,state,chamber,startYear,endYear,url
2,Y000033,0.0,"Young, Don",Republican,Alaska,House of Representatives,1973,2022.0,https://api.congress.gov/v3/member/Y000033?for...
8,L000174,,"Leahy, Patrick J.",Democratic,Vermont,Senate,1975,2023.0,https://api.congress.gov/v3/member/L000174?for...
9,H000874,5.0,"Hoyer, Steny H.",Democratic,Maryland,House of Representatives,1981,,https://api.congress.gov/v3/member/H000874?for...
19,R000395,5.0,"Rogers, Harold",Republican,Kentucky,House of Representatives,1981,,https://api.congress.gov/v3/member/R000395?for...
0,G000386,,"Grassley, Chuck",Republican,Iowa,Senate,1981,,https://api.congress.gov/v3/member/G000386?for...
18,S000522,4.0,"Smith, Christopher H.",Republican,New Jersey,House of Representatives,1981,,https://api.congress.gov/v3/member/S000522?for...
12,K000009,9.0,"Kaptur, Marcy",Democratic,Ohio,House of Representatives,1983,,https://api.congress.gov/v3/member/K000009?for...
2,M000355,,"McConnell, Mitch",Republican,Kentucky,Senate,1985,,https://api.congress.gov/v3/member/M000355?for...
17,D000191,4.0,"DeFazio, Peter A.",Democratic,Oregon,House of Representatives,1987,2023.0,https://api.congress.gov/v3/member/D000191?for...
15,P000034,6.0,"Pallone, Frank",Democratic,New Jersey,House of Representatives,1987,,https://api.congress.gov/v3/member/P000034?for...
