# Gathering data in loops

Your final project will probably require you to send more than one API request or scrape more than one page. The next sections offer some general tips on how you can write code that can collect large amounts of data from an API. The final section gives a few tips on organizing your projects. 

Regardless of how you choose to organize it, your final project must have code that can be used to replicate or update the results you use in your analysis. 

In [None]:
from requests import get
import numpy as np
import pandas as pd
import time


## API Pagination

Many APIs will place a limit on the number of results you can retrieve with a single query. In order to collect a complete data set, you'll usually need to write a loop that sends multiple requests until you've collected all of the relevant data.

The exact process for doing this will vary depending on the API, but usually it will involve using either an offset or a pagination parameter.

We can use the example from the World Bank Development Indicators API to illustrate how to do this. This query returns carbon emissions for all countries in 2020:



In [None]:
wdi_params = {'format':'json',
              'per_page':100,
              'date':2020
             }
url = 'https://api.worldbank.org/v2/country/all/indicator/EN.ATM.CO2E.PC'
response = get(url, params = wdi_params)


This query only returns the first 100 results, but the response object tells us how many more results are available:

In [None]:
response.json()[0]

[According to the documentation for this API](https://datahelpdesk.worldbank.org/knowledgebase/articles/898581), we can get the next page of results by incrementing the `page` parameter in our request. So the next page of results would just add "&page=2" to the URL we just requested. 

We could just write all three links out separately, but a more generalizable approach would be to write a loop that makes use of the pagination information that the API gives us. The code below uses a `while` loop to continuously send requests until we reach the final page. After running it, we'll have a list of responses that we can then concatenate into a single data frame

In [None]:
# start with an empty list
results_list = []

morepages = True
i = 1

while morepages == True:
    wdi_params = {'format':'json',
              'per_page':100,
              'date':2020, 
               'page':i}
    url = 'https://api.worldbank.org/v2/country/all/indicator/EN.ATM.CO2E.PC'
    response = get(url, params = wdi_params)
    # append page i to results_list
    results_list.append(response)
    # check to see if we've reached the final page:
    morepages = i < response.json()[0].get('pages')
    
    time.sleep(1)
    i +=1






Now we just need to format and concatenate all the results. To do that, I've written a function that takes a single response from the WDI API and turns it into a data frame. I'll apply it to each list element using a list comprehension, and then use `pd.concat` to create a single data frame

In [None]:


def wdi_parser(resp):
    result_dict = [{'country_id':i['countryiso3code'],
                    'country_name':i['country']['value'],
                    'date': int(i['date']),
                    'indicator': i['indicator']['id'],
                    'indicator_description' : i['indicator']['value'],
                    'indicator_value': np.float64(i['value'])} for i in resp.json()[1]]
    return pd.DataFrame(result_dict)


In [None]:
parsed_responses = [wdi_parser(i) for i in results_list]
wdi_df = pd.concat(parsed_responses)
wdi_df.shape

Now, we should have results for all 266 countries:

In [None]:
wdi_df.tail()

### Pagination with offsets
Keep in mind that the process of paginating through data will not always be the same across all APIs. For instance: the [Nobel Prize API](https://app.swaggerhub.com/apis/NobelMedia/NobelMasterData/2.1) uses an offset parameter rather than a pagination parameter. So you would write something like `offset=0&limit=100` to get results 1-100, and then you would increment that by 100 (`offset=100&limit=100`) to get 101 through 200 and so on and you would continue until your offset was greater than or equal to the maximum number of responses. 

However, while the specific parameters might be different, the basic ingredients for pagination are more-or-less the same:
1. You need code that takes a response object and then creates a URL to retrieve the next page of data
2. You need code that can detect when there are no pages left
3. You need code to format all of the pages into a single data frame

<b style="color:red;">
<h3>Question 1A</h3> The request below gets a single page of results from the PokeApi (see <a href ='https://pokeapi.co/docs/v2#pokemon'>documentation</a>) Start by writing code that will retrieve/create a request for the next page of data</b>

(Note that you can either use an offset parameter or the "next" url to get results here.)

In [None]:
params = {'offset':0,
         'limit':100
         }
request = get('https://pokeapi.co/api/v2/pokemon', params=params)

request.url

In [None]:
# code to get the next page of results

<b style="color:red;">
<h3>Question 1B</h3> The request below shows you what the final page of data would look like. Use this response to write some code that will return `False` if we've reached the final page


In [None]:
params = {'offset':request.json()['count']-10,
         'limit':100
         }
request = get('https://pokeapi.co/api/v2/pokemon', params=params)
request.url


<b style="color:red;">
<h3>Question 1C</h3>
Use the code above to create a while loop that iterates through each page of results and collects the name and url of each Pokemon in a list. Remember to put a short pause between each iteration of the loop. </b>

If you find your loop runs for a really long time, you might want to interrupt the kernal by pressing the stop button at the top of your notebook.


In [None]:
# code to create a list with all the responses


<b style="color:red;">
<h3>Question 1D</h3>
Take a single element from your list of responses and write a function that will turn it into a dataframe. Then apply that function to your list of results from the previous step using a list comprehension and use `pd.concat` to combine them all together
</b>

In [None]:
# code to concatenate everything in a data frame 


Once we're reasonably confident that we know how to navigate the pagination process, we might want to write a pagination function that can take any query and return the entire list of results. You can see an example of doing that with the Congress.gov API in the `congress_api_functions.py` file in the extra code folder in this directory

### A note on gathering complex data

Depending on how the data are structured, there may be cases where you need to query one part of the API to get a URL for a separate endpoint that has more detailed data about that subject. The PokeApi is a good example of this: we retrieved a list of names and URLs, but if we navigate to any one of those URLs we'll get even more detailed information about the selected Pokemon. So if we wanted to create a data set with detailed information on each Pokemon, we would need to iterate over all of these URLs and then format all of our results in data frame. The way that data are organized is really up to the person who maintains the data set, so you'll want to spend some time getting to know an API before you can really get a good sense of what you can do with it.