# Gathering data in loops

Your final project will probably require you to send more than one API request or scrape more than one page. The next sections offer some general tips on how you can write code that can collect large amounts of data from an API. The final section gives a few tips on organizing your projects. 

Regardless of how you choose to organize it, your final project must have code that can be used to replicate or update the results you use in your analysis. 

In [None]:
from requests import get
import numpy as np
import pandas as pd
import time


## API Pagination

Many APIs will place a limit on the number of results you can retrieve with a single query. In order to collect a complete data set, you'll usually need to write a loop that sends multiple requests until you've collected all of the relevant data.

The exact process for doing this will vary depending on the API, but usually it will involve using either an offset or a pagination parameter.

We can use the example from the World Bank Development Indicators API to illustrate how to do this. This query returns carbon emissions for all countries in 2020:



In [None]:
wdi_params = {'format':'json',
              'per_page':100,
              'date':2020
             }
url = 'https://api.worldbank.org/v2/country/all/indicator/EN.ATM.CO2E.PC'
response = get(url, params = wdi_params)


This query only returns the first 100 results, but the response object tells us how many more results are available:

In [None]:
response.json()[0]

[According to the documentation for this API](https://datahelpdesk.worldbank.org/knowledgebase/articles/898581), we can get the next page of results by incrementing the `page` parameter in our request. So the next page of results would just add "&page=2" to the URL we just requested. 

We could just write all three links out separately, but a more generalizable approach would be to write a loop that makes use of the pagination information that the API gives us. The code below uses a `while` loop to continuously send requests until we reach the final page. After running it, we'll have a list of responses that we can then concatenate into a single data frame

In [None]:
# start with an empty list
results_list = []

morepages = True
i = 1

while morepages == True:
    wdi_params = {'format':'json',
              'per_page':100,
              'date':2020, 
               'page':i}
    url = 'https://api.worldbank.org/v2/country/all/indicator/EN.ATM.CO2E.PC'
    response = get(url, params = wdi_params)
    # append page i to results_list
    results_list.append(response)
    # check to see if we've reached the final page:
    morepages = i < response.json()[0].get('pages')
    
    time.sleep(1)
    i +=1






Now we just need to format and concatenate all the results. To do that, I've written a function that takes a single response from the WDI API and turns it into a data frame. I'll apply it to each list element using a list comprehension, and then use `pd.concat` to create a single data frame

In [None]:


def wdi_parser(resp):
    result_dict = [{'country_id':i['countryiso3code'],
                    'country_name':i['country']['value'],
                    'date': int(i['date']),
                    'indicator': i['indicator']['id'],
                    'indicator_description' : i['indicator']['value'],
                    'indicator_value': np.float64(i['value'])} for i in resp.json()[1]]
    return pd.DataFrame(result_dict)


In [None]:
parsed_responses = [wdi_parser(i) for i in results_list]
wdi_df = pd.concat(parsed_responses)
wdi_df.shape

Now, we should have results for all 266 countries:

In [None]:
wdi_df.tail()

In [None]:
result.json()[0]

### Pagination with offsets
Keep in mind that the process of paginating through data will not always be the same across all APIs. For instance: the [Nobel Prize API](https://app.swaggerhub.com/apis/NobelMedia/NobelMasterData/2.1) uses an offset parameter rather than a pagination parameter. So you would write something like `offset=0&limit=100` to get results 1-100, and then you would increment that by 100 (`offset=100&limit=100`) to get 101 through 200 and so on and you would continue until your offset was greater than or equal to the maximum number of responses. 

However, while the specific parameters might be different, the basic ingredients for pagination are more-or-less the same:
1. You need code that takes a response object and then creates a URL to retrieve the next page of data
2. You need code that can detect when there are no pages left
3. You need code to format all of the pages into a single data frame

<b style="color:red;">
<h3>Question 1A</h3> The request below gets a single page of results from the PokeApi (see <a href ='https://pokeapi.co/docs/v2#pokemon'>documentation</a>) Start by writing code that will retrieve/create a request for the next page of data</b>

(Note that you can either use an offset parameter or the "next" url to get results here.)

In [None]:
params = {'offset':0,
         'limit':100
         }
request = get('https://pokeapi.co/api/v2/pokemon', params=params)

request.url

In [None]:
# code to get the next page of results

<b style="color:red;">
<h3>Question 1B</h3> The request below shows you what the final page of data would look like. Use this response to write some code that will return `False` if we've reached the final page


In [None]:
params = {'offset':request.json()['count']-10,
         'limit':100
         }
request = get('https://pokeapi.co/api/v2/pokemon', params=params)
request.url


<b style="color:red;">
<h3>Question 1C</h3>
Use the code above to create a while loop that iterates through each page of results and collects the name and url of each Pokemon in a list. Remember to put a short pause between each iteration of the loop. </b>

If you find your loop runs for a really long time, you might want to interrupt the kernal by pressing the stop button at the top of your notebook.


In [None]:
# code to create a list with all the responses


<b style="color:red;">
<h3>Question 1D</h3>
Take a single element from your list of responses and write a function that will turn it into a dataframe. Then apply that function to your list of results from the previous step using a list comprehension and use `pd.concat` to combine them all together
</b>

In [None]:
# code to concatenate everything in a data frame 


Once we're reasonably confident that we know how to navigate the pagination process, we might want to write a pagination function that can take any query and return the entire list of results. You can see an example of doing that with the Congress.gov API in the `congress_api_functions.py` file which is discussed at the end of this document

### A note on gathering complex data

Depending on how the data are structured, there may be cases where you need to query one part of the API to get a URL for a separate endpoint that has more detailed data about that subject. The PokeApi is a good example of this: we retrieved a list of names and URLs, but if we navigate to any one of those URLs we'll get even more detailed information about the selected Pokemon. So if we wanted to create a data set with detailed information on each Pokemon, we would need to iterate over all of these URLs and then format all of our results in data frame. The way that data are organized is really up to the person who maintains the data set, so you'll want to spend some time getting to know an API before you can really get a good sense of what you can do with it.

## Scraping Multiple Pages

Large scale web scraping can sometimes require us to do a different kind of "pagination" in order to visit multiple links on a page and extract some text or data from each one. For instance: [this URL](https://lite.cnn.com/) has a list of top stories from CNN.com and hyperlinks to (a minimal HTML version of) each article.

In [None]:
from bs4 import BeautifulSoup
from urllib.parse import urljoin

site = get('https://lite.cnn.com/')
content = BeautifulSoup(site.content, 'html.parser')



I can extract the list of links and headlines and place it in a data frame like this:

In [None]:
headlines = content.select('ul a')

fmted_result ={'links' : [urljoin(site.url, i.get('href')) for i in headlines],
               'headline_text' : [i.get_text().strip() for i in headlines]}

article_df = pd.DataFrame(fmted_result)
article_df.head()

Note: the `urljoin` function here just turns a relative url into an absolute url (see [here](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Web_mechanics/What_is_a_URL#absolute_urls_vs._relative_urls) for some discussion)

On its own, this list of links isn't very useful. More often I'll want to get the actual article text as well. To do that, I would need to write a loop that visits each page extracts the full text. 

For the sake of this example, I'll just take the first five headlines here. Once we've got code that works, we can easily re-run this loop on the entire list of headlines

In [None]:
# adding an empty column to the data frame
#article_df['article_text'] = np.NaN

# taking the first 5 rows just for this example 
article_sample = article_df[:5]

In [None]:
article_sample.loc[1, 'links']

In [None]:
pages = []
for link in article_sample['links']:
    # navigate to link i
    resp = get(link)
    pages.append(resp)
    time.sleep(1)


<b style="color:red;">
<h3>Question 2A</h3>
Write a loop to get a response object from each page in the sample of articles. Make sure to put a `time.sleep` call in your loop to put a short rest between each request
</b>

Hint: since you already know exactly how many URLs you need to visit, you can use a `for` loop instead of a `while` loop here.


In [None]:
# a loop to get responses from each link


<b style="color:red;">
<h3>Question 2B</h3>
Write a function that takes a single response and returns one long string of text from a CNN article. Call the function <code>cnn_text</code>
</b>

Note that you can concatenate a list of strings like this:

`' '.join(string_list)`

In [None]:
# try working with a single result to start, then wrap your code in a function that takes "x" as an argument:
# x = pages[0]


In [None]:
def cnn_text(resp):
    content = BeautifulSoup(resp.content, 'html.parser')
    text = '\n'.join([i.get_text() for i in content.select('.paragraph--lite')])
    
    return text
    

Now that we have a working function, we can apply it to the entire list and then use the `insert` function to add it to our existing data frame:

In [None]:
text = [cnn_text(i) for i in pages]
article_sample.insert(0, "article_text", text, False)

In [None]:
article_sample.head()


When we're done, it might be a good idea to save a copy of `article_sample` so we can work with it later. This is especially true when we're running code that sends a lot of requests.

We can use the `.to_csv` method to store a result as a csv file. After running this you should see a file called `cnn_articles` in your working directory

In [None]:
# usually set index=False to avoid writing the index names as a new column
article_sample.to_csv('cnn_articles.csv', index=False)

If you want to restore this data, then you would just run

In [None]:
arts = pd.read_csv("cnn_articles.csv")
arts.head()

# Managing larger projects
<a id='managing'></a>

Up to now, we've mostly run all of our analyses in a single notebook file. This is fine for quick analysis, but when we start to assemble larger projects, we'll often want to maintain multiple scripts or notebooks to help us organize and separate our code. Whether and how you choose to do this is a matter of personal judgement. However, whatever you choose to do, all the code needed to replicate your analysis should be saved somewhere along with comments or written instructions on how you did things. 






### Separating your code and saving results

If we're just sending a couple API requests or scraping a single page, it often makes sense to do this "in memory": we scrape a page or send a request to an API, store the results in a Python variable, and run our analyses. The results of our analyses will go away whenever we exit Python, but we can just re-run the request next time we open open Python. This is fine for small-scale analysis, but, when we have code that sends a lot of requests, we probably don't want to have to re-run that over and over again. In those cases we probably want to write a separate script that collects our data and then saves the results to a file. Then we can just re-load that file next time we open Python

For instance, rev.com has transcriptions of speeches from the 2024 presidential campaign. I want to scrape the full text of each speech. There are about 200 separate pages with transcripts, and I put a short pause of 1 second between each request, so this takes over 3 minutes to run. I don't want to do this every time I open Python, so I wrote a separate script called `speeches_scraper.py` that does the data collection and stores the results in a .csv. Then I can reload that data by running  `pd.read_csv` at the top of my script:




In [None]:
import pandas as pd
speeches = pd.read_csv("extra_code/speeches.csv")
speeches.head()

When/whether it makes sense to separate code this way is often a matter of personal judgement, but if it takes more than a few seconds to run and you don't need/want the data to update, then this is probably your best option

If you've written a Python script in a separate file, you can call it directly from within a Jupyter notebook using the `%run` magic command.

Here's an example of conditionally running our scraper script from within a jupyter notebook. This code runs if the "speeches.csv" file doesn't exist in the current working directory, otherwise, it skips this script. 

In [None]:
import os.path
if os.path.isfile('extra_code/speeches.csv') == False:
    print("speeches file doesn't exist, creating")
    %run ./extra_code/pres_speeches.py
else:
    print("file already exists. Skipping")


### Creating a functions file

Another scenario where we might want to keep separate files is when we have a lot of function definitions that get re-used throughout our analysis and we want to avoid cluttering a written report.

For instance, I've written a couple of functions for working with the congress.gov API. The `member_parser` function takes a response from the members endpoint and re-formats it as a dataframe. The `congress_paginate` function takes an initial requests and then automatically paginates until it gets a complete list of results from the API. Instead of including those functions in this document, I've placed them in a separate file and then I make them available in this notebook by calling `from congress_api_functions import congress_paginate, member_parser`. 

Note: this code assumes you've got a Congress.gov API key. You can sign up for one here: https://gpo.congress.gov/sign-up/


In [None]:
import pandas as pd
# import my custom functions: 
from extra_code.congress_api_functions import congress_paginate, member_parser

# import the congress API key
with open('extra_code/congress_gov.txt', 'r') as f:
    congress_key = f.readline()

# the members endpoint
member_url = 'https://api.congress.gov/v3/member'
# my additional parameters: 
congress_parameters = {'currentMember': 'true',
                       'page':'1',
                       'limit': 250,
                       'api_key':congress_key}
# running the pagination function:
responses_list = congress_paginate(member_url, params= congress_parameters)

In [None]:
# iterate over the list of responses, parse each one, and then create a single concatenated data frame: 
member_frame = pd.concat([member_parser(i) for i in responses_list])
# look at the first few results: 
member_frame.head()

As with the previous example, the decision to organize code this way is largely a matter of personal judgement. The main disadvantage of this approach is that it can make it harder for people to understand what your code is doing. On the other hand, it can make our notebook file more concise. 

Another advantage is that, if I'm writing multiple analyses that all require this set of functions, I can maintain a single functions file and the import a copy to each of my analyses. This helps me avoid writing redundant code, and it also makes it a lot easier to make modifications or correct errors in my functions since I only need to edit a single file instead of three.


[![](https://mermaid.ink/img/pako:eNp9zrEKwjAQBuBXCTcptIO6ZRDUrk51Mw5HcmkDTVLSC1JK390IOupN9x_fD7eAjoZAgh3iU_eYWNwaFUSZZnOdhc1Bs4th2oq6PorTvdXJjSx2j9_o_EX7P-jyRYeCoAJPyaMz5ZPlXVLAPXlSIMtqyGIeWIEKa6GYObZz0CA5Zaogxdz1IC0OU0l5NMjUOOwS-s91fQHh7kqx?type=png)](https://mermaid.live/edit#pako:eNp9zrEKwjAQBuBXCTcptIO6ZRDUrk51Mw5HcmkDTVLSC1JK390IOupN9x_fD7eAjoZAgh3iU_eYWNwaFUSZZnOdhc1Bs4th2oq6PorTvdXJjSx2j9_o_EX7P-jyRYeCoAJPyaMz5ZPlXVLAPXlSIMtqyGIeWIEKa6GYObZz0CA5Zaogxdz1IC0OU0l5NMjUOOwS-s91fQHh7kqx)

## Installing Additional Packages

We've only used packages that were already available and installed in our BSOS JuypterHub environment so far. However, you might want to use packages that aren't already installed (or you might want to have JupyterLab on your own computers and need to install them yourself). To do this within Jupyter, you can use the ! notation and use the pip installer to install any packages. For example:


In [None]:
!pip install sqlite3

The package above installs an Python interface for a `sqlite`, a lightweight SQL data base engine. While SQL is somewhat outside the scope of this class, its a useful tool to have in our tool kit if we want to be able to create and interact with very large databases because it allows us to work with datasets that are too large to hold in memory. 

If you're interested in trying out the sqlite3 package, there's a script in the extra code directory called `scraper_db.py` that gives a toy example of a script that creates/updates an SQL database with the transcripts from rev.com.

Unlike the `speeches_scraper.py` code, this code checks scraped links against the links that are already in the database, and only scrapes them if they're new. Code like this can be used to efficiently update a database on at regular intervals (although in a really large dataset, you would probably want to handle this with an "upsert" operation)



If you're interested in seeing what a really well-designed project of this sort might look like, you can check out the open-source `count-love` crawler on github: https://github.com/count-love/crawler