# Automated data retrieval

Today we will cover data retrieval, that is, how to obtain data from a website.

The game plan is to first go over the two examples from class by otaining data from FRED and the World Bank.

Next, I will demo data retrieval through a few other means mentioned in class.

First by using pd.read_html, which is how we retrieved data for the last two labs.

Next, we'll go over using requests to obtain data repeatedly from the same website.

Finally by using an API, in this case Google's API.


In [1]:
import pandas as pd
import datetime
#Used to request data from certain websites using pandas
import pandas_datareader.data as web
#Used to request data from the world bank using pandas
from pandas_datareader import wb
#Library used to obtain data from websites
import requests
#Used to delay requests
import time
#json needs to be installed first, it is used to turn strings obtained 
#through APIs, which are in the form of a dictionary, into pandas objects
#import json
#pretty print just makes it easier to see complicated dictionaries
from pprint import pprint
#Setting path, because anything that is opened will first be downloaded
#into this path, I don't think it necessary to use os.
path = r'C:\Users\John\Documents\GitHub\John-Notebooks\TA Session 8'

## Accessing a file that is directly part of the URL

The pandas data reader takes the data_source, the start and end dates, how long you want to wait between scraping, and an API key if necessary. In the help print out below, you can see examples of other supported websites.

In [2]:
help(web.DataReader)

Help on function DataReader in module pandas_datareader.data:

DataReader(name, data_source=None, start=None, end=None, retry_count=3, pause=0.1, session=None, api_key=None)
    Imports data from a number of online sources.
    
    Currently supports Google Finance, St. Louis FED (FRED),
    and Kenneth French's data library, among others.
    
    Parameters
    ----------
    name : str or list of strs
        the name of the dataset. Some data sources (IEX, fred) will
        accept a list of names.
    data_source: {str, None}
        the data source ("iex", "fred", "ff")
    start : string, int, date, datetime, Timestamp
        left boundary for range (defaults to 1/1/2010)
    end : string, int, date, datetime, Timestamp
        right boundary for range (defaults to today)
    retry_count : {int, 3}
        Number of times to retry query request.
    pause : {numeric, 0.001}
        Time, in seconds, to pause between consecutive queries of chunks. If
        single value given 

The following code is very similar to the code that from class. It obtains data from FRED about the unemployment rate in Washington state

In [3]:
start = datetime.date(year=1990, month=1,  day=1)
end   = datetime.date(year=2019, month=12, day=31)
series = 'WAUR'
source = 'fred'

In [4]:
df = web.DataReader(series, source, start, end)
df.head()


Unnamed: 0_level_0,WAUR
DATE,Unnamed: 1_level_1
1990-01-01,5.4
1990-02-01,5.3
1990-03-01,5.2
1990-04-01,5.1
1990-05-01,5.1


In [5]:
sum(df['WAUR'])/len(df)

6.029722222222223

In [6]:
print(f'Highest unemployment rate:', df['WAUR'].max())
print(f'Lowest unemployment rate:', df['WAUR'].min())
print('Average unemployment rate:', sum(df['WAUR'])/len(df))

Highest unemployment rate: 9.3
Lowest unemployment rate: 3.9
Average unemployment rate: 6.029722222222223


If you want the data from multiple states, it's also simple. Series is placed in the first argurment, 'name', as you can see in the documentation, 'name' can be a list.

In [7]:
series = ['WAUR','ILUR', 'WIUR', 'MIUR']
df = web.DataReader(series, source, start, end)
df.head()

Unnamed: 0_level_0,WAUR,ILUR,WIUR,MIUR
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1990-01-01,5.4,6.3,4.2,7.7
1990-02-01,5.3,6.2,4.1,7.6
1990-03-01,5.2,6.1,4.1,7.5
1990-04-01,5.1,6.1,4.1,7.4
1990-05-01,5.1,6.1,4.1,7.4


Next let's go over an example from the World Bank.

wb.downoad takes kwargs for the country, indicator, and the start and end dates. As you can see below, the default start date is 2003, and the default end date is 2005. The indicator is taken from the id field in WDIsearch(), a list of indicators can be found here: https://data.worldbank.org/indicator.

For country use the ISO names provided by wikipedia found here: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes. You can use either the 2 character or 3 character codes interchangably.

In [9]:
help(wb.download)

Help on function download in module pandas_datareader.wb:

download(country=None, indicator=None, start=2003, end=2005, freq=None, errors='warn', **kwargs)
    Download data series from the World Bank's World Development Indicators
    
    Parameters
    ----------
    indicator: string or list of strings
        taken from the ``id`` field in ``WDIsearch()``
    country: string or list of strings.
        ``all`` downloads data for all countries
        2 or 3 character ISO country codes select individual
        countries (e.g.``US``,``CA``) or (e.g.``USA``,``CAN``).  The codes
        can be mixed.
    
        The two ISO lists of countries, provided by wikipedia, are hardcoded
        into pandas as of 11/10/2014.
    start: int
        First year of the data series
    end: int
        Last year of the data series (inclusive)
    freq: str
        frequency or periodicity of the data to be retrieved (e.g. 'M' for
        monthly, 'Q' for quarterly, and 'A' for annual). None defa

The following shows data for CO2 emissions for all countries

In [10]:
indicator = 'EN.ATM.CO2E.KT'
country = 'all'

In [11]:
df = wb.download(indicator=indicator,                  
                 country = country,                  
                 start=2000, 
                 end=2010)
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,EN.ATM.CO2E.KT
country,year,Unnamed: 2_level_1
Arab World,2010,1643369.717
Arab World,2009,1581643.106
Arab World,2008,1491783.271
Arab World,2007,1360284.651
Arab World,2006,1382459.0


In [12]:
df[df['EN.ATM.CO2E.KT']==max(df['EN.ATM.CO2E.KT'])]

Unnamed: 0_level_0,Unnamed: 1_level_0,EN.ATM.CO2E.KT
country,year,Unnamed: 2_level_1
World,2010,31927780.0


Again if we want to look at a list of countries, we can enter a list of countries into the country kwarg.

In [13]:
country = ['US', 'CN', 'RU']
df = wb.download(indicator=indicator,                  
                 country=country,                  
                 start=2015, end=2016)
df=df.reset_index()
df


Unnamed: 0,country,year,EN.ATM.CO2E.KT
0,China,2016,9893038.0
1,China,2015,10145000.0
2,Russian Federation,2016,1732027.0
3,Russian Federation,2015,1698213.0
4,United States,2016,5006302.0
5,United States,2015,5126913.0


## Parsing data out of the html of a website

First up, our old friend pd.read_html! If you recall, this what we used to collect the data we used in the past two lab sessions. 

Below you can see all of the different parameters that pd.read_html takes. It relys on BeautifulSoup, which you will learn about next week, to determine what is a table based on the HTML code of the source website, and turns those tables into dataframes. It works really well on Wikipedia.

In [14]:
help(pd.read_html)

Help on function read_html in module pandas.io.html:

read_html(io: Union[ForwardRef('PathLike[str]'), str, IO[~T], io.RawIOBase, io.BufferedIOBase, io.TextIOBase, _io.TextIOWrapper, mmap.mmap], match: Union[str, Pattern] = '.+', flavor: Union[str, NoneType] = None, header: Union[int, Sequence[int], NoneType] = None, index_col: Union[int, Sequence[int], NoneType] = None, skiprows: Union[int, Sequence[int], slice, NoneType] = None, attrs: Union[Dict[str, str], NoneType] = None, parse_dates: bool = False, thousands: Union[str, NoneType] = ',', encoding: Union[str, NoneType] = None, decimal: str = '.', converters: Union[Dict, NoneType] = None, na_values=None, keep_default_na: bool = True, displayed_only: bool = True) -> List[pandas.core.frame.DataFrame]
    Read HTML tables into a ``list`` of ``DataFrame`` objects.
    
    Parameters
    ----------
    io : str, path object or file-like object
        A URL, a file-like object, or a raw string containing HTML. Note that
        lxml only

Here pd.read_html returned a list of length 6. Each element is a pandas dataframe, but as you can see below, not everything in the list should be put into a dataframe.

In [15]:
df=pd.read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_income', header=0)
print(f'The df length is: {len(df)}')
type(df)

The df length is: 6


list

In [16]:
df[0]

Unnamed: 0,This article is part of a series on
0,Income in theUnited States of America
1,Topics Household Personal Affluence Social cla...
2,Lists by income States (by equality (Gini)) Co...
3,United States portal
4,.mw-parser-output .navbar{display:inline;font-...


In [17]:
df[1]

Unnamed: 0,Rank,State or territory,2018,2017,2016,2015,2014[note 2]
0,1,"Washington, D.C.","$85,203","$82,372","$75,506","$75,628","$71,648"
1,2,Maryland,"$83,242","$80,776","$78,945","$75,847","$73,971"
2,3,New Jersey,"$81,740","$80,088","$76,126","$72,222","$71,919"
3,4,Hawaii,"$80,212","$77,765","$74,511","$73,486","$69,592"
4,5,Massachusetts,"$79,835","$77,385","$75,297","$70,628","$69,160"
5,6,Connecticut,"$76,348","$74,168","$73,433","$71,346","$70,048"
6,7,California,"$75,277","$71,805","$67,739","$64,500","$61,933"
7,8,New Hampshire,"$74,991","$73,381","$70,936","$70,303","$66,532"
8,9,Alaska,"$74,346","$73,181","$76,440","$73,355","$71,583"
9,10,Washington,"$74,073","$70,979","$67,106","$64,129","$61,366"


In [18]:
df[2]

Unnamed: 0.1,Unnamed: 0,This section needs to be updated. Please update this article to reflect recent events or newly available information. (July 2019)


In [19]:
df[3]

Unnamed: 0,Rank,State or territory,Per capitaincome,Medianhouseholdincome,Medianfamilyincome,Population,Number ofhouseholds,Number offamilies
0,1.0,District of Columbia,"$45,877","$71,648","$84,094",658893,277378,117864
1,2.0,Connecticut,"$39,373","$70,048","$88,819",3596677,1355817,887263
2,3.0,New Jersey,"$37,288","$69,160","$87,951",8938175,2549336,1610581
3,4.0,Massachusetts,"$36,593","$71,919","$88,419",6938608,3194844,2203675
4,5.0,Maryland,"$36,338","$73,971","$89,678",5976407,2165438,1445972
5,6.0,New Hampshire,"$34,691","$66,532","$80,581",1326813,519756,345901
6,7.0,Virginia,"$34,052","$64,902","$78,290",8326289,3083820,2058820
7,8.0,New York,"$33,095","$58,878","$71,115",19746227,7282398,4621954
8,9.0,North Dakota,"$33,071","$59,029","$75,221",739482,305431,187800
9,10.0,Alaska,"$33,062","$71,583","$82,307",736732,249659,165015


In [20]:
df[4]

Unnamed: 0,vteUnited States locations by per capita income,vteUnited States locations by per capita income.1
0,Nationwide,U.S. states and territories by per capita inco...
1,State locations,Alabama Alaska Arizona Arkansas California Col...
2,Federal district,District of Columbia
3,Territory locations,American Samoa Guam Northern Mariana Islands P...
4,Related lists,Highest-income counties in the United States H...


In [21]:
df[5]

Unnamed: 0,vteUnited States state-related lists,vteUnited States state-related lists.1
0,List of states and territories of the United S...,List of states and territories of the United S...
1,Demographics,Population African American Amish Asian Birth ...
2,Economy,Billionaires Budgets Companies Credit ratings ...
3,Environment,Botanical gardens Carbon dioxide emissions Par...
4,Geography,Area Bays Beaches Coastline Elevation Extreme ...
5,Government,Agriculture commissioners Attorneys general Ca...
6,Health,Fertility rates Hospitals Human Development In...
7,History,Date of statehood Name etymologies Historical ...
8,Law,Abortion Age of consent Alcohol Dry communitie...
9,Miscellaneous,Abbreviations Airports Bus transit systems Cas...


Now let's use request to collect the raw HTML code of a website!

Let's scrape the search page of Aljazeera. This won't produce good useful data, you would need to use BeautifulSoup to parse the output into something useful.

First let's check the robots.txt of aljazeera.com to make sure we abide by their rules, that way we won't get banned (also we won't be rude).

If you look at this website: https://www.aljazeera.com/robots.txt

You will see the following output:



In [22]:
'''
User-agent: *
Sitemap: https://www.aljazeera.com/sitemap.xml
Disallow: /api
Disallow: /asset-manifest.json
'''

'\nUser-agent: *\nSitemap: https://www.aljazeera.com/sitemap.xml\nDisallow: /api\nDisallow: /asset-manifest.json\n'

This means that they don't want you crawling on any part of their website that has /api or /asset-manifest.json on it.

So, to start we set our url, here we are going to collect the first page of search results of 'Gaza' from aljazeera. requests.get(url) collects the html code from the web page, and adding .text to the end turns the output into a string. 

Below you can see the resulting output, but if we control find 'enclave' we can see that the information on the page that we care about is in this string.

In [23]:
url = 'https://www.aljazeera.com/search/gaza'
response = requests.get(url)
data = response.text
data



In [24]:
type(data)

str

That's all well and good if we only want to scrape one page of one search result, but what if we want to scrape many pages?

Let's write a function with some of the good practices Dr. Levy talked about in class!

As you can see, the function get_subject is generalized, so that it can scrape any search term or page number for that search term. 

time.sleep(2) means that there will be a two second delay between executing the line of code: url = url + subject, and 'if pg_number > 1:' This means that if we run this function hundreds or thousands of times we won't get banned from Aljazeera. In reality we could probably reduce this considerably (pandas data reader only pauses for .001 seconds). 

The 'if pg_number > 1:' conditional is there because the link is slightly different for the first page number and subsequent ones.

The 'with open' line will save the html code as a txt file in the path location. the 'w' means that we are writing a file. Use 'r' for reading in a file, and 'rb' for reading in a file that is in binary (like a pdf). Encoding='utf-8' allows python to read in symbols coded in HTML, without specifying this, we will get very strange symbols in our text file. 

**Careful because this code will overwrite any file in the same location with the same name.**

In [25]:
def get_subject(subject, pg_number=1):
    url = 'https://www.aljazeera.com/search/'
    url = url + subject
    time.sleep(2)
    if pg_number > 1:
        url = url + f'?page={pg_number}'
    response = requests.get(url)
    data = response.text
    #Cite: https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters
    with open(f'{path}/{subject}_{pg_number}.txt', 'w', encoding="utf-8") as f:
        f.write(data)
    return data


Let's try out the new function! As you can see the code produces the same output as the code we used above, with two key differences, it takes longer to run the function (because of the time delay) and the string was saved as a text file in your path.

In [26]:
data = get_subject('Gaza')
data



The get_subject function is a huge step up from the loose code, but it isn't complete until we pair it with the function get_page. get_page uses the get_subject function to ensure that the we don't scrape Aljazeera repeatedly every time we want to run our code. If you specify update = True, it will run get_subject and overwrite any text file in the path location. 

If not, The code will first try to open up the text file with the correct naming convention, you see that the 'with open' now has a 'r' and .read instead of a 'w' and .write.

If the code produces a FileNotFoundError, then it will execute the get_subject code to scrape the subject and page because it isn't in our repository.

In [27]:
   
def get_page(subject, pg_number = 1, update = False):
    if update:
        data = get_subject(subject, pg_number)
    else:
        try: 
            #cite https://stackoverflow.com/questions/8369219/how-to-read-a-text-file-into-a-string-variable-and-strip-newlines
            with open(f'{path}/{subject}_{pg_number}.txt', 'r', encoding="utf-8") as file:
                data = file.read()
        except FileNotFoundError:
            data = get_subject(subject, pg_number)
    return data

Because we ran get_subject('Gaza') earlier, Gaza_1.txt is already saved in our repository, so running get_page('Gaza') will simply open up that txt file,  this is why the code runs instantly without the 2 second delay. Running gaza_2 will trigger the 2 second delay the first time it is run, because Gaza_2 is not in our repository.

In [28]:
gaza_1 = get_page('Gaza')

In [29]:
gaza_2 = get_page('Gaza', 2)

Let's make a for loop to show how you could apply this code to many subjects and page numbers! You can see that the first time we run this code, it takes a wihile

In [30]:
subject_lst = ['USA', 'Iran', 'Iraq']
num_lst = [1,2,3,4,5]
country_files = []
for country in subject_lst:
    for num in num_lst:
        page = get_page(country, num)
        country_files.append(page)
print(len(country_files))

15


## Using a data API

We won't spend as much time here because you won't use APIs in this class or the next one (unless Dr. Levy changes the rubric from what it was last year).

You won't be able to run the code below unless you have an API key for Google. If you do, replace 'Your API key' with your key. 

We can use APIs to exchange data between programs. Let's use the Google API to obtain information about the Keller Center from Google Maps!

In [31]:
params = {'key': 'Your API key',
          'address': 'Keller Center, Chicago'}
          
url= 'https://maps.googleapis.com/maps/api/geocode/json?'

Like before we use requests.get, only now we have an API key to access the website. 

There is a problem, however, while the output looks a dictionary, it isn't, it's a string.

In [37]:
response = requests.get(url,params)
data = response.text
print(type(data))
print(data)

<class 'str'>
{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "1307",
               "short_name" : "1307",
               "types" : [ "street_number" ]
            },
            {
               "long_name" : "East 60th Street",
               "short_name" : "E 60th St",
               "types" : [ "route" ]
            },
            {
               "long_name" : "Woodlawn",
               "short_name" : "Woodlawn",
               "types" : [ "neighborhood", "political" ]
            },
            {
               "long_name" : "Chicago",
               "short_name" : "Chicago",
               "types" : [ "locality", "political" ]
            },
            {
               "long_name" : "Cook County",
               "short_name" : "Cook County",
               "types" : [ "administrative_area_level_2", "political" ]
            },
            {
               "long_name" : "Illinois",
               "short_name" : "IL",
     

While we could write code using regular expressions (the re library) to convert this string into a dictionary filled with lists and dictionaries it's easier if we just use the json library to turn that dictionary like string into a dictionary, like so:

In [33]:
data = json.loads(response.text)
pprint(data)


{'results': [{'address_components': [{'long_name': '1307',
                                      'short_name': '1307',
                                      'types': ['street_number']},
                                     {'long_name': 'East 60th Street',
                                      'short_name': 'E 60th St',
                                      'types': ['route']},
                                     {'long_name': 'Woodlawn',
                                      'short_name': 'Woodlawn',
                                      'types': ['neighborhood', 'political']},
                                     {'long_name': 'Chicago',
                                      'short_name': 'Chicago',
                                      'types': ['locality', 'political']},
                                     {'long_name': 'Cook County',
                                      'short_name': 'Cook County',
                                      'types': ['administrative_area_level_2',
 

In [34]:
type(data)

dict

As you can see we can now parse the data like any other Python object. 

If for example we wanted to find out how Google maps identifies the Keller Center we could do that like so:

In [35]:
pprint(data['results'][0]['types'])

['establishment', 'point_of_interest', 'school', 'university']
