# In-Class Activity: Getting Data from APIs

Today we will learn how to:
* Use requests to access an API
* Parse the data we get from an API (in JSON)
* Retrieve historical web data from the Internet Archive
* Play around with the TV Maze API

A very big thank you to Brian Keegan (my advisor!) and the materials in his [Web Data Scraping course](https://github.com/CU-ITSS/Web-Data-Scraping-S2023). Check that out if you want to dig deeper ;) Another thank you to Jason Zeitz and Anas Buhayh at University of Colorado, Boulder, who developed some of the TV Maze API content.

But first... a warm-up!

### Refresher: Lists and Dictionaries in Python
Often we get data in the forms of lists. Lists are an **ordered** data structure that can contain integers, strings, or other objects (like lists or dictionaries), Here's an example:

In [19]:
# Make classrooms as lists with student names as strings
classroom0 = ['Alice','Bob','Carol','Dave']
classroom1 = ['Eve','Frank','Grace','Harold']
classroom2 = ['Isabel','Jack','Katy','Lloyd']
classroom3 = ['Maria','Nate','Olivia','Philip']
classroom4 = ['Quinn','Rachel','Steve','Terry','Ursula']
classroom5 = ['Violet','Walter','Xavier','Yves','Zoe']

# Make schools that contain classrooms
school0 = [classroom0,classroom1]
school1 = [classroom2,classroom3]
school2 = [classroom4,classroom5]

# Make a school district that contains schools
school_district = [school0,school1,school2]

In [20]:
# Task 1: How would you access classroom0 **FROM** school_district, using index notation?
school_district[0][0]

# YOUR CODE HERE

['Alice', 'Bob', 'Carol', 'Dave']

In [21]:
# Task 2: How would you access the 0th student in classroom3 (Maria) **FROM** school_district, using index notation?
school_district[1][1][0]
# YOUR CODE HERE

'Maria'

We also often get data in the form of dictionaries. Dictionaries are an **unordered** data structure containing key-value pairs, kind of like like a phonebook.

Here's a dictionary with information about the states in the Pacific Northwest:

In [22]:
pacific_northwest = {
    'Washington' : {
        'Abbreviation': 'WA',
        'Area': 71362,
        'Capital': 'Olympia',
        'Established': '1889-11-11',
        'Largest city': 'Seattle',
        'Population': 7887965,
        'Representatives': 10
    },
    'Idaho': {
        'Abbreviation': 'ID',
        'Area': 83569,
        'Capital': 'Boise',
        'Established': '1890-07-03',
        'Largest city': 'Boise',
        'Population': 1839106,
        'Representatives': 2
    },
    'Oregon': {
        'Abbreviation': 'OR',
        'Area': 98381,
        'Capital': 'Salem',
        'Established': '1859-02-14',
        'Largest city': 'Portland',
        'Population': 4246155,
        'Representatives': 6
    }
                   }

In [23]:
# Task: How would you list get of the keys in this dictionary?
pacific_northwest.keys()
# YOUR CODE HERE

dict_keys(['Washington', 'Idaho', 'Oregon'])

In [24]:
# Task: How would you access all of the values?
pacific_northwest.values()
# YOUR CODE HERE

dict_values([{'Abbreviation': 'WA', 'Area': 71362, 'Capital': 'Olympia', 'Established': '1889-11-11', 'Largest city': 'Seattle', 'Population': 7887965, 'Representatives': 10}, {'Abbreviation': 'ID', 'Area': 83569, 'Capital': 'Boise', 'Established': '1890-07-03', 'Largest city': 'Boise', 'Population': 1839106, 'Representatives': 2}, {'Abbreviation': 'OR', 'Area': 98381, 'Capital': 'Salem', 'Established': '1859-02-14', 'Largest city': 'Portland', 'Population': 4246155, 'Representatives': 6}])

In [25]:
# Task: How would you access *all* of the information about Washington?
pacific_northwest['Washington']
# YOUR CODE HERE

{'Abbreviation': 'WA',
 'Area': 71362,
 'Capital': 'Olympia',
 'Established': '1889-11-11',
 'Largest city': 'Seattle',
 'Population': 7887965,
 'Representatives': 10}

In [26]:
# Task: How would you access the population of Oregon?
pacific_northwest['Oregon']['Population']
# YOUR CODE HERE

4246155

Nested data structures do not need to be the same data type. Here's the same information above, but as a list of dictionaries:

In [27]:
pacific_northwest_list = [
    {'Name': 'Washington',
     'Abbreviation': 'WA',
     'Area': 71362,
     'Capital': 'Olympia',
     'Established': '1889-11-11',
     'Largest city': 'Seattle',
     'Population': 7887965,
     'Representatives': 10
    },
    {'Name':'Idaho',
     'Abbreviation': 'ID',
     'Area': 83569,
     'Capital': 'Boise',
     'Established': '1890-07-03',
     'Largest city': 'Boise',
     'Population': 1839106,
     'Representatives': 2
    },
    {'Name': 'Oregon',
     'Abbreviation': 'OR',
     'Area': 98381,
     'Capital': 'Salem',
     'Established': '1859-02-14',
     'Largest city': 'Portland',
     'Population': 4246155,
     'Representatives': 6
    }
]

In [28]:
# Task: How would you access the capital of Idaho?

In [29]:
pacific_northwest_list[1]['Capital']


'Boise'

In [30]:
# Task: How would you print out all of the state names and populations?
for state in pacific_northwest_list:
    name=state['Name']
    pop=state['Population']
    print(name+':'+ str(pop))

Washington:7887965
Idaho:1839106
Oregon:4246155


Ok, now we are ready to access APIs, which often return data in the JSON format. [JavaScript Object Notation (JSON)](https://www.json.org/) is probably the most popular data markup language and is especially ubiquitous when retreiving data from the application programming interfaces (APIs) of popular platforms like Twitter, Reddit, Wikipedia, etc.

JSON is attractive for programmers using JavaScript and Python because it can represent a mix of different data types.

What you need to know is that JSON is very similar to the form of a Python dictionary, and it can contain other data structures (for example, lists).

_**As you start to work with APIs, remember to always put on your data detective hats and figure out what structure you are in and how to extract information from it!**_

### Getting historical web pages from the Wayback Machine API

Now we are ready to start using APIs! For our first example, we'll use the Wayback Machine, a service from the [Internet Archive](https://archive.org/), which is a database of historical webpages and media content.

For fun, let's look at a few:
* [CNN in June 2000](https://web.archive.org/web/20000815052826/http://www.cnn.com/)
* [Apple in April 1997](https://web.archive.org/web/19970404064444/http://www.apple.com:80/)
* [Whitman College in 2002](https://web.archive.org/web/20020124214454/http://www.whitman.edu/)


---
**Activity:**

Visit the Wayback Machine at [https://web.archive.org](https://web.archive.org/) and check out a historical version of a page that interests you. Share it with your partner.

---

Pretty fun, huh? Even better, we can access much of this information via an API! Here's some [info from the Wayback Machine about how their API works](https://archive.org/help/wayback_api.php). Let's try it out!

In [31]:
# First import the packages we need

# Lets us talk to other servers on the web
import requests

# APIs spit out data in JSON
import json

# Use BeautifulSoup to parse some HTML
from bs4 import BeautifulSoup

# Safetly quoting strings for URLs
from urllib.parse import unquote, quote

# Handling dates and times
from datetime import datetime

# DataFrames!
import pandas as pd
import numpy as np

# Data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb


The simplest API request we can make asks for the most recent snapshot of a webpage archived by the Wayback Machine. For example:

In [32]:
# API calls come in the form of URLs
wayback_url = 'http://archive.org/wayback/available?url=whitman.edu'

# We can use requests.get() to get the contents of that URL
wayback_response = requests.get(wayback_url)

# Finally, we render the response as JSON using .json()
wayback_response.json()


{'url': 'whitman.edu',
 'archived_snapshots': {'closest': {'status': '200',
   'available': True,
   'url': 'http://web.archive.org/web/20231016153703/https://www.whitman.edu/',
   'timestamp': '20231016153703'}}}

What do you notice about the response above? What information does it include? How is this information structured?'

---
__Fun note: APIs are just URLs!__ You don't need to write any code to check them out. Try pasting this URL, http://archive.org/wayback/available?url=whitman.edu, into your web browser. What happenes?

In [33]:
# Task: Extract the URL from this request (this is the location of the page)
# Save it as a variable called recent_whitman_url
recent_whitman_url=wayback_response.json()
recent_whitman_url['archived_snapshots']['closest']['url']
# YOUR CODE HERE

'http://web.archive.org/web/20231016153703/https://www.whitman.edu/'

In [34]:
# Show how to build it up bit by bit

In [35]:
# And now we can use requests.get() to grab the HTML
recent_whitman_response = requests.get(recent_whitman_url)

# Turn it into soup -- remember to use .text first!
recent_whitman_soup = BeautifulSoup(recent_whitman_response.text)

# And find all the links
links_text = [link.text for link in recent_whitman_soup.find_all('a')]

# And print them out
for l in links_text:
    print(l)


InvalidSchema: No connection adapters were found for "{'url': 'whitman.edu', 'archived_snapshots': {'closest': {'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20231016153703/https://www.whitman.edu/', 'timestamp': '20231016153703'}}}"

Ok, this is cool ... but it's much more fun to get HISTORICAL data. With the Wayback Machine API, we can also search for content around a given timestamp.

In [None]:
# Notice how we now have '&timestamp=20080201' in the URL
# What do you think this means?
wb_url = 'http://archive.org/wayback/available?url=whitman.edu&timestamp=20080201'

# Use requests.get() to get the response
wb_response = requests.get(wb_url)

# Render it as JSON
wb_response_json = wb_response.json()

# And examine
wb_response_json

# What do you notice?
# When was this page scraped by the Wayback Machine?

In [None]:
# Task: Make an API request to find out when Facebook's privacy policy (http://www.facebook.com/policy.php)
# was archived in the Wayback Machine closest to January 1, 2008.
url='http://archive.org/wayback/available?url=facebook.com/policy.php'
# First construct the API url

# Then use requests.get() to get the response
fb_url=requests.get(url)
# Turn it into JSON
fb_json=fb_url.json()
# And extract the timestamp
fb_json['archived_snapshots']['closest']['timestamp']
# What day was it archived?

In [None]:
# First construct the API url
url = 'http://archive.org/wayback/available?url=facebook.com/policy.php&timestamp=20080101'

# Then use requests.get() to get the response
response = requests.get(url)

# Turn it into JSON
response_json = response.json()

# And extract the timestamp
response_json['archived_snapshots']['closest']['timestamp']

What might you do with this? For exmaple, you could examine how Facebook (or any company's) privacy policies or terms of service changed over time. This would be your starting point -- then you could compile the text and do some natural language processing to analyze it!

A simple way to analyze how the privacy policies and terms of service have changed over time would be to see how the number of words has changed. Brian Keegan has an example of how to do this in his web scraping course -- I encourage you to [check it out!](https://github.com/CU-ITSS/Web-Data-Scraping-S2023/blob/main/Class%2004%20-%20Internet%20Archive%20and%20Wikipedia%20APIs/Class%2004%20-%20Scraping%20Internet%20Archive%20and%20Wikipedia.ipynb)

---

__Activity:__ Brainstorm with a partner for a few minutes about how you might use the Wayback Machine API to do a data science project.

### Using the TVMaze API

Ok, let's try out a different API, this time from [TVMaze](https://www.tvmaze.com/api). This is an API that has information about tons and tons of TV shows.

First let's make a basic request for information about a show.

In [None]:
# This is the basic URL -- we are going to build our query requests from this
base_url = "https://api.tvmaze.com"

# We can get information about specific shows by appending /show/ and then an ID number to the URL
# The code below requests the info for show 321
showInfo=requests.get(base_url +"/shows/321").json()
print(showInfo)

# What is the show name?

# What information is included?

# How is it structured?

In [None]:
# How would you print out the summary?
showInfo['summary']
# YOUR CODE HERE
print(showInfo['summary'])
# How would you print out the average rating?
print(showInfo['rating']['average'])

# YOUR CODE HEREabs


In [None]:
# summary
print(showInfo['summary'])

In [None]:
# average rating
# note that we have a nested dictionary here!
print(showInfo['rating']['average'])

Ok, this is cool! But how do we know what shows are in the TVMaze database and what their IDs are?

For this, we can do a [show search](https://www.tvmaze.com/api#show-search).

In [None]:
# Search by a string:
showSearch = '/search/shows?q='
queryString = 'bachelor'

searchResults=requests.get(base_url + showSearch + queryString).json()
print(searchResults)

In [None]:
# YOUR TASK
# What is the format of the results?
# How many shows are in my results?
# Print out all the names of the shows in the results
# Print out all the IDs of the show results

# YOUR CODE HERE

In [None]:
# How many shows are in the results?
len(searchResults)

In [None]:
# Print out all the names
for item in searchResults:
    name = item['show']['name']
    showID = item['show']['id']
    print("Name: " + item['show']['name'] + ", ID: " + str(showID))

In [None]:
# You could now use the show IDs to get the show info!

In [None]:
# YOUR TASK: Search for a different show
# Based on the ID (or IDs) you get back, make an API request for the info TVMaze has about that show
showquery= '/search/shows?q='
queryString='Modern Family'
searchResults=requests.get(base_url + showquery + queryString).json()
print(searchResults)
# YOUR CODE HERE

These are just a few of the kinds of requests you can make using the TVMaze API.

What other things can you do?

**Acitivty:** Spend a few minutes lookat the the API's documentation. Try out a different kind of query. What did you find?

### Suggestions for working with APIs

1. Spend some time figuring out how they work...read the docs!
2. The docs often have example queries. Use these to your advantage!
3. Make some sample requests. Ask yourself: How is the data structured? Do I have any nested data structures?
4. Often, you need to request preliminary information (like the show IDs above) in order to get the information you reallky want (the show facts above).

Let's explore! Pick an API to explore. You can use any that you want, but here are some you might consider:
* [Pokemon API](https://pokeapi.co/)
* [Dog API](https://dog.ceo/dog-api/) (pictures of dogs)
* [SpaceX API](https://github.com/r-spacex/SpaceX-API/)
* [COVID19 API](https://covid19api.com/)
* [NASA APIs](https://api.nasa.gov/)
* [EPA's Air Quality Index API](https://aqs.epa.gov/aqsweb/documents/data_api.html)
* [Superhero API](https://superheroapi.com/?ref=apilist.fun)
* [Open Movie Database](https://www.omdbapi.com/)
* [New York Times](https://developer.nytimes.com/?ref=apilist.fun)
* [Spoonaculur Food API](https://spoonacular.com/food-api)
* [Open Library Books API](https://openlibrary.org/developers/api)


Then, answer the following questions:

1. Find the API's documentation. Spend some time reading about it -- what information does it have? How are queries structured? What kinds of different queries can you make?
2. Try to make some intersting queries. What can you do?
3. Think about how you might use this API to do a data science project. What questions might you be able to answer?
4. How would you go about storing the data that you are getting back from the API?

_Note: Some APIs require you to first request an API key. This is so you don't overload them with too many requests._