<h1><center> PPOL 5203 Data Science I: Foundations <br><br> 
<font color='grey'> Collecting Digital Data - API<br><br>
Tiago Ventura </center> <h1> 

---

## Learning Goals

In the class today, we will learn how to collect digital data through APIs. We will focus on: 

- Building a solid understanding about APIs
- Working with three types of APIs:
    - APIs with no credentials and no wrappers
    - APIs with credentials and no wrappers
    - APIs with wrappers. 

In [3]:
# setup
import requests
import os
import pandas as pd

## APIs 101

The famous acronym API stands for “Application Programming Interface”. An API is an online server allows different applications to interact. Most often for our purposes, an API will facilitate information exchange between data users and the holders of certain data. Many companies build these repositories for various functions, including sharing data, receiving data, joint database management, and providing artificial intelligence functions or machines for public use.

Let's think of an example capable of motivating the creation of an API. Imagine you own Twitter. You would have zillions of hackers every day trying to scrape your data, this would make your website more unstable and insecure. What is a possible solution? You create an API, and you control who accesses the information, when they access it, and what type of information you make available. Another option is to close you API and restrict data access to researchers. But, if you do this, you are likely to pay a reputational cost for not being transparent, and users might leave your platform.

Have you ever watched Matrix? APIs are just like that! In the movies, Neil and others would physically connect their mindes to a super developed server  and ask to learn a certain skill - kung-fu, programming, language, etc. This is exactly what an API does. You connect to the website and request data, and receive it in return. It's like sending an email, but doing everything via programming language.

### API Use-Cases

There are two main ways in which we academics commonly use APIs.

1. Access data shared by Companies and NGOs.

2. Process our data in Algorithms developed by third parties.

Our focus will be on the first. Later, we will see how to use the ChatGPT API for text classification tasks. 


### APIs Components

An API is just an URL. See the example below:

`http://mywebsite.com/endpoint?key&param_1&param_2`

Main Components: 

- **http://mywebsite/**: API root. The domain of your api/
- **endpoint**: An endpoint is a server route for retrieving specific data from an API
- **key**: credentials that some websites ask for you to create before you can query the api. 
- **?param_1*param_2** parameters. Those are filters that you can input in apis requests. 


### Requests to APIs

In order to work with APIs, we need tools to access the web. In Python, the most common library for making requests and working with APIs is the `requests` library. There are two main types of requests: 

- `get()`: to receive information from the API -- which we will use the most for web data collection

- `post()`: to send information to the API -- think about the use of ChatGPT for classification of text. 


## Example 1: Open Trivia API

### Querying an API: Step-by-Step

Let's start querying our first API. We will start with the simple [Open Trivia API](https://opentdb.com/api_config.php). This is a very simple API, and serves the purpose of learning all the basic steps of querying APIs. The Open Trivia API gives you ideas for your trivia games!

The Trivia API: 

- **Does not** require us to create credentials.
- And **does not** have a Python wrapper.

When querying an API, our work will often involve the following steps: 

- **Step 1:** Look at the API documentation and endpoints, and construct a query of interest
- **Step 2:** Use requests.get(querystring) to call the API
- **Step 3:** Examine the response
- **Step 4:** Extract your data and save it. 

### Step 1: Documentation, Endpoints and Query. 

Before we start querying an API, we always need to read through the [documentation/reference](https://opentdb.com/api_config.php). The documentation often revel to us: 

- The base url for the API: `https://opentdb.com/api.php`
- The different endpoints: 
    - This api has only one endpoint
- The API parameters:
    - `amount`
    - `category`
    - And some others we will learn.
    
Notice one thing here. The Trivia API requires you to gave the `amount` filter in your call. Not all APIs are like this. Some have a random api endpoint for you to play around

In [1]:
# build query
query = "https://opentdb.com/api.php?amount=1"

### **Step 2:** Use `requests.get(querystring)` to call the API

To interact with the API, we will use the `requests` package. The requests package allow us to send a HTTP request to the API. Because we are intereste in retrieving data, we will mostly be working with the `.get()` method, which requires one argument — the URL we want to make the request to. 

When we make a request, the response from the API comes with a response code which tells us whether our request was successful. Response codes are important because they immediately tell us if something went wrong.

To make a ‘GET’ request, we’ll use the requests.get() function, which requires one argument — the URL we want to make the request to. We’ll start by making a request to an API endpoint that doesn’t exist, so we can see what that response code looks like

In [4]:
# Make a get request to get the latest position of the ISS from the OpenNotify API.
response = requests.get(query)
type(response)

requests.models.Response

### **Step 3:** Examine the response

When we make a request, the response from the API comes with a response code which tells us whether our request was successful. Response codes are important because they immediately tell us if something went wrong. Here is a list of response codes you can get

200 — Everything went okay, and the server returned a result (if any).

301 — The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or when an endpoint's name has changed.

401 — The server thinks you're not authenticated. This happens when you don't send the right credentials to access an API.

400 — The server thinks you made a bad request. This can happen when you don't send the information that the API requires to process your request (among other things).

403 — The resource you're trying to access is forbidden, and you don't have the right permissions to see it.

404 — The server didn't find the resource you tried to access.


In [5]:
# check status code
status_code = response.status_code

# print status code
status_code

200

### **Step 4:** Extract your data.

With an 200 code, we can access the content of the get request. The return from the API is stored as a `content` attribute in the response object.

In [6]:
print(response.content)

b'{"response_code":0,"results":[{"type":"multiple","difficulty":"hard","category":"History","question":"Which English king was married to Eleanor of Aquitaine?","correct_answer":"Henry II","incorrect_answers":["Richard I","Henry I","Henry V"]}]}'


#### Processing JSONs

The deafault data type we receive from APIS are in the JSON format. This format encodes data structures like lists and dictionaries as strings to ensure that machines can read them easily. 

For that kind of content, the requests library includes a specific .json() method that you can use to immediately convert the API bytes response into a Python data structure, in general a nested dictionary. 

In [7]:
# convert the get output to a dictionary
response_dict = response.json()
print(response_dict)

{'response_code': 0, 'results': [{'type': 'multiple', 'difficulty': 'hard', 'category': 'History', 'question': 'Which English king was married to Eleanor of Aquitaine?', 'correct_answer': 'Henry II', 'incorrect_answers': ['Richard I', 'Henry I', 'Henry V']}]}


In [8]:
# index just like a dict
response_dict["results"][0]["question"]

'Which English king was married to Eleanor of Aquitaine?'

In [9]:
# convert to a dataframe
import pandas as pd

# need to convert to a list for weird python reasons
pd.DataFrame([response_dict["results"][0]])

Unnamed: 0,type,difficulty,category,question,correct_answer,incorrect_answers
0,multiple,hard,History,Which English king was married to Eleanor of A...,Henry II,"[Richard I, Henry I, Henry V]"


Let's see the full code:

In [10]:
# full code
import requests
import pandas as pd

# build query
query = "https://opentdb.com/api.php?amount=1"

# 
response = requests.get(query)

# check status code
status_code = response.status_code

# move forward with code
if status_code==200:
    # convert the get output to a dictionary
    response_dict = response.json()
    # convert to a dataframe
    res = pd.DataFrame([response_dict["results"][0]])
else:
    print(status_code)
    
# print the activity
res

Unnamed: 0,type,difficulty,category,question,correct_answer,incorrect_answers
0,multiple,medium,Sports,What is the name of the AHL affiliate of the T...,Toronto Marlies,"[Toronto Rock, Toronto Argonauts, Toronto Wolf..."


### Exploring API Filters

If we look at the documentation, you see the APIs provides filters (query parameters) that allow you to refine your search. 

For example, when you send a `get` request to the Youtube API, you are not interested in the entire Youtube data. You want data associated with certain videos, profiles, for a certain period of time, for example. These filters are often embedded as query parameters in the API call. 

To add a query parameter to a given URL, you have to add a question mark (?) before the first query parameter. If you want to have multiple query parameters in your request, then you can split them with an ampersand (&)

We can add filters by: 

- constructing the full API call

- Using dictionaries


### Filter with the full API call

In [11]:
## get only recreational activities
# build query
query = "https://opentdb.com/api.php"

# add filter
activity = "?amount=10"

# full request
url = query + activity

# Make a get request 
response = requests.get(url)

# see json
response.json()

{'response_code': 0,
 'results': [{'type': 'multiple',
   'difficulty': 'hard',
   'category': 'Entertainment: Video Games',
   'question': 'In the 2014 Pokemon VGC Finals, which Pokemon was famous for bringing the winner to victory?',
   'correct_answer': 'Pachirisu',
   'incorrect_answers': ['Garchomp', 'Lapras', 'Primal Groudon']},
  {'type': 'multiple',
   'difficulty': 'medium',
   'category': 'History',
   'question': 'Who was the first wife of King Henry VIII?',
   'correct_answer': 'Catherine of Aragon',
   'incorrect_answers': ['Jane Seymour', 'Anne Boleyn', 'Anne of Cleves']},
  {'type': 'multiple',
   'difficulty': 'medium',
   'category': 'Entertainment: Video Games',
   'question': 'Who is the main character in the video game &quot;Just Cause 3&quot;?',
   'correct_answer': 'Rico Rodriguez',
   'incorrect_answers': ['Tom Sheldon', 'Marcus Holloway', 'Mario Frigo']},
  {'type': 'multiple',
   'difficulty': 'easy',
   'category': 'Geography',
   'question': 'What is the 15th

### Or using dictionaries

In [16]:
## get only recreational activities
# build query
query = "https://opentdb.com/api.php"

# add filter
parameters = {"amount": "10", 
             "category":"11"}

# Make a get request to get 
response = requests.get(query, 
                        params=parameters)

# see json
print(response.status_code)
response.json()

200


{'response_code': 0,
 'results': [{'type': 'multiple',
   'difficulty': 'medium',
   'category': 'Entertainment: Film',
   'question': 'What mutated animals act as monsters in the movie &#039;Night of the Lepus&#039;?',
   'correct_answer': 'Rabbits',
   'incorrect_answers': ['Dogs', 'Rats', 'Bats']},
  {'type': 'multiple',
   'difficulty': 'medium',
   'category': 'Entertainment: Film',
   'question': 'What is the name of the robot in the 1951 science fiction film classic &#039;The Day the Earth Stood Still&#039;?',
   'correct_answer': 'Gort',
   'incorrect_answers': ['Robby', 'Colossus', 'Box']},
  {'type': 'multiple',
   'difficulty': 'medium',
   'category': 'Entertainment: Film',
   'question': 'Who played Sgt. Gordon Elias in &#039;Platoon&#039; (1986)?',
   'correct_answer': 'Willem Dafoe',
   'incorrect_answers': ['Charlie Sheen', 'Matt Damon', 'Johnny Depp']},
  {'type': 'multiple',
   'difficulty': 'easy',
   'category': 'Entertainment: Film',
   'question': 'In the movie &q

See... it is the same url..

In [17]:
response.url

'https://opentdb.com/api.php?amount=10&category=11'

In [18]:
pd.DataFrame(response.json()["results"])

Unnamed: 0,type,difficulty,category,question,correct_answer,incorrect_answers
0,multiple,medium,Entertainment: Film,What mutated animals act as monsters in the mo...,Rabbits,"[Dogs, Rats, Bats]"
1,multiple,medium,Entertainment: Film,What is the name of the robot in the 1951 scie...,Gort,"[Robby, Colossus, Box]"
2,multiple,medium,Entertainment: Film,Who played Sgt. Gordon Elias in &#039;Platoon&...,Willem Dafoe,"[Charlie Sheen, Matt Damon, Johnny Depp]"
3,multiple,easy,Entertainment: Film,"In the movie &quot;Spaceballs&quot;, what are ...",Air,"[The Schwartz, Princess Lonestar, Meatballs]"
4,multiple,medium,Entertainment: Film,"This movie contains the quote, &quot;I love th...",Apocalypse Now,"[Platoon, The Deer Hunter, Full Metal Jacket]"
5,multiple,medium,Entertainment: Film,What year did the James Cameron film &quot;Tit...,1997,"[1996, 1998, 1999]"
6,multiple,easy,Entertainment: Film,Which of the following movies was not based on...,The Thing,"[Carrie, Misery, The Green Mile]"
7,multiple,medium,Entertainment: Film,"In the 1984 movie &quot;The Terminator&quot;, ...",T-800,"[I-950, T-888, T-1000]"
8,multiple,medium,Entertainment: Film,In the 1971 film &quot;Willy Wonka &amp; the C...,Gene Wilder,"[Shia LeBouf, Peter Ostrum, Johnny Depp]"
9,multiple,hard,Entertainment: Film,"In the film &quot;Interstellar&quot;, how long...","23 years, 4 months, and 8 days","[15 years, 2 months, and 15 days, 10 months an..."


## Example 2: Yelp API. 

Let's transition now to a more complex, and with interesting data, API. We will work with the Yelp API.

This API: 
-  Requires us to get credentials
-  But does not have a wrapper to query the daya (that I know of). 

See the documentation for the API [here](https://docs.developer.yelp.com/docs/fusion-intro). The API has some interesting endpoints, for example:

- `/businesses/search` - Search for businesses by keyword, category, location, price level, etc.
- `/businesses/{id}` - Get rich business data, such as name, address, phone number, photos, Yelp rating, price levels and hours of operation.
- `/businesses/{business_id_or_alias}/reviews` - Get up to three review excerpts for a business.
- Among many other endpoints

## Authentication with an API

Most often, the provider of an API will require you to authenticate before you can get some data. Authentication usually occures through an access token you can generate directly from the API.  Depending on the type of authentication each API have in place, it can be a simple token (string) or multiple different ids (Client ID, Access Token, Client Token..)

Keep in mind that using a token is better than using a username and password for a few reasons:

- Typically, you'll be accessing an API from a script. If you put your username and password in the script and someone finds it, they can take over your account. 

- Access tokens can have scopes and specific permissions. 

To authorize your access, you need to add the token to your API call. Often, you do this by passing the token through an authorization header. We can use Python's requests library to make a dictionary of headers, and then pass it into our request.


### Acquiring credentials with Yelp Fusion API

Information about acquiring your credentials to make API call are often displayed in the API documentation. 

[Here it is Yelp's information](https://docs.developer.yelp.com/docs/fusion-authentication)

Every API has a bit of a distinct process. In general, APIs require you to create an app to access the API. This is a bit of a weird terminology. The assumption here is that you are creating an app (think about the Botometer at Twitter) that will query the API many times. 

For the YELP API, after you create the app, you will get an `Client ID` and an `API KEY`

### How to save the API keys/token?

API keys are personal information. Keep yours safe, and do not paste into your code.

Don't do this:

`api_key = "my_key"`

Do this:

- create a file with your keys and save as .env
- Add your keys there
- load them in your environment when running the APIs.
- And never upload your .env file in a public server (like github)

I will show you in class what a .env file looks like. 

### Querying the API

We repeat the same steps as before, but adding an authentication step. 

- **Step 0:** Load your API Keys
- **Step 1:** Look at the API documentation and endpoints, and construct a query of interest
- **Step 2:** Use requests.get(querystring) to call the API
- **Step 3:** Examine the response
- **Step 4:** Extract your data and save it. 


### Step 0: Load your API Keys

In [23]:
# load library to get environmental files
import os
from dotenv import load_dotenv


# load keys from  environment variables
load_dotenv() # .env file in cwd

# Print all environment variables
#for key, value in os.environ.items():
#    print(key, "=", value)

yelp_client = os.environ.get("yelp_client_id") 
yelp_key = os.environ.get("yelp_api_key")

# OR JUST HARD CODE YOUR API KEY HERE. NOT A GREAT PRACTICE!!!
#yelp_key = ""

# save your token in the header of the call
header = {'Authorization': f'Bearer {yelp_key}'}

In [24]:
# see here
header["Authorization"]

'Bearer syM5u9r4OFOcdp-ApFx8wD6GEDKaG97kUs9xiO9jQStWvZisnQT3_JENEKYXl6aazVMZAypJPh2g6v4IRHT8viNgXQTObKVWVGQWe_qfiZXVMfs1W047aGAHK9wRZXYx'

### Step 1: Look at the API documentation and endpoints, and construct a query of interest

We will query the `/businesses/search` endpoint. Let's check together the documentation here: https://docs.developer.yelp.com/reference/v3_business_search


We will use two parameters: 

- location: This string indicates the geographic area to be used when searching for businesses
- term: Search term, e.g. "food" or "restaurants".

In [25]:
# endpoint
endpoint = "https://api.yelp.com/v3/businesses/search"

# Add as parameters
params ={"location":" Washington, DC 20017",
        "term":"best noodles restaurant"}

### **Step 2:** Use requests.get(endpoint) to call the API


In [26]:
# Make a get request with header + parameters
response = requests.get(endpoint, 
                        headers=header,
                        params=params)

### **Step 3:** Examine the response

Let's check the response code

In [27]:
# looking for a 200
response.status_code

200

### **Step 4:** Extract your data and save it. 



In [29]:
# What does the response look like?
yelp_json = response.json()

# print
print(yelp_json)

{'businesses': [{'id': 'Zvo_vQ-6gX5A-8OA-jutkA', 'alias': 'menya-hosaki-washington', 'name': 'Menya Hosaki', 'image_url': 'https://s3-media0.fl.yelpcdn.com/bphoto/A9UrKgBwyR1-1nIhSs2yZQ/o.jpg', 'is_closed': False, 'url': 'https://www.yelp.com/biz/menya-hosaki-washington?adjust_creative=GJK5eaHUqVE8eGMl0w0Pfg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=GJK5eaHUqVE8eGMl0w0Pfg', 'review_count': 458, 'categories': [{'alias': 'ramen', 'title': 'Ramen'}, {'alias': 'noodles', 'title': 'Noodles'}], 'rating': 4.7, 'coordinates': {'latitude': 38.942141, 'longitude': -77.024679}, 'transactions': [], 'price': '$$', 'location': {'address1': '845 Upshur St NW', 'address2': 'Fl 1', 'address3': None, 'city': 'Washington, DC', 'zip_code': '20011', 'country': 'US', 'state': 'DC', 'display_address': ['845 Upshur St NW', 'Fl 1', 'Washington, DC 20011']}, 'phone': '+13018187650', 'display_phone': '(301) 818-7650', 'distance': 2741.8108301595653}, {'id': 'cHvf3L4NAklSE4j9agOkHQ', '

In [30]:
yelp_json.keys()

dict_keys(['businesses', 'total', 'region'])

In [31]:
yelp_json["businesses"]

[{'id': 'Zvo_vQ-6gX5A-8OA-jutkA',
  'alias': 'menya-hosaki-washington',
  'name': 'Menya Hosaki',
  'image_url': 'https://s3-media0.fl.yelpcdn.com/bphoto/A9UrKgBwyR1-1nIhSs2yZQ/o.jpg',
  'is_closed': False,
  'url': 'https://www.yelp.com/biz/menya-hosaki-washington?adjust_creative=GJK5eaHUqVE8eGMl0w0Pfg&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=GJK5eaHUqVE8eGMl0w0Pfg',
  'review_count': 458,
  'categories': [{'alias': 'ramen', 'title': 'Ramen'},
   {'alias': 'noodles', 'title': 'Noodles'}],
  'rating': 4.7,
  'coordinates': {'latitude': 38.942141, 'longitude': -77.024679},
  'transactions': [],
  'price': '$$',
  'location': {'address1': '845 Upshur St NW',
   'address2': 'Fl 1',
   'address3': None,
   'city': 'Washington, DC',
   'zip_code': '20011',
   'country': 'US',
   'state': 'DC',
   'display_address': ['845 Upshur St NW', 'Fl 1', 'Washington, DC 20011']},
  'phone': '+13018187650',
  'display_phone': '(301) 818-7650',
  'distance': 2741.81083015956

It returns a long dictionary with the key "businesses" and a list with multiple sub-entries.

**How to deal with this data?**

### Approach 1: Convert all to dataframe and clean it later

In [32]:
# convert to pd
df_yelp = pd.DataFrame(yelp_json["businesses"])

# see
df_yelp

# not looking realy bad. 

Unnamed: 0,id,alias,name,image_url,is_closed,url,review_count,categories,rating,coordinates,transactions,price,location,phone,display_phone,distance
0,Zvo_vQ-6gX5A-8OA-jutkA,menya-hosaki-washington,Menya Hosaki,https://s3-media0.fl.yelpcdn.com/bphoto/A9UrKg...,False,https://www.yelp.com/biz/menya-hosaki-washingt...,458,"[{'alias': 'ramen', 'title': 'Ramen'}, {'alias...",4.7,"{'latitude': 38.942141, 'longitude': -77.024679}",[],$$,"{'address1': '845 Upshur St NW', 'address2': '...",13018187650.0,(301) 818-7650,2741.81083
1,cHvf3L4NAklSE4j9agOkHQ,astoria-dc-washington,Astoria DC,https://s3-media0.fl.yelpcdn.com/bphoto/MvSvbh...,False,https://www.yelp.com/biz/astoria-dc-washington...,567,"[{'alias': 'szechuan', 'title': 'Szechuan'}, {...",4.3,"{'latitude': 38.91052, 'longitude': -77.03813}",[],$$,"{'address1': '1521 17th St NW', 'address2': No...",,,4947.158258
2,3HglMxYPJ9UKtA86xq8-7w,thai-station-washington-2,Thai Station,https://s3-media0.fl.yelpcdn.com/bphoto/Lxxe6L...,False,https://www.yelp.com/biz/thai-station-washingt...,2,"[{'alias': 'thai', 'title': 'Thai'}]",5.0,"{'latitude': 38.920666999457104, 'longitude': ...","[pickup, delivery]",,"{'address1': '2300 Washington Pl NE', 'address...",12026356999.0,(202) 635-6999,1951.487888
3,Z0v0NOziKoF6g_NzO68Z6g,reren-washington,Reren,https://s3-media0.fl.yelpcdn.com/bphoto/WVbvqe...,False,https://www.yelp.com/biz/reren-washington?adju...,2463,"[{'alias': 'ramen', 'title': 'Ramen'}, {'alias...",4.1,"{'latitude': 38.900318145752, 'longitude': -77...","[pickup, delivery]",$$,"{'address1': '817 7th St NW', 'address2': None...",12022903677.0,(202) 290-3677,4870.563908
4,gLCGsMhmAOJ0ak8qawkUhg,da-hong-pao-washington,Da Hong Pao,https://s3-media0.fl.yelpcdn.com/bphoto/XdrwYq...,False,https://www.yelp.com/biz/da-hong-pao-washingto...,1111,"[{'alias': 'cantonese', 'title': 'Cantonese'},...",3.8,"{'latitude': 38.9093783217825, 'longitude': -7...",[delivery],$$,"{'address1': '1409 14th St NW', 'address2': ''...",12028467229.0,(202) 846-7229,4611.151418
5,HQ404fe3ndkAprE00uxWLA,toki-underground-washington,Toki Underground,https://s3-media0.fl.yelpcdn.com/bphoto/LWJnlZ...,False,https://www.yelp.com/biz/toki-underground-wash...,2599,"[{'alias': 'ramen', 'title': 'Ramen'}, {'alias...",4.1,"{'latitude': 38.90041, 'longitude': -76.98903}","[pickup, delivery, restaurant_reservation]",$$,"{'address1': '1234 H St NE', 'address2': '', '...",12023883086.0,(202) 388-3086,4219.05153
6,050Muo-FLJuGve6UWsBdWQ,the-noodle-lady-washington,The Noodle Lady,https://s3-media0.fl.yelpcdn.com/bphoto/LwmTfe...,False,https://www.yelp.com/biz/the-noodle-lady-washi...,2,"[{'alias': 'noodles', 'title': 'Noodles'}]",5.0,"{'latitude': 38.91717374446151, 'longitude': -...","[pickup, delivery]",,"{'address1': '2000 5th St NE', 'address2': Non...",12028081762.0,(202) 808-1762,2398.450819
7,3LC28t_YHclGWDmHul6PpA,pantry-thai-bistro-and-sushi-washington,Pantry - Thai Bistro and Sushi,https://s3-media0.fl.yelpcdn.com/bphoto/-2WaC_...,False,https://www.yelp.com/biz/pantry-thai-bistro-an...,159,"[{'alias': 'sushi', 'title': 'Sushi Bars'}, {'...",4.4,"{'latitude': 38.9365798787057, 'longitude': -7...","[pickup, delivery]",$$,"{'address1': '3716 Georgia Ave NW', 'address2'...",12026291643.0,(202) 629-1643,2711.356804
8,GMI7ecCz0Ylfw7hGRi56KA,ramen-by-uzu-washington-4,Ramen By Uzu,https://s3-media0.fl.yelpcdn.com/bphoto/9ce9XL...,False,https://www.yelp.com/biz/ramen-by-uzu-washingt...,179,"[{'alias': 'ramen', 'title': 'Ramen'}]",4.4,"{'latitude': 38.90872, 'longitude': -76.9976499}","[pickup, delivery]",$$,"{'address1': '1309 5th St NE', 'address2': '',...",,,3308.389953
9,iORMLvXAbFbe9xLDlT_C6g,baan-siam-washington,Baan Siam,https://s3-media0.fl.yelpcdn.com/bphoto/GgiBqZ...,False,https://www.yelp.com/biz/baan-siam-washington?...,688,"[{'alias': 'thai', 'title': 'Thai'}, {'alias':...",4.5,"{'latitude': 38.90147950746981, 'longitude': -...",[delivery],$$,"{'address1': '425 I St NW', 'address2': '', 'a...",12025885889.0,(202) 588-5889,4538.289964


### Approach 2: write a function to collect the information you need

Assume you are interested in the id, name, url, lat and long, and rating

In [34]:
# function to clean and extract information from yelp
def clean_yelp(yelp_json):
    '''
    function to extract columns of interest from yelp json
    '''
    # create a temporary dictionary to store the information
    temp_yelp = {}
    
    # collect information
    temp_yelp["id"]= yelp_json["id"]
    temp_yelp["name"]= yelp_json["name"]
    temp_yelp["url"]= yelp_json["url"]
    temp_yelp["latitude"] = yelp_json["coordinates"]["latitude"]
    temp_yelp["longitude"] = yelp_json["coordinates"]["longitude"]
    temp_yelp["rating"]= yelp_json["rating"]
    
    # return
    
    return(temp_yelp)
    

In [35]:
# apply to the dictionary
results_yelp = [clean_yelp(entry) for entry in yelp_json["businesses"]]

# Convert results to dataframe
yelp_df = pd.DataFrame(results_yelp)   
yelp_df

Unnamed: 0,id,name,url,latitude,longitude,rating
0,Zvo_vQ-6gX5A-8OA-jutkA,Menya Hosaki,https://www.yelp.com/biz/menya-hosaki-washingt...,38.942141,-77.024679,4.7
1,cHvf3L4NAklSE4j9agOkHQ,Astoria DC,https://www.yelp.com/biz/astoria-dc-washington...,38.91052,-77.03813,4.3
2,3HglMxYPJ9UKtA86xq8-7w,Thai Station,https://www.yelp.com/biz/thai-station-washingt...,38.920667,-76.994875,5.0
3,Z0v0NOziKoF6g_NzO68Z6g,Reren,https://www.yelp.com/biz/reren-washington?adju...,38.900318,-77.021591,4.1
4,gLCGsMhmAOJ0ak8qawkUhg,Da Hong Pao,https://www.yelp.com/biz/da-hong-pao-washingto...,38.909378,-77.03174,3.8
5,HQ404fe3ndkAprE00uxWLA,Toki Underground,https://www.yelp.com/biz/toki-underground-wash...,38.90041,-76.98903,4.1
6,050Muo-FLJuGve6UWsBdWQ,The Noodle Lady,https://www.yelp.com/biz/the-noodle-lady-washi...,38.917174,-76.999691,5.0
7,3LC28t_YHclGWDmHul6PpA,Pantry - Thai Bistro and Sushi,https://www.yelp.com/biz/pantry-thai-bistro-an...,38.93658,-77.02467,4.4
8,GMI7ecCz0Ylfw7hGRi56KA,Ramen By Uzu,https://www.yelp.com/biz/ramen-by-uzu-washingt...,38.90872,-76.99765,4.4
9,iORMLvXAbFbe9xLDlT_C6g,Baan Siam,https://www.yelp.com/biz/baan-siam-washington?...,38.90148,-77.016342,4.5


#### Save the json

Remember to always save your response from the API call. You don't want be querying the API all the time to grab the same data. 

In [118]:
import json

with open("yelp_results.json", 'w') as f:
    # write the dictionary to a string
    json.dump(response.json(), f, indent=4)

## Practice

Make a successful query using your favorite type of food to the Yelp API. Pretty much I only want you to repeat what we did before, but changing the search term a bit 

In [None]:
# code here

## Example 3 : YouTube API

Now let's move to our last example. 

We will be working with the YouTube API. This is a complex API, but lucky for us some other programmers already created a Python wrapper to access the API. We will use the [youtube-data-api](https://youtube-data-api.readthedocs.io/en/latest/youtube_api.html) library which contains a set of functions to facilitate the access to the API. 

### What kind of data can you get from the Youtube API?

Youtube has a very extensive api. There are a lot of data you can get access to. See a compreensive list [here](https://developers.google.com/youtube/v3/docs/)

What is included in the package:

- video metadata
- channel metadata
- playlist metadata
- subscription metadata
- featured channel metadata
- comment metadata
- search results

### How to Install

The software is on PyPI, so you can download it via `pip`
   

In [60]:
#!pip install youtube-data-api

### How to get an API key

#### A quick guide: [https://developers.google.com/youtube/v3/getting-started](https://developers.google.com/youtube/v3/getting-started)

1. You need a Google Account to access the Google API Console, request an API key, and register your application. You can use your GMail account for this if you have one.

2. Create a project in the <a href="https://console.developers.google.com/apis/">Google Developers Console</a> and <a href="https://developers.google.com/youtube/registering_an_application">obtain authorization credentials</a> so your application can submit API requests.

3. After creating your project, make sure the YouTube Data API is one of the services that your application is registered to use.

    a. Go to the <a href="https://console.developers.google.com/apis/">API Console</a> and select the project that you just registered.

    b. Visit the <a href="https://console.developers.google.com/apis/enabled">Enabled APIs page</a>. In the list of APIs, make sure the status is ON for the YouTube Data API v3. You do not need to enable OAuth 2.0 since there are no methods in the package that require it.
        

In [37]:
# call some libraries
import os
import datetime
import pandas as pd

In [38]:
#Import YouTubeDataAPI
from youtube_api import YouTubeDataAPI
from youtube_api.youtube_api_utils import *
from dotenv import load_dotenv

In [39]:
# load keys from  environmental var
load_dotenv() # .env file in cwd
api_key = os.environ.get("YT_KEY")

In [40]:
# create a client 
# this is what we call: instantiate the class
yt = YouTubeDataAPI(api_key)
print(yt)

<youtube_api.youtube_api.YouTubeDataAPI object at 0x169164c90>


#### Starting with a channel name and getting some basic metadata

Let's start with the `LastWeekTonight` channel

[https://www.youtube.com/user/LastWeekTonight](https://www.youtube.com/user/LastWeekTonight)

First we need to get the channel id

In [41]:
channel_id = yt.get_channel_id_from_user('LastWeekTonight')
print(channel_id)

UC3XTzVzaHQEd30rQbuvCtTQ


#### Channel metadata

In [42]:
# collect metadata
yt.get_channel_metadata(channel_id)

{'channel_id': 'UC3XTzVzaHQEd30rQbuvCtTQ',
 'title': 'LastWeekTonight',
 'account_creation_date': 1395178899.0,
 'keywords': None,
 'description': 'Breaking news on a weekly basis. Sundays at 11PM - only on HBO.\nSubscribe to the Last Week Tonight channel for the latest videos from John Oliver and the LWT team.',
 'view_count': '4518066054',
 'video_count': '899',
 'subscription_count': '10000000',
 'playlist_id_likes': '',
 'playlist_id_uploads': 'UU3XTzVzaHQEd30rQbuvCtTQ',
 'topic_ids': 'https://en.wikipedia.org/wiki/Entertainment|https://en.wikipedia.org/wiki/Politics|https://en.wikipedia.org/wiki/Television_program|https://en.wikipedia.org/wiki/Society',
 'country': None,
 'collection_date': datetime.datetime(2025, 10, 28, 11, 51, 14, 699080)}

#### Subscriptions of the channel. 

In [43]:
pd.DataFrame(yt.get_subscriptions(channel_id))

Unnamed: 0,subscription_title,subscription_channel_id,subscription_kind,subscription_publish_date,collection_date
0,trueblood,UCPnlBOg4_NU9wdhRN-vzECQ,youtube#channel,1395357000.0,2025-10-28 11:51:47.869956
1,GameofThrones,UCQzdMyuz0Lf4zo4uGcEujFw,youtube#channel,1395357000.0,2025-10-28 11:51:47.869984
2,HBO,UCVTQuK2CaWaTgSsoNkn5AiQ,youtube#channel,1395357000.0,2025-10-28 11:51:47.870002
3,HBOBoxing,UCWPQB43yGKEum3eW0P9N_nQ,youtube#channel,1395357000.0,2025-10-28 11:51:47.870020
4,Cinemax,UCYbinjMxWwjRpp4WqgDqEDA,youtube#channel,1424812000.0,2025-10-28 11:51:47.870046
5,HBODocs,UCbKo3HsaBOPhdRpgzqtRnqA,youtube#channel,1395357000.0,2025-10-28 11:51:47.870066
6,HBOLatino,UCeKum6mhlVAjUFIW15mVBPg,youtube#channel,1395357000.0,2025-10-28 11:51:47.870085
7,OfficialAmySedaris,UCicerXLHzJaKYHm1IwvTn8A,youtube#channel,1461561000.0,2025-10-28 11:51:47.870108
8,Real Time with Bill Maher,UCy6kyFxaMqGtpE3pQTflK8A,youtube#channel,1418342000.0,2025-10-28 11:51:47.870126


#### List of videos of the channel
You first need to convert the `channel_id` into a playlist id to get all the videos ever posted by a channel using a function from the `youtube_api_utils` in the package. Then you can get the video ids, and collect metadata, comments, among many others. 

In [44]:
from youtube_api.youtube_api_utils import *
playlist_id = get_upload_playlist_id(channel_id)
print(playlist_id)

UU3XTzVzaHQEd30rQbuvCtTQ


In [45]:
## Get video ids
videos = yt.get_videos_from_playlist_id(playlist_id)
df = pd.DataFrame(videos)
print(df)

        video_id                channel_id  publish_date  \
0    Mwc21oNdnaA  UC3XTzVzaHQEd30rQbuvCtTQ  1.761601e+09   
1    Ejoi9yfLVCc  UC3XTzVzaHQEd30rQbuvCtTQ  1.761561e+09   
2    73iQpsIE0i8  UC3XTzVzaHQEd30rQbuvCtTQ  1.761548e+09   
3    m9ExweRMFAA  UC3XTzVzaHQEd30rQbuvCtTQ  1.761000e+09   
4    s9FsxWK0f1A  UC3XTzVzaHQEd30rQbuvCtTQ  1.760944e+09   
..           ...                       ...           ...   
894  Dh9munYYoqQ  UC3XTzVzaHQEd30rQbuvCtTQ  1.398670e+09   
895  k8lJ85pfb_E  UC3XTzVzaHQEd30rQbuvCtTQ  1.398669e+09   
896  WHCQndalv94  UC3XTzVzaHQEd30rQbuvCtTQ  1.398663e+09   
897  8q7esuODnQI  UC3XTzVzaHQEd30rQbuvCtTQ  1.395379e+09   
898  gdQCtWlhx90  UC3XTzVzaHQEd30rQbuvCtTQ  1.395379e+09   

               collection_date  
0   2025-10-28 11:52:02.294295  
1   2025-10-28 11:52:02.294330  
2   2025-10-28 11:52:02.294353  
3   2025-10-28 11:52:02.294375  
4   2025-10-28 11:52:02.294397  
..                         ...  
894 2025-10-28 11:52:04.807395  
895 2025-10-28 

#### Collect video metadata

In [46]:
# id for videos as a list
df.video_id.tolist()

['Mwc21oNdnaA',
 'Ejoi9yfLVCc',
 '73iQpsIE0i8',
 'm9ExweRMFAA',
 's9FsxWK0f1A',
 '-xIQxzlXN-0',
 'cicEeYtFd1M',
 'a277Pg23Dao',
 '-6r-lAxhtv4',
 'bl6Ww92bb0o',
 'gieTx_P6INQ',
 '9iZK_DurYOo',
 'xQwGv4UYvbk',
 '88YixeXbRMo',
 'eHJwoYjTyyE',
 'SCv0hlq5iQY',
 'JOoHELC8w8M',
 'dB1-lg-xZWc',
 'NtHZ7IR88dg',
 'S9EVYaSa1Ws',
 'Wg8OcJopuBE',
 'z016SEN7HzE',
 'wn46weQY0DY',
 '0_Bwix9IjOE',
 'G97ew5P68vI',
 '8ek4E7vkiQw',
 'KU5eY7wBkNI',
 'IhmjgncyBiI',
 'k315NvOdHvc',
 'hSJtzwy6E_M',
 'ohPToBog_-g',
 'gZlw0gKXQAI',
 'I9yskcTgalA',
 'osh1v6FqUhc',
 'xk94il8L820',
 '4T-CHSNHx6U',
 'dijMKwZMU2Q',
 'eP5vqvGLC1I',
 'W3V1oswZM-o',
 '7wdGf-48OKo',
 'MEtfL3fBaRU',
 'kSZ9DNj5_Ts',
 'lcbnhXKvi-c',
 '3lzfH86avIc',
 'E02gokpa7Ug',
 'o6g62JCgWxc',
 'VPtAWwRBHUc',
 'VJAIChXYQVk',
 'ZSTM7-rq4eE',
 'DfTBhrkae74',
 '2psRpQvFSno',
 'lrZlBPJYH-Y',
 'yQ_xxbVtcYo',
 'eYjuA0Yc-7Y',
 '_R3uMbUyyaI',
 'xNo8Ve-Ej6U',
 'TMk28nV9nkI',
 'otn03GFKpgE',
 '7GwfM53wqak',
 'qjafPzoy4eI',
 'ZTgI0mTwUCs',
 'XBAtPdMej98',
 'R_VsQd

In [47]:
#grab metadata
video_meta = yt.get_video_metadata(df.video_id.tolist()[:5])

In [48]:
#visualize
pd.DataFrame(video_meta)

Unnamed: 0,video_id,channel_title,channel_id,video_publish_date,video_title,video_description,video_category,video_view_count,video_comment_count,video_like_count,video_dislike_count,video_thumbnail,video_tags,collection_date
0,Mwc21oNdnaA,LastWeekTonight,UC3XTzVzaHQEd30rQbuvCtTQ,1761601000.0,Trump & Military Strikes #lastweektonight,Congress should be reviewing Trump’s plans for...,24,454001,1026,22559,,https://i.ytimg.com/vi/Mwc21oNdnaA/hqdefault.jpg,,2025-10-28 11:52:14.590915
1,Ejoi9yfLVCc,LastWeekTonight,UC3XTzVzaHQEd30rQbuvCtTQ,1761561000.0,Medicare Advantage: Last Week Tonight with Joh...,John Oliver details what Medicare Advantage is...,24,1943384,6072,69846,,https://i.ytimg.com/vi/Ejoi9yfLVCc/hqdefault.jpg,,2025-10-28 11:52:14.590952
2,73iQpsIE0i8,LastWeekTonight,UC3XTzVzaHQEd30rQbuvCtTQ,1761548000.0,S12 E27: Trump’s Week & Medicare Advantage: 10...,John Oliver discusses the many bold moves Dona...,24,290853,883,9744,,https://i.ytimg.com/vi/73iQpsIE0i8/hqdefault.jpg,,2025-10-28 11:52:14.590979
3,m9ExweRMFAA,LastWeekTonight,UC3XTzVzaHQEd30rQbuvCtTQ,1761000000.0,Air Bud Pt. II #lastweektonight,That should sum it up! Please watch our full s...,24,412053,128,12869,,https://i.ytimg.com/vi/m9ExweRMFAA/hqdefault.jpg,,2025-10-28 11:52:14.591003
4,s9FsxWK0f1A,LastWeekTonight,UC3XTzVzaHQEd30rQbuvCtTQ,1760944000.0,Air Bud Pt. II: Last Week Tonight with John Ol...,John Oliver discusses the Air Bud franchise. A...,24,2623488,4958,87196,,https://i.ytimg.com/vi/s9FsxWK0f1A/hqdefault.jpg,,2025-10-28 11:52:14.591030


In [49]:
## Collect Comments
ids = df.video_id.tolist()[:5]

In [50]:
ids

['Mwc21oNdnaA', 'Ejoi9yfLVCc', '73iQpsIE0i8', 'm9ExweRMFAA', 's9FsxWK0f1A']

In [51]:
# loop
list_comments = []

for video_id in ids:
    comments = yt.get_video_comments(video_id, max_results=10)
    list_comments.append(pd.DataFrame(comments))

# concat
df = pd.concat(list_comments)
df.head()

Unnamed: 0,video_id,commenter_channel_url,commenter_channel_id,commenter_channel_display_name,comment_id,comment_like_count,comment_publish_date,text,commenter_rating,comment_parent_id,collection_date,reply_count
0,Mwc21oNdnaA,http://www.youtube.com/@2shabbs,UCpYMg-maUp0rQ-ys7UKk8Qw,@2shabbs,Ugw9C6hO0EsZTt4i2Yd4AaABAg,0,1761681000.0,Finally cleanin' up the country.,none,,2025-10-28 11:53:11.139617,0
1,Mwc21oNdnaA,http://www.youtube.com/@electronic_rat,UCaT8EG2XLzigv5jrOwxevdQ,@electronic_rat,UgzOD7VAyBIsamB8V-Z4AaABAg,0,1761681000.0,Genuinely at what point do other countries ste...,none,,2025-10-28 11:53:11.139656,0
2,Mwc21oNdnaA,http://www.youtube.com/@lizardbytheriver1567,UC5qGGBPmNlReBI36k2pT8LA,@lizardbytheriver1567,Ugx90I8vaVFHDu6jWFx4AaABAg,0,1761681000.0,We are killing people in Venezuela and Colombi...,none,,2025-10-28 11:53:11.139713,0
3,Mwc21oNdnaA,http://www.youtube.com/@allthingsawesome4589,UCqEsvN3XsJq1CSJO9lMOOIg,@allthingsawesome4589,Ugy46lUvm1prV4OyLHB4AaABAg,0,1761681000.0,If they’re not from our country then they have...,none,,2025-10-28 11:53:11.139756,0
4,Mwc21oNdnaA,http://www.youtube.com/@lizardbytheriver1567,UC5qGGBPmNlReBI36k2pT8LA,@lizardbytheriver1567,UgzVwOdtNaFlG9ao--N4AaABAg,0,1761681000.0,I love that Libertarianism and Anarcho-Capital...,none,,2025-10-28 11:53:11.139783,0


#### Search

The youtube API also allows you to search for most popular videos using queries. This is very cool!


In [57]:
df = pd.DataFrame(yt.search(q='government shutdown', max_results=50))
df

Unnamed: 0,video_id,channel_title,channel_id,video_publish_date,video_title,video_description,video_category,video_thumbnail,collection_date
0,-ez7XCdBDV4,NewsNation,UCCjG8NtOig0USdrT5D1FpxQ,1761629000.0,Politicians have a choice to end government sh...,Stephen A. Smith tells “CUOMO” the government ...,,https://i.ytimg.com/vi/-ez7XCdBDV4/hqdefault.jpg,2025-10-28 11:56:11.539261
1,RDTxsIq9dUE,Fox Business,UCCXoCcu9Rp7NPbTzIvogpZg,1761588000.0,&#39;ECONOMIC SHUTDOWN&#39;: Democrats are giv...,"Sen. Deb Fischer, R-Neb., discusses how long t...",,https://i.ytimg.com/vi/RDTxsIq9dUE/hqdefault.jpg,2025-10-28 11:56:11.539286
2,dOPjYqZ6slw,TODAY,UChDKyKQ59fYz3JO2fl0Z6sg,1761668000.0,Government Shutdown Enters Day 28 Prompting Hu...,"As the government shutdown enters Day 28, the ...",,https://i.ytimg.com/vi/dOPjYqZ6slw/hqdefault.jpg,2025-10-28 11:56:11.539306
3,QnkOUIYAu8E,ABC News,UCBi2mrWuNuyYy4gbM6fU18Q,1761587000.0,Government shutdown enters 4th week with no en...,The FAA was forced to slow air traffic on Sund...,,https://i.ytimg.com/vi/QnkOUIYAu8E/hqdefault.jpg,2025-10-28 11:56:11.539324
4,akzGTDtp3rk,Face the Nation,UC2x_qNZXKDxNP1j4jvHATRQ,1761674000.0,Watch Live: Senate votes on funding bill again...,The Senate is voting on a GOP-backed funding m...,,https://i.ytimg.com/vi/akzGTDtp3rk/hqdefault_l...,2025-10-28 11:56:11.539342
5,TFH9Ay86OWM,WREG News Channel 3,UCueJOS3ZjWyTtRU9bnx9XFw,1761621000.0,No SNAP benefits on Nov. 1 due to government s...,No SNAP benefits on Nov. 1 due to government s...,,https://i.ytimg.com/vi/TFH9Ay86OWM/hqdefault.jpg,2025-10-28 11:56:11.539360
6,oDzVD3oaSUI,MSNBC,UCaXkIU1QidjPwiAYu6GcHjg,1761605000.0,Democrat hits Speaker Johnson as government sh...,"New Jersey Congressman Robert Menendez, Jr. (D...",,https://i.ytimg.com/vi/oDzVD3oaSUI/hqdefault.jpg,2025-10-28 11:56:11.539378
7,zG4DzrfZxF4,WUSA9,UCcT6w3xUyVshyR2_2vrMp1w,1761674000.0,Day 28 of government shutdown,Three things to know on day 28 of the federal ...,,https://i.ytimg.com/vi/zG4DzrfZxF4/hqdefault.jpg,2025-10-28 11:56:11.539396
8,jCviEP0_Ob0,MSNBC,UCaXkIU1QidjPwiAYu6GcHjg,1761575000.0,Democrats and Republicans remain at impasse am...,Monday marks the 27th day of the government sh...,,https://i.ytimg.com/vi/jCviEP0_Ob0/hqdefault.jpg,2025-10-28 11:56:11.539413
9,815UPTeTfoM,FOX 4 Dallas-Fort Worth,UCruQg25yVBppUWjza8AlyZA,1761672000.0,🔴 LIVE: Government Shutdown 2025 Day 28 | Update,Transportation Secretary Duffy and NATCA Presi...,,https://i.ytimg.com/vi/815UPTeTfoM/hqdefault_l...,2025-10-28 11:56:11.539430


Some cool research using the Youtube API: 
    
- [Lei et al, Estimating the Ideology of Political YouTube Videos](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4088828)

- [Brown et al, Echo Chambers, Rabbit Holes, and Algorithmic Bias: How YouTube Recommends Content to Real Users](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4114905)
    

In [44]:
!jupyter nbconvert _week-08_apis.ipynb --to html --template classic


[NbConvertApp] Converting notebook _week-08_apis.ipynb to html
[NbConvertApp] Writing 461383 bytes to _week-08_apis.html
