# Lecture 6

January 24, 2022

### Announcements

* Project teams are created on Piazza
* Project proposal

### Last time

* numpy and pandas
* numpy for statistical models

### Topics

* Web APIs
* The __requests__ package

### Data Sets

* [iTunes Search API](https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/)


### References

* [__requests__ documentation](http://docs.python-requests.org/en/master/)
* Python for Data Analysis, Ch. 6

[PDSH]: https://jakevdp.github.io/PythonDataScienceHandbook/
[ProGit]: https://git-scm.com/book/

## Getting Data from the Web

Three ways you can get data from the web, from most to least convenient:

1. Direct download or "data dump"
2. API
3. Scraping

Always look for a direct download first!

Difference between web scraping and API

_Web Scraping_ refers to the process of extracting data from a website or specific webpage.

API stands for _application programming interface_ (API) is a collection of functions and data structures for communicating with other software. For instance, whenever you use a Python package, you're using the API created by the package's developers.

The goal of both web scraping and APIs is to access web data.

Web scraping allows you to extract data from any website through the use of web scraping software. On the other hand, APIs give you direct access to the data you’d want.

Websites sometimes provide an API so that programmers can access content without web scraping.


### Questions

How can we call a function in a web API?

### Hypertext Transfer Protocol

The hypertext transfer protocol (HTTP) is a set of rules for communicating over the internet.

For example, your web browser uses HTTP every time you visit a web page. The browser makes a _request_ to the server for the page, and if nothing goes wrong, the server _responds_ with the page. If you have Firefox or Chrome, you can inspect these requests with your browser's web developer tools (`Ctrl-I`).

Several [different kinds of HTTP requests](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods) are possible. Think of these as the different "verbs" you can use when communicating in HTTP.

Many protocols exist for communicating over the internet. For instance, you may have heard of _file transfer protocol_ (FTP) for transferring files, or _simple mail transfer protocol_ (SMTP) for sending/receiving email. However, web APIs almost always use HTTP.

### Representational State Transfer

The most popular kind of web API is a _representational state transfer_ (REST) API. In a REST API:
   
* Each function has a different URL, sometimes called an _endpoint_.
* The server handles separate function calls independently of each other.

We can use the [Star Wars API](https://swapi.co/) to answer our first example question. One of the endpoints in the Star Wars API is `https://swapi.co/api/`. This endpoint returns a list of all other endpoints in the API.

When you first use a web API, check the documentation to find out what the endpoints are and what kind of HTTP requests to use. If the documentation doesn't mention what kind of HTTP request to use, then GET is usually the right choice.

### Making Requests

Python's __requests__ package provides functions for making HTTP requests and is [well-documented](http://docs.python-requests.org/en/master/).

Let's use the endpoint we learned from the Star Wars API.

In [None]:
import requests

response = requests.get("WEBSITE ADDRESS")

### Query Strings

Most of the functions we use have parameters, and you can pass arguments for those parameters when you call a function.

Endpoints in REST APIs work the same way, but the syntax is different. You can pass arguments by adding `?PARAMETER=ARGUMENT` to the end of the URL. Parameter and argument pairs are separated by `&`. This syntax is called a _query string_.

For instance, Apple provides a web API for the iTunes store, with [documentation](https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/). 

We can use this to try to answer the question: how many remixed there are of a song.

The search endpoint is `https://itunes.apple.com/search`, and the documentation lists several parameters. We can use __requests__ to build the query string automatically.

Most popular song 2021?

https://music.apple.com/us/playlist/top-songs-of-2021-global/pl.db803163f811479e9d00f921f74684fc

In [None]:
response = requests.get("https://itunes.apple.com/search", params = {
        "term": "dynamite",
        "country": "US"
    })

A response to an HTTP request always includes a status code that summarizes whether the request was successful. Wikipedia has a full [list of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). Generally,

* 200-299: Your request succeeded.
* 300-399: You need to take further action to complete the request.
* 400-499: Your request wasn't valid (you made a mistake). You've probably seen 404 before!
* 500-599: Your request failed (the server made a mistake).

You can have __requests__ check the status for you with the `.raise_for_status()` method:

In [None]:
response.raise_for_status()

Once you have the response, now what? Where's the data?

Different web APIs use different formats. Again, see the documentation. Two common formats are:

* _JavaScript Object Notation_ (JSON): JSON looks and works a lot like Python lists and dictionaries. Lists are surrounded with `[ ]`, and dictionaries are surrounded with `{ }`. There are many Python libraries for reading JSON into lists and dictionaries. Jupyter notebooks are an example of a file in JSON format.

* _eXtensible Markup Language_ (XML): XML uses "tags" denoted by `< >` to mark up sections of text. We'll learn more about XML when we learn about web scraping, since XML is very similar to hypertext markup language (HTML), the language used to build web pages.

The [Star Wars API](https://swapi.co/documentation) returns data in JSON format (derived from JavaScript).

We can inspect the raw content (bytes) of a response with the `.content` attribute. If we know the response is in a text format, we can use `.text` to see the content as an ordinary Python string.

In [None]:
response.text

Since the response we got is in JSON format, we'd like to convert the string to lists and dictionaries. The __requests__ package provides a method `.json()` to do this.

In [None]:
result = response.json()
result

In [None]:
result["results"]

### Being Polite

Making an HTTP request is not free! It has a real cost in CPU time and also cash. Server administrators will not appreciate it if you make too many requests or make requests too quickly. So:

* Use `time.sleep()` to slow down any requests you make in a loop. Aim for no more than 20-30 requests per second.
* Install and use the __requests_cache__ package to avoid downloading extra data when you make the same request twice.

Failing to be polite can get you banned from websites!

In [None]:
import requests_cache # conda install -c conda-forge requests-cache

requests_cache.install_cache("my_cache")

### Answering Example Question

In [None]:
response = requests.get("https://itunes.apple.com/search", params = {
        "term": "dynamite",
        "country": "US"
    })
response.raise_for_status()
result = response.json()
result

In [None]:
result.keys()

In [None]:
result["results"]

In [None]:
import pandas as pd

result_df = pd.DataFrame(result["results"])

In [None]:
result_df.columns

In [None]:
result_df = result_df[["artistName", "trackCount", "releaseDate"]]
result_df.head()

In [None]:
result_df = result_df.set_index("artistName")
result_df

In [None]:
result_df.apply(len)

In [None]:
response = requests.get("https://itunes.apple.com/search", params = {
        "term": "dynamite",
        "country": "US"
    })
response.raise_for_status()
result = response.json()

In [None]:
result

In [None]:
results = response.json()["results"]
results = pd.DataFrame(results)

is_gangnam = results["trackName"].str.contains("dynamite")

results[is_gangnam][["trackName", "artistName"]].shape