## 2.1: APIs

The two most common ways to collect publicly available web data are (1) web scraping and (2) interacting with Application Programming Interfaces (API). In the previous module, we covered some web scraping basics. In this lab, we'll start talking about how to collect data from web APIs. Thus, this lab provides a quick introduction along with some example code for working with web APIs.

## APIs

Many sites make structured data easily available to users via Application Programming Interfaces (API). APIs are sets of protocols and procedures that govern the interaction between a user and a site. Websites will often maintain APIs to make it easier for users to interact with their data, which, in turn, makes it easier for third-party development while reducing strain (more efficient data delivery than XML/HTML) placed on host servers. 

To navigate site protocols and interact with a site's API, users send and receive HTTP requests with a web server. It is therefore important to have a basic understanding of this process as well as some familiarity with how to formulate and send HTTP requests.

## URL components

As you work with APIs, you'll notice the requests look a lot like the URLs you are used to seeing in web browsers. That's because APIs follow the same protocol. With this in mind, it's useful to take a second to identify the different components of URLs. Typically, URLs consist of three or four components:

**Scheme**: The protocol to be used when accessing the resource. Can be either HTTP or HTTPS.

**Hostname**": The name of the domain that holds a web resource. These are sometimes followed by a port number, but because most sites use standard servers, the port number is typically omitted.

**Path**. The path indicates a specific resource or access point. Paths often define what the underlying task is about.

**Query**: Queries provide additional information to be used by the web resource. The name of parameters and their values are linked via equal signs `=` and parameter/value pairs are collapsed using ampersands `&`. Queiries are optional. So when they are used, they are separated from the scheme/hostname/path by a question mark `?`.



## HTTP requests

To interact with a web server, users must send and receive HTTP requests from a web server. To send and receive these requests in R, I recommend the [{httr}](https://github.com/r-lib/httr) package.

HTTP protocol defines different methods that can be used to send a request message to an HTTP server. Although there are several others, the two most common request methods are GET and POST.

**GET** requests are used to *retrieve* data from a web server. Here's a basic example:

In [1]:
## GET request
r <- httr::GET("http://httpbin.org/get")
r

Response [http://httpbin.org/get]
  Date: 2018-01-31 03:50
  Status: 200
  Content-Type: application/json
  Size: 327 B
{
  "args": {}, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "libcurl/7.29.0 r-curl/1.1 httr/1.2.1"
  }, 
  "origin": "128.206.116.250", 
...

For more complicated GET requests (i.e., requests that include multiple variables), parameters may also be specified in the *body* of the request. In the {httr} package, this is done by passing a named list of values to the `query` argument. In the example below, a query is sent to http://httpbin.org/get with the parameters `key1 = "value1"` and `key2 = "value2"`. If this were compiled into a single URL, it would look something like http://httpbin.org/get?key1=value1&key2=value2, which, if you open in a web browser, should return the same information as if requested in R.

In [3]:
## send a GET request with parameters in body
r <- httr::GET(
    "http://httpbin.org/get", 
    query = list(key1 = "value1", key2 = "value2")
)
r

Response [http://httpbin.org/get?key1=value1&key2=value2]
  Date: 2018-01-31 04:00
  Status: 200
  Content-Type: application/json
  Size: 398 B
{
  "args": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "application/json, text/xml, application/xml, */*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
...

**POST** requests are used to *send* data to a web server. Because the purpose of POST requests is to *send* information, the body of the request will almost always include the information to be sent to a web server.

In [4]:
r <- httr::POST("http://httpbin.org/post", body = list(a = 1, b = 2, c = 3))

The data sent via POST requests can be encoded in various ways. Use the `encode` argument in the `httr::POST()` function to specify a desired encoding method:

In [5]:
url <- "http://httpbin.org/post"
body <- list(a = 1, b = 2, c = 3)

# Form encoded
r <- POST(url, body = body, encode = "form")
# Multipart encoded
r <- POST(url, body = body, encode = "multipart")
# JSON encoded
r <- POST(url, body = body, encode = "json")

ERROR: Error in POST(url, body = body, encode = "form"): could not find function "POST"


To quickly summarize, for simple retrieval requests, use `httr::GET()`. For more verbose requests---e.g., when you need to upload more data than what's typically found in URLs---use `httr::POST()`.

### Response status

All response objects come with a response status, which will indicate whether there were any issues retrieving the response. Status code 200 means everything worked. If it's something other than 200, though, it's a sign that something didn't quite go as expected. The status code will provide the first piece of information for diagnosing the source of the problem.

In [6]:
## perform simple GET and POST request 
gr <- httr::GET("http://httpbin.org/get")
pr <- httr::GET("http://httpbin.org/post", body = list(a = 1, b = 2, c = 3))

## GET status
httr::http_status(gr)
gr$status_code

## POST status
httr::http_status(pr)
pr$status_code

When writing functions designed to send and receive HTTP requests, there are some useful {httr} functions check and return relevant information related to the HTTP status.

In [7]:
## Get an informative description:
r <- httr::GET("http://httpbin.org/get")
httr::warn_for_status(r)
httr::stop_for_status(r)

#### Parse response
Content is typically returned as xml, json, plain text, or raw. If known, the type can be found stored as `content-type` in the response headers.

In [8]:
## check content type
r$headers$`content-type`

By default, the function `httr::content()` will parse the response object into a more usable form.

In [9]:
httr::content(r)

But in most cases it's best to be explicit with the parsing method. For example, with json response objects, it's better to parse the response object as text and then to use the [{jsonlite}](https://github.com/jeroen/jsonlite) package.

In [10]:
## specify encoding optional (defaults to UTF-8)
txt <- httr::content(r, as = "text", encoding = "UTF-8")

## convert json text to R list object
jsonlite::fromJSON(txt)

When HTML markup (XML) is returned, it's best to convert the response object to an `xml_document()` (and then extract nodes using the [{rvest}](https://github.com/hadley/rvest) package.

In [11]:
## GET xml content
r <- httr::GET("http://httpbin.org/xml")

## convert to xml_document
xml2::read_html(r)

{xml_document}
<html>
[1] <body><slideshow title="Sample Slide Show" date="Date of publication" aut ...

## Authorization (ouath) methods

To interact with many APIs, users must be authorized. This process includes providing a unique key or token. The method used by Twitter (OAuth 1.0) is explained in greater detail in [lab 2.2](oauth.ipynb).