In [1]:
import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

# HTTP

HTTP (AKA **H**yper**T**ext **T**ransfer **P**rotocol) is an all-purpose infrastructure to access resources on the web. There are a tremendous number of datasets available to us on the Internet. HTTP enables us to make requests for data. This includes your browser requesting a web page to display, a Web form to formulate a request for data, and other protocols (such as REST) to request data. HTTP is a simple *request-response* protocol for computers to talk to another over the Internet.

## Requests and Responses

The Internet allows computers to send text to one another, but does not impose any restrictions on what that text contains. HTTP defines a structure on the text communication between an application, the client, and the server. In this protocol, a client submits a *request* to a server, a specially formatted text message, and the server sends a specially formatted text  *response* back.

A request consists of two parts: a header and an optional body. The header must follow a specific syntax. An example request to obtain the Wikipedia web page shown in {numref}`Figure %s <fig-wiki-1500>` looks like this:

```
GET /wiki/1500_metres_world_record_progression HTTP/1.1
Host: en.wikipedia.org
User-Agent: curl/7.65.2
Accept: */* 
{blank_line}
``` 

The first line of this request follows a specific format. This first line contains three pieces of information: it starts with the method of the request, which is GET, then follows the URL of the web page that we want, and ends with the protocol and version. Each of the three lines that follow form the HTTP header, auxiliary information that is sent to the server. The HTTP header information has the format `{name}: {value}`. Finally, the blank line at the end of the message tells the server that the message has ended. Note that we've marked the blank line with `{blank_line}` in the snippet above; in the actual message `{blank_line}` is actually a blank line.

```{figure} figures/Wikipedia1500mScreen23-02-24.png
---
name: fig-wiki-1500
---

Screenshot of the [Wikipedia page](https://en.wikipedia.org/wiki/1500_metres_world_record_progression) with data on the World Record for the 1500 meter race.
```

The client's computer uses the Internet to send this message to the Wikipedia web server. The server processes the request, and sends a response, which also consists of a header and body. The header looks like the following response:

```
< HTTP/1.1 200 OK
< date: Fri, 24 Feb 2023 00:11:49 GMT
< server: mw1369.eqiad.wmnet
< x-content-type-options: nosniff
< content-language: en
< vary: Accept-Encoding,Cookie,Authorization
< last-modified: Tue, 21 Feb 2023 15:00:46 GMT
< content-type: text/html; charset=UTF-8
...
< content-length: 153912
{blank_line}
```

The first line of the header states that the request completed successfully; the code is 200. The following lines form additional optional information that the server sends back to the client. We shortened this response quite a bit to focus on a few pieces of information about the response body. We are informed that the content of the body is HTML and uses the UTF-8 encoding, and that the content is 153912 characters in length. Finally, the blank line at the end of the header tells the client that the server has finished sending its response headers and the response body follows, which is the HTML page. 

This HTTP protocol is used in almost every application that interacts with the Internet. For example, if you visit the Wikipedia page in your web browser, the browser makes the same basic HTTP request as the command above. It then displays the body of the response in your browser's window so it looks like the screen shot in {numref}`Figure %s <fig-wiki-1500>`. 

In practice, we do not write out full HTTP requests in text. Instead, we use tools like the `requests` Python library to construct requests for us. The code below constructs the HTTP request for us. We simply pass it the URL and indicate that the GET method should be used with `requests.get`.

In [35]:
import requests

url_1500 = 'https://en.wikipedia.org/wiki/1500_metres_world_record_progression'

In [44]:
resp_1500 = requests.get(url_1500)

Let's check the status of the request to make sure the server completed our request successfully.

In [45]:
resp_1500.status_code

200

We have many attributes available to us so we can thoroughly examine the request and response. Let's take a look at the header key/value pairs that was sent as part of our request. The header information is stored as a dictionary. 

In [47]:
for key in resp_1500.request.headers:
    print(f'{key}: {request.headers[key]}')

User-Agent: python-requests/2.25.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive


While, we did not specify any header information in our request; the `request.get` function provided them for us. However, if there are additional pieces of information that the server expects in a request, we can specify them. We demonstrate this in the example of the next section. 

Now, let's examine the header of the response we received from the server.

In [48]:
len(resp_1500.headers)

20

With so much header information, we just display a few key/value pairs.

In [51]:
keys = ['date', 'content-type', 'content-length' ]
for key in keys:
    print(f'{key}: {resp_1500.headers[key]}')

date: Fri, 24 Feb 2023 00:11:49 GMT
content-type: text/html; charset=UTF-8
content-length: 22978


The response header informs us that the body of the response HTML and its encoding is UTF-8. We are also told that the body has over 20,000 bytes. Finally, we display the first 600 characters of the response body (the entire content is too long to display nicely here).

In [53]:
resp_1500.text[:600]

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-disabled vector-feature-page-tools-pinned-disabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>1500 metres world record progression - Wikipedia</title>\n<script>document.documentElement.className="client-js vector'

We see that the response is an HTML document and it contains a title `1500 metres world record progression - Wikipedia`. We have successfully retrieved the web page shown in {numref}`Figure %s <fig-wiki-1500>`.

## Types of Response Status Codes

The previous HTTP responses had the HTTP status code `200`. This status code indicates that the request completed successfully. There are hundreds of other HTTP status codes. Thankfully, they are grouped into categories to make them easier to remember:

:::{table} Response Status Codes
:name: response-codes

| Code   | Type    |  Description                                                                  |
|--------|---------|-------------------------------------------------------------------------------|
| 100s   | Informational | More input is expected from client or server (100 Continue, 102 Processing, etc.) |
| 200s   | Success  | The client's request was successful (200 OK, 202 Accepted, etc.) |
| 300s   | Redirection | Requested URL is located elsewhere; May need user's further action (300 Multiple Choices, 301 Moved Permanently, etc.)  |
| 400s   | Client Error |  Client-side error (400 Bad Request, 403 Forbidden, 404 Not Found, etc.)         |
| 500s   | Server Error |  Server-side error or server is incapable of performing the request (500 Internal Server Error, 503 Service Unavailable, etc.) |

:::

As an example, the following request is for a resource that doesn't exist, so we get a 404 page not found error. When the data returned are not what you expect, a first step is to check the status code. 

In [56]:
url = "https://www.youtube.com/404errorwow"
bad_loc = requests.get(url)
bad_loc.status_code

404

The request we made to retrieve the web page was a `GET` HTTP request. There are four main HTTP request types" GET, POST, PUT, and DELETE. The two most commonly used methods are `GET` and `POST`. We just used GET to retrieve the web page.

In [46]:
resp_1500.request.method

'GET'

The `POST` request is used to send specific information from the client to the server. In the next section we use `POST` to retrieve data from Spotify. To access these data, we first need to send information that identifies us as a legitimate client who has registered as a developer. 