# Housekeeping

* Quiz today after the lecture
* Final project assigned

# Module 8: Data Formats, HTTP, and REST APIs

_Tonight we'll be learning about some of the various file formats that data can be stored in (CSV, Parquet, JSON, HTML), and then how to download that data programmatically over the HTTP protocol using the REST web standard._

## 8.1 Data (Serialization) Formats

Today we're going to learn how to communicate from one computer to another. To start with, I'd like to introduce some common data formats that we see used when communicating to other computers.

---

We've learned that we can store data in Python as strings, integers, lists, dictionaries, etc. However, you cannot _send_ a Python object as-is from one place to another. 

Instead, there are various data formats (also known as serialization formats or a number of other related terminology) that are used for storing data and later retrieving it in another place. We'll look at a few of these you will encounter, if you haven't already. **These formats are used to store data in files, or to transfer them directly over the internet.**

### Constraints
Before we go anywhere, we should think about the constraints or requirements placed on _any_ data format.

Any data format we'll be using will represent all kinds of values as _strings_.

This is because whilst Python as a language has lots of rich datatypes -- `list`s, `dict`s, `tuple`s, etc. -- ultimately a file on your computer essentially contains strings, and just strings. You cannot put a Python `list` in a file, only a string that somehow represents it. So any data format will need to somehow represent _all_ of the structure it supports as a string.

Imagine you need to store a single list of 100 integers. How might you store it?

### 8.1.1 Line-based

Let's start perhaps with a _wrong_ answer! What happens if you write out `repr(yourlist)` to a file. Can you see what goes wrong by doing this?

To answer this correctly, we start instead with perhaps the simplest format anyone could invent -- it's so simple it lacks a true name -- that being to simply write each of these integers out, delimited by a newline. For instance:

```
1
17
-32
24
37
```

which we could take and write to a file, and send to someone else.


#### Exercise
Create a list of 100 random integers (there can be repeats).

With Python, write these integers to a file somewhere on your computer where each integer is separated by a newline (`\n`). Then in a successive cell, read this file back in, and parse it back into a list of 100 integers. Compare the two lists to assert they are equal.

In [1]:
my_list = [1, 17, -32, 24, 37] * 20

output = "my_list.txt"
with open(output, "w") as f:
    for item in my_list:
        f.write(str(item))
        f.write("\n")
    
with open(output, "r") as f:
    lines = f.readlines()

my_list_cleaned = [int(i.strip()) for i in lines]
assert my_list == my_list_cleaned

#### Data, Syntax and Delimiters
We should differentiate for a moment between the parts of our format that are the _data_ and the parts of it that are _syntactical_, or part of the format itself.

In other words -- if we were to store the list above -- `[1, 17, -32, 24, 37]`, we have chosen to represent this list with a certain string, specifically, to write that string as a Python string:

`"1\n17\n-32\n24\n37\n"`

In this simple format, most of the characters of this string are the data we wish to store, and the only real character that is syntactically meaningful to our data format is the newline character (`\n`) which separates elements of our data. We'd call the newline character here a _delimiter_, as it delimits where one element ends and another begins. Keep this distinction in mind for data formats below, which will begin to introduce more syntactical elements, and thereby a more complicated data format, in exchange for upgrading the amount we can represent.

In short: Any data format will somewhere within it contain the data we want to serialize, and also somehow contain characters or syntactical elements which the data format uses to delimit, annotate, separate or locate structured parts of the data within itself.


#### Pros & Cons

Pros:
* simple format
* easy to write out / read in
* human-readable

Cons:
* very limiting - only one-dimensional
* not standardized

The format above is, hopefully you'll agree, amazingly simple. It's easy to write out, easy to read back in ("parse"), reasonably noiseless in the way described above, and easily human-readable.

However -- say we wanted to store tabular data that was multidimensional. We'd need to somehow be able to store multiple _columns_ for each of our rows. And doing simply the above doesn't easily allow us to do so without modification. In other words, the richness of data structures we can represent is fairly weak -- we can represent only one-dimensional data, and even then, depending on the robustness of your parsing code, only integers.

Secondly, the format we've invented here is simple, but not self-describing or standardized -- to understand what this means, consider that someone who sees a number of lines from our file sees that they contain integers.

Can they be sure the entirety of our file contains only integers, or is the lack of non-integers simply a coincidence? Might it also some day contain strings, or floats? Can they even be sure that the lines are meant to be integers and not strings with integral digits?

To be truly sure they'd need to ask the file's author, or see the entirety of the dataset and be able to be sure they understood the intended meaning of it. This is in contrast to other formats we'll see below, where the data format itself is standardized -- meaning there is a formal description of the format itself, and someone can write a parser they can be sure parses all valid files of the given format!

Keep these concerns in mind -- they're often key things to consider for any data format:

* easily written out?
* easily parsed?
* compact?
* human-readable?
* precisely defined?
* supports the types and data structures needed?

### 8.1.2 CSV

If we want to upgrade our data format a bit to be able to store both multiple rows as well as multiple columns on each row, we'll have to teach it how to do so.

If we consider what we ultimately did to store multiple rows, we chose a _delimiter_ (a newline character, `\n`) -- if we do the same to separate each column or field, we can extend our format to support what we want.

To cut to the chase -- CSV, or comma-separated values -- is precisely a way of doing so where the column delimiter is a comma character. TSV, or tab-separated values, are similarly one where columns are separated by tab characters (`\t`).

Imagine we had data like the below:

In [17]:
data = [
    (123, "Hello", "Foo Bar"),
    (456, "World", "Baz Quux"),
    (789, "!", "Spam Eggs"),
]

We could write these data out to a file using the `csv` module:

In [13]:
import csv
with open("out.csv", "w") as out:
    writer = csv.writer(out)
    writer.writerow(("ID", "First", "Second"))  # A header for our CSV, (optional)
    for row in data:
        writer.writerow(row)

Have a look at the contents of the `out.csv` file that was just created, first in a text editor where you'll see you have emitted lines of data separated by commas, then by opening it in Excel which you'll see interprets the file as a table.

You can similarly read back the data using `csv.reader`:

In [14]:
with open("out.csv") as out:
    reader = csv.reader(out)
    for row in reader:
        print(row)

['ID', 'First', 'Second']
[]
['123', 'Hello', 'Foo Bar']
[]
['456', 'World', 'Baz Quux']
[]
['789', '!', 'Spam Eggs']
[]


#### Pros & Cons

Pros:
* simple
* easy to write out/read in with `csv` module
* human readable
* works well with Excel and Google Sheets

Cons:
* reading: Everything is a string
* still limiting: not easy to have higher-dimensional data
* only partially standardized

CSV is still quite a simple format -- it's easy to write out using the `csv` module, easy to read back in, and easily human-readable. It interoperates well with other programs like Excel as well.

However -- you'll notice that whilst we originally started with integer IDs, when we read them back from our file, they were interpreted as strings. CSV only has one _kind_ of value it stores -- string values. You must separately know what kind of value to interpret the strings as, should you want to interpret them as integers. You also cannot easily store higher dimensional data, or non-tabular data within it easily. In other words, the richness of data structures we can represent is still fairly weak -- we can represent only tabular data, and even then, just string values.

CSV is partially standardized -- you'll see that a parser for the data format exists in the modules we mentioned. But truthfully, there are in fact multiple dialects of CSV (even notwithstanding the question of tabs vs. commas alluded to above). There isn't a true "single" dialect of CSV. Some of the complexity of these dialects comes in when considering a question like -- how should a field that itself contains a comma within it be stored in a CSV?

Nevertheless, CSVs are extremely prevalent as a simple way to exchange tabular data. They're:

* easily written out
* easily parsed, as long as you know what dialect they're written in
* reasonably compact
* human-readable
* within a specific dialect, precisely defined
* support tabular, string values

### 8.1.3 JSON
If we want to escape the bounds of tabular data and be able to store a wider variety of data shapes and structures, JSON is an extremely prevalent choice of serialization format. JSON stands for _JavaScript Object Notation_ -- and was inpired by the syntax that the JavaScript language uses for objects, which confusingly is JavaScript's terminology for what Python calls `dict`s. You'll see that thankfully for us, JSON is quite close to Python syntax as well.

The number of types JSON can represent is a lot larger than CSV -- in fact, to learn JSON, we should learn how JSON represents each kind of value it supports, which include:

* booleans
* numbers (ints and floats)
* strings
* null (like None)
* arrays (like lists)
* objects (like dicts)

JSON is widely readable from every programming language -- so you can create JSON in Python, and someone else will be able to read and manipulate a corresponding data structure in some other programming language, like JavaScript.

Let's look at what JSON looks like by "dumping" a Python object to JSON, which is how we take some Python object and represent it as JSON (aka: _serializing_ to JSON).

In [11]:
import json
data = {
    "foo": 12,
    "bar": [1, 2, 3],
    "baz": {"quux": None},
    "spam": False,
}
json.dumps(data)

'{"foo": 12, "bar": [1, 2, 3], "baz": {"quux": null}, "spam": false}'

It is literally just a string in Python. 

It is a bit easier to compare the above to what we started with if we show it with some indentation:

In [12]:
print(json.dumps(data, indent=2))

{
  "foo": 12,
  "bar": [
    1,
    2,
    3
  ],
  "baz": {
    "quux": null
  },
  "spam": false
}


Here's another example, this time with a JSON array (Python `list`):

In [6]:
import json

array = [1, 2, [5, 5, 5], {"foo": "bar"}]
json_array = json.dumps(array)
print(json_array)
print(type(json_array))

[1, 2, [5, 5, 5], {"foo": "bar"}]
<class 'str'>


Notice that the above is indeed quite similar to the Python syntax itself!

We can load back some JSON using `json.loads`. Watch:

In [7]:
py_array = json.loads(json_array)
print(py_array)
print(type(py_array))

[1, 2, [5, 5, 5], {'foo': 'bar'}]
<class 'list'>


We can also write JSON to and read JSON from a file. A JSON file has a `.json` extension:

In [9]:
data = {
    "foo": 12,
    "bar": [1, 2, 3],
    "baz": {"quux": None},
    "spam": False,
}

# write JSON to file called 'out.json'
with open("out.json", "w") as out:
    json.dump(data, out)
    
    
# read the JSON file and load it as a python object
with open("out.json") as out:
    loaded = json.load(out)
    
print(loaded)
print(type(loaded))
print(loaded == data)

{'foo': 12, 'bar': [1, 2, 3], 'baz': {'quux': None}, 'spam': False}
<class 'dict'>
True


**Note:**

* `loads` and `dumps` (with an `s` at the end) loads from/dumps to a _string_.
* `load` and `dump` (without an `s`) loads from/dumps to a _file_.

We have a `dict` which is equal to what we started with, including all of the varied types we used!

#### Pros & Cons

Pros:
* Easy to read in/write out
* Standardized, language support
* Basic data types
* Multi-dimensional

Cons:
* Often lacks structure, validation - not required to have a schema
* Kind of "heavy" - repeating key names
* Can't stream/append data to a file like CSV

### 8.1.4 Parquet [SKIPPED]

CSV files are prevalent across the data science and research domain. It's simple, human-readable, and many programs can handle the CSV format.

However, CSV files are considered sub-optimal when reading and storing a lot of data. 

CSV is described as "record-oriented" or "row-oriented". Parquet is described as "columnar" or "column-based" format for data storage. A visual:

![](https://miro.medium.com/max/887/1*0yVtTZ6MSR-S_uF2YeeUqQ.png)

Say that we had a table of information, where it has 3 columns and 4 rows. CSV is considered "row-oriented", the data is stored and retrieved by row. Remember: when we read CSV data in, we had a list of rows.

Parquet is considered column oriented - where data is stored & retrieved by columns. In some cases, it is more optimal to organize by column - like when you just want to tally all the sales.

In general, row-oriented storage is better for transactional processing, whereas column-oriented is better for analytical processing. Inserting & deleting is fast for row-oriented storage, but aggregation is slow. The inverse is true for column-oriented. And because all the data in a column is the same (like, all integers reflecting sales, or strings for just the products that are sold), column oriented storage can make better use of compression.

For the final project, the format of the taxi data is in Parquet. If we download one of the taxi files, and try to open it in Python (this will take a few seconds to download):

In [1]:
import requests

taxi_data_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-06.parquet"

response = requests.get(taxi_data_url, stream=True)
with open("lecture_08_taxi_data.parquet", "wb") as f:
    for chunk in response.iter_content(chunk_size=1024): 
        if chunk:
            f.write(chunk)

In [2]:
with open("lecture_08_taxi_data.parquet", "rb") as f:
    # only read 4096 bytes - otherwise Jupyter crashes!
    data = f.read(4096)

In [3]:
data

b'PAR1\x15\x04\x15@\x15>L\x15\x08\x15\x04\x12\x00\x00\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\ncd\x80\x00&(\xcd\x06\xa5Y\xa14\x00\x88\xfb\x9a* \x00\x00\x00\x15\x00\x15\xce\xb2d\x15\xf0\xda3,\x15\x80\xa0\xa2\x03\x15\x04\x15\x06\x15\x06\x1c\x18\x08\x06\x00\x00\x00\x00\x00\x00\x00\x18\x08\x01\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08\x06\x00\x00\x00\x00\x00\x00\x00\x18\x08\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\nL\xfd\xddoeI\x96\xe5\x89\x99\x89\xc7@\x10\x04\xc1s\xe9\xc7\xe8t:\x9d\xce`0=====\xbd<\xa3\xbc\xa3c\xa2\xa2\xcd\xa2\xee\xc9I%\x125vZf\x85BMO\xa3\xd1\x10\x06\xc2\x08\x1a\x08z\x12\xf4\xa2\x81\x1e\xf4\xa4g=\x0b\xfd\x97j\xff\xd6>\xf4ndUf\x84;y\xef9\xf6\xb1?\xd6^{\xed\x14B\xf8\xdf\xfe\x7f\xff\xe5$\xfe\xefN\xe7%\x87\x12nb\xee\xa3o\xe3\xb8\x96\x11GZz\xdb\xc6\x12\xfaU\xbc\xe89\xf4\x9ezZ\xd3\xe8w1\xd9O\x8d%^\x86c\xcf-\xf6\xad\xb64\xaf\xeb\x12_\xf5Q\xc6H\xab\xfd\x7f\x8f\xcb8\xb4\xb8\xf41z\x18W\xf1Mo\xbd\xaf\xa3\xd5\x96\xe72r_B\xce\xf6!y\xb6\xaf[\xaf\xe3\xdb

**I can't read this – where's the data?** The `data` we downloaded is unintelligible right now. This is because the way that Parquet files are stored on our file is with _compression_. Compression makes files smaller, but are unreadible until we use a library to help us _deflate_ the file.

We'll need to install that library (the following cell must be run for the subsequent cells about Parquet to work):

In [5]:
!pip install pyarrow

Collecting pyarrow
  Downloading pyarrow-11.0.0-cp39-cp39-macosx_11_0_arm64.whl (22.4 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m22.4/22.4 MB[0m [31m59.0 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Installing collected packages: pyarrow
Successfully installed pyarrow-11.0.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.9 -m pip install --upgrade pip[0m


To read the data we've downloaded, we will use the [`pyarrow`](https://arrow.apache.org/docs/python/index.html) 3rd-party library we've just downloaded above.

In [6]:
import pyarrow.parquet as pq

data = pq.read_table("lecture_08_taxi_data.parquet")

Now if we look at what `data` is, we can now see the data:

In [7]:
data

pyarrow.Table
VendorID: int64
tpep_pickup_datetime: timestamp[us]
tpep_dropoff_datetime: timestamp[us]
passenger_count: double
trip_distance: double
RatecodeID: double
store_and_fwd_flag: string
PULocationID: int64
DOLocationID: int64
payment_type: int64
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
improvement_surcharge: double
total_amount: double
congestion_surcharge: double
airport_fee: double
----
VendorID: [[1,1,2,1,1,...,1,1,1,1,1],[2,2,2,2,1,...,2,2,2,2,2],...,[1,1,2,2,2,...,2,2,2,2,2],[2,2,2,2,2,...,1,2,2,2,2]]
tpep_pickup_datetime: [[2022-06-01 00:25:41.000000,2022-06-01 00:44:40.000000,2022-06-01 00:23:07.000000,2022-06-01 00:25:53.000000,2022-06-01 00:23:58.000000,...,2022-06-02 08:36:21.000000,2022-06-02 08:04:01.000000,2022-06-02 08:27:10.000000,2022-06-02 08:38:21.000000,2022-06-02 08:53:12.000000],[2022-06-02 08:10:22.000000,2022-06-02 08:30:26.000000,2022-06-02 08:36:01.000000,2022-06-02 08:57:39.000000,2022-06-02 08:10:31.00

**How do I read this?** Above the `----` are the column names and the types: there's the `VendorID` that is an integer, `tpep_pickup_datetime` that's a timestamp, etc. 

_Note that while we have types here, the data itself is still stored as strings. The types are a part of the column (known as "metadata" or "schema"), and then Python matches the column's declared type (e.g. integer, timestamp) to the string data, and converts it into its appropriate type._

There are 19 columns:

In [8]:
data.num_columns

19

Below the `----` is the actual data per column, but in row-form. The first column `VendorID` is that first row:

```
VendorID: [[1,1,2,1,1,...,1,1,1,1,1],[2,2,2,2,1,...,2,2,2,2,2],...,[1,1,2,2,2,...,2,2,2,2,2],[2,2,2,2,2,...,1,2,2,2,2]]
```

It doesn't show _all_ the data – it uses the elipses `...` to show only a sample size of the column's data when printing to the Jupyter notebook. We actually have over 3.5 million rows:

In [9]:
data.num_rows

3558124

In [10]:
# Can make a list of dictionaries - may take a while with a lot of data
data.to_pylist()[:5]

[{'VendorID': 1,
  'tpep_pickup_datetime': datetime.datetime(2022, 6, 1, 0, 25, 41),
  'tpep_dropoff_datetime': datetime.datetime(2022, 6, 1, 0, 48, 22),
  'passenger_count': 1.0,
  'trip_distance': 11.0,
  'RatecodeID': 1.0,
  'store_and_fwd_flag': 'N',
  'PULocationID': 70,
  'DOLocationID': 48,
  'payment_type': 1,
  'fare_amount': 32.0,
  'extra': 3.0,
  'mta_tax': 0.5,
  'tip_amount': 2.0,
  'tolls_amount': 6.55,
  'improvement_surcharge': 0.3,
  'total_amount': 44.35,
  'congestion_surcharge': 2.5,
  'airport_fee': 0.0},
 {'VendorID': 1,
  'tpep_pickup_datetime': datetime.datetime(2022, 6, 1, 0, 44, 40),
  'tpep_dropoff_datetime': datetime.datetime(2022, 6, 1, 1, 1, 48),
  'passenger_count': 1.0,
  'trip_distance': 4.2,
  'RatecodeID': 1.0,
  'store_and_fwd_flag': 'N',
  'PULocationID': 170,
  'DOLocationID': 226,
  'payment_type': 1,
  'fare_amount': 14.0,
  'extra': 3.0,
  'mta_tax': 0.5,
  'tip_amount': 0.0,
  'tolls_amount': 0.0,
  'improvement_surcharge': 0.3,
  'total_amoun

Pros:

* Very efficient to store - 55MB vs 2.5GB - 2% of the size of CSV file
* Very efficient to search for data
* Supported by Python and many other languages

Cons:

* Not human-readable without 3rd-party libraries to read
* Only makes sense for a _lot_ of data, and only need to work with a subset of data

### 8.1.5 HTML

Sometimes, you won't have nice, cleanly formatted data like a Parquet file, a JSON file, or CSV file. Sometimes, the information you have to work with is just a webpage - which is written in HTML.

HTML stands for HyperText Markup Language - and it's used to structure the content of a webpage. It is _not_ a programming language though - it's a markup language (other markup languages include: XML (eXtensible Markup Language), Markdown, LaTeX). It "marks up" elements of a page through tags, which are structured with angle brackets:

```html
<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <title>My Website</title>
  </head>
  <body>
    <h1>My Website</h1>
    <h2>Second header</h2>
    <h3>Third Header</h3>
    <p>A paragraph with an image inside: <img src="picture.jpg" alt="a picture"/></p>
    <p>Another paragraph with a <a href="https://google.com">link</a> inside!</p>
  </body>
</html>
```

A valid HTML page first declares that it's HTML with the `!DOCTYPE` - this is actually _not_ an HTML tag, but just information to the browser of what type of document to expect.

We then have `<html></html>` tags at the start and end of the document - this just represents the root of the HTML document. All other HTML elements need to be inside this tag.

We then have a `<head></head>` section - the `<head>` element is a container for metadata - data about the data on the page. This part of HTML is not displayed on a webpage, but the browser makes use of it. Inside our `<head>`, we set a `<meta>` element called `charset`, short for "character set" - just specifying the encoding for the HTML document itself. We also have `<title>` in the `<head>` - which defines the title in the browser toolbar up above. It's also the title of the page when found in search results.

We then have the `<body></body>` tags - which defines all the rendered content that you see in a webpage. Inside the `<body>`, we have a few headers (`<h1>`, `<h2>`, `<h3>`) of different sizes, and a paragraph (`<p>`) for our prose, which also renders an image (`<img>`). We have a second paragraph with a link inside (`<a>`) pointing to Google.

When writing HTML yourself, it's always good to ensure that it's valid. Use a validator like [this site](https://www.freeformatter.com/html-validator.html) to double-check you've written valid HTML.

**Aside**: You can get more of a friendly introduction to HTML [here](https://betterprogramming.pub/understanding-html-basics-for-web-scraping-ae351ee0b3f9).

We can make use of the [`requests`](https://docs.python-requests.org/en/latest/) and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) packages to help us parse HTML pages in Python.

First, using `requests`, let's grab the HTML of a website:

In [None]:
import requests

content = requests.get('https://twitter.com')

In [1]:
import requests

content = requests.get('https://python.org')

In [2]:
content.text



In [2]:
import bs4

In [4]:
soup = bs4.BeautifulSoup(content.text, 'html.parser')
soup

<!DOCTYPE html>

<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" dir="ltr" lang="en"> <!--<![endif]-->
<head>
<!-- Google tag (gtag.js) -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-TF35YF9CVH"></script>
<script>
      window.dataLayer = window.dataLayer || [];
      function gtag(){dataLayer.push(arguments);}
      gtag('js', new Date());
      gtag('config', 'G-TF35YF9CVH');
    </script>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js" rel="prefetch"/>
<link href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js" rel="prefetch"/>
<meta content="Python.org" name="application-name"/>
<meta content="The official

In [5]:
# Get all the links from the page
links = soup.find_all("a")
len(links)

208

In [6]:
links

[<a href="#content" title="Skip to content">Skip to content</a>,
 <a aria-hidden="true" class="jump-link" href="#python-network" id="close-python-network">
 <span aria-hidden="true" class="icon-arrow-down"><span>▼</span></span> Close
                 </a>,
 <a class="current_item selectedcurrent_branch selected" href="/" title="The Python Programming Language">Python</a>,
 <a href="https://www.python.org/psf/" title="The Python Software Foundation">PSF</a>,
 <a href="https://docs.python.org" title="Python Documentation">Docs</a>,
 <a href="https://pypi.org/" title="Python Package Index">PyPI</a>,
 <a href="/jobs/" title="Python Job Board">Jobs</a>,
 <a href="/community-landing/">Community</a>,
 <a aria-hidden="true" class="jump-link" href="#top" id="python-network">
 <span aria-hidden="true" class="icon-arrow-up"><span>▲</span></span> The Python Network
                 </a>,
 <a href="/"><img alt="python™" class="python-logo" src="/static/img/python-logo.png"/></a>,
 <a class="donate-

In [7]:
links[-1].text

'Privacy Policy'

In [8]:
[a.get('href') for a in links]

['#content',
 '#python-network',
 '/',
 'https://www.python.org/psf/',
 'https://docs.python.org',
 'https://pypi.org/',
 '/jobs/',
 '/community-landing/',
 '#top',
 '/',
 'https://psfmember.org/civicrm/contribute/transact?reset=1&id=2',
 '#site-map',
 '#',
 'javascript:;',
 'javascript:;',
 'javascript:;',
 '#',
 'https://www.linkedin.com/company/python-software-foundation/',
 'https://fosstodon.org/@ThePSF',
 '/community/irc/',
 'https://twitter.com/ThePSF',
 '/about/',
 '/about/apps/',
 '/about/quotes/',
 '/about/gettingstarted/',
 '/about/help/',
 'http://brochure.getpython.info/',
 '/downloads/',
 '/downloads/',
 '/downloads/source/',
 '/downloads/windows/',
 '/downloads/macos/',
 '/download/other/',
 'https://docs.python.org/3/license.html',
 '/download/alternatives',
 '/doc/',
 '/doc/',
 '/doc/av',
 'https://wiki.python.org/moin/BeginnersGuide',
 'https://devguide.python.org/',
 'https://docs.python.org/faq/',
 'http://wiki.python.org/moin/Languages',
 'https://peps.python.org',

In [3]:
import requests
response = requests.get('https://www.nytimes.com/')
soup = bs4.BeautifulSoup(response.content, 'html.parser')

ConnectTimeout: HTTPSConnectionPool(host='www.nytimes.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x000001CAF9ABB590>, 'Connection to www.nytimes.com timed out. (connect timeout=None)'))

In [47]:
soup

<!DOCTYPE html>

<html class="nytapp-vi-homepage" lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<meta charset="utf-8"/>
<title data-rh="true">The New York Times - Breaking News, US News, World News and Videos</title>
<meta content="Live news, investigations, opinion, photos and video by the journalists of The New York Times from more than 150 countries around the world. Subscribe for coverage of U.S. and international news, politics, business, technology, science, health, arts, sports and more." data-rh="true" name="description"/><meta content="https://www.nytimes.com" data-rh="true" property="og:url"/><meta content="website" data-rh="true" property="og:type"/><meta content="The New York Times - Breaking News, US News, World News and Videos" data-rh="true" property="og:title"/><meta content="Live news, investigations, opinion, photos and video by the journalists of The New York Times from more than 150 countries around the world. Subscribe for coverage of U.S. and i

In [48]:
# find all h1, h2, h3 headers
soup.find_all("h1")

[<h1 class="css-1dv1kvn">New York Times - Top Stories</h1>]

In [49]:
soup.find_all("h2")

[<h2 class="css-1dv1kvn" id="hp-live-band-container-heading">Live</h2>,
 <h2 class="css-1dv1kvn">Top Stories</h2>,
 <h2><a aria-hidden="false" class="css-9mylee" data-uri="" href="https://www.nytimes.com/section/opinion"><div class="css-v7w2uh"><span>Opinion</span></div></a></h2>,
 <h2><div class="css-v7w2uh"><span>In Case You Missed It</span></div></h2>,
 <h2><div class="css-v7w2uh"><span>More News</span></div></h2>,
 <h2><div class="css-v7w2uh"><span>Well</span></div></h2>,
 <h2><div class="css-v7w2uh"><span>Culture and Lifestyle</span></div></h2>,
 <h2><a aria-hidden="false" class="css-9mylee" data-uri="" href="https://theathletic.com/"><div class="css-1lel406"><p class="css-2z4gz0"><span>The Athletic</span></p><span class="css-t3ges8">Sports coverage</span></div></a></h2>,
 <h2><a aria-hidden="false" class="css-9mylee" data-uri="" href="https://cooking.nytimes.com/"><div class="css-1lel406"><p class="css-2z4gz0"><span>Cooking</span></p><span class="css-t3ges8">Recipes and guides</s

In [50]:
soup.find_all("h3")

[<h3 class="css-1nudrh3" id="U.S.-links-column-header">Sections</h3>,
 <h3 class="css-1nudrh3" id="U.S.-links-column-header"></h3>,
 <h3 class="css-1nudrh3" id="U.S.-links-column-header">U.S. Politics</h3>,
 <h3 class="css-1nudrh3" id="U.S.-links-column-header">Top Stories</h3>,
 <h3 class="css-1nudrh3" id="U.S.-thumbnail-column-header">Newsletters</h3>,
 <h3 class="css-1nudrh3" id="U.S.-thumbnail-column-header">Podcasts</h3>,
 <h3 class="css-1nudrh3" id="World-links-column-header">Sections</h3>,
 <h3 class="css-1nudrh3" id="World-links-column-header">Top Stories</h3>,
 <h3 class="css-1nudrh3" id="World-thumbnail-column-header">Newsletters</h3>,
 <h3 class="css-1nudrh3" id="World-thumbnail-column-header"></h3>,
 <h3 class="css-1nudrh3" id="Business-links-column-header">Sections</h3>,
 <h3 class="css-1nudrh3" id="Business-links-column-header"></h3>,
 <h3 class="css-1nudrh3" id="Business-links-column-header">Top Stories</h3>,
 <h3 class="css-1nudrh3" id="Business-thumbnail-column-header"

In [51]:
# another way to find all headers (h1 - h6)
import re

pattern = re.compile("h\d{1}")
soup.find_all(pattern)

[<h3 class="css-1nudrh3" id="U.S.-links-column-header">Sections</h3>,
 <h3 class="css-1nudrh3" id="U.S.-links-column-header"></h3>,
 <h3 class="css-1nudrh3" id="U.S.-links-column-header">U.S. Politics</h3>,
 <h3 class="css-1nudrh3" id="U.S.-links-column-header">Top Stories</h3>,
 <h3 class="css-1nudrh3" id="U.S.-thumbnail-column-header">Newsletters</h3>,
 <h3 class="css-1nudrh3" id="U.S.-thumbnail-column-header">Podcasts</h3>,
 <h3 class="css-1nudrh3" id="World-links-column-header">Sections</h3>,
 <h3 class="css-1nudrh3" id="World-links-column-header">Top Stories</h3>,
 <h3 class="css-1nudrh3" id="World-thumbnail-column-header">Newsletters</h3>,
 <h3 class="css-1nudrh3" id="World-thumbnail-column-header"></h3>,
 <h3 class="css-1nudrh3" id="Business-links-column-header">Sections</h3>,
 <h3 class="css-1nudrh3" id="Business-links-column-header"></h3>,
 <h3 class="css-1nudrh3" id="Business-links-column-header">Top Stories</h3>,
 <h3 class="css-1nudrh3" id="Business-thumbnail-column-header"

HTML tags can also have IDs attached to them:

```html
<h1 id="intro">My Website</h1>
```

An `id` for an HTML tag must be unique, and can only be used once within a page.

There are also `class`es on HTML tags, which can be used on multiple HTML elements:

```html
<p class="red">A paragraph</p>
```

A typical reason to use IDs and classes in HTML is styling. In the `<head>` of an HTML page, we can add a `<style>` block that defines some CSS (cascading style sheets) - which is just a language that manipulates how a document in HTML is presented; like font color, background color, alignment, etc. You apply a style to an HTML element by ID or class:

```html
<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <title>My Website</title>
    <style>
    /* Style the element with the id "myHeader" */
    #myHeader {
      background-color: lightblue;
      color: black;
      padding: 40px;
      text-align: center;
    }

    /* Style all elements with the class name "city" */
    .city {
      background-color: tomato;
      color: white;
      padding: 10px;
    }
    </style>
  </head>
  <body>
    <!-- An element with a unique id -->
    <h1 id="myHeader">My Cities</h1>

    <!-- Multiple elements with same class -->
    <h2 class="city">London</h2>
    <p>London is the capital of England.</p>

    <h2 class="city">Paris</h2>
    <p>Paris is the capital of France.</p>

    <h2 class="city">Tokyo</h2>
    <p>Tokyo is the capital of Japan.</p>
  </body>
</html>
```

#### Exercise [SKIPPED]

Create a function `fetch_by_id` that fetches the contents of a specific URL and searches that content for an elment with a given ID. The function should take the URL and element ID and should return the text of the element or `None`. Handle any attribute exceptions due to the element not having a `text` attribute.

```py
>>> fetch_by_id('https://example.com/testpage.html', 'test-elem')
'This is a test element'
```

In [28]:
import bs4
import requests

def fetch_by_id(url, elem_id):
    response = requests.get(url)
    soup = bs4.BeautifulSoup(response.content, 'html.parser')
    try:
        return soup.find(id=elem_id).text
    except AttributeError:
        return

In [30]:
fetch_by_id('https://www.nytimes.com', 'app')



##### Note

In addition to some HTML parsing, the homework will ask you to parse data that is in [XML format](https://developer.mozilla.org/en-US/docs/Web/XML/XML_introduction). Have a quick read of what it is (it's quite similar to HTML). Then walk through [Python's XML tutorial](https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree) using the `xml.etree.ElementTree` module.

## 8.2 Downloading Data Programmatically

You're actually already familiar with downloading data from the internet – everytime you go to a website in your browser, the browser downloads data from a server – usually some HTML, some CSS for styling, often some JavaScript. The browser renders all that data for you.

But you can also download all that information (HTML, CSS, JavaScript, and other data like JSON, CSV, Parquet, etc) with Python. You can write a couple of lines of code to _programmatically_ download data over the internet.

The underlying procedure that both your browser and your Python code uses to download data is called HTTP – or Hypertext Transfer Protocol.

### 8.2.1 What is HTTP?

When we navigate to a website like `spotify.com` in the browser, the browser sends what's called a `GET` request that asks for the actual webpage to render.

In [64]:
response = requests.get("https://spotify.com")

In [56]:
response.content

b'<!doctype html><html class="mobile-web-player" lang="en" dir="ltr"><head><meta charSet="utf-8"/><title>Spotify - Web Player: Music for everyone</title><meta property="og:site_name" content="Spotify"/><meta property="fb:app_id" content="174829003346"/><link rel="icon" sizes="32x32" type="image/png" href="https://open.spotifycdn.com/cdn/images/favicon32.b64ecc03.png"/><link rel="icon" sizes="16x16" type="image/png" href="https://open.spotifycdn.com/cdn/images/favicon16.1c487bff.png"/><link rel="icon" href="https://open.spotifycdn.com/cdn/images/favicon.0f31d2ea.ico"/><meta http-equiv="X-UA-Compatible" content="IE=9"/><meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"/><link rel="preload" href="https://encore.scdn.co/fonts/CircularSp-Book-4eaffdf96f4c6f984686e93d5d9cb325.woff2" as="font" type="font/woff2" crossorigin="anonymous"/><link rel="preload" href="https://encore.scdn.co/fonts/CircularSp-Bold-fe1cfc14b7498b187c78fa72fb72d148.woff2" as="font" type

In [57]:
# Use BeautifulSoup to "pretty"-print the HTML we downloaded
soup = bs4.BeautifulSoup(response.content, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<html class="mobile-web-player" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Spotify - Web Player: Music for everyone
  </title>
  <meta content="Spotify" property="og:site_name"/>
  <meta content="174829003346" property="fb:app_id"/>
  <link href="https://open.spotifycdn.com/cdn/images/favicon32.b64ecc03.png" rel="icon" sizes="32x32" type="image/png"/>
  <link href="https://open.spotifycdn.com/cdn/images/favicon16.1c487bff.png" rel="icon" sizes="16x16" type="image/png"/>
  <link href="https://open.spotifycdn.com/cdn/images/favicon.0f31d2ea.ico" rel="icon"/>
  <meta content="IE=9" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
  <link as="font" crossorigin="anonymous" href="https://encore.scdn.co/fonts/CircularSp-Book-4eaffdf96f4c6f984686e93d5d9cb325.woff2" rel="preload" type="font/woff2"/>
  <link as="font" crossorigin="anonymous" href="https://encore.scdn.co/fonts/CircularS

### 8.2.2 Anatomy of HTTP

The code `requests.get("https://spotify.com")` is really nice and simple, and it hides a lot of the details of what's happening with an HTTP request and response. 

Here's a diagram of the 5 different parts of an HTTP request, and the 4 different parts of a response: 

![](https://blog.kakaocdn.net/dn/BaB1D/btqV2a7KgVn/aORy6qJPrs2a1jJHdxBkO0/img.jpg)

When we make a request to a website, that HTTP request contains more than just the URL that you're asking for. It will also contain a **verb**, the **URI** (uniform resource identifier – similar to a URL) , the HTTP **version**, **headers**, and optionally a **body** or a request message.

Let's take a look at `http://httpbin.org/get` as an example. If I were to do this in Python:

In [70]:
response = requests.get("http://httpbin.org/get")

Behind the scenes, my computer sends the following information:

```http
GET /get HTTP/1.1
Host: httpbin.org
Accept: */*
User-Agent: python-requests/2.27.1
```

Looking back at the diagram above for "HTTP Request":

* the verb is `GET`
* the URI is `/get`
* the version of HTTP we're using is 1.1
* the headers are:
    * [`Host`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Host) – put together with the `URI`, and that's the full URL requested.
    * [`Accept`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept) – this tells the server for httpbin.org what type of data we will accept (e.g. JSON, HTML, etc – in our case, "anything").
    * [`User-Agent`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) – what type of program is making the request (a Python program, or a browser, or something else). This is for the server's benefit.


_**FYI**: There can be many more headers attached to an HTTP request. For instance, any authorization needed to access the website (like a password), or how data should be delivered to us (can the server send compressed data or not), and many other things. If you're curious, you can find out more headers [here with Mozilla's really great developer docs](https://developer.mozilla.org/en-US/docs/Glossary/Request_header)._

You can see this information in the Python code as well:

In [71]:
# request method
response.request.method

'GET'

In [72]:
# the URI
response.request.path_url

'/get'

In [73]:
# and the full URL
response.request.url

'http://httpbin.org/get'

In [74]:
# HTTP version
response.raw.version

11

_Note: `11` here means HTTP version 1.1. If it said `10`, that would be version `1.0`, `20` would be `2.0`._

In [75]:
# request headers
for key,value in response.request.headers.items():
    print(f"{key}: {value}")

User-Agent: python-requests/2.31.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive


In [76]:
# request body/message
response.request.body  # no body

We can also see the parts of the HTTP response too:

In [79]:
# response headers
for key, value in response.headers.items():
    print(f"{key}: {value}")

Date: Mon, 18 Mar 2024 22:44:20 GMT
Content-Type: application/json
Content-Length: 306
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true


In [80]:
# response message/body
response.text

'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.31.0", \n    "X-Amzn-Trace-Id": "Root=1-65f8c3c3-727f1ac4762330ed6e99b12e"\n  }, \n  "origin": "209.2.227.186", \n  "url": "http://httpbin.org/get"\n}\n'

In [81]:
# can `print` the body to make it easier to read
print(response.text)

{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.31.0", 
    "X-Amzn-Trace-Id": "Root=1-65f8c3c3-727f1ac4762330ed6e99b12e"
  }, 
  "origin": "209.2.227.186", 
  "url": "http://httpbin.org/get"
}



In [82]:
# status code
response.status_code

200

The status code tells us whether or not the server could complete the request we sent it. It's the _status_ of the response, and is quite important. There are 5 types of statuses:

* `1xx` indicates an informational message only
* `2xx` indicates success of some kind
* `3xx` redirects the client to another URL
* `4xx` indicates an error on the client’s part
* `5xx` indicates an error on the server’s part

So, if you encountered an error, you can take a look at the HTTP response to check what type of status code you have received. [This site](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) lists all the types of HTTP response status codes and what they mean, and [another helpful site](https://http.cat/) for remembering the various different codes.

All this information information - the request and response headers, HTTP version, response status code - are more relevant to the underlying software. We usually only care about the verb (`GET` in our case above), and the full URL. But this is meant to show how clients (like a browser) and servers (where websites "live") speak to one another using the **Hypertext Transfer Protocol** (HTTP).

### 8.2.3 HTTP Methods

HTTP has what's called _methods_ (or "verbs"). We've been working with `GET` so far (`requests.get(URL)`), but there's more than just `GET`. The most common HTTP verbs are:

* `GET` - provides a read-only access to a resource
* `POST` - create a new resource
* `DELETE` - remove a resource
* `PUT` - update an existing resource

There are a total of 9 methods (read about the others [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods)), but we'll really only care about these four.

We would use `POST` when (for example) we're filling out a Google form, or creating a new account for a website. 

A `DELETE` request is used when we might want to delete our account from a website.

And a `PUT` request is used to update existing information – like updating our address with the DMV when we move.

These methods (verbs) for HTTP are actually the building blocks for what's called "REST".

### 8.2.4 REST

REST stands for REpresentational State Transfer.

Essentially, it's a web standard that's built upon the HTTP protocol. It revolves around resources where every component is a resource, and a resource is accessed by a common interface using HTTP standard methods. And we've already talked about a few HTTP standard methods - `GET`, `POST`, `DELETE`, and `PUT`.

We call something a "RESTful" service (website) if it follows the guidelines of the REST standard. To interact with a RESTful service, we need a REST client in order to access and modify the data (resources) that the service (like a website) hosts. 

Each resource (data) is identified by a URI ("Uniform Resource Identifier", or a global ID). And a resource can be represented in various data formats - like JSON, XML, or others. JSON is quite popular.

#### Endpoints or URLs

RESTful services typically follow a standardized approach to URLs. The naming of these endpoints are _nouns_, as they go along with the HTTP verbs (methods). 

A RESTful service will also follow a bit of grammar. If an endpoint is a plural noun, we should expect a list of resources. If it's singular or a specific noun (maybe a proper noun, or a specific ID), we should expect a single resource.

Some examples should help convey what a resource is and how to interact with it.

#### Dummy REST service: `example.com/cities`

Let's say we have a website – `example.com/cities` – that has information about various cities around the world. We can look up information for cities that the website knows about, add new cities that it doesn't know about, make changes for any errors we find, and delete cities if we ever need to.

We should be able to send a `GET` request (the verb), to `/cities` (the endpoint, which is a plural noun), and get a response back that lists all the cities in the site's database.

HTTP Request:

```http
GET /cities HTTP/1.1
Host: example.com
Accept: */*
Connection: keep-alive
[...]
```

_(Note: the_ `[...]` _just means that there could be more headers in the request, but they don't matter for our sake.)_

HTTP Response:

```http
HTTP/1.1 200
Date: Mon, 31 Oct 2022 18:17:59 GMT
Content-Type: application/json
[...]
```
```json
{
    "data": [
        {
            "Seattle": {
                "id": 1,
                "name": "Seattle",
                "latitude": 47.6062,
                "longitude": 122.3321
            },
        },
        {
            "New York": {
                "id": 2,
                "name": "New York",
                "latitude": 40.7128,
                "longitude": 74.0060
            }
        }
    ]
}
```

Notice that the endpoint we called, `/cities`, is a plural noun. And we got a list of cities. Following this convention is what's considered RESTful.

We should also be able to send a `GET` request (the verb) to the `/cities/Seattle` endpoint (the noun) to get specific information about the Seattle entry in the database.

Request:

```http
GET /cities/Seattle HTTP/1.1
Host: example.com
Accept: */*
Connection: keep-alive
[...]
```

Response:

```http
HTTP/1.1 200
Date: Mon, 31 Oct 2022 18:17:59 GMT
Content-Type: application/json
[...]
```
```json
{
    "id": 1,
    "name": "Seattle",
    "latitude": 47.6062,
    "longitude": 122.3321
}
```

Notice that the `/cities/Seattle` endpoint is a specific (proper) noun – Seattle, and I received information about the specific Seattle resource, not a list that includes other cities.

If I send a `GET` request to a city (resource) that does not exist, the server should respond with a `404` (not found) or a `400` (bad request):


Request:

```http
GET /cities/Boston HTTP/1.1
Host: example.com
Accept: */*
Connection: keep-alive
[...]
```

Response:

```http
HTTP/1.1 404
Date: Mon, 31 Oct 2022 18:17:59 GMT
Content-Type: application/json
[...]
```

We should be able to send a `POST` request to `/cities` to add a new city to the database by including some JSON data in the body of my request:

Request:

```http
POST /cities HTTP/1.1
Host: example.com
Accept: */*
Connection: keep-alive
[...]
```
```json
{
    "name": "Boston",
    "latitude": 42.3601,
    "longitude": 71.0589
}
```

Response:

```http
HTTP/1.1 201
Date: Mon, 31 Oct 2022 18:17:59 GMT
Content-Type: application/json
Location: /cities/Boston
[...]
```

The server should create the entry in its database, as well as generate an ID for me. The response should also have the location (URL) of the newly-created resource. I should be able to now send a `GET` request for `/cities/Boston`: 

Request:

```http
GET /cities/Boston HTTP/1.1
Host: example.com
Accept: */*
Connection: keep-alive
[...]
```

Response:

```http
HTTP/1.1 200
Date: Mon, 31 Oct 2022 18:17:59 GMT
Content-Type: application/json
[...]
```
```json
{
    "id": 3,
    "name": "Boston",
    "latitude": 42.3601,
    "longitude": 71.0589
}
```

If I messed up, and need to update the coordinates, I can send a `PUT` request with a body of data in the request to change the coordinates: 

Request:

```http
PUT /cities/Boston HTTP/1.1
Host: example.com
Accept: */*
Connection: keep-alive
[...]
```
```json
{
    "name": "Boston",
    "latitude": 42.3602,
    "longitude": 71.0588
}
```

Response:

```http
HTTP/1.1 200
Date: Mon, 31 Oct 2022 18:17:59 GMT
Content-Type: application/json
[...]
```
```json
{
    "id": 3,
    "name": "Boston",
    "latitude": 42.3602,
    "longitude": 71.0588
}
```

We should also be able to delete a city if we need to by sending a `DELETE` request:

```http
DELETE /cities/Boston HTTP/1.1
Host: example.com
Accept: */*
Connection: keep-alive
[...]
```

Response:

```http
HTTP/1.1 200
Date: Mon, 31 Oct 2022 18:17:59 GMT
Content-Type: application/json
[...]
```

To recap: REST is just a set of mutually-agreed upon standards for clients (browser, Python) and servers (websites) to communicate using HTTP.

#### Other RESTful Features [SKIPPED]

Some other basic features of a RESTful service:

* Stateless: Meaning the client data is not stored on the server; that the server does not have any memory between different requests.
* Client<->Server: There is a separation of concerns between the front-end (client) and the back-end (server). They operate independently of each other and both are replaceable.
* Cache: Data from the server can be cached on the client, which can improve performance speed.

You can read more about REST [here](https://en.wikipedia.org/wiki/Representational_state_transfer). It's also good to understand what [idempotency](https://developer.mozilla.org/en-US/docs/Glossary/Idempotent) means with respect to the different HTTP verbs on RESTful resources.

---


#### Why  learn REST?

We will often use a REST API when interacting with other websites' data, particularly where we don't want to scrape a website's HTML to get data that we need.

It's also quite nice that we can download data directly to our Jupyter notebook (programmatically downloading data), rather than manually downloading a file, and reading that file.

We'll see in the homework (and elsewhere, like the final project) that we can use the `requests` package to help us interact with REST APIs within Python (these won't actually be successful since I'm using a made-up service as an example):

```python
# GET request
response = requests.get("https://example.com/cities")

# POST request
data = {"name": "Boston", "latitude": 42.3602, "longitude": 71.0588}
response = requests.post("https://example.com/cities", json=data)

# PUT request
data = {"name": "Boston", "latitude": 42.3601, "longitude": 71.0589}
response = requests.put("https://example.com/cities/Boston", json=data)

# DELETE request
response = requests.delete("https://example.com/cities/Boston")
```

#### Practice at home

You can check out a list of interesting public REST APIs [here](https://github.com/public-apis/public-apis) to play around with. Select an API (website) that interests you (although some may not be working), and play around with `requests.get(...)`. To get started, try out:

In [83]:
response = requests.get("https://coffee.alexflipnote.dev/random.json")

In [84]:
response

<Response [200]>

In [85]:
response.text

'{\n  "file": "https://coffee.alexflipnote.dev/6HScQuO8B_o_coffee.png"\n}'

In [86]:
# convert text of response (which is json formatted) to python dict
response.json()

{'file': 'https://coffee.alexflipnote.dev/6HScQuO8B_o_coffee.png'}

In [88]:
import json

In [89]:
# above cell is the same as using python's json library
json.loads(response.text)

{'file': 'https://coffee.alexflipnote.dev/6HScQuO8B_o_coffee.png'}

In [90]:
data = response.json()
data["file"]

'https://coffee.alexflipnote.dev/6HScQuO8B_o_coffee.png'

Viewing image from `https://coffee.alexflipnote.dev/-cYnglSh1uA_coffee.jpg`:

![](https://coffee.alexflipnote.dev/-cYnglSh1uA_coffee.jpg)

## 8.3 Appendix

Data (Serialization) Formats
* [CSV (wiki)](https://en.wikipedia.org/wiki/Comma-separated_values)
* [IETF's RFC 4180 on CSV](https://datatracker.ietf.org/doc/html/rfc4180#page-1) (good to familiarize yourself with reading RFCs from IETF, even if you don't understand all of it)(including many that we didn't review)
* [Introduction to working with CSV with Python](https://python-adv-web-apps.readthedocs.io/en/latest/csv.html)
* [What Is Apache Parquet?](https://www.dremio.com/resources/guides/intro-apache-parquet/)
* [Demystifying the Parquet File Format](https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705) (may need to open this in incognito if it wants you to sign in/sign up)
* [CSV Files for Storage? No Thanks. There's a Better Option](https://towardsdatascience.com/csv-files-for-storage-no-thanks-theres-a-better-option-72c78a414d1d) (CSV vs Parquet) (may need to open this in incognito if it wants you to sign in/sign up)
* [Understanding Apache Parquet](https://towardsdatascience.com/understanding-apache-parquet-7197ba6462a9) (may need to open this in incognito if it wants you to sign in/sign up)
* [JSON.org](https://www.json.org/json-en.html)
* [Introduction to JSON](https://www.digitalocean.com/community/tutorials/an-introduction-to-json)
* [Working with JSON](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON)
* [Working with JSON Data in Python](https://realpython.com/python-json/) (may need to open this in incognito if it wants you to sign in/sign up)

* [Comparison of data serialization formats](https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats)

* [Web scraping with Beautiful Soup](https://www.analyticsvidhya.com/blog/2021/08/a-simple-introduction-to-web-scraping-with-beautiful-soup/) (HTML parsing)
* [HTML validator](https://www.freeformatter.com/html-validator.html) 
* [Understanding HTML Basics for Web Scraping](https://betterprogramming.pub/understanding-html-basics-for-web-scraping-ae351ee0b3f9).

Formats we didn't go over but may be of interest
* [TSV](https://en.wikipedia.org/wiki/Tab-separated_values)
* [XML](https://en.wikipedia.org/wiki/XML)
* [XML Introduction](https://developer.mozilla.org/en-US/docs/Web/XML/XML_introduction)
* [Python's XML tutorial](https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree)
* [YAML](https://en.wikipedia.org/wiki/YAML)
* [Pickle (Python-only)](https://en.wikipedia.org/wiki/Serialization#Pickle)
* [Protobuf](https://en.wikipedia.org/wiki/Protocol_Buffers) (short for "protocol buffers")

How the Internet Works

* [Great video to watch](https://pyvideo.org/pycon-us-2013/how-the-internet-works.html) from PyCon 2013
* [Cartoon/graphic used above](https://blog.knowbe4.com/hubfs/How-The-Web-Works-1.jfif)
* [How does the Internet work?](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/How_does_the_Internet_work) (Mozilla)
* [How does the Internet Work](https://web.stanford.edu/class/msande91si/www-spr04/readings/week1/InternetWhitepaper.htm) (Standford)
* [_Very_ in-depth walkthrough of "what happens when you type a URL into the browser"](https://github.com/alex/what-happens-when)
* [Internet protocol suite](https://en.wikipedia.org/wiki/Internet_protocol_suite)
* [OSI model](https://en.wikipedia.org/wiki/OSI_model) (open systems interconnection model)
* [What is the OSI model?](https://www.cloudflare.com/learning/ddos/glossary/open-systems-interconnection-model-osi/)
* [`dig` command](https://en.wikipedia.org/wiki/Dig_(command)) - CLI tool for DNS queries
* [HTTP Status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) from Mozilla
* [HTTP verbs/methods](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) from Mozilla
* [http.cat](https://http.cat/)

REST
* [What is REST – A simple explanation for beginners](https://medium.com/extend/what-is-rest-a-simple-explanation-for-beginners-part-1-introduction-b4a072f8740f)
* [What is REST](https://restfulapi.net/)
* [What is a REST API](https://blog.postman.com/rest-api-examples/)
* One alternative to a REST API is a [SOAP](https://en.wikipedia.org/wiki/SOAP) API, based off of the XML data serialization format