# The Internet: Behind the Scenes

The Internet is a transport mechanism that lets any connected device communicate with any other connected device. Behind the scenes:

Each device has a globally distinct IP address, which is a 32 bit number. Usually an IP address is represented as a sequence of four decimal numbers, each number in the range (0, 255). For example, when I checked the IP address for my laptop just now, it was 141.211.203.248. Any IP address beginning with 141.211 is for a device at the University of Michigan. When I take my laptop home and connect to a network there, my laptop gets a different IP address that it uses there.

Data is chopped up into reasonable sized packets (up to 65,535 bytes, but usually much smaller).

Each data packet has a header that includes the destination IP address.

Each packet is routed independently, getting passed on from one computing device to another until it reaches its destination. The computing devices that do that packet forwarding are called routers. Each router keeps an address table that says, when it gets a packet for some destination address, which of its neighbors should it pass the packet on to. The routers are constantly talking to each other passing information about how they should update their routing tables. The system was designed to be resistant to any local damage. If some of the routers stop working, the rest of the routers talk to each other and start routing packets around in a different way so that packets still reach their intended destination if there is some path to get there. It is this technical capability that has spawned metaphoric quotes like this one from John Gilmore: “The Net interprets censorship as damage and routes around it.”

At the destination, the packets are reassembled into the original data message.

<img src = "./routers.png" />

# Anatomy of URLs

A URL is used by a browser or other program to specify what server to connect to and what page to ask for. Like other things that will be interpreted by computer programs, URLs have a very specific formal structure. If you put a colon in the wrong place, the URL won’t work correctly. The overall structure of a URL is:

Usually, the scheme will be http or https. The s in https stands for “secure”. When you use https, all of the communication between the two devices is encrypted. Any devices that intercepts some of the packets along the way will be unable to decrypt the contents and figure out what the data was.

Other schemes that you will sometimes see include ftp (for file transfer) and mailto (for email addresses).

The host will usually be a domain name, like si.umich.edu or github.com or google.com. When the URL specifies a domain name, the first thing the computer program does is look up the domain name to find the 32-bit IP address. For example, right now the IP adddress for github.com is 192.30.252.130. This could change if, for example, github moved its servers to a different location or contracted with a different Internet provider. Lookups use something called the Domain Name System, or DNS for short. Changes to the mapping from domain names to IP addresses can take a little while to propagate: if github.com announces a new IP address associated with its domain, it might take up to 24 hours for some computers to start translating github.com to the new IP address.

Alternatively, the host can be an IP address directly. This is less common, because IP addresses are harder to remember and because a URL containing a domain name will continue to work even if the remote server keeps its domain name but moves to a different IP address.

The :port is optional. If it is omitted, the default port number is 80. The port number is used on the receiving end to decide which computer program should get the data that has been received. We probably will not encounter any URLs that include the : and a port number in this course.

The /path is also optional. It specifies something about which page, or more generally which contents, are being requested.

For example, consider the url https://github.com/presnick/runestone:

https:// says to use the secure http protocol

github.com says to connect to the server at github.com, which currently maps to the IP address 192.30.252.130. The connection will be made on the default port, which is 443 for https.

/presnick/runestone says to ask the remote server for the page presnick/runestone. It is up to the remote server to decide how to map that to the contents of a file it has access to, or to some content that it generates on the fly.

The url http://blueserver.com/path?k=val is another example that we can consider. The path here a bit different from https://github.com/presnick/runestone because it includes what are called “query parameters”, the information after the ?.

<img src="internet_requests.png"/>

# The HTTP protocol

A protocol specifies the order in which parties will speak and the format of what they say and the content of appropriate responses.

HTTP is the protocol that specifies how web browsers or other programs communicate with web servers. One version of the formal specification, before it was later split into multiple documents, was IETF RFC 2616. It is 176 pages long! Fortunately, the basics are pretty easy to understand.

Step 1: the client makes a request to the server.
If the request only involves fetching data, the client sends a message of the form GET <path>, where <path> is the path part of the URL

If the request involves sending some data (e.g., a file upload, or some authentication information), the message starts with POST

In either case, the client sends some HTTP headers. These include:
The type of client program. This allows the server to send back different things to small mobile devices than desktop browsers (a “responsive” website)

Any cookies that the server previously asked the client to hold onto. This allows the server to continue previous interactions, rather than treating every request as stand-alone. It also allows ad networks to place personalized ads.

After the HTTP headers, for a POST type communication, there is some data (the body of the request).

<img src="argumentstoserver.png" />

Step 2: the server responds to the client.
The server first sends back some HTTP headers. These include:
a response code indicating whether the server thinks it has fulfilled the request or not.

a description of the type of content it is sending back (e.g., text/html when it is sending html-formatted text).

any cookies it would like the client to hold onto and send back the next time it communicates with the server.

After the headers come the contents. This is the stuff that you would see if you ask to “View Source” in a browser.

<img src="serverresponse.png />

#  Using REST APIs

REST stands for REpresentational State Transfer. It originally had a more abstract meaning, but has come to be a shorthand name for web sites that act a bit like python functions, taking as inputs values for certain parameters and producing outputs in the form of a long text string.

API stands for Application Programming Interface. An API specifies how an external program (an application program) can request that a program perform certain computations.

Putting the two together, a REST API specifies how external programs can make HTTP requests to a web site in order to request that some computation be carried out and data returned as output. When a website is designed to accept requests generated by other computer programs, and produce outputs to be consumed by other programs, it is sometimes called a web service, as opposed to a web site which produces output meant for humans to consume in a web browser.

Of course, a web browser is just a computer program, so all requests to web sites come from other computer programs. But usually a browser requests data from a web site in order to display it directly to a human user of the browser.

Prior to the development of REST APIs, there were other ways that computer programs made remote requests to other computers on a network, asking them to perform some computation or retrieve some data. Those other techniques are still in use. REST APIs are particularly convenient, however, both for students and for others, because it is easy to see what is going on in a request and a response between two computers, and thus it is easier to debug.

We will examine a common pattern used in REST APIs, where there is a base URL that defines an “endpoint”, and then additional information is appended to the URL as query parameters, and the response comes back not as HTML but as a format called JSON. Along the way, we will see some functions in python modules that are helpful for constructing the URLS and for processing the JSON formatted response.

# URL Structure for REST APIs

In a REST API, the client or application program– the kind of program you will be writing– makes an HTTP request that includes information about what kind of request it is making. Web sites are free to define whatever format they want for how the request should be formatted. This chapter covers a particularly common and particularly simple format, where the request information is encoded right in the URL. This is convenient, because if something goes wrong, we can debug by copying the URL into a browser and see what happens when it tries to visit that URL.

In this format, the URL has a standard structure:

<li>the base URL</li>
<li>a ? character</li>
<li>one or more key-value pairs, formatted as key=value pairs and separated by the & character.</li>

one or more key-value pairs, formatted as key=value pairs and separated by the & character.

For example, consider the URL https://itunes.apple.com/search?term=Ann+Arbor&entity=podcast. Try copying that URL into a browser, or just clicking on it. It retrieves data about podcasts posted from Ann Arbor, MI. Depending on your browser, it may put the contents into a file attachment that you have to open up to see the contents, or it may just show the contents in a browser window.

Let’s pull apart that URL.

<li>the base URL: https://itunes.apple.com/search</li>
<li>a ? character</li>
<li>key=value pairs. In this case, there are two pairs. The keys are term and entity. An & separates the two pairs.</li>

<img src="urlstructure.png" />

#  Encoding URL Parameters.

Here’s another URL that has a similar format. https://www.google.com/search?q=%22violins+and+guitars%22&tbm=isch. It’s a search on Google for images that match the string “violins and guitars”. It’s not actually based on a REST API, because the contents that come back are meant to be displayed in a browser. But the URL has the same structure we have been exploring above and introduces the idea of “encoding” URL parameters.

<ul>
    The base URL is https://www.google.com/search

?

Two key=value parameters, separated by &

<li>q=%22violins+and+guitars%22 says that the query to search for is “violins and guitars”.</li>

<li>tbm=isch says to go to the tab for image search.</li>
</ul>

<p>Now why is "violins and guitars" represented in the URL as %22violins+and+guitars%22? The answer is that some characters are not safe to include, as is, in URLs. For example, a URL path is not allowed to include the double -quote character. It also can’t include a : or / or a space. Whenever we want to include one of those characters in a URL, we have to encode them with other characters. A space is encoded as +. " is encoded as %22. : would be encoded as %3A. And so on.</p>