## The Web

Internet (or The Web) is a massive distributed client/server information system as depicted in the following diagram:

![image.png](attachment:image.png)

Many applications are running concurrently over the Web, such as web browsing/surfing, e-mail, file transfer, audio & video streaming, and so on.  In order for proper communication to take place between the client and the server, these applications must agree on a specific application-level protocol such as HTTP, FTP, SMTP, POP, and etc.

## HTTP over TCP/IP

HTTP is an application-level protocol. It typically runs over a TCP/IP connection, as illustrated below. (HTTP needs not run on TCP/IP. It only presumes a reliable transport. Any transport protocols that provide such guarantees can be used.)

![image.png](attachment:image.png)

**Figure 3** - HTTP over TCP/IP

Referring to **Figure 3**, on the left side, there is client and in the middle there are routers and on the right side there is the server.

In its idling state, an HTTP server does nothing but listening to the IP address(es) and port(s) specified in the configuration for incoming request. When a request arrives, the server analyzes the message header, applies rules specified in the configuration, and takes one of the appropriate actions:

    (1) The server interprets the request received, maps the request into a file under the server's document 
    directory, and returns the file requested to the client.
    (2) The server interprets the request received, maps the request into a program kept in the server, executes 
    the program, and returns the output of the program to the client.
    (3) The request cannot be satisfied, the server returns an error message.

TCP/IP (Transmission Control Protocol/Internet Protocol) is a set of transport and network-layer protocols for machines to communicate with each other over the network.

IP (Internet Protocol) is a network-layer protocol, deals with network addressing and routing. In an IP network, each machine is assigned an unique IP address (e.g., 165.1.2.3), and the IP software is responsible for routing a message from the source IP to the destination IP. In IPv4 (IP version 4), the IP address consists of 4 bytes, each ranges from 0 to 255, separated by dots, which is called a quad-dotted form.  This numbering scheme supports up to 4G addresses on the network.  The latest IPv6 (IP version 6) supports more addresses.  Since memorizing number is difficult for most of the people, an english-like domain name, such as www.example.com is used instead.  The DNS (Domain Name Service) translates the domain name into the IP address (via distributed lookup tables). A special IP address 127.0.0.1 always refers to your own machine.  It's domian name is "localhost" and can be used for local loopback testing.

TCP (Transmission Control Protocol) is a transport-layer protocol, responsible for establish a connection between two machines. Transmission layer has 2 protocols: TCP and UDP (User Datagram Package).  TCP is reliable, each packet has a sequence number, and an acknowledgement is expected.  A packet will be re-transmitted if it is not received by the receiver.  **Packet delivery is guaranteed in TCP**.  UDP does not guarantee packet delivery, and is therefore not reliable.  However, UDP has less network overhead and can be used for applications such as video and audio streaming, where reliability is not critical.

TCP multiplexes applications within an IP machine; For each IP machine, TCP supports (multiplexes) up to 65536 ports (or sockets), from port number 0 to 65535.  An application, such as HTTP or FTP, runs (or listens) at a particular port number for incoming requests. Port 0 to 1023 are pre-assigned to popular protocols, e.g., HTTP at 80, FTP at 21, Telnet at 23, SMTP at 25, NNTP at 119, and DNS at 53.  Port 1024 and above are available to the users.

Although TCP port 80 is pre-assigned to HTTP, as the default HTTP port number, this does not prohibit you from running an HTTP server at other user-assigned port number (1024-65535) such as 8000, 8080, especially for test server. You could also run multiple HTTP servers in the same machine on different port numbers. When a client issues a URL without explicitly stating the port number, e.g., http://www.example.com/index.html, the browser will connect to the default port number 80 of the host www.example.com. You need to explicitly specify the port number in the URL, e.g. http://www.example.com:8000/docs/index.html if the server is listening at port 8000 and not the default port 80.

In brief, to communicate over TCP/IP, you need to know (a) IP address or hostname, (b) Port number.

## HyperText Transfer Protocol (HTTP)

Basically, HTTP is a TCP/IP based communication protocol, that is used to deliver data (HTML files, image files, query results, etc.) on the internet. It is the set of conventions that dictate how a client talks to a web server. HTTP is an asymmetric request-response client-server protocol as illustrated.  An HTTP client sends a request message to an HTTP server.  The server, in turn, returns a response message.

![image.png](attachment:image.png)

The default port is TCP 80, but other ports can be used as well. It provides a standardized way for computers to communicate with each other. HTTP specification specifies how clients' request data will be constructed and sent to the server, and how the servers respond to these requests.

![image.png](attachment:image.png)

## Basic Features of HTTP

There are three basic features that make HTTP a simple but powerful protocol:  

### HTTP1.0 is connectionless, HTTP1.1 is connection based

The HTTP client, i.e., a browser initiates an HTTP request and after a request is made, the client waits for the response. The server processes the request and sends a response back after which **the client disconnect the connection** [a] [b]. So client and server knows about each other during the current request and response only. Further requests are made on new connection like client and server are new to each other.

[a] HTTP/1.0 uses a new connection for each request/response exchange
[b] whereas in HTTP/1.1 the same connection may be used for one or more request/response exchanges.

### HTTP is media independent

It means, any type of data can be sent by HTTP as long as both the client and the server know how to handle the data content. It is required for the client as well as the server to specify **the content type** using appropriate **MIME-type**. So, HTTP permits negotiating of data type and representation, so as to allow systems to be built independently of the data being transferred.

### HTTP is stateless but not sessionless

HTTP is **stateless**: there is no link between two requests being successively carried out on the same connection. This immediately has the prospect of being problematic for users attempting to interact with certain pages coherently, for example, using e-commerce shopping baskets. But while the core of **HTTP itself is stateless**, **HTTP cookies allow the use of stateful sessions**. Using header extensibility, HTTP Cookies are added to the workflow, allowing session creation on each HTTP request to share the same context, or the same state.

### HTTP is extensible

Introduced in HTTP/1.0, HTTP headers make this protocol easy to extend and experiment with. New functionality can even be introduced by a simple agreement between a client and a server about a new header's semantics (i.e. Cookie header)

# Basic Architecture where HTTP sits between client and server

![image.png](attachment:image.png)

**Figure 1** - Basic HTTP Architecture

The HTTP protocol is a request/response protocol based on the client/server based architecture where web browsers, robots and search engines, etc. act like HTTP clients, and the Web server acts as a server.

The client makes a request and the server responds.

The **HTTP protocol** is also **a stateless protocol** meaning that the server isn’t required to store **session information**, and **each request is independent of the other**. This means:

    All requests originate at the client ( your browser)
    The server responds to a request.
    The requests(commands) and responses are in readable text.
    The requests are independent of each other and the server doesn’t need to track the requests.

### Web Client

### Server

Referring to **Figure 1** - a request is made by the client and a response is provided by the server. Lets dig into
the Request/Response structure in HTTP protocol.

![image.png](attachment:image.png)

**Figure 2** - Request/Response flow in HTTP

### Request and Response Structure in HTTP

Request and response message structures are the same and shown below:

![image.png](attachment:image.png)

A request consists of:

**A command or request + optional headers + optional body content.**

A response consists of:

**A status code + optional headers + optional body content.**

A simple CRLF (carriage return and Line feed) combination is used to delimit the parts, and a single blank line (CRLF ) indicates end of the headers.

If the request or response contains a message body then this is indicated in the header. In other words,
if the request/response has a body, it should also have a header.

### HTTP Requests

We saw the general request response format earlier now we will cover the request message in more detail.

The start line is mandatory and is structured as follows:

**Method/Command + Resource Path + protocol version**

For example, if we try to access index.html on www.example.com, then the request would be:

**GET /index.html HTTP/1.1**  where:

    GET is the method
    /index.html is the relative path to the source. A relative path doesn’t include the domain name
    HTTP/1.1 is the protocol we are using

Below given the requests content:

![image.png](attachment:image.png)

Referring to the URL: http://www.example.com:80/index.html, the above request is made by the browser.

When this request message reaches the server, the server can take either one of these actions:

    (1) The server interprets the request received, maps the request into a file under the server's document 
    directory, and returns the file requested to the client.
    (2) The server interprets the request received, maps the request into a program kept in the server, executes 
    the program, and returns the output of the program to the client.
    (3) The request cannot be satisfied, the server returns an error message.


## URL is used to form the HTTP Request

Note that the browser uses the **URL** (http://www.example.com/index.html) that we enter to create the relative URI of the resource (/index.html). **URL** (uniform resource Locator) is used for web pages. It is an example of a **URI** (uniform resource indicator).

Most people are familiar with entering a URL into a web browser. Usually looking like this:

![image.png](attachment:image.png)

The url can also includes the port which is normally hidden by the browser, but you can manually include it as shown below:

![image.png](attachment:image.png)

This tells the web browser the address of the resource to locate and the protocol to use to retrieve that resource (http).

http is the transfer protocol that transfer the resource (web page,image,video etc) from the server to the client.

In summary, A URL has the following syntax:

**protocol://hostname:port/relative-path/file-name**

There are 5 parts in a URL:

    1. Protocol: The application-level protocol used by the client and server, e.g., HTTP, FTP, and telnet.
    2. Hostname: The DNS domain name (e.g., www.example.com) or IP address (e.g., 93.184.216.34) of the server.
    3. Port: The TCP port number that the server is listening for incoming requests from the clients 
    (i.e 80 for HTTP)
    4. Relative-Path: the location of requested resource under the servers base directory
    5. file-name: The name the requested resource


### HTTP Responses and Response Codes

Each request has a response. The Response consists of:

    STATUS code And Description
    1 or more optional headers
    Optional Body message can be many lines including binary data

Below given the response to **GET /index.html HTTP/1.1**:

![image.png](attachment:image.png)

Response Status codes are split into 5 groups each group has a meaning and a three digit code.

**1xx – Informational**. Request received, server is continuing the process. (i.e. 100 continue)

**2xx – Successful**. The request was successfully received, understood, accepted and serviced. (i.e. 200 OK)

**3xx -Redirection**. Further action must be taken in order to complete the request. (i.e. 301 permanent 
redirection)

**4xx– Client Error**. The request contains bad syntax or cannot be understood. (i.e. 400 Bad Request)

**5xx -Server Error**. The server failed to fulfill an apparently valid request. (i.e. 500 Internal Server Error)


You can find a complete list and their meaning in [3].

## Request Types/Methods in HTTP

The HTTP protocol now support 8 request types/methods, also called methods or verbs in the documentation,they are:

**GET** – Requesting resource from server

**HEAD** – As GET but only return headers and not the content. A client can use the HEAD request to get the response headers that a GET request would have obtained. Since the headers contains the last-modified date of the data, this can be used to check against the local cache copy.
    
**POST** – submitting a resource to a server (e.g. file uploads)

**PUT** -As POST but replaces an existing resource

**DELETE**-Delete a resource from a server

**TRACE** -Ask the server to return a diagnostic trace of the actions it takes

**OPTIONS** -To find out which request types a server supports, one can use curl and issue an OPTIONS request

**PATCH** -Apply modifications to a resource

**Other request types** can be added
    


Note that GET, HEAD and TRACE are safe methods; i.e. they do not cause a state change (i.e. file deletion) in the server

## HTTP Headers in HTTP requests/responses

HTTP headers re used to convey additional information between the client and the server.

Although they are optional they make up the most of the http request and are almost always present.

When you request a web page using a web browser the headers are inserted automatically by the web browser, and you don’t see them.

Similarly the response headers are inserted by the web server and are not seen by the user.

## Request and Response Header Structure

Request and response headers share a common structure

![image.png](attachment:image.png)

**Connection:keep-alive** header is found in **GET /index.html HTTP/1.1** request:

![image.png](attachment:image.png)

Referring to the HTTP Request above, if a header can have multiple values, in which case they are separated using a comma. For example **Accept-Language** or **Accept** headers.

**NOTE:** Field names are case insensitive but field values should be treated as case sensitive

**NOTE:** You should note that only necessary headers are sent all other headers are assumed by the web server and client to be their default.

For example the **Connection** header is not normally sent as the default behaviour is **keep-alive** and this is assumed by the server.

## Common Request Header #1 : Connection Header

The original HTTP 1.0 protocol used non persistent connections. This meant that the client:

    Made a request
    Got a Response
    Closed the Connection


Because it takes time and resources to establish the connection in the first place it makes no sense to drop it so quickly.

Therefore in HTTP 1.0 the client can tell the server that it will keep the connection open by using the **Connection: keep-alive** header.

In HTTP v1.1 the default behaviour was changed and persistent connections become the default mode.

Now the client can tell the server that it will close the connection by using the header **Connection: close**

## Common Request Header #2 : Host Header

Almost all websites,including example.com, use **shared hosting**. With shared hosting any web server is configured as a **virtual host** and all virtual hosts (i.e. any shared hosting web server) will be assigned to a single IP address.

The **host header** tells the web server which server to refer the request to e.g.

Host: www.example.com\r\n


Note that in our **GET /index.html HTTP/1.1** example, the sharing hosting has an IP address 93.184.216.34. One of the web servers (i.e. example.com) is configured as a virtual host and shares the IP address with other web servers. Therefore, in the GET request, we have the host header (i.e. Host:www.example.com\r\n):

![image.png](attachment:image.png)

## Common Request Header #3 : User-Agent Header

This gives information about the client making the request as shown below:

User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0\r\n

## Common Request Header #4 : Accept Request Header

**Accept Request header** is used for content negotiation and are sent by the client to the server, and tells the server what formats the client can understand.

The Accept header is used to tell the server what media types the client prefers e.g. Text, audio etc.

For normal web pages common values are text/plain and text/html.

    Accept:text/plain,text/html

For JSON encoded data the header:

    Accept:application/json

## HTTP 1.1 vs HTTP 1.0

![image.png](attachment:image.png)

Multiple host name support introduced **the host header** (i.e. Host:www.example.com\r\n )

Persistent Connections ensured that the 3 way TCP handshake occurs only once between the client and the server.
Persistent Connections introduced the **Connection: keep-alive/close** header. As a result, HTTP 1.1 is faster as it does not kill the connection at each message sending but uses the existing established connection.

A web server can sustain a limited amount of persistent connections. If a TCP connection is not used and stays in idle mode, then **idle timeout** for the TCP connection kicks in. After **idle timeout** the current connection is closed and for the upcoming requests, a new TCP connection needs to be established.

## HTTP Caching

![image.png](attachment:image.png)

To explain how web server caching works, lets refer back to the 2.nd HTTP request message; namely **GET /index.html HTTP/1.1** by pressing Ctrl+R (Reload) on the browser. This means that the browser has already a cached version of index.html and force to request the same content from the server:

![image.png](attachment:image.png)

Notice that in the request, we have **If-Modified-Since** header with a value "Thu, 17 Oct 2019 07:18:26 GMT"

**TODO**: More to be added to explain If-None-Match, Cache-Control headers.

And the response to the GET message is 304 without any body:

![image.png](attachment:image.png)

**TODO**: More to be added to explain ETag, Cache-Control, Expires HTTP response headers

## PIPELINING CONCEPT UTILIZING COOKIES

![image.png](attachment:image.png)

## REFERENCES:

[1] https://www.tutorialspoint.com/http/http_overview.htm

[2] http://www.steves-internet-guide.com/http-basics/

[3] https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

[4] http://www.steves-internet-guide.com/http-headers/

[5] HTTP [Hypetext Transport Protocol] tutorial in depth | HTTP Protocol Tutorial
    https://www.youtube.com/watch?v=JFZMyhRTVt0

[6] https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/If-Modified-Since

[7] CS50 2017 - Lecture 6 - HTTP
  https://www.youtube.com/watch?v=PUPDGbnpSjw

[8] Main Reference on HTTP: https://www.ntu.edu.sg/home/ehchua/programming/webprogramming/HTTP_Basics.html