# <center>Big Data &ndash; Exercises &ndash; Solution</center>
## <center>Fall 2025 &ndash; Week 2 &ndash; ETH Zurich</center>

## Exercise 1: URIs

A URI is constituted of up to 5 components. Identify all components of the following URI:

`https://en.wikipedia.org/wiki/ETH_Zurich?view=wide#history`

- authority: `_`
- fragment: `_`
- path: `_`
- scheme: `_`
- query: `_`

### Solution

The URI scheme is as follows:

```
scheme ":" ["//" authority] path ["?" query] ["#" fragment]
```

With that, the components of the given URI are as follows:

- authority: `en.wikipedia.org`
- fragment: `history`
- path: `/wiki/ETH_Zurich`
- scheme: `https`
- query: `view=wide`

## Exercise 2: REST APIs

Below is a sequence of HTTP (REST API) requests with their responses. 

Complete the missing fields (i.e. the "_____" and the _____ ) by selecting the most appropriate answer.

Notes:

- The "method" field is a string. The available options are:
    `GET` &nbsp;&nbsp; `DELETE` &nbsp;&nbsp; `POST` &nbsp;&nbsp; `PUT` &nbsp;&nbsp; `FETCH` &nbsp;&nbsp; `SET` &nbsp;&nbsp; `UPDATE`

- The "status" field is an integer. The available options are:
    `404` &nbsp;&nbsp; `301` &nbsp;&nbsp; `204` &nbsp;&nbsp; `402` &nbsp;&nbsp; `500` &nbsp;&nbsp; `200` &nbsp;&nbsp; `201` &nbsp;&nbsp; `413`

Hints:
1. The JSON syntax we use here is a simplified version of the actual protocol. The fields are self-explanatory as they use the same terminology as in the course.
2. No external effects (requests) are to be considered.
3. The server respects the REST protocol.
4. Requests and responses happen exactly in the order they are listed.

In [None]:
{  "traces": 
  [
    {
      "request": {
        "method": "_____",
        "url": "https://api.school.com/students",
        "body": None
      },
      "response": {
        "status": _____,
        "body": [
          {
            "id": 1,
            "name": "John Doe",
            "age": 20,
            "major": "Computer Science"
          },
          {
            "id": 2,
            "name": "Jane Smith",
            "age": 22,
            "major": "Mathematics"
          }
        ]
      }
    },
    {
      "request": {
        "method": "_____",
        "url": "https://api.school.com/students",
        "body": {
          "name": "Alice Johnson",
          "age": 19,
          "major": "Physics"
        }
      },
      "response": {
        "status": _____,
        "body": {
          "id": 3,
          "name": "Alice Johnson",
          "age": 19,
          "major": "Physics"
        }
      }
    },
    {
      "request": {
        "method": "_____",
        "url": "https://api.school.com/students/3",
        "body": {
          "id": 3,
          "name": "Alice Johnson",
          "age": 19,
          "major": "Biology"
        }
      },
      "response": {
        "status": _____,
        "body": {
          "id": 3,
          "name": "Alice Johnson",
          "age": 19,
          "major": "Biology"
        }
      }
    }, 
    {
      "request": {
        "method": "_____",
        "url": "https://api.school.com/students/3",
        "body": None
      },
      "response": {
        "status": _____,
        "body": None
      }
    },
    {
      "request": {
        "method": "_____",
        "url": "http://api.school.com/students/3",
        "body": None
      },
      "response": {
        "status": _____,
        "body": {
          "error": "Student not found"
        }
      }
    }
  ]
}

### Solution

- First, we use the `GET` method to retrieve all existing students. The server responds with a status code `200`, indicating a successful operation. In the response body, we can see two students: John Doe and Jane Smith, along with their details.

- Next, a new student named Alice Johnson is created. In this case, we use the `POST` method, which is suitable when the server is responsible for generating the URI of the new resource. Notice that the URI specified in the `POST` request is `https://api.school.com/students`, without a specific ID. The server chooses the ID `3` for the newly created resource, as indicated in the response. The status code for this operation is `201 Created`, which confirms the successful creation of the resource. The full URI of the new resource will be `https://api.school.com/students/3`. If we had known the URI of the resource beforehand, we would have used the `PUT` method to create its representation, and the server would still return a `201 Created` status (see: [PUT](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods/PUT)).

- Once the new resource is created, we proceed to update it. Here, the `PUT` method is used to change Alice's major from "Physics" to "Biology". Since the resource already exists, the response has a status code of `200 OK`, and the updated representation of the student is returned in the response body. A `200` status code is appropriate when the update operation includes a response body, while `204` could be used if no body is returned.

- In the next step, the student with ID `3` (Alice Johnson) is deleted. The `DELETE` method is used, and the response has a status code of `204 No Content`, indicating the deletion was successful, with no further content returned.

- Finally, we try to retrieve the student with ID `3` again using the `GET` method. Since the student was deleted, the server responds with a `404 Not Found` status, indicating that the resource no longer exists.

In [None]:
{
  "traces":
  [
    {
      "request": {
        "method": "GET",
        "url": "https://api.school.com/students",
        "body": None
      },
      "response": {
        "status": 200,
        "body": [
          {
            "id": 1,
            "name": "John Doe",
            "age": 20,
            "major": "Computer Science"
          },
          {
            "id": 2,
            "name": "Jane Smith",
            "age": 22,
            "major": "Mathematics"
          }
        ]
      }
    },
    {
      "request": {
        "method": "POST",
        "url": "https://api.school.com/students",
        "body": {
          "name": "Alice Johnson",
          "age": 19,
          "major": "Physics"
        }
      },
      "response": {
        "status": 201,
        "body": {
          "id": 3,
          "name": "Alice Johnson",
          "age": 19,
          "major": "Physics"
        }
      }
    },
    {
      "request": {
        "method": "PUT",
        "url": "https://api.school.com/students/3",
        "body": {
          "id": 3,
          "name": "Alice Johnson",
          "age": 19,
          "major": "Biology"
        }
      },
      "response": {
        "status": 200,
        "body": {
          "id": 3,
          "name": "Alice Johnson",
          "age": 19,
          "major": "Biology"
        }
      }
    }, 
    {
      "request": {
        "method": "DELETE",
        "url": "https://api.school.com/students/3",
        "body": None
      },
      "response": {
        "status": 204,
        "body": None
      }
    },
    {
      "request": {
        "method": "GET",
        "url": "http://api.school.com/students/3",
        "body": None
      },
      "response": {
        "status": 404,
        "body": {
          "error": "Student not found"
        }
      }
    }
  ]
}

## Exercise 3: Key-Value Vector Clocks

> Reference: "Dynamo: Amazon’s Highly Available Key-value Store". In SOSP ’07 (Vol. 41, p. 205). [DOI](https://dl.acm.org/citation.cfm?doid=1294261.1294281)

Multiple distributed hash tables use vector clocks for capturing causality among different versions of the same object. In Amazon's Dynamo, a vector clock is associated with every version of every object.

Let $VC$ be an $N$-element array which contains non-negative integers, initialized to 0, representing $N$ logical clocks of the $N$ processes (nodes) of the system. $VC$ gets its $j$ element incremented by one everytime node $j$ performs a write operation on it. <br>
Moreover, $VC(x)$ denotes the vector clock of a write event, and $VC(x)_z$ denotes the element of that clock for the node $z$.

The formal definition of partial ordering that we get from using vector clocks is the following:
$$VC(x) \leq VC(y) \iff \forall z[VC(x)_z \leq VC(y)_z]$$


### Task 1

Consider $n$ servers in a cluster where $S_i$ denotes the node at position $i$.  

In this exercise, we adopt a slightly modified notation from the Dynamo paper:  
- The Dynamo paper indicates the writing server on the edge, we however write it before the colon.  
- For brevity, we index server by position and omit server name in the vector clock.

For example **aa ([$S_0$,0],[$S_1$,4])** with $S_1$ as writing server becomes **$S_1$ : aa ([0,4])**

Given the following version-evolution DAG for a particular object, complete the vector clocks computed at the corresponding version.

<img src="https://polybox.ethz.ch/index.php/s/byzKBjZg3BzSQlB/download" width=1000/>

#### Solution

<img src="https://polybox.ethz.ch/index.php/s/830WqekKCjNwvt3/download" width=1000/>

### Task 2

Consider the scenario where a `get()` request for some object is received by an Amazon Dynamo node.

If the receiving node is not in the top $N$ nodes in the preference list for the given key, the request is forwarded to the first node in the list. Once the request reaches a node in the top $N$ nodes of the preference list, this node becomes the coordinator. The coordinator then requests all existing versions of data for the given key from the $N$ highest-ranked healthy nodes in the preference list.

Your task is now to take on the role of the coordinator node. Before you can return the object to answer the `get()` request, you must build a DAG from all returned versions. Using this DAG, you can then identify all causally unrelated versions (those versions that are not comparable in the resulting partial ordering).

In exercise, the returned versions are modelled as `value (vector clock)` pairs.

#### Task 2a

- ` 1 ([0,0,1])`
- ` 1 ([0,1,1])`
- ` 2 ([1,1,1])` 
- ` 3 ([0,2,1])`
- `10 ([1,3,1])`

##### Solution

<img src="https://polybox.ethz.ch/index.php/s/8ePr58NIdIxfyJc/download" width=400/>

#### Task 2b

- ` a ([1,0,0,0])`
- ` b ([0,0,0,1])`
- `aa ([0,0,1,0])`
- `bb ([0,1,0,0])`
- ` c ([1,2,0,1])`
- `cc ([0,1,1,2])`
- ` d ([1,3,0,1])`
- ` f ([1,2,1,3])`
- ` e ([2,1,1,2])`
- ` g ([2,2,2,3])`

##### Solution

<img src="https://polybox.ethz.ch/index.php/s/fzp9Dpxd1KcGUPW/download" width=700/>

## Exercise 4: Merkle Trees

A hash tree or Merkle tree is a binary tree in which every leaf node gets as its label a data block and every non-leaf node is labelled with the cryptographic hash of the labels of its child nodes. 

Some Key-Value stores use Merkle trees for efficiently detecting inconsistencies in data between replicas. 

This works by exchanging first the root hash and comparing it with their own. If the hashes match, the replicas are synchronized. If they do not match, then the children of the node (in the Merkle tree) will be retrieved, and their hashes will be compared. This process continues until the inconsistent leaf(s) are identified.

Each of the following tasks shows a pair of Merkle trees for some object. For each pair, specify the process for identifying the data blocks that do not match or explain why the trees are not valid.

### Task 1

<img src="https://polybox.ethz.ch/index.php/s/aex97NCb0JuPscJ/download" width=800/>

#### Solution

1. Node $A$ (root) is exchanged and compared. 
    - The hashes do not match.
2. Nodes $B$ and $C$ are exchanged and compared. 
    - Node $B$ matches, so its children are not exchanged. 
    - Node $C$ does not match.
3. Nodes $F$ and $G$ are exchanged and compared.
    - Node $F$ matches, so its children are not exchanged.
    - Node $G$ does not match.
4. Nodes $N$ and $O$ are exchanged and compared. 
    - Node $N$ does not match, indicating a difference in the associated data block.
    - Node $O$ also does not match, again indicating a difference in the associated data block.

### Task 2

<img src="https://polybox.ethz.ch/index.php/s/L6IDGcGesoL1nR0/download" width=800/>

#### Solution

This configuration is theoretically impossible with a sound hash function. The hashes of $B$ and $C$, respectively, are identical across the trees. Yet, the hash of node $A$ does not match. This would imply that our hash function can give two different results from the exact same input values.

Another problem of the trees can be seen in node $C$. Both trees have the same value for node $C$, yet the hash of one of the children (node $G$) is different across trees. This is theoretically possible since hash functions generally have non-zero collision probability (i.e., the same hash value can be obtained from two different inputs). In practice, this is extremely unlikely, but may lead to unnecessary exchanges/synchronization.

## Exercise 5: Virtual Nodes

Virtual nodes were introduced to avoid assigning data in an unbalanced manner and coping with hardware heterogeneity by taking into consideration the physical capacity of each server

Let assume we have ten physical nodes ($i_0$ to $i_{9}$) with the following main memory configurations:

|          **Node** |     0 |      1 |      2 |     3 |      4 |       5 |     6 |      7 |      8 |        9 |
| ----------------: | ----: | -----: | -----: | ----: | -----: | ------: | ----: | -----: | -----: | -------: |
|        **Memory** | 8 GiB | 16 GiB | 32 GiB | 8 GiB | 16 GiB | 0.5 TiB | 1 TiB | 18 GiB | 30 GiB | 0.25 TiB |

Calculate the number of virtual nodes/tokens each server should get according to its main memory capacity if we want to have a total of **256** virtual nodes/tokens.

For the purpose of this exercise, always round up if you get a fractional number of virtual nodes, even if the total sum of nodes in the end exceeds 256.

### Solution

The total amount of main memory available is: 

$$8 + 16 + 32 + 8 + 16 + 512 + 1024 + 18 + 30 + 256 = 1920~\text{GiBs} $$

So, if we want to have 256 virtual nodes/tokens, the memory per virtual node is as follows.

$$\frac{1920~\text{GiB}}{256~\text{nodes}} = 7.5~\text{GiB}/\text{node}$$

With this, the physical nodes should be assigned the following number of virtual nodes:

|          **Node** |     0 |      1 |      2 |     3 |      4 |       5 |     6 |      7 |      8 |        9 | Total |
| ----------------: | ----: | -----: | -----: | ----: | -----: | ------: | ----: | -----: | -----: | -------: | ----- |
|        **Memory** | 8 GiB | 16 GiB | 32 GiB | 8 GiB | 16 GiB | 0.5 TiB | 1 TiB | 18 GiB | 30 GiB | 0.25 TiB |       |
|  **Memory** (GiB) |     8 |     16 |     32 |     8 |     16 |     512 |  1024 |     18 |     30 |      256 | 1920  |
| **Virtual Nodes** |     2 |      3 |      5 |     2 |      3 |      69 |   137 |      3 |      4 |       35 | 263   |

In general, rounding up to the next integer number or rounding down to the previous integer number is not determinant as virtual nodes are logically positioned in random parts of the ring anyway. However, one thing to be taken into consideration in practice is that the server with the smallest capacity should have more than a single virtual node.