# <center>Big Data &ndash; Exercises</center>
## <center>Fall 2023 &ndash; Week 2 &ndash; ETH Zurich</center>

## Exercise 1: Storage devices (Optional)

In this exercise, we want to understand the differences between [SSD](https://en.wikipedia.org/wiki/Solid-state_drive), [HDD](https://en.wikipedia.org/wiki/Hard_disk_drive), and [SDRAM](https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory) in terms of __capacity__, __speed__ and __price__. 

### Task 1
Fill in the table below by visiting your local online hardware store and choosing the storage device with largest capacity available but optimizing for read/write speed.
For instance, you can visit Digitec.ch to explore the prices on [SSDs](https://www.digitec.ch/en/s1/producttype/ssd-545?tagIds=76), [HDDs](https://www.digitec.ch/en/s1/producttype/hard-drives-36?tagIds=76), and 
[SDRAMs](https://www.digitec.ch/en/s1/producttype/memory-2?tagIds=76). 
You are free to use any other website for filling the table. 


| Storage Device | Maximum capacity, GB | Price, CHF/GB  | Read speed, GB/s | Write speed, GB/s | Link |
| --------------:| --------------------:| --------------:|-----------------:|------------------:|------|
| HDD            |                      |                |                  |                   |&nbsp;|
| SSD            |                      |                |                  |                   |&nbsp;|
| DRAM           |                      |                |                  |                   |&nbsp;|


### Task 2
Answer the following questions:
1. What type of storage devices above is the cheapest one?
2. What type of storage devices above is the fastest in terms of read speed?

# Exercise 2. KeyValue Vector Clocks

As pointed out in the lecture, the concepts are clearly explained in the Dynamo paper by the DeCandia, G., et. al. (2007). "Dynamo: Amazon’s Highly Available Key-value Store". In SOSP ’07 (Vol. 41, p. 205). [DOI](https://dl.acm.org/citation.cfm?doid=1294261.1294281)

## Task 1
Multiple distributed hash tables use vector clocks for capturing causality among different versions of the same object. In Amazon's Dynamo, a vector clock is associated with every version of every object.

Let $VC$ be an $N$-element array which contains non-negative integers, initialized to 0, representing $N$ logical clocks of the $N$ processes (nodes) of the system. $VC$ gets its $j$ element incremented by one everytime node $j$ performs a write operation on it. <br>
Moreover, $VC(x)$ denotes the vector clock of a write event, and $VC(x)_z$ denotes the element of that clock for the node $z$.

Try to __formally define__ the partial ordering that we get from using vector clocks.

## Task 2

Vector clock antisymmetry property is defined as follows:

If $ VC(x) \lt VC(y)$, then $ ¬ \ (VC(y) \lt VC(x)) $

Prove this property.

## Task 3

Consider $j$ servers in a cluster where $S_j$ denotes the $j$th node.  
In this exercise, we adopt a slightly modified notation from the Dynamo paper:  
- The Dynamo paper indicates the writing server on the edge, we however write it before the colon.  
- For brevity, we index server by position and omit server name in the vector clock.

For example **aa ([$S_0$,0],[$S_1$,4])** with $S_1$ as writing server become **$S_1$ : aa ([0,4])**
So, given the following version evolution DAG for a particular object, complete the vector clocks computed at the corresponding version.

<img src="https://polybox.ethz.ch/index.php/s/iRONxqhpQkRdLeY/download" width=400/>

<img src="https://polybox.ethz.ch/index.php/s/WzJlMxIrA2RGcKh/download" width=400/>

<img src="https://polybox.ethz.ch/index.php/s/nZ83Jb7mrr0uhi8/download" width=400/>

## Task 4

When a get request comes in to Amazon Dynamo with some key, then:
  - The coordinator node (selected from the preference list as the top node for this key) is taking care of this request
  - The coordinator node requests from other nodes (itself + the next N-1 healthy ones on the preference list), and receives, a set of versions for the value associated with the key, that are modelled as __value (vector clock)__ pairs such as a ([1, 3, 2])

### Task 4.1
Given the following list of versions, draw the version DAG that the coordinator node will build for returning available versions.

1 ([0,0,1])  
1 ([0,1,1])  
2 ([1,1,1])  
3 ([0,2,1])  
10 ([1,3,1])

### Task 4.2
Given the following list of versions, draw the version DAG that the coordinator node will build for returning the correct version.


 a ([1,0,0])  
 b ([0,1,0])  
 c ([2,1,0])   
 d ([2,1,1])   
 e ([3,1,1])  
 f ([2,2,1])   
 g ([3,1,2])   
 h ([3,2,3])  
 i ([4,2,2])   
 j ([5,2,2])  
 k ([4,3,3])  
 l ([5,2,3])  
 m ([5,4,3])  
 n ([6,3,3])  
 o ([6,4,4])  

### Task 4.3
Given the following list of versions, draw the version DAG that the coordinator node will build for returning the correct version.

a ([1,0,0,0])  
b ([0,0,0,1])  
aa ([0,0,1,0])  
bb ([0,1,0,0])  
c ([1,2,0,1])  
cc ([0,1,1,2])  
d ([1,3,0,1])  
f ([1,2,1,3])  
e ([2,1,1,2])  
g ([2,2,2,3])  

## Task 5 (Optional)

Consider $j$ servers in a cluster where $S_j$ denotes the $j$th node. The following table denotes the execution of a series of get/put operations. Also each line of the table represents the events that happen at the time time. For example, at time 0 (`t0`) servers $S_1$ and $S_3$ perform operations. Moreover, when reading and writing an object, we are provided with / must provide a context respecitvely. The context itself is the vector clock, and helps the routines understand what version of the object they are dealing with, and what the new, updated version of the context will be.

For the `get` and `put` routines, we have the following signatures:

* `get(key)` $\rightarrow$ `[val_1, val_2, ...]`, $C_{key}$ `(` $VC$ `(key))` 
  * Example: `get("foo")` $\rightarrow$ `[bar_1, bar_2]`, $C_{2}$ `([1, 0, 1, 0]) # We assume the existence of 4 nodes` 

* `put(key, context, val)` $\rightarrow$ `None`
  * Example: `put("foo",` $C_2$`, "bar")`  

Note that the $C_{key}$ elements are just notation, and are meant to highlight that the context gets passed around between the `get` and `put` routines in a real API. 

Complete missing `[list_values],` $C_{key}$ `([vector_clock])` tuples for the calls below.

<table>
  <tr><th></th><th>S0</th><th>S1</th><th>S2</th><th>S3</th><th>S4</th></tr>
  <tr>
    <td>t0</td>
    <td></td>
    <td>Get(1)$\rightarrow$ _______________, $C_{1}$(_______________)<br>Put(1, _____, ”a”)</td>
    <td></td>
    <td>Get(1)$\rightarrow$ _______________, $C_{2}$(_______________)<br>Put(1, _____, ”bb”)</td>
    <td></td>
  </tr>
  <tr>
    <td>t2</td>
    <td>Get(1)$\rightarrow$ _______________, $C_4$(_______________)<br>Put (1, _____, “rr”)</td>
    <td>Get(1)$\rightarrow$ _______________, $C_5$(_______________)<br>Put (1, _____, ”dd”)
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>t4</td>
    <td></td>
    <td></td>
    <td>Get(1)$\rightarrow$ _______________, $C_9$(_______________) <br>Put(1, _____, ”ccc”)</td>
    <td>Get(1)$\rightarrow$ _______________, $C_{10}$(_______________) <br> Put(1, _____, ”dd”)</td>
    <td></td>
  </tr>
  <tr>
    <td>t5</td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td>Get(1)$\rightarrow$ _______________, $C_{11}$(_______________) <br>Put(1, _____, “fff”)</td>
  </tr>
</table>

The DAG below shows the interaction among nodes when retrieving values. You can use it for determining the expected values.

<img src="https://polybox.ethz.ch/index.php/s/YoWi7QK2DcMHJFe/download" width=400/>

# Exercise 3. Merkle Trees
A hash tree or Merkle tree is a binary tree in which every leaf node gets as its label a data block and every non-leaf node is labelled with the cryptographic hash of the labels of its child nodes. 

Some KeyValue stores use Merkle trees for efficiently detecting inconsistencies in data between replicas. 

This works by exchaging first the root hash, comparing it with their own. If the hashes match, the replicas are synchronised. If they do not match, then the children of the node (in the Merkle tree) will be retrieved, and their hashes will be compared. This process continues until the inconsistent leave(s) are identified. 

## Task 1
The two pictures below depict two Merkle trees each one belonging to two different replicas. Both should represent the same object.

For the two pairs of trees below. Specify if it is a possible configuration as well as which nodes have to be exchanged in order to sync the trees, if applicable.  

<img src="https://polybox.ethz.ch/index.php/s/vqj7AOAozZKEO3N/download" width=800/>

## Task 2
Repeat the exercise for the following pair of Merkle Trees.

<img src="https://polybox.ethz.ch/index.php/s/TwFd3KDxTrqq2B1/download" width=800/>

# Exercise 4. Virtual nodes

Virtual nodes were introduced to avoid assigning data in an unbalanced manner and coping with hardware heterogeneity by taking into consideration the physical capacity of each server

Let assume we have ten servers ($i_1$ to $i_{10}$) each with the following amount of main memory: `8GiB, 16GiB, 32GiB, 8GiB, 16GiB, 0.5TiB, 1TiB, 0.25TiB, 10GiB, 20GiB`. Calculate the number of virtual nodes/tokens each server should get according to its main memory capacity if we want to have a total of `256` virtual nodes/tokens.

Just for the purpose of the exercises if you get a fractional number of virtual nodes, always round up, even if the total sum of nodes in the end exceeds `256`.