# <center>Big Data &ndash; Exercises</center>
## <center>Fall 2020 &ndash; Week 2 &ndash; ETH Zurich</center>

## Exercise 1: Storage devices (Optional)

In this exercise, we want to understand the differences between [SSD](https://en.wikipedia.org/wiki/Solid-state_drive), [HDD](https://en.wikipedia.org/wiki/Hard_disk_drive), and [SDRAM](https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory) in terms of __capacity__, __speed__ and __price__. 

### Task 1
Fill in the table below by visiting your local online hardware store and choosing the storage device with largest capacity available but optimizing for read/write speed.
For instance, you can visit Digitec.ch to explore the prices on [SSDs](https://www.digitec.ch/en/s1/producttype/ssd-545?tagIds=76), [HDDs](https://www.digitec.ch/en/s1/producttype/hard-drives-36?tagIds=76), and 
[SDRAMs](https://www.digitec.ch/en/s1/producttype/memory-2?tagIds=76). 
You are free to use any other website for filling the table. 


| Storage Device | Maximum capacity, GB | Price, CHF/GB  | Read speed, GB/s | Write speed, GB/s | Link |
| --------------:| --------------------:| --------------:|-----------------:|------------------:|------|
| HDD            |                      |                |                  |                   |&nbsp;|
| SSD            |                      |                |                  |                   |&nbsp;|
| DRAM           |                      |                |                  |                   |&nbsp;|


### Task 2
Answer the following questions:
1. What type of storage devices above is the cheapest one?
2. What type of storage devices above is the fastest in terms of read speed?

## Exercise 2: Seting up an Azure storage account

In this section you'll learn how to set up a Locally Redundant Storage instance.


### Step 1: Create a Locally-redundant Storage

1. First, you need to create a storage account. In the Azure portal, click on the option "Storage accounts" in the left hand side menu. 

<img src="https://polybox.ethz.ch/index.php/s/O50x6Ip3wAfHZZt/download" width=800/>

2. Click on the "Add" button at the top of the page. 

<img src="https://polybox.ethz.ch/index.php/s/28ZnfQidRktKXNG/download" width=800/>

3. Fill in the form in the following way:
  * If not already present, create a new resource group called *exercise02*
  * Select *Locally-redundant storage (LRS)* as *Replication* mode
  * The *Storage Account Name* can be whatever you want 
  * Leave all other values unchanged 

<img src="https://polybox.ethz.ch/index.php/s/NxLQbGomz0tkPUd/download" width=800/>

4. Click *Review + create* then *Create* on the next page (deployment might take a few minutes).

<img src="https://polybox.ethz.ch/index.php/s/XthQkrrg9PrEOSq/download" width=800/>

5. Go to the resource page of the newly created LRS.

<img src="https://polybox.ethz.ch/index.php/s/A0rZoRbC7BJiPVA/download" width=800/>

6. In the left-hand menu, under the *Settings* group, select the *Access Keys* tab.

<img src="https://polybox.ethz.ch/index.php/s/xrbKc7AlqDUwbSq/download" width=800/>

7. Copy one of the access keys to the clipboard. 

8. Paste the *Storage Account Name* in `ACCOUNT_NAME`, the access key in `ACCOUNT_KEY`, and add an arbitrary string in`CONTAINER_NAME` (or leave it as default).

In [None]:
ACCOUNT_NAME   = '...'
ACCOUNT_KEY    = '...'
CONTAINER_NAME = 'exercise02'


###  Step 2: Installing and Importing the Azure Storage Library

In [None]:
!pip install azure-storage==0.33.0

In [None]:
from azure.storage.blob import BlockBlobService
from azure.storage.blob import PageBlobService
from azure.storage.blob import AppendBlobService
from azure.storage.blob import PublicAccess
from azure.storage.models import LocationMode
from azure.storage.blob import ContentSettings
from azure.storage.blob import BlockListType
from azure.storage.blob import BlobBlock
from timeit import default_timer as timer
import uuid
import random

#function for genereting unique names for blobs
def get_blob_name():
    return '{}{}'.format('blob', str(uuid.uuid4()).replace('-', ''))

## Exercise 3: Azure Blob Storage Features

### Step 1: Explore Concepts of Azure Blob Storage

1. A container provides a grouping of a set of blobs. All blobs must be in a container. An account can contain an unlimited number of containers, and a container can store an unlimited number of blobs. Note that the container name must be lowercase.

![Image of blob](https://docs.microsoft.com/en-us/azure/includes/media/storage-blob-concepts-include/blob1.png)

2. Let us look at the different types of blobs available in Azure Blob storage by reading the article at the following [link](https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs). After you are done, inspect the table below, and determine which type of blob is suitable for each of the use cases. 

|                | Block Blob | Append Blob  | Page Blob |
| --------------:| -----------| ------------:| ---------:|
| Static content delivery             |            |              |           |
| As a disk for a VirtualMachine       |            |              |           |
| Streaming video                      |            |              |           |
| Log Files                     |            |             |          |
| Social network events (e.g., uploading photos to Instagram)          |            |              |           |

### Step 2: Test Your First Container

1. Create a new container under the specified account. If the container with the same name already exists, the operation fails and returns `False`.

In [None]:
# Choose whether to have public access for this container
public_access = False
access_type = PublicAccess.Container if public_access else None

# Create the container
block_blob_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)
status = block_blob_service.create_container(CONTAINER_NAME, public_access=access_type)

# Write a console message indicating if the container was successfully created
if status==True:
	print(f"Container {CONTAINER_NAME} created")
else:
	print(f"Container {CONTAINER_NAME} already exists")

2. Download a file to the Colab's Virtual Machine (VM).

In [None]:
!wget https://www.vetbabble.com/wp-content/uploads/2016/11/hiding-cat.jpg -O cat.jpg

3. Upload the file to Azure Blob storage. Note that the name of the file on local machine (`local_file`) can differ from the name of the blob (`blob_name`).

In [None]:
# Define the local and remote file names
local_file = "cat.jpg"
blob_name = "picture"

# Create a blob which contains the downloaded image
try:
  block_blob_service.create_blob_from_path(
    CONTAINER_NAME,
    blob_name,
    local_file,
    content_settings=ContentSettings(content_type='image/jpg')
  )
  print("Blob URL:", block_blob_service.make_blob_url(CONTAINER_NAME, blob_name))
except:
  print ("Could not create the blob")

4. Try to open the link above

By default, the new container is private, so you must specify your storage access key (as you did earlier) to download blobs from this container. If you want to make the blobs within the container available to everyone, you can create the container and pass the public access level using the following code.

In [None]:
# Give your container public access
block_blob_service.set_container_acl(CONTAINER_NAME, public_access=PublicAccess.Container)

After this change, anyone on the Internet can see blobs in a public container, but only you can modify or delete them. 

Try to open the link again. Note that it may take a few seconds to change access permisions.

5. List all blobs in the container

In order to list the blobs in a container, use the `list_blobs` method. This method returns a generator which can be iterated over in a loop. The following code outputs the name, type, size and url of each blob in a container.

In [None]:
# List all blobs in the container
blobs = block_blob_service.list_blobs(CONTAINER_NAME)
for blob in blobs:
  try:
    print(f"Name: {blob.name}") 
    print(f" > Type: {blob.properties.blob_type}") 
    print(f" > Size: {blob.properties.content_length}") 
    print(f" > URL:  {block_blob_service.make_blob_url(CONTAINER_NAME,blob.name)}")
  except:
    print("Something went wrong!")

6. Download blobs

In order to download data from a blob, use `get_blob_to_path`, `get_blob_to_stream`, `get_blob_to_bytes`, or `get_blob_to_text`. They are high-level methods that perform the necessary chunking when the size of the data exceeds 64 MB.

Note: The name of the file after downloading can differ from the name of the blob.

The following example uses `get_blob_to_path` to download the content of your container and store it with names of the form `file_i`, where `i` is a sequential index.

In [None]:
# Specify the local path where the contents will be downloaded
LOCAL_PATH = "."   

# Iterate through the blobs and download them
blobs = block_blob_service.list_blobs(CONTAINER_NAME)
for i, blob in enumerate(blobs):
  local_file = f"{LOCAL_PATH}/file_{i}"
  try:
    block_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)
    print(f"Successfully downloaded {local_file}")
  except:
    print(f"Something went wrong while downloading blob {blob.name}")

Since the downloaded file does not have an extension, Colab will not know how to interpret it. The cell below adds the `.jpg` extension to the name. After this, try opening the file and see what you get.

In [None]:
!mv file_0 file_0.jpg

### Step 3: Using the REST API

REpresentational State Transfer (__REST__), or __RESTful__, web services provide interoperability between computer systems on the Internet. REST-compliant web services allow the requesting systems to access and manipulate textual representations of web resources by using a uniform and predefined set of **stateless** operations.

The most popular operations in REST are GET, POST, PUT, DELETE. A response may be in XML, HTML, JSON, or some other format. 

You can find the Azure Blob Service API description at the following [link](https://docs.microsoft.com/en-us/rest/api/storageservices/blob-service-rest-api), and the HTTP response codes defined by the World Wide Web Consortium (W3C) [here](https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html).

We could use tools like [Postman](https://www.getpostman.com/), CURL or others to make REST requests to Azure Storage. In this tutorial we will use [Reqbin](https://reqbin.com/), a simple effective website to post HTTP request online.
<br>

### Tasks

Complete the tasks below:

1. Use any tool for listing all blobs in your container. For this, use the following [REST request](https://docs.microsoft.com/en-us/rest/api/storageservices/list-blobs). To avoid setting up an authentication you can make your container public by changing access policy to *Container*. See Pictures below:

<img src="https://polybox.ethz.ch/index.php/s/rIw7dSK20PEH5um/download" width=600/>

<img src="https://polybox.ethz.ch/index.php/s/IpjOgJtOj2v13fT/download" width=600/>

<img src="https://polybox.ethz.ch/index.php/s/k5MVQE8jKQg6OOJ/download" width=600/>

Alternatively, you can also use the cell below to do this without accessing the Azure UI:

In [None]:
block_blob_service.set_container_acl(CONTAINER_NAME, public_access=PublicAccess.Container)

2. Explain why the request above does not include a **body** part.
3. What is the response format of the request? 

# Exercise 4. KeyValue Vector Clocks

As pointed out in the lecture, the concepts are clearly explained in the Dynamo paper by the DeCandia, G., et. al. (2007). "Dynamo: Amazon’s Highly Available Key-value Store". In SOSP ’07 (Vol. 41, p. 205). [DOI](https://dl.acm.org/citation.cfm?doid=1294261.1294281)

## Task 1
Multiple distributed hash tables use vector clocks for capturing causality among different versions of the same object. In Amazon's Dynamo, a vector clock is associated with every version of every object.

Let $VC$ be an $N$-element array which contains non-negative integers, initialized to 0, representing $N$ logical clocks of the $N$ processes (nodes) of the system. $VC$ gets its $j$ element incremented by one everytime node $j$ performs a write operation on it. <br>
Moreover, $VC(x)$ denotes the vector clock of a write event, and $VC(x)_z$ denotes the element of that clock for the node $z$.

Try to __formally define__ the partial ordering that we get from using vector clocks.

## Task 2

Vector clock antisymmetry property is defined as follows:

If $ VC(x) \lt VC(y)$, then $ ¬ \ (VC(y) \lt VC(x)) $

Prove this property.

## Task 3

Consider $j$ servers in a cluster where $S_j$ denotes the $j$th node.  
In this exercise, we adopt a slightly modified notation from the Dynamo paper:  
- The Dynamo paper indicates the writing server on the edge, we however write it before the colon.  
- For brevity, we index server by position and omit server name in the vector clock.

For example **aa ([$S_0$,0],[$S_1$,4])** with $S_1$ as writing server become **$S_1$ : aa ([0,4])**
So, given the following version evolution DAG for a particular object, complete the vector clocks computed at the corresponding version.

<img src="https://polybox.ethz.ch/index.php/s/iRONxqhpQkRdLeY/download" width=400/>

<img src="https://polybox.ethz.ch/index.php/s/WzJlMxIrA2RGcKh/download" width=400/>

<img src="https://polybox.ethz.ch/index.php/s/nZ83Jb7mrr0uhi8/download" width=400/>

## Task 4

When a get request comes in to Amazon Dynamo with some key, then:
  - The coordinator node (selected from the preference list as the top node for this key) is taking care of this request
  - The coordinator node requests from other nodes (itself + the next N-1 healthy ones on the preference list), and receives, a set of versions for the value associated with the key, that are modelled as __value (vector clock)__ pairs such as a ([1, 3, 2])

### Task 4.1
Given the following list of versions, draw the version DAG that the coordinator node will build for returning available versions.

1 ([0,0,1])  
1 ([0,1,1])  
2 ([1,1,1])  
3 ([0,2,1])  
10 ([1,3,1])

### Task 4.2
Given the following list of versions, draw the version DAG that the coordinator node will build for returning the correct version.


 a ([1,0,0])  
 b ([0,1,0])  
 c ([2,1,0])   
 d ([2,1,1])   
 e ([3,1,1])  
 f ([2,2,1])   
 g ([3,1,2])   
 h ([3,2,3])  
 i ([4,2,2])   
 j ([5,2,2])  
 k ([4,3,3])  
 l ([5,2,3])  
 m ([5,4,3])  
 n ([6,3,3])  
 o ([6,4,4])  

### Task 4.3
Given the following list of versions, draw the version DAG that the coordinator node will build for returning the correct version.

a ([1,0,0,0])  
b ([0,0,0,1])  
aa ([0,0,1,0])  
bb ([0,1,0,0])  
c ([1,2,0,1])  
cc ([0,1,1,2])  
d ([1,3,0,1])  
f ([1,2,1,3])  
e ([2,1,1,2])  
g ([2,2,2,3])  

## Task 5 (Optional)

Consider $j$ servers in a cluster where $S_j$ denotes the $j$th node. The following table denotes the execution of a series of get/put operations. Also each line of the table represents the events that happen at the time time. For example, at time 0 (`t0`) servers $S_1$ and $S_3$ perform operations. Moreover, when reading and writing an object, we are provided with / must provide a context respecitvely. The context itself is the vector clock, and helps the routines understand what version of the object they are dealing with, and what the new, updated version of the context will be.

For the `get` and `put` routines, we have the following signatures:

* `get(key)` $\rightarrow$ `[val_1, val_2, ...]`, $C_{key}$ `(` $VC$ `(key))` 
  * Example: `get("foo")` $\rightarrow$ `[bar_1, bar_2]`, $C_{2}$ `([1, 0, 1, 0]) # We assume the existence of 4 nodes` 

* `put(key, context, val)` $\rightarrow$ `None`
  * Example: `put("foo",` $C_2$`, "bar")`  

Note that the $C_{key}$ elements are just notation, and are meant to highlight that the context gets passed around between the `get` and `put` routines in a real API. 

Complete missing `[list_values],` $C_{key}$ `([vector_clock])` tuples for the calls below.

<table>
  <tr><th></th><th>S0</th><th>S1</th><th>S2</th><th>S3</th><th>S4</th></tr>
  <tr>
    <td>t0</td>
    <td></td>
    <td>Get(1)$\rightarrow$ _______________, $C_{1}$(_______________)<br>Put(1, _____, ”a”)</td>
    <td></td>
    <td>Get(1)$\rightarrow$ _______________, $C_{2}$(_______________)<br>Put(1, _____, ”bb”)</td>
    <td></td>
  </tr>
  <tr>
    <td>t2</td>
    <td>Get(1)$\rightarrow$ _______________, $C_4$(_______________)<br>Put (1, _____, “rr”)</td>
    <td>Get(1)$\rightarrow$ _______________, $C_5$(_______________)<br>Put (1, _____, ”dd”)
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>t4</td>
    <td></td>
    <td></td>
    <td>Get(1)$\rightarrow$ _______________, $C_9$(_______________) <br>Put(1, _____, ”ccc”)</td>
    <td>Get(1)$\rightarrow$ _______________, $C_{10}$(_______________) <br> Put(1, _____, ”dd”)</td>
    <td></td>
  </tr>
  <tr>
    <td>t5</td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td>Get(1)$\rightarrow$ _______________, $C_{11}$(_______________) <br>Put(1, _____, “fff”)</td>
  </tr>
</table>

The DAG below shows the interaction among nodes when retrieving values. You can use it for determining the expected values.

<img src="https://polybox.ethz.ch/index.php/s/YoWi7QK2DcMHJFe/download" width=400/>

# Exercise 5. Merkle Trees
A hash tree or Merkle tree is a binary tree in which every leaf node gets as its label a data block and every non-leaf node is labelled with the cryptographic hash of the labels of its child nodes. 

Some KeyValue stores use Merkle trees for efficiently detecting inconsistencies in data between replicas. 

This works by exchaging first the root hash, comparing it with their own. If the hashes match, the replicas are synchronised. If they do not match, then the children of the node (in the Merkle tree) will be retrieved, and their hashes will be compared. This process continues until the inconsistent leave(s) are identified. 

## Task 1
The two pictures below depict two Merkle trees each one belonging to two different replicas. Both should represent the same object.

For the two pairs of trees below. Specify if it is a possible configuration as well as which nodes have to be exchanged in order to sync the trees, if applicable.  

<img src="https://polybox.ethz.ch/index.php/s/vqj7AOAozZKEO3N/download" width=800/>

Repeat the exercise for the following pair of Merkle Trees.

<img src="https://polybox.ethz.ch/index.php/s/TwFd3KDxTrqq2B1/download" width=800/>

# Exercise 6. Virtual nodes

Virtual nodes were introduced to avoid assigning data in an unbalanced manner and coping with hardware heterogeneity by taking into consideration the physical capacity of each server

Let assume we have ten servers ($i_1$ to $i_{10}$) each with the following amount of main memory: `8GB, 16GB, 32GB, 8GB, 16GB, 0.5TB, 1TB, 0.25TB, 10GB, 20GB`. Calculate the number of virtual nodes/tokens each server should get according to its main memory capacity if we want to have a total of `256` virtual nodes/tokens.

Just for the purpose of the exercises if you get a fractional number of virtual nodes, always round up, even if the total sum of nodes in the end exceed `256`.