# <center>Big Data for Engineers &ndash; Exercises &ndash; Solution</center>
## <center>Spring 2022 &ndash; Week 2 &ndash; ETH Zurich</center>

### Azure Lab Handout

For the Azure Blob Storage component of this exercise, you will need to navigate to the [Azure Education Portal](https://aka.ms/startedu) and accept the lab handout by clicking 'Setup Lab' as seen below. If you have not yet enrolled in the Azure classroom, you can find instructions to do so on the right sidebar of the BDFE Moodle homepage.

**We recommend that you do this as early as possible, as the lab handout can sometimes take a while to create!**

<img src="images/azure1.png" width=800/>

## Exercise 1: Storage devices

In this exercise, we want to understand the differences between [SSD](https://en.wikipedia.org/wiki/Solid-state_drive), [HDD](https://en.wikipedia.org/wiki/Hard_disk_drive), and [SDRAM](https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory) in terms of __capacity__, __speed__ and __price__. 

### Task 1
Fill in the table below by visiting your local online hardware store and choosing the storage device with largest capacity available but optimizing for read/write speed.
For instance, you can visit Digitec.ch to explore the prices on [SSDs](https://www.digitec.ch/en/s1/producttype/ssd-545?tagIds=76), [HDDs](https://www.digitec.ch/en/s1/producttype/hard-drives-36?tagIds=76), and 
[SDRAMs](https://www.digitec.ch/en/s1/producttype/memory-2?tagIds=76). 
You are free to use any other website for filling the table. 

| Storage Device | Maximum capacity, GB | Price, CHF/GB  | Read speed, GB/s | Write speed, GB/s | Link |
| --------------:| --------------------:| --------------:|-----------------:|------------------:|------|
| HDD            |                      |                |                  |                   |&nbsp;|
| SSD            |                      |                |                  |                   |&nbsp;|
| DRAM           |                      |                |                  |                   |&nbsp;|


### Task 2
Answer the following questions:
1. What type of storage devices above is the cheapest one?
2. What type of storage devices above is the fastest in terms of read speed?

### Solution
Looking at digitec.ch, we complete the table as follows:

| Storage Device | Maximum capacity, GB | Price, CHF/GB  | Read speed, GB/s | Write speed, GB/s | Link |
| --------------:| --------------------:| --------------:|-----------------:|------------------:|------|
| HDD            |      20000 (20 TB).  | 0.02970 CHF/GB |        0.29 GB/s |         0.29 GB/s |[Link](https://www.digitec.ch/en/s1/product/seagate-ironwolf-pro-20-tb-35-cmr-hard-drives-17728311?supplier=406802)|
| SSD            |      30720 (30.7 TB) |   0.309 CHF/GB |         2.1 GB/s |          1.7 GB/s |[Link](https://www.digitec.ch/en/s1/product/samsung-enterprise-pm1643-30720-gb-25-ssd-10110860?supplier=406802)|
| DRAM           |                128GB |    8.57 CHF/GB |        ~60 GB/s* |         ~48 GB/s* |[Link](https://www.digitec.ch/en/s1/product/kingston-ddr4-3200mhz-lrdimm-quad-rank-module-1-x-128gb-lr-dimm-memory-17550606?supplier=406802)|

*RAM speeds are usually not measured in GB/s, but rather MT/s (Megatransfers per second). Actual data transfer speeds in GB/s depend also on the CPU/Motherboard and are usually empirical rather than by specification.

1. HDDs are the cheapest storage device among mentioned devices
2. DRAMs are the fastest storage device among mentioned devices 

## Exercise 2: HTTP

HTTP is the underlying protocol used by the World Wide Web. It defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands.

The HTTP protocol is based on requests (usually made by a client, eg: your web browser) and responses (usually made by a server hosting a particular website or application).

When we visit websites, our browser is always making multitudes of HTTP requests to retrieve all the resources needed to render out webpage! If you're curious, you can try keep the 'network' tab of your  browser's 'Developer Tools' tab open while visiting several websites.

#### Example Request:
> <font color="#990000">POST</font> <font color="blue">/path/index.html</font> <font color="#bf9000">HTTP/1.1</font>  
<font color="green">Host: www .example.com  
User-Agent: Mozilla/4.0</font><br>
BookId=3131&Author=Asimov

<font color="#990000">This fragment indicates the HTTP method used</font><br>
<font color="blue">This indicate the relative path of the resource</font><br>
<font color="#bf9000">This is the HTTP version</font><br>
<font color="green">These are the headers of the request (1 per line)</font><br>
<font>This is the body of the request</font>


#### Example Response
> <font color="#bf9000">HTTP/1.1</font> <font color="#990000">200 OK</font>  
<font color="green">Date: Tue, 25 Sep 2018 09:48:34 GMT  
Content-Type: text/html; charset=UTF-8  
Content-Length: 138  
</font>
&lt;html> &lt;head> &lt;title>An Example Page&lt;/title>
&lt;/head> &lt;body> Hello World, this is a very simple
HTML document. &lt;/body> &lt;/html>

<font color="#990000">This fragment indicates the status code of the response</font><br>
<font color="#bf9000">This is the HTTP version</font><br>
<font color="green">These are the headers of the response (1 per line)</font><br>
<font>This is the body of the response</font>

### HTTP Methods

Consider a well-designed object storage service providing a REST API implemented over HTTP.

1. Which HTTP method allows the retrieval of objects on the server? What do we mean by saying that it should be side-effect free?  
2. Which HTTP methods allow the insertion and deletion of objects from the server, respectively?
3. Which other generic method allows for sending information and/or receiving results?

### Solution

1. GET. This means that it should cause no modifications on the server.
2. PUT, DELETE
3. POST

## Exercise 3: Storage

### Object Storage, Scalability

1. What are the four most important traits of Object Storage that allows scalability? 
2. What are the two ways through which you can scale beyond one machine?

#### Solution
1. Black box objects, key-value model, flexible metadata, commodity hardware.
2. Horizontal scalability (more nodes); vertical scalability (more powerful nodes)

### Azure Blob Storage vs Amazon S3

For each question give the answer for both: Azure Blob Storage and Amazon S3

1. How are objects identified?
2. What kind of objects can you create?

#### Solution
* Azure
    1. Account ID + Container ID + Blob ID
    2. 3 types of blobs: BlockBlob, PageBlob, AppendBlob
* S3
    1. Bucket ID + Object ID
    2. Blackbox objects

## Exercise 4: Setting up an Azure storage account

In this section you'll learn how to set up a Locally Redundant Storage instance.


### Step 1: Create a Locally-redundant Storage

1. First, you need to create a storage account. In the Azure portal, click on the option "Storage accounts" in the left hand side menu. 

<img src="images/azure2.png" width=800/>

2. Click on the "Add" button at the top of the page. 

<img src="images/azure3.png" width=800/>

3. At this point, you will arrive at a form titled 'Create a storage account'. Fill in the form in the following manner:

Under Project details:
- Select *Week 2: Object Storage* as *Subscription* 
- If not already present, create a new resource group called *exercise02*
      
<img src="images/azure4.png" width=500/>

Under Instance details:
- The *Storage Account Name* can be whatever you want, but it needs to be unique. Oftentimes, including your ETH username will suffice.
- Select *Locally-redundant storage (LRS)* as *Redundancy* mode
- For the *Region* you can select *Switzerland North*
- Leave all other values unchanged 
    
Afterwards, your form should look like this:

<img src="images/azure5.png" width=500/>

4. Click *Review + create* then *Create* on the next page (deployment might take a few minutes).

5. Go to the resource page of the newly created LRS by clicking 'Go to resource'

<img src="images/azure6.png" width=800/>

6. In the left-hand menu, under the *Settings* group, select the *Access Keys* tab.

<img src="images/azure7.png" width=800/>

7. Copy one of the access keys (it doesn't matter which key). You may have to click the 'Show keys' button first.

8. Paste the *Storage Account Name* in `ACCOUNT_NAME`, the access key in `ACCOUNT_KEY`, and add an arbitrary string in`CONTAINER_NAME` (or leave it as default).

In [None]:
ACCOUNT_NAME   = ''
ACCOUNT_KEY    = ''
CONTAINER_NAME = 'exercise02'


###  Step 2: Installing and Importing the Azure Storage Library

In [None]:
!pip install azure-storage==0.33.0

In [None]:
from azure.storage.blob import BlockBlobService
from azure.storage.blob import PageBlobService
from azure.storage.blob import AppendBlobService
from azure.storage.blob import PublicAccess
from azure.storage.models import LocationMode
from azure.storage.blob import ContentSettings
from azure.storage.blob import BlockListType
from azure.storage.blob import BlobBlock
from timeit import default_timer as timer
import uuid
import random

#function for genereting unique names for blobs
def get_blob_name():
    return '{}{}'.format('blob', str(uuid.uuid4()).replace('-', ''))

## Exercise 5: Azure Blob Storage Features

### Step 1: Explore Concepts of Azure Blob Storage

1. A container provides a grouping of a set of blobs. All blobs must be in a container. An account can contain an unlimited number of containers, and a container can store an unlimited number of blobs. Note that the container name must be lowercase.

![Image of blob](https://docs.microsoft.com/en-us/azure/includes/media/storage-blob-concepts-include/blob1.png)

2. Let us look at the different types of blobs available in Azure Blob storage by reading the article at the following [link](https://docs.microsoft.com/en-us/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs). After you are done, inspect the table below, and determine which type of blob is suitable for each of the use cases. 

|                | Block Blob | Append Blob  | Page Blob |
| --------------:| -----------| ------------:| ---------:|
| Static content delivery             |            |              |           |
| As a disk for a VirtualMachine       |            |              |           |
| Streaming video                      |            |              |           |
| Log Files                     |            |             |          |
| Social network events (e.g., uploading photos to Instagram)          |            |              |           |

__Solution:__

|                | Block Blob | Append Blob  | Page Blob |
| --------------:| -----------| ------------:| ---------:|
| Static content delivery              |     X      |             |           |
| As a disk for a VirtualMachine       |            |             |     X     |
| Streaming video                      |     X      |             |           |
| Log Files                     |            |       X      |           |
| Social network events (e.g., uploading photos to Instagram) |     X      |              |           |

__Block blobs__ let you upload large blobs efficiently. Block blobs are comprised of blocks, each of which is identified by a block ID. You create or modify a block blob by writing a set of blocks and committing them by their block IDs. Each block can have a different size, up to a maximum of 4000 MiB, and a block blob can include up to 50,000 blocks. The maximum size of a block blob is therefore slightly more than 190 TiB (4000 MiB X 50,000 blocks). 

__Append blobs__ are comprised of blocks and are optimized for append operations. When one modifies an append blob, blocks are added to the end of the blob only - via the `append_block` operation -. Updating or deleting of existing blocks is not supported. Unlike a block blob, an append blob does not expose its block IDs.

Each block in an append blob can have a different size, up to a maximum of 4 MiB, and an append blob can include up to 50,000 blocks. The maximum size of an append blob is therefore slightly more than 195 GiB (4 MiB X 50,000 blocks).

__Page blobs__ are a collection of 512-byte pages optimized for random read and write operations. To create a page blob, you initialize the page blob and specify the maximum size the page blob will grow. To add or update the contents of a page blob, you write a page or a set of pages by specifying an offset and a range that align to 512-byte page boundaries. A write to a page blob can overwrite just one page, some pages, or up to 4 MiB of the page blob. Writes to page blobs happen in-place and are immediately committed to the blob. The maximum size for a page blob is 8 TiB.

### Step 2: Test Your First Container

1. Create a new container under the specified account. If the container with the same name already exists, the operation fails and returns `False`.

In [None]:
# Choose whether to have public access for this container
public_access = False
access_type = PublicAccess.Container if public_access else None

# Create the container
block_blob_service = BlockBlobService(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY)
status = block_blob_service.create_container(CONTAINER_NAME, public_access=access_type)

# Write a console message indicating if the container was successfully created
if status==True:
	print(f"Container {CONTAINER_NAME} created")
else:
	print(f"Container {CONTAINER_NAME} already exists")

2. Download a file to the local filesystem. Don't open it yet!

In [None]:
!wget https://cloud.inf.ethz.ch/s/skt6Ka4fYEJXcga/download/topsecret.jpg -O topsecret.jpg

3. Upload the file to Azure Blob storage. Note that the name of the file on local machine (`local_file`) can differ from the name of the blob (`blob_name`).

In [None]:
# Define the local and remote file names
local_file = "topsecret.jpg"
blob_name = "picture"

# Create a blob which contains the downloaded image
try:
  block_blob_service.create_blob_from_path(
    CONTAINER_NAME,
    blob_name,
    local_file,
    content_settings=ContentSettings(content_type='image/jpg')
  )
  print("Blob URL:", block_blob_service.make_blob_url(CONTAINER_NAME, blob_name))
except:
  print ("Could not create the blob")

4. Try to open the link above (you should get an error)

By default, the new container is private, so you must specify your storage access key (as you did earlier) to download blobs from this container. If you want to make the blobs within the container available to everyone, you can create the container and pass the public access level using the following code.

In [None]:
# Give your container public access
block_blob_service.set_container_acl(CONTAINER_NAME, public_access=PublicAccess.Container)

After this change, anyone on the Internet can see blobs in a public container, but only you can modify or delete them. 

Try to open the link again. Note that it may take a few seconds to change access permisions.

5. List all blobs in the container

In order to list the blobs in a container, use the `list_blobs` method. This method returns a generator which can be iterated over in a loop. The following code outputs the name, type, size and url of each blob in a container.

In [None]:
# List all blobs in the container
blobs = block_blob_service.list_blobs(CONTAINER_NAME)
for blob in blobs:
  try:
    print(f"Name: {blob.name}") 
    print(f" > Type: {blob.properties.blob_type}") 
    print(f" > Size: {blob.properties.content_length}") 
    print(f" > URL:  {block_blob_service.make_blob_url(CONTAINER_NAME,blob.name)}")
  except:
    print("Something went wrong!")

6. Download blobs

In order to download data from a blob, use `get_blob_to_path`, `get_blob_to_stream`, `get_blob_to_bytes`, or `get_blob_to_text`. They are high-level methods that perform the necessary chunking when the size of the data exceeds 64 MB.

Note: The name of the file after downloading can differ from the name of the blob.

The following example uses `get_blob_to_path` to download the content of your container and store it with names of the form `file_i`, where `i` is a sequential index.

In [None]:
# Specify the local path where the contents will be downloaded
LOCAL_PATH = "."   

# Iterate through the blobs and download them
blobs = block_blob_service.list_blobs(CONTAINER_NAME)
for i, blob in enumerate(blobs):
  local_file = f"{LOCAL_PATH}/file_{i}"
  try:
    block_blob_service.get_blob_to_path(CONTAINER_NAME, blob.name, local_file)
    print(f"Successfully downloaded {local_file}")
  except:
    print(f"Something went wrong while downloading blob {blob.name}")

Since the downloaded file does not have an extension, Jupyter will not know how to interpret it. The cell below adds the `.jpg` extension to the name. After this, try opening the file and see what you get.

In [None]:
!mv file_0 file_0.jpg

Now open the file explorer in your Jupyter sidebar and open `file_0.jpg`. You may need to refresh first.

### Step 3: Using the REST API

REpresentational State Transfer (__REST__), or __RESTful__, web services provide interoperability between computer systems on the Internet. REST-compliant web services allow the requesting systems to access and manipulate textual representations of web resources by using a uniform and predefined set of **stateless** operations.

The most popular operations in REST are GET, POST, PUT, DELETE. A response may be in XML, HTML, JSON, or some other format. 

You can find the Azure Blob Service API description at the following [link](https://docs.microsoft.com/en-us/rest/api/storageservices/blob-service-rest-api), and the HTTP response codes defined by the World Wide Web Consortium (W3C) [here](https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html). It is part of the question that you navigate through the API documentation to look for the GET request we want here.

We could use tools like [Postman](https://www.getpostman.com/), CURL or others to make REST requests to Azure Storage. In this tutorial we will use [Reqbin](https://reqbin.com/), a simple effective website to post HTTP request online.
<br>

### Tasks

Complete the tasks below:

1. Use any tool for listing all blobs in your container. For this, use the following [REST request](https://docs.microsoft.com/en-us/rest/api/storageservices/list-blobs). To avoid setting up an authentication you may need to make your container public by changing access policy to *Container*. See Pictures below:

<img src="images/azure8.png" width=600/>

<img src="images/azure9.png" width=600/>

<img src="images/azure10.png" width=600/>

Alternatively, you can also use the cell below to do this without accessing the Azure UI:

In [None]:
block_blob_service.set_container_acl(CONTAINER_NAME, public_access=PublicAccess.Container)

2. Explain why the request above does not include a **body** part.
3. What is the response format of the request? 

__Solutions:__

1. Listing all blobs in a container:

__Solution 1:__ Using Reqbin.

You can use as URL the following found [here](https://docs.microsoft.com/en-us/rest/api/storageservices/list-blobs), but replacing the storage account name with what you chose instead of `myaccount` and the container we set up in this exercise instead of `mycontainer`. <br>

https://myaccount.blob.core.windows.net/mycontainer?restype=container&comp=list

__Solution 2:__ Using `wget`.

As specified above, make sure to replace `myaccount` and `mycontainer` with the values specific to your setup.

In [None]:
!wget -qO- "https://exercise02dcutting.blob.core.windows.net/exercise02?restype=container&comp=list"

2. We are performing a **GET** request, which does not have a body.
3. The format of the response is **XML**.

### Step 4: Delete your resources!
Having open storage accounts will gradually (albeit slowly) drain your assigned credits. When you are done with the exercise, please delete your storage account:

<img src="images/azure8.png" width=800/>