# Characteristics of Big Data

## Assignment 1

In this assignment, you will calculate the estimated sizes of big data sets and the latency involved in transmitting data. 

This notebook contains the skeleton necessary for you to complete the assignment.  Look for comments that include `# TODO:` for sections that you need to complete. This notebook also contains the functions `check_data_items` and `check_latency_items` that check that you completed the assignment correctly.  Before you submit the assignment, the notebook should run without any assertion errors. 

Warning: Do not change the names of the dataframes (i.e. `df1_1`, `df1_2`, `df`_3`) as the instructor uses these names when checking the assignments. 

In [2]:
# This code helps check asssignment data

import pandas as pd
from collections import namedtuple
from dataclasses import dataclass

InformationUnit = namedtuple('InformationUnit', ['name', 'size'])
DataItem = namedtuple('DataItem', ['name', 'size', 'unit'])
LatencyItem = namedtuple('LatencyItem', ['name', 'time', 'unit', 'explanation'])

information_units = dict(
    B=InformationUnit("byte", 1),
    KB=InformationUnit("kilobyte", 1e3),
    MB=InformationUnit("megabyte", 1e6),
    GB=InformationUnit("gigabyte", 1e9),
    TB=InformationUnit("terabyte", 1e12),
    PB=InformationUnit("petabyte", 1e15),
    EB=InformationUnit("exabyte", 1e18),
    ZB=InformationUnit("zettabyte", 1e21),
    YB=InformationUnit("yottabyte", 1e24)
)

time_units = {
    "ms": "millisecond",
    "s": "second",
    "min": "minute"
}

def check_data_items(items):
    # Checks to see if data sizes and units are filled out correctly
    for item in items:
        assert item.size > 0, 'Size for "{}" should be greater than zero'.format(item.name)
        assert item.unit in information_units, 'Unit "{}" not in units dictionary'.format(item.unit)
        
def check_latency_items(items):
    # Checks to see if time sizes and units are filled out correctly
    for item in items:
        # assert item.time > 0, 'Time for "{}" should be greater than zero'.format(item.name)
        assert item.unit in time_units, 'Unit "{}" not in time units dictionary'.format(item.unit)
        assert item.explanation != "FILL IN THE EXPLANATION HERE", 'Fill in explanation for "{}"'.format(item.name)

### Assignment 1.1

Provide estimates for the size of various data items.  Please explain how you arrived at the estimates for the size of each item by citing references or providing calculations. 

* Assume all videos are 30 frames per second
* [HEVC](https://en.wikipedia.org/wiki/High_Efficiency_Video_Coding) stands for High Efficiency Video Coding
* See the Wikipedia article on [display resolution](https://en.wikipedia.org/wiki/Display_resolution) for information on HD (1080p) and 4K UHD resolutions. 

| Data Item                                  | Size per Item | 
|:-------------------------------------------|--------------:|
| 128 character message                      | ? Bytes       |
| 1024x768 PNG image                         | ? MB          |
| 1024x768 RAW image                         | ? MB          | 
| HD (1080p) HEVC Video (15 minutes)         | ? MB          |
| HD (1080p) Uncompressed Video (15 minutes) | ? MB          |
| 4K UHD HEVC Video (15 minutes)             | ? MB          |
| 4k UHD Uncompressed Video (15 minutes)     | ? MB          |
| Human Genome (Uncompressed)                | ? GB          |

In [3]:
# TODO: Fill in the estimated sizes for each item
# You may need to adjust the units as well

items1_1 = [
    DataItem('1 Byte', 1, 'B'),
    DataItem("128 character message", 512, "B"), 
    #Since Windows interally uses UTF-16 as basis for its tasks, I'll be using it for the calculations.
    #According to the IBM document here (https://www.ibm.com/docs/en/db2-for-zos/11?topic=unicode-utfs) UTF-16 requires at most 4 Bytes per character, so I'll run my calcuations on that. 
    #That means 512 Bytes per 128 character message.
    DataItem("1024x768 PNG image", 3, "MB"), 
    #For a 1024x768 image we end up with 786,432 Pixels total. A PNG image has 32 bits per pixel (https://en.wikipedia.org/wiki/Portable_Network_Graphics),
    #resulting in 25,165,824 bits. When reduced to MB we get roughly 3MB per image.
    DataItem("1024x768 RAW image", 1.28, "MB"), 
    #Similar to above but a RAW image, but according to this (https://en.wikipedia.org/wiki/Raw_image_format) they tend to have 12-14 bits per pixel. 
    #For the sake of overestimating we will take 14. this results in 10,758,048 bits or ~1.28MB per image.
    DataItem("HD (1080p) HEVC Video (15 minutes)", 1.13, "GB"), 
    #Using Youtube's recommendations for video bitrate where applicable as I don't know the HEVC bitrate values (https://support.google.com/youtube/answer/1722171?hl=en#zippy=%2Cvideo-codec-h%2Cbitrate), 
    #and using this calculator (https://toolstud.io/video/filesize.php)
    #For 1080P the recommended bitrate is 10Mbps for a video with 30fps
    DataItem("HD (1080p) Uncompressed Video (15 minutes)", 167.96, "GB"),
    #Assuming 8bit depth
    DataItem("4K UHD HEVC Video (15 minutes)", 5.85, "GB"),
    #For 4k the recommended bitrate is between 44-56Mbps for a video with 30fps. I'll be using 52 as an inbetween.
    DataItem("4k UHD Uncompressed Video (15 minutes)", 839.81, "GB"),
    #ASsuming 10bit depth
    DataItem("Human Genome (Uncompressed)", 200, "GB"),
    #I wasn't able to find any recent concrete information, but the according to this article in 2014 (https://medium.com/precision-medicine/how-big-is-the-human-genome-e90caa3409b0)
    #It would be, on average, 200GB right off of a genome sequencer.
]

# Checks if items properly updated
check_data_items(items1_1)
    
df1_1 = pd.DataFrame(items1_1)
df1_1.style.hide_index()

  df1_1.style.hide_index()


name,size,unit
1 Byte,1.0,B
128 character message,512.0,B
1024x768 PNG image,3.0,MB
1024x768 RAW image,1.28,MB
HD (1080p) HEVC Video (15 minutes),1.13,GB
HD (1080p) Uncompressed Video (15 minutes),167.96,GB
4K UHD HEVC Video (15 minutes),5.85,GB
4k UHD Uncompressed Video (15 minutes),839.81,GB
Human Genome (Uncompressed),200.0,GB


### Assignment 1.2

Using the estimates for data sizes in the previous part, determine how much storage space you would need for the following items.

* [Twitter statistics](https://www.internetlivestats.com/twitter-statistics/) estimates 500 million tweets are sent each day. For simplicity, assume each tweet is 128 characters. 
* See the [Snappy Github repository](https://github.com/google/snappy) for estimates of Snappy's performance. 
* [Instagram statistics](https://www.omnicoreagency.com/instagram-statistics/) estimates over 100 million videos and photos are uploaded to Instagram every day.   Assume that 75% of those items are 1024x768 PNG photos.
* [YouTube statistics](https://www.omnicoreagency.com/youtube-statistics/) estimates 500 hours of video is uploaded to YouTube every minute.  For simplicity, assume all videos are HD quality encoded using HEVC at 30 frames per second. 


| Data Item                                  | Size per Item | 
|:-------------------------------------------|--------------:|
| Daily Twitter Tweets (Uncompressed)        | ??? TB        |                       
| Daily Twitter Tweets (Snappy Compressed)   | ??? PB        |                       
| Daily Instagram Photos                     | ??? GB        |                       
| Daily YouTube Videos                       | ??? TB        |                       
| Yearly Twitter Tweets (Uncompressed)       | ??? PB        |                       
| Yearly Twitter Tweets (Snappy Compressed)  | ??? PB        |                       
| Yearly Instagram Photos                    | ??? PB        |                       
| Yearly YouTube Videos                      | ??? PB        | 

In [4]:
# TODO: Fill in the estimated sizes for each item
# You may need to adjust the units as well

items1_2 = [
    DataItem("Daily Twitter Tweets (Uncompressed)", 256, "GB"), #Just 500,000,000 * 512 Bytes, reduced to GB
    DataItem("Daily Twitter Tweets (Snappy Compressed)", 409.6, "GB"), #With an average of 1.6x, multiply the above accordingly
    DataItem("Daily Instagram Photos", 225, "TB"), #75,000,000 * 3MB, reduced to TB. 
    DataItem("Daily YouTube Videos", 3.254, "PB"), #With 500hrs of video per minute and 1440min/d, we hit ~720,000hrs of footage at ~4.52GB/hr of footage (previous value *4) Multiply and reduce
    DataItem("Yearly Twitter Tweets (Uncompressed)", 93.44, "TB"), #Daily Twitter tweets *365
    DataItem("Yearly Twitter Tweets (Snappy Compressed)", 149.5, "TB"), #Daily Snappy tweets *365
    DataItem("Yearly Instagram Photos", 82.12, "PB"), #Daily Instagram photos *365
    DataItem("Yearly YouTube Videos", 1.187, "EB"), #Daily Youtube Videos *365
]

# Checks if items properly updated
check_data_items(items1_2)

df1_2 = pd.DataFrame(items1_2)
df1_2.style.hide_index()

  df1_2.style.hide_index()


name,size,unit
Daily Twitter Tweets (Uncompressed),256.0,GB
Daily Twitter Tweets (Snappy Compressed),409.6,GB
Daily Instagram Photos,225.0,TB
Daily YouTube Videos,3.254,PB
Yearly Twitter Tweets (Uncompressed),93.44,TB
Yearly Twitter Tweets (Snappy Compressed),149.5,TB
Yearly Instagram Photos,82.12,PB
Yearly YouTube Videos,1.187,EB


### Assignment 1.3

Provide estimates of the one way latency for each of the following items.  Please explain how you arrived at the estimates for each item by citing references or providing calculations. 

|                           | One Way Latency      |
|:--------------------------|---------------------:|
| Los Angeles to Amsterdam  | ? ms                 |
| Low Earth Orbit Satellite | ? ms                 |
| Geostationary Satellite   | ? ms                 |
| Earth to the Moon         | ? ms                 |
| Earth to Mars             | ? min                | 

In [5]:
# TODO: Provide explanations for how you arrived at each estimation

#In instances where I use the speed of light (c), I will estimate it at 300,000km/s. 
#Additionally, for sake of simplicity, I am ignoring added delay from routers, switches, processing, etc. So basically just propogation delay. 
los_angeles_to_amsterdam_explanation = """
Most data is transferred through Fibre Optic cables anymore due to the bandwith it can carry vs Coaxial cables. As such I'll just take the distance from LA to Amsterdam,
 then calculate the latency as distance/(c*0.67). The 0.67 comes from the speed loss of the fibre optic medium 
(https://www.commscope.com/globalassets/digizuite/2799-latency-in-optical-fiber-systems-wp-111432-en.pdf?r=1)

"""
low_earth_orbit_satellite_explanation = """
Low Earth Orbit range is limited to 2000km or less according to NASA (https://www.nasa.gov/leo-economy/faqs/) As such we will use the max range for latency calculation.
Since we are in the sky, we will have to use Radio Waves which move at the full speed of light. This results in a simple calculation of 2000km/c.
"""
geostationary_satellite_explanation = """
According to Wikipedia (https://en.wikipedia.org/wiki/Geostationary_orbit) the Geostationary Orbit distance is 36,786km. Since we would use radio waves the calcualtion would be 36786km/c.
"""
earth_to_the_moon_explanation = """
According to Wikipedia (https://en.wikipedia.org/wiki/Lunar_distance_(astronomy)) The moon is approximately 400,000km away from the Earth. Since we are using radio waves the calculation
Will be 400,000km/c
"""
earth_to_mars_explanation = """
According to NASA (https://sservi.nasa.gov/articles/how-far-is-it-to-mars/), Mars is approximately 225,000,000km away from Earth on average. Since we will continue using radio waves, the
calculation will be 225,000,000km/c (converting it to min where needed)
"""

# TODO: Fill in the estimated times for each item

items1_3 = [
    LatencyItem(
        "Los Angeles to Amsterdam",
        44,
        "ms",
        los_angeles_to_amsterdam_explanation.strip()
    ),
    LatencyItem(
        "Low Earth Orbit Satellite",
        6,
        "ms",
        low_earth_orbit_satellite_explanation.strip()
    ),
    LatencyItem(
        "Geostationary Satellite",
        122,
        "ms",
        geostationary_satellite_explanation.strip()
    ),
    LatencyItem(
        "Earth to the Moon",
        1.33,
        "s",
        earth_to_the_moon_explanation.strip()
    ),
    LatencyItem(
        "Earth to Mars",
        12.5,
        "min",
        earth_to_mars_explanation.strip()
    ),
]

# Checks if items properly updated
check_latency_items(items1_3)

df1_3 = pd.DataFrame(items1_3)
df1_3.style.hide_index()

  df1_3.style.hide_index()


name,time,unit,explanation
Los Angeles to Amsterdam,44.0,ms,"Most data is transferred through Fibre Optic cables anymore due to the bandwith it can carry vs Coaxial cables. As such I'll just take the distance from LA to Amsterdam,  then calculate the latency as distance/(c*0.67). The 0.67 comes from the speed loss of the fibre optic medium (https://www.commscope.com/globalassets/digizuite/2799-latency-in-optical-fiber-systems-wp-111432-en.pdf?r=1)"
Low Earth Orbit Satellite,6.0,ms,"Low Earth Orbit range is limited to 2000km or less according to NASA (https://www.nasa.gov/leo-economy/faqs/) As such we will use the max range for latency calculation. Since we are in the sky, we will have to use Radio Waves which move at the full speed of light. This results in a simple calculation of 2000km/c."
Geostationary Satellite,122.0,ms,"According to Wikipedia (https://en.wikipedia.org/wiki/Geostationary_orbit) the Geostationary Orbit distance is 36,786km. Since we would use radio waves the calcualtion would be 36786km/c."
Earth to the Moon,1.33,s,"According to Wikipedia (https://en.wikipedia.org/wiki/Lunar_distance_(astronomy)) The moon is approximately 400,000km away from the Earth. Since we are using radio waves the calculation Will be 400,000km/c"
Earth to Mars,12.5,min,"According to NASA (https://sservi.nasa.gov/articles/how-far-is-it-to-mars/), Mars is approximately 225,000,000km away from Earth on average. Since we will continue using radio waves, the calculation will be 225,000,000km/c (converting it to min where needed)"
