# **1.2 Exercise**
Michael J. Montana
College of Science and Tecnology, Bellevue University
DSC400: Big Data, Technology, and Algorithms
Professor Shawn Hermans
June 11 2023

# Characteristics of Big Data

## Assignment 1

In this assignment, you will calculate the estimated sizes of big data sets and the latency involved in transmitting data. 

This notebook contains the skeleton necessary for you to complete the assignment.  Look for comments that include `# TODO:` for sections that you need to complete. This notebook also contains the functions `check_data_items` and `check_latency_items` that check that you completed the assignment correctly.  Before you submit the assignment, the notebook should run without any assertion errors. 

Warning: Do not change the names of the dataframes (i.e. `df1_1`, `df1_2`, `df`_3`) as the instructor uses these names when checking the assignments. 

In [70]:
# This code helps check asssignment data

import pandas as pd
from collections import namedtuple
from dataclasses import dataclass

InformationUnit = namedtuple('InformationUnit', ['name', 'size'])
DataItem = namedtuple('DataItem', ['name', 'size', 'unit'])
LatencyItem = namedtuple('LatencyItem', ['name', 'time', 'unit', 'explanation'])

information_units = dict(
    B=InformationUnit("byte", 1),
    KB=InformationUnit("kilobyte", 1e3),
    MB=InformationUnit("megabyte", 1e6),
    GB=InformationUnit("gigabyte", 1e9),
    TB=InformationUnit("terabyte", 1e12),
    PB=InformationUnit("petabyte", 1e15),
    EB=InformationUnit("exabyte", 1e18),
    ZB=InformationUnit("zettabyte", 1e21),
    YB=InformationUnit("yottabyte", 1e24)
)

time_units = {
    "ms": "millisecond",
    "s": "second",
    "min": "minute"
}

def check_data_items(items):
    # Checks to see if data sizes and units are filled out correctly
    for item in items:
        assert item.size > 0, 'Size for "{}" should be greater than zero'.format(item.name)
        assert item.unit in information_units, 'Unit "{}" not in units dictionary'.format(item.unit)
        
def check_latency_items(items):
    # Checks to see if time sizes and units are filled out correctly
    for item in items:
        # assert item.time > 0, 'Time for "{}" should be greater than zero'.format(item.name)
        assert item.unit in time_units, 'Unit "{}" not in time units dictionary'.format(item.unit)
        assert item.explanation != "FILL IN THE EXPLANATION HERE", 'Fill in explanation for "{}"'.format(item.name)

### Assignment 1.1

Provide estimates for the size of various data items.  Please explain how you arrived at the estimates for the size of each item by citing references or providing calculations. 

* Assume all videos are 30 frames per second
* [HEVC](https://en.wikipedia.org/wiki/High_Efficiency_Video_Coding) stands for High Efficiency Video Coding
* See the Wikipedia article on [display resolution](https://en.wikipedia.org/wiki/Display_resolution) for information on HD (1080p) and 4K UHD resolutions. 

| Data Item                                  | Size per Item | 
|:-------------------------------------------|--------------:|
| 128 character message                      | ? Bytes       |
| 1024x768 PNG image                         | ? MB          |
| 1024x768 RAW image                         | ? MB          | 
| HD (1080p) HEVC Video (15 minutes)         | ? MB          |
| HD (1080p) Uncompressed Video (15 minutes) | ? MB          |
| 4K UHD HEVC Video (15 minutes)             | ? MB          |
| 4k UHD Uncompressed Video (15 minutes)     | ? MB          |
| Human Genome (Uncompressed)                | ? GB          |

# <font color=2d5db5>**Assignment 1.1 Estimates and Sources**

In [101]:
#bit reference http://www.beesky.com/newsite/bit_byte.htm
meg = 2**20
gig = 2**30
tera = 2**40
peta =2**50
exa =2**60

#128 message
# twitter uses UTF-8 which ranges from 1-4 bytes per character; 4 byte characters are typically emojies, math symbols and other less common characters; expected range is 128-384 depending on the language, so going with the average of 256;
# reference: https://en.wikipedia.org/wiki/UTF-8

#1024x768 PNG;
# references: https://en.wikipedia.org/wiki/PNG#File_size_and_optimization_software; https://en.wikipedia.org/wiki/Megabyte#:~:text=The%20unit%20megabyte%20is%20commonly,bytes%20or%2010242%20bytes.
pix = 1024*768 #calculates total pixes
Bpp = 24/8 # 24 bits per pix or 3 bytes
print('A 1024x768 PNG image has ', format(pix,",") , ' pixels. Each pixel has ', Bpp, ' Bytes and there are ', format(meg,","), ' bytes in each MegaByte (MB), so a 1024x768 PNG image has ',pix*Bpp/meg, 'MBs.', sep="")

#1024x768 Raw image
#reference: https://en.wikipedia.org/wiki/Raw_image_format
pix = 1024*768
Raw_Bpp = 13/8 # raw uses 12-14 bits per pixels, so averaging at 13
print('A 1024x768 Raw image has ', format(pix,",") , ' pixels. Each pixel has ', Raw_Bpp, ' Bytes and there are ', format(meg,","), ' bytes in each MegaByte (MB), so a 1024x768 PNG image has ','{:.2f}'.format(pix*Raw_Bpp/meg), 'MBs.', sep="")

# HD (1080p) HEVC Video (15 minutes) & HD (1080p) Uncompressed Video (15 minutes)
#reference: https://www.circlehd.com/blog/how-to-calculate-video-file-size
#File Size = Bitrate x duration x compression ratio
#File Size = Bitrate(pixel count * bit per pixel (8-12; going with 10 avg) * frame rate (30fps) * duration(60 * 15) * compression ratio (raw(1/1), HEVC (1/1000))
br= 1080 * 1920 * 10 * 30
dur = (60*15) #seconds * mins (conversiton for per second bitrate)
raw_cr = 1/1
HEVC_cr = 1/1000

print('HD (1080p) HEVC Video (15 minutes) = ','{:.2f}'.format(br * dur * HEVC_cr /meg),"MBs\n",
      'HD (1080p) Uncompressed Video (15 minutes) = ','{:.2f}'.format(br * dur * raw_cr/gig), "GBs", sep="")

# 4K UHD HEVC Video (15 minutes) & 4k UHD Uncompressed Video (15 minutes)
br = 3840*2160 * 10 * 30

print('4K UHD HEVC Video (15 minutes) = ','{:.2f}'.format(br * dur * HEVC_cr /gig),"GBs\n",
      '4k UHD Uncompressed Video (15 minutes) = ','{:.2f}'.format(br * dur * raw_cr/gig), "GBs", sep="")


# Human Genome (Uncompressed)
#reference https://medium.com/precision-medicine/how-big-is-the-human-genome-e90caa3409b0
# ~200 GB


A 1024x768 PNG image has 786,432 pixels. Each pixel has 3.0 Bytes and there are 1,048,576 bytes in each MegaByte (MB), so a 1024x768 PNG image has 2.25MBs.
A 1024x768 Raw image has 786,432 pixels. Each pixel has 1.625 Bytes and there are 1,048,576 bytes in each MegaByte (MB), so a 1024x768 PNG image has 1.22MBs.
HD (1080p) HEVC Video (15 minutes) = 533.94MBs
HD (1080p) Uncompressed Video (15 minutes) = 521.42GBs
4K UHD HEVC Video (15 minutes) = 2.09GBs
4k UHD Uncompressed Video (15 minutes) = 2085.69GBs


In [73]:
# TODO: Fill in the estimated sizes for each item
# You may need to adjust the units as well

items1_1 = [
    DataItem('1 Byte', 1, 'B'),
    DataItem("128 character message", 256, "B"),
    DataItem("1024x768 PNG image", 2.25, "MB"),
    DataItem("1024x768 RAW image", 1.22, "MB"),
    DataItem("HD (1080p) HEVC Video (15 minutes)", 533.94, "MB"),
    DataItem("HD (1080p) Uncompressed Video (15 minutes)", 521.42, "GB"),
    DataItem("4K UHD HEVC Video (15 minutes)", 2.09, "GB"),
    DataItem("4k UHD Uncompressed Video (15 minutes)", 2085.69, "GB"),
    DataItem("Human Genome (Uncompressed)", 200, "GB"),
]

# Checks if items properly updated
check_data_items(items1_1)
    
df1_1 = pd.DataFrame(items1_1)
df1_1.style.hide_index()

  df1_1.style.hide_index()


name,size,unit
1 Byte,1.0,B
128 character message,256.0,B
1024x768 PNG image,2.25,MB
1024x768 RAW image,1.22,MB
HD (1080p) HEVC Video (15 minutes),533.94,MB
HD (1080p) Uncompressed Video (15 minutes),521.42,GB
4K UHD HEVC Video (15 minutes),2.09,GB
4k UHD Uncompressed Video (15 minutes),2085.69,GB
Human Genome (Uncompressed),200.0,GB


### Assignment 1.2

Using the estimates for data sizes in the previous part, determine how much storage space you would need for the following items.

* [Twitter statistics](https://www.internetlivestats.com/twitter-statistics/) estimates 500 million tweets are sent each day. For simplicity, assume each tweet is 128 characters. 
* See the [Snappy Github repository](https://github.com/google/snappy) for estimates of Snappy's performance. 
* [Instagram statistics](https://www.omnicoreagency.com/instagram-statistics/) estimates over 100 million videos and photos are uploaded to Instagram every day.   Assume that 75% of those items are 1024x768 PNG photos.
* [YouTube statistics](https://www.omnicoreagency.com/youtube-statistics/) estimates 500 hours of video is uploaded to YouTube every minute.  For simplicity, assume all videos are HD quality encoded using HEVC at 30 frames per second. 


| Data Item                                  | Size per Item | 
|:-------------------------------------------|--------------:|
| Daily Twitter Tweets (Uncompressed)        | ??? TB        |                       
| Daily Twitter Tweets (Snappy Compressed)   | ??? PB        |                       
| Daily Instagram Photos                     | ??? GB        |
| Daily YouTube Videos                       | ??? TB        |                       
| Yearly Twitter Tweets (Uncompressed)       | ??? PB        |                       
| Yearly Twitter Tweets (Snappy Compressed)  | ??? PB        |                       
| Yearly Instagram Photos                    | ??? PB        |                       
| Yearly YouTube Videos                      | ??? PB        |

In [105]:


# Daily Twitter Tweets (Uncompressed)
daily_tweets=500000000*256*8

# Daily Twitter Tweets (Snappy Compressed)
# Typical compression ratios (based on the benchmark suite) are about 1.5-1.7x for plain text, about 2-4x for HTML, and of course 1.0x for JPEGs, PNGs and other already-compressed data.
tweets_snappy = daily_tweets * 1.6

# Daily Instagram Photos
insta = 100000000 * .75 * pix * Bpp

# Daily YouTube Videos
dutube = ((1080 * 1920 * 10 * 30) * (60 * 60 * 500 * 24) * (HEVC_cr))

# Yearly Twitter Tweets (Uncompressed),Yearly Twitter Tweets (Snappy Compressed),Yearly Instagram Photos, Yearly YouTube Videos all x365
#bit reference http://www.beesky.com/newsite/bit_byte.htm
meg = 2**20
gig = 2**30
tera = 2**40
peta =2**50
exa =2**60

print('Daily:\n',
      '\tTwitter Tweets (Uncompressed): ', '{:.2f}'.format(daily_tweets/gig), 'GB',
      '\n\tDaily Twitter Tweets (Snappy Compressed): ', '{:.2f}'.format(tweets_snappy/tera), 'TB',
      '\n\tDaily Instagram Photos: ','{:.2f}'.format(insta/tera),'TB',
      '\n\tDaily YouTube Videos: ','{:.2f}'.format(dutube/tera), 'TB',
      '\nYearly:'
      '\n\tYearly Twitter Tweets (Uncompressed): ','{:.2f}'.format(daily_tweets*365/tera), 'TB',
      '\n\tYearly Twitter Tweets (Snappy Compressed): ', '{:.2f}'.format(tweets_snappy*365/tera),'TB',
      '\n\tYearly Instagram Photos: ','{:.2f}'.format(insta*365/peta),'PB',
      '\n\tYearly YouTube Videos: ', '{:.2f}'.format(dutube *365/peta),'PB',
      sep=''
      )

Daily:
	Twitter Tweets (Uncompressed): 953.67GB
	Daily Twitter Tweets (Snappy Compressed): 1.49TB
	Daily Instagram Photos: 160.93TB
	Daily YouTube Videos: 24.44TB
Yearly:
	Yearly Twitter Tweets (Uncompressed): 339.93TB
	Yearly Twitter Tweets (Snappy Compressed): 543.89TB
	Yearly Instagram Photos: 57.36PB
	Yearly YouTube Videos: 8.71PB


In [104]:
# TODO: Fill in the estimated sizes for each item
# You may need to adjust the units as well

items1_2 = [
    DataItem("Daily Twitter Tweets (Uncompressed)", 953.67, "GB"),
    DataItem("Daily Twitter Tweets (Snappy Compressed)", 1.49, "TB"),
    DataItem("Daily Instagram Photos", 160.93, "TB"),
    DataItem("Daily YouTube Videos", 24.44, "TB"),
    DataItem("Yearly Twitter Tweets (Uncompressed)", 339.93, "TB"),
    DataItem("Yearly Twitter Tweets (Snappy Compressed)", 543.89, "TB"),
    DataItem("Yearly Instagram Photos", 57.36, "PB"),
    DataItem("Yearly YouTube Videos", 8.71, "PB"),
]

# Checks if items properly updated
check_data_items(items1_2)

df1_2 = pd.DataFrame(items1_2)
df1_2.style.hide_index()

  df1_2.style.hide_index()


name,size,unit
Daily Twitter Tweets (Uncompressed),953.67,GB
Daily Twitter Tweets (Snappy Compressed),1.49,TB
Daily Instagram Photos,160.93,TB
Daily YouTube Videos,24.44,TB
Yearly Twitter Tweets (Uncompressed),339.93,TB
Yearly Twitter Tweets (Snappy Compressed),543.89,TB
Yearly Instagram Photos,57.36,PB
Yearly YouTube Videos,8.71,PB


### Assignment 1.3

Provide estimates of the one way latency for each of the following items.  Please explain how you arrived at the estimates for each item by citing references or providing calculations. 

|                           | One Way Latency      |
|:--------------------------|---------------------:|
| Los Angeles to Amsterdam  | ? ms                 |
| Low Earth Orbit Satellite | ? ms                 |
| Geostationary Satellite   | ? ms                 |
| Earth to the Moon         | ? ms                 |
| Earth to Mars             | ? min                | 

In [106]:
# TODO: Provide explanations for how you arrived at each estimation

los_angeles_to_amsterdam_explanation = """
I used global ping statistics.  I chose average of 140.089 for the timestamp 2023-06-11 09:01:18. https://wondernetwork.com/pings/Los%20Angeles/Amsterdam
"""
low_earth_orbit_satellite_explanation = """
LEO satellites should deliver approximately 50 ms of latency (and this will improve with next-generation technology to <20 ms). https://www.analog.com/en/analog-dialogue/articles/internet-from-space-rfic-in-high-capacity-low-latency-leo-satellite-terminals.html#:~:text=LEO%20satellites%20should%20deliver%20approximately,due%20to%20the%20lower%20orbit..
"""
geostationary_satellite_explanation = """
Typically, during perfect conditions, the physics involved in satellite communications account for approximately 550 milliseconds of latency round-trip time. https://en.wikipedia.org/wiki/Satellite_Internet_access#:~:text=Typically%2C%20during%20perfect%20conditions%2C%20the,a%20geostationary%20satellite%2Dbased%20network.
"""
earth_to_the_moon_explanation = """
Radio waves propagate in vacuum at the speed of light c, exactly 299,792,458 m/s. Propagation time to the Moon and back ranges from 2.4 to 2.7 seconds, with an average of 2.56 seconds (the average distance from Earth to the Moon is 384,400 km).
"""
earth_to_mars_explanation = """
It generally takes about 5 to 20 minutes for a radio signal to travel the distance between Mars and Earth, depending on planet positions.  Using avg of 12.5 mins.  https://mars.nasa.gov/mars2020/spacecraft/rover/communications/#:~:text=It%20generally%20takes%20about%205,Earth%2C%20depending%20on%20planet%20positions.
"""

# TODO: Fill in the estimated times for each item

items1_3 = [
    LatencyItem(
        "Los Angeles to Amsterdam",
        140.089,
        "ms",
        los_angeles_to_amsterdam_explanation.strip()
    ),
    LatencyItem(
        "Low Earth Orbit Satellite",
        50,
        "ms",
        low_earth_orbit_satellite_explanation.strip()
    ),
    LatencyItem(
        "Geostationary Satellite",
        550,
        "ms",
        geostationary_satellite_explanation.strip()
    ),
    LatencyItem(
        "Earth to the Moon",
        2.56,
        "s",
        earth_to_the_moon_explanation.strip()
    ),
    LatencyItem(
        "Earth to Mars",
        12.5,
        "min",
        earth_to_mars_explanation.strip()
    ),
]

# Checks if items properly updated
check_latency_items(items1_3)

df1_3 = pd.DataFrame(items1_3)
df1_3.style.hide_index()

  df1_3.style.hide_index()


name,time,unit,explanation
Los Angeles to Amsterdam,140.089,ms,I used global ping statistics. I chose average of 140.089 for the timestamp 2023-06-11 09:01:18. https://wondernetwork.com/pings/Los%20Angeles/Amsterdam
Low Earth Orbit Satellite,50.0,ms,"LEO satellites should deliver approximately 50 ms of latency (and this will improve with next-generation technology to <20 ms). https://www.analog.com/en/analog-dialogue/articles/internet-from-space-rfic-in-high-capacity-low-latency-leo-satellite-terminals.html#:~:text=LEO%20satellites%20should%20deliver%20approximately,due%20to%20the%20lower%20orbit.."
Geostationary Satellite,550.0,ms,"Typically, during perfect conditions, the physics involved in satellite communications account for approximately 550 milliseconds of latency round-trip time. https://en.wikipedia.org/wiki/Satellite_Internet_access#:~:text=Typically%2C%20during%20perfect%20conditions%2C%20the,a%20geostationary%20satellite%2Dbased%20network."
Earth to the Moon,2.56,s,"Radio waves propagate in vacuum at the speed of light c, exactly 299,792,458 m/s. Propagation time to the Moon and back ranges from 2.4 to 2.7 seconds, with an average of 2.56 seconds (the average distance from Earth to the Moon is 384,400 km)."
Earth to Mars,12.5,min,"It generally takes about 5 to 20 minutes for a radio signal to travel the distance between Mars and Earth, depending on planet positions. Using avg of 12.5 mins. https://mars.nasa.gov/mars2020/spacecraft/rover/communications/#:~:text=It%20generally%20takes%20about%205,Earth%2C%20depending%20on%20planet%20positions."
