In [None]:
from scapy.all import *
import pandas as pd
import numpy as np
import binascii
import seaborn as sns
from util.pandas_util import pcap_to_df
sns.set(color_codes=True)
%matplotlib inline
pd.set_option('display.max_rows', None)

# Analyzing Ookla Speedtest

**Objective:** In this section we will examine an [Ookla speedtest](https://www.speedtest.net/) and empirically analyze the how it works.

Before we can begin analyzing packet captures, we need to process the data into a Pandas DataFrame. Using Pandas DataFrames as our data structure is useful for several reasons:  

- Data Manipulation: Pandas offers a wide range of functionalities to filter, sort, group, and aggregate data.

- Visualization: Pandas has built-in visualization capabilities based on Matplotlib. Although specialized network visualization tools might be better suited for graph layouts, Pandas can be useful for preliminary data exploration and plotting network statistics.

- CSV & File I/O: Real-world network data often comes in the form of CSV files or other standard formats. Pandas provides powerful I/O capabilities, allowing you to easily read in data, manipulate it, and then use it for network analysis.

- Scalability: Pandas DataFrames are optimized for performance. They can handle large datasets efficiently.

- Flexible Data Types: A DataFrame can contain a mix of different data types (integers, strings, floats, etc.).

### Task 0
Import the file `speed_test.pcap` located in the directory `/mnt/cs190n/pcaps/speed_test.pcap` and copy it to your local working directory. If you have cloned the course repository in your home directory, you can import this pcap using the following command
```
cp /mnt/md0/cs190n/pcaps/speed_test.pcap ~/cs190n-fall-2023/assignments/assignment1/pcaps/
```

### Task 1
Using the `scapy` library, report the number of TCP/UDP/total packets contained within both files `pcaps/sample_speed_test.pcap` and `pcaps/speed_test.pcap`. Report the time it takes to read each pcap using the `scapy` library. The execution time of code can be measured using `print` statements before and after the relevant code being measured using the [datetime](https://www.programiz.com/python-programming/datetime/current-time) library.


### Task 2
Using the `pcap_to_df` function defined in `util/pandas_util.py`, measure the time it takes to convert the file `pcaps/sample_speed_test.pcap` to a Pandas DataFrame. Using this data, estimate the time it will take to convert the file `pcaps/speed_test.pcap` to a Pandas DataFrame. The DataFrame for `pcaps/speed_test.pcap` has been pre-generated for you and written to a CSV file. You can copy it to your local working directory. If you have cloned the course repository in your home directory, you can import this pcap using the following command
```
cp /mnt/md0/cs190n/csvs/speed_test.csv ~/cs190n-fall-2023/assignments/assignment1/pcaps/
```

For the remainder of this assignment, you should use the sample data to develop your code and build your queries, and rerun your analysis on the entire data once you have working code.

### Task 3
The 5-tuple is commonly referred to in networking, and refers to the 5 attributes for a network flow:  
   - Protocol  
   - Source IP  
   - Source Port  
   - Destination IP  
   - Destination Port  
   
Report the number of unique 5-tuples, within `pcaps/speed_test.pcap`. Remember, the port number can be either TCP or UDP port numbers depending on the protocol. Additionally, the port numbers can both be `None` if the `proto` is not equal to 6 (TCP) or 17 (UDP). You should first develop your code using the sample packet capture before executing the code on the full packet capture.

### Task 4
Next, we want to understand the most significant connections that is made during this speed test. Report the top 5 server IP(s) that send the most inbound data to the local client (192.168.0.203) and the top 5 server IP(s) that send the most outbound data from the local client (192.168.0.203) in `pcaps/speed_test.pcap`. (Hint: use the [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function with `dropna=False` because each row may have some values that are null)

### Task 5
Visualize the number of inbound and outbound bytes observed every second aggregated **over all flows** between the local IP (192.168.0.203) and the server IPs you identified in the previous task. You can use [seaborn](https://www.geeksforgeeks.org/creating-a-time-series-plot-with-seaborn-and-pandas/) to graph the data directly from a Pandas DataFrame.

### Task 6
Identify the flows that are relevant to the speed test. (Hint: The number of packets sent in each flow and the protocol are useful indicators for which servers are used for the speed test)

### Task 7
Estimate the duration of the speed test using the flows you identified in the previous task. The duration of the test would be the minimum start time observed for all the flows until the maximum end time observed for all the flows. Report the minimum time, maximum time, and duration.

### Task 8
For this task, you wil use a new windowing approach over all flows you identified relevant to the speed test for download and upload. Split the duration of the test into 20 windows, and aggregate the bytes over all relevant flows for each window. Report the total bytes measured over each of the 20 windows as well as the duration of each window for upload and download.

### Task 9
Discard the windows with the 5 lowest and top 2 number of bytes. Estimate the download and upload throughput in Mbps from these 15 windows, where throughput is bytes divided by duration.