# Assignment: Video Quality Inference

To this point in the class, you have learned various techniques for leading and analyzing packet captures of various types, generating features from those packet captures, and training and evaluating models using those features.

In this assignment, you will put all of this together, using a network traffic trace to train a model to automatically infer video quality of experience from a labeled traffic trace.

## Part 1: Warmup

The first part of this assignment builds directly on the hands-on activities but extends them slightly.

### Extract Features from the Network Traffic

Load the `netflix.pcap` file, which is a packet trace that includes network traffic. 


In [70]:
import logging
logging.getLogger("scapy.runtime").setLevel(logging.ERROR)

from netml.pparser.parser import PCAP
from netml.utils.tool import dump_data, load_data

import pandas as pd

In [71]:
pcap = PCAP('../notebooks/data/netflix.pcap', flow_ptks_thres=2, verbose=10)

In [77]:
pcap.pcap2pandas()

pdf = pcap.df

'_pcap2pandas()' starts at 2023-10-21 23:12:37
'_pcap2pandas()' ends at 2023-10-21 23:13:28 and takes 0.84 mins.


### Identifying the Service Type

Use the DNS traffic to filter the packet trace for Netflix traffic.

In [211]:
NF_DOMAINS = ["nflxvideo", 
              "netflix", 
              "nflxso", 
              "nflxext"]

In [213]:
nfre = '|'.join(NF_DOMAINS)

In [214]:
df = pdf[pdf['is_dns']]
df.head(4)


Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
0,2018-02-11 08:10:00,"(fonts.gstatic.com.,)",,128.93.77.234,2153598442.0,192.168.43.72,3232246600.0,True,77,a0:ce:c8:0d:2b:a7,176809980013479,e4:ce:8f:01:4c:54,251575813622868,53.0,55697.0,UDP,1518358200.534682,0.0
1,2018-02-11 08:10:00,"(fonts.gstatic.com.,)",,128.93.77.234,2153598442.0,192.168.43.72,3232246600.0,True,77,a0:ce:c8:0d:2b:a7,176809980013479,e4:ce:8f:01:4c:54,251575813622868,53.0,59884.0,UDP,1518358200.534832,0.00015
2,2018-02-11 08:10:00,"(googleads.g.doubleclick.net.,)",,128.93.77.234,2153598442.0,192.168.43.72,3232246600.0,True,87,a0:ce:c8:0d:2b:a7,176809980013479,e4:ce:8f:01:4c:54,251575813622868,53.0,61223.0,UDP,1518358200.539408,0.004726
3,2018-02-11 08:10:00,"(googleads.g.doubleclick.net.,)",,128.93.77.234,2153598442.0,192.168.43.72,3232246600.0,True,87,a0:ce:c8:0d:2b:a7,176809980013479,e4:ce:8f:01:4c:54,251575813622868,53.0,58785.0,UDP,1518358200.541204,0.006522


In [215]:
def get_first_value(t):
    return t[0]
df = df.applymap(lambda x: get_first_value(x) if isinstance(x, tuple) else x)

In [218]:
nfre = '|'.join(NF_DOMAINS)
onlynetflixdf = df[df['dns_query'].str.contains(nfre, regex=True, na=False) | df['dns_resp'].str.contains(nfre, regex=True, na=False)]
print(len(nf_queries))
nf_queries.head()

34


Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
0,2018-02-11 08:10:00,fonts.gstatic.com.,,128.93.77.234,2153598442.0,192.168.43.72,3232246600.0,True,77,a0:ce:c8:0d:2b:a7,176809980013479,e4:ce:8f:01:4c:54,251575813622868,53.0,55697.0,UDP,1518358200.534682,0.0
1,2018-02-11 08:10:00,fonts.gstatic.com.,,128.93.77.234,2153598442.0,192.168.43.72,3232246600.0,True,77,a0:ce:c8:0d:2b:a7,176809980013479,e4:ce:8f:01:4c:54,251575813622868,53.0,59884.0,UDP,1518358200.534832,0.00015
86,2018-02-11 08:10:02,www.netflix.com.,,128.93.77.234,2153598442.0,192.168.43.72,3232246600.0,True,75,a0:ce:c8:0d:2b:a7,176809980013479,e4:ce:8f:01:4c:54,251575813622868,53.0,43209.0,UDP,1518358202.362996,1.828314
87,2018-02-11 08:10:02,assets.nflxext.com.,,128.93.77.234,2153598442.0,192.168.43.72,3232246600.0,True,78,a0:ce:c8:0d:2b:a7,176809980013479,e4:ce:8f:01:4c:54,251575813622868,53.0,28162.0,UDP,1518358202.363168,1.828486
88,2018-02-11 08:10:02,codex.nflxext.com.,,128.93.77.234,2153598442.0,192.168.43.72,3232246600.0,True,77,a0:ce:c8:0d:2b:a7,176809980013479,e4:ce:8f:01:4c:54,251575813622868,53.0,48245.0,UDP,1518358202.363441,1.828759


### Generate Statistics

Generate statistics and features for the Netflix traffic flows. Use the `netml` library or any other technique that you choose to generate a set of features that you think would be good features for your model. 

In [219]:
pcap.pcap2flows()

'_pcap2flows()' starts at 2023-10-21 23:41:07
pcap_file: ../notebooks/data/netflix.pcap
ith_packets: 0
ith_packets: 10000
ith_packets: 20000
ith_packets: 30000
ith_packets: 40000
ith_packets: 50000
ith_packets: 60000
ith_packets: 70000
ith_packets: 80000
ith_packets: 90000
ith_packets: 100000
ith_packets: 110000
ith_packets: 120000
ith_packets: 130000
ith_packets: 140000
len(flows): 275
total number of flows: 275. Num of flows < 2 pkts: 91, and >=2 pkts: 184 without timeout splitting.
kept flows: 184. Each of them has at least 2 pkts after timeout splitting.
flow_durations.shape: (184, 1)
        col_0
count 184.000
mean   82.331
std   127.700
min     1.138
25%    14.428
50%    17.122
75%    71.087
max   486.705
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184 entries, 0 to 183
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col_0   184 non-null    float64
dtypes: float64(1)
memory usage: 1.6 KB
None
0th_flow: len(pkts

In [227]:
pcap.flow2features('STATS', fft=False, header=False)
stats = pd.DataFrame(pcap.features)
print(len(stats))
stats.head(4)

'_flow2features()' starts at 2023-10-21 23:45:15
True
'_flow2features()' ends at 2023-10-21 23:45:16 and takes 0.0222 mins.
184


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,11.462,1.047,93.53,89.333,49.601,66.0,66.0,69.0,66.0,200.0,12.0,1072.0
1,248.797,0.269,22.283,82.746,41.922,66.0,66.0,66.0,54.0,200.0,67.0,5544.0
2,19.35,0.258,17.675,68.4,4.8,66.0,66.0,66.0,66.0,78.0,5.0,342.0
3,19.349,0.258,17.675,68.4,4.8,66.0,66.0,66.0,66.0,78.0,5.0,342.0


**Write a brief justification for the features that you have chosen.**

<p>Candidates for features include IAT, STATS, SIZE, SAMP-NUM, and SAMP-SIZE. IAT and SIZE contained numerous "zero" points, and would likely prove difficult to use while training. Additionally, STATS has a large number of non-zero values as well as further statistics on packet size. This flexibility of nonzero data is what encouraged me to pick this feature.

### Inferring Segment downloads

In addition to the features that you could generate using the `netml` library or similar, add to your feature vector a "segment downloads rate" feature, which indicates the number of video segments downloaded for a given time window.

Note: If you are using the `netml` library, generating features with `SAMP` style options may be useful, as this option gives you time windows, and you can then simply add the segment download rate to that existing dataframe.

In [230]:
stats['segment downloads rate'] = 0
stats.head(4)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,segment downloads rate
0,11.462,1.047,93.53,89.333,49.601,66.0,66.0,69.0,66.0,200.0,12.0,1072.0,0
1,248.797,0.269,22.283,82.746,41.922,66.0,66.0,66.0,54.0,200.0,67.0,5544.0,0
2,19.35,0.258,17.675,68.4,4.8,66.0,66.0,66.0,66.0,78.0,5.0,342.0,0
3,19.349,0.258,17.675,68.4,4.8,66.0,66.0,66.0,66.0,78.0,5.0,342.0,0


## Part 2: Video Quality Inference

You will now load the complete video dataset from a previous study to train and test models based on these features to automatically infer the quality of a streaming video flow.

For this part of the assignment, you will need two pickle files, which we provide for you by running the code below:

```

!gdown 'https://drive.google.com/uc?id=1N-Cf4dJ3fpak_AWgO05Fopq_XPYLVqdS' -O netflix_session.pkl
!gdown 'https://drive.google.com/uc?id=1PHvEID7My6VZXZveCpQYy3lMo9RvMNTI' -O video_dataset.pkl

```

### Load the File

Load the video dataset pickle file.

### Clean the File

1. The dataset contains video resolutions that are not valid. Remove entries in the dataset that do not contain a valid video resolution. Valid resolutions are 280, 360, 480, 720, 1080.

2. The file also contains columns that are unnecessary (in fact, unhelpful!) for performing predictions. Identify those columns, and remove them.

**Briefly explain why you removed those columns.**

### Prepare Your Data

Prepare your data matrix, determine your features and labels, and perform a train-test split on your data.

### Train and Tune Your Model

1. Select a model of your choice.
2. Train the model using your training data.

### Tune Your Model

Perform hyperparameter tuning to find optimal parameters for your model.

### Evaluate Your Model

Evaluate your model accuracy according to the following metrics:

1. Accuracy
2. F1 Score
3. Confusion Matrix
4. ROC/AUC

## Part 3: Predict the Ongoing Resolution of a Real Netflix Session

Now that you have your model, it's time to put it in practice!

Use a preprocessed Netflix video session to infer **and plot** the resolution at 10-second time intervals.