#**Homework**

Complete the following tasks:

* Use a dataset of 21 Video Sessions
* Recognize the Video Server(s) IP and select video traffic (***if more than one Server is found, keep the dominant flow only***)
* Detect Video Client HTTP Requests (Uplink packets with size larger or equal to 100 Bytes)
* Compute features to predict:
 1.   When the next UL Request is sent by the Video Client
 2.   How large is the response of the Server to the next UL Request

**N.B.**: Below, you can find a list of useful functions for the tasks at hand (introduced during class).

The first thing is to import the needed libraries and mount the folder with the data files

In [None]:
import os

import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import KFold

import plotly.graph_objects as go
from matplotlib import pyplot as plt

import warnings
warnings.filterwarnings('ignore')

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Colab Notebooks/Network measurements /Captures_HW3

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/Colab Notebooks/Network measurements /Captures_HW3


### Functions Ready-To-Use

Set of predefined functions, same application as the ones discussed during the class. They were not modified and used as provided.

In [None]:
def filter_traffic(data, domain):
    # Look in DNS Responses for googlevideo domain
    dns_data = data[data['Protocol']=='DNS']
    dns = dns_data[dns_data['Info'].apply(lambda x: 'googlevideo' in x and 'response' in x)]
    ips = dns.Address.values
    server_names = dns.Name.values

    # Filtering on either "Source" or "Destination" IP, get the
    # rows of the dataset that contain at least one of the selected IPs
    downlink = data[data['Source'].apply(lambda x: x in ips)].dropna(subset=['Length'])
    uplink = data[data['Destination'].apply(lambda x: x in ips)].dropna(subset=['Length'])
    return ips, server_names, uplink, downlink

def find_dominant(uplink, downlink):
  # Expressed in MB
  # Order flows by cumulative DL Volume
  flows_DL = downlink.groupby(['Source','Destination'])['Length'].sum()/(10**6)

  # Get (Source,Destination) IPs of dominant flows
  dom_id = flows_DL[flows_DL==max(flows_DL)].index[0]

  # Filter traffic selecting the dominant flow
  dom_dl = downlink[downlink['Source']==dom_id[0]]
  dom_ul = uplink[(uplink['Source']==dom_id[1])]
  return dom_ul, dom_dl

def timebased_filter(data, length=None, min_time=None, max_time=None):
  '''
  :param data: pd dataframe to be filtered. Must contain columns: "Length" and "Time"
  :param length: all packets shorter than length [Bytes] will be discarded (default 0)
  :param min_time: all packets with timestamp smaller than min_time [s] will be discarded (default 0)
  :param max_time: all packets with timestamp larger than max_time [s] will be discarded (default 1000)
  '''

  if length is None:
    length=0
  if min_time is None:
    min_time = 0
  if max_time is None:
    max_time = 1000

  filtered_data = data.copy().reset_index()
  mask = (filtered_data['Length']>=length) & (filtered_data['Time']>=min_time) & (filtered_data['Time']<= max_time)
  filtered_data = filtered_data.loc[mask[mask ==True].index]
  return filtered_data

def find_next(array, value):
  '''
  :param array: np.array, array of floats
  :param value: float, reference value
  :return: position of the closest element of the array greater than "value"
  '''
  delta = np.asarray(array) - value
  idx = np.where(delta >= 0, delta, np.inf).argmin()
  return idx

def normalize_dataset(training_set, test_set):
  mean_train = training_set.mean()
  std_train = training_set.std()
  norm_train = (training_set - mean_train)/std_train
  norm_test = (test_set - mean_train)/std_train
  return norm_train, norm_test

This is a new function. Since the arrival time of next
HTTP Request is the time between the last downlik packet and the next uplink packet, I use this function to find the index of the largest timstamp in the downlink which has a smaller value than the a timestamp of an uplink packet.

In [None]:
# Useful function
def find_previous(array, value): # similarly to "find_next", this function returns te closest smaller value to the given one
  '''
  :param array: np.array, array of floats
  :param value: float, reference value
  :return: position of the closest element of the array smaller than "value"
  '''
  delta = np.asarray(array) - value
  idx = np.where(delta <= 0, delta, -np.inf).argmax()
  return idx

### Functions to be completed


It is required to compute 'Next_Request_Time' and 'Next_Response_Vol' as the groundtuth values. 'Next_Response_Vol' can be easily taken from the 'DL_Vol' feature and assigning it's value to the previous grountruth. We will use 1:last values of 'DL_Vol' and assign them to 0:last-1 values of 'Next_Response_Vol'. For the 'Next_Request_Time', we take the uplink (discarding the first one) requests and for each one of them we find the last downlink using the function 'find_previous'. Then we use the difference in the timestamp as the corresponding label.

In [None]:
def features_extraction(uplink, downlink):
  '''
  Complete this function to extract both features and groundtruth.

  NB: The features extraction process is the same as the one introduced during
  the lecture.
  '''
  dataset = pd.DataFrame(columns=['Request_Size','Inter_RR_Time','DL_Time','DL_Vol','DL_Size','PB_Time'])
  # ****************************************************************************
  # Feature 1: Client Request Size
  dataset['Request_Size'] = list(uplink.Length.values)

  # ****************************************************************************
  # Feature 2: Inter Request-Response Time
  rr_time = []
  response_time = []
  for t in uplink.Time:
    response_time.append(find_next(downlink.Time, t)) #index of next DL packet timestamp
    rr_time.append(downlink.Time.iloc[response_time[-1]] - t)

  dataset['Inter_RR_Time'] = rr_time

  # ****************************************************************************
  # Feature 3-4-5: Download Time, Download Volume, Download Size (# Packets)
  dt = []
  dv = []
  ds = []

  for rt1, rt2 in zip(response_time[:-1], response_time[1:]):
    #Download Time
    dt.append(downlink.Time.iloc[rt2-1] - downlink.Time.iloc[rt1])
    temp = timebased_filter(downlink, 0, downlink.Time.iloc[rt1], downlink.Time.iloc[rt2-1])

    #Download Volume
    dv.append(temp.Length.sum())
    
    #Download Size (# Packets)
    ds.append(temp.shape[0])

  # Last Iteration data might be corrupted due to drastic interruption of capture
  # process. If it is so, an error would occur during the features extraction.
  # To avoid this, we skip last HTTP iteration data when an error is raised
  # using the try-except logic below.
  try:
    # Consider also last HTTP iteration
    # Download Time
    dt.append(downlink.Time.iloc[-1] - downlink.Time.iloc[rt2])

    temp = timebased_filter(downlink, 0, downlink.Time.iloc[rt2], downlink.Time.iloc[-1])
    # Download Volume
    dv.append(temp.Length.sum())

    # Download Size (# Packets)
    ds.append(temp.shape[0])
  except:
    print()

  dataset['DL_Time'] = dt
  dataset['DL_Vol'] = dv
  dataset['DL_Size'] = ds

  # ****************************************************************************
  # Feature 5: Playback Time
  pbt = list(uplink.Time.values)
  dataset['PB_Time'] = pbt
  # ****************************************************************************


  # Check Features Consistency
  dataset = dataset[(dataset > 0).all(1)]
  dataset = dataset[dataset['DL_Time']<20]

  ###############################################################
  # TO BE COMPLETED

  ### EXTRACT GROUNDTRUTH HERE
  groundtruth = pd.DataFrame(columns=['Next_Request_Time','Next_Response_Vol'])
  # ****************************************************************************
  # GT 1: Next Request Time

  Next_Request_Time = []

  for t in uplink.Time:
    # difference between the large uplink packet and the closest previous downlink
    Next_Request_Time.append(t - downlink.Time.iloc[find_previous(downlink.Time, t)])

  groundtruth['Next_Request_Time'] = Next_Request_Time

  # the first uplink doesn't have downlink before -> time will be negative -> discard
  groundtruth = groundtruth[(groundtruth['Next_Request_Time'] > 0)]
  # ****************************************************************************
  # GT 2: Next Response Volume
  
  # basically, the feature of the next sample (starting from the 2nd)
  next_resp_vol = dataset['DL_Vol'].iloc[1:]
  # but for the assignment, we will start from the first index, ignore the last row (can't extract the label for it)
  ind_list = list(dataset['DL_Vol'].iloc[:-1].index)
  # swap indexes
  next_resp_vol.index = ind_list 
  groundtruth['Next_Response_Vol'] =  next_resp_vol

  ###############################################################

  # Check Ground Truth Consistency
  groundtruth.dropna(inplace=True)

  # Align Dataset and Groundtruth
  intersection = set(dataset.index).intersection(set(groundtruth.index))
  dataset = dataset.loc[intersection,:]
  groundtruth = groundtruth.loc[intersection,:]
  return dataset, groundtruth

### Write your code here



Now we will loop over each of the files in the folder, read it, extract the uplink and dounlink and filter the uplink to get the 'large' packet (nothing is specified for the upling and the playback start, therefore it is not filtered). Then we extract the predefined features and the acquired groundtruth from the dounlink and filtered uplink are append them to the global dataset. Then we visualize several rows of the datasets to check the consistecy.

In [None]:
# initialize the dataframes
data = pd.DataFrame()
ground = pd.DataFrame()

# Filter UL/DL Data

playback_start = 0 #sec
playback_end = 180 #sec (after -> discard)
min_ul_size = 100 #bytes
min_dl_size = 0 #bytes

for file_name in os.listdir():# for each file in the folder
    df = pd.read_csv(file_name) # read the file

    _, _, uplink, downlink = filter_traffic(df, 'googlevideo')# extract the uplink and the downlink

    dom_ul, dom_dl = find_dominant(uplink, downlink)# find the dominant up- and downlinks

    # filter the uplink to get packets with the length at least 'min_ul_size' bytes
    dom_ul = timebased_filter(dom_ul, min_ul_size, playback_start, playback_end)
    dom_dl = timebased_filter(dom_dl, min_dl_size, playback_start, playback_end)# NOT FILTERED, SINCE NOT REQUIRED BY THE TASK

    dataset, groundtruth = features_extraction(dom_ul, dom_dl)# extract the features and the ground truth

    # append everything in the global datafframe
    data = data.append(dataset, ignore_index = True)
    ground = ground.append(groundtruth, ignore_index = True)

Short consistency analysis of the data (features).

In [None]:
data.head(5)

Unnamed: 0,Request_Size,Inter_RR_Time,DL_Time,DL_Vol,DL_Size,PB_Time
0,661,0.003477,0.035957,1493386,989,0.542976
1,661,0.010791,0.013774,509484,338,5.540538
2,661,0.009317,0.02651,1018662,675,11.552147
3,661,0.010225,0.027004,1008818,669,16.561117
4,661,0.009692,0.014666,510478,339,21.575435


Short consistency analysis of the groundtruth (labels). We see that the 'Next_Response_Vol' of the ith sample corresponds to the 'DL_Vol' of the next row in 'data'.

In [None]:
ground.head(5)

Unnamed: 0,Next_Request_Time,Next_Response_Vol
0,0.004185,509484.0
1,4.958128,1018662.0
2,5.987044,1008818.0
3,4.973143,510478.0
4,4.977089,521404.0


Now we will apply two random forest regressors for each of the label independently to predict 'Next_Response_Vol' and 'Next_Response_Time'. We also use 10-fold CV to get a performance value with a larger suppurt as well as a standard deviation for each of the RMSE.

In [None]:
kf = KFold(n_splits=10)# 10-fold CV

# lists to keep track of the performance (RMSE)
RMSE_time = []
RMSE_volume = []

for train_index, test_index in kf.split(data):
    data_train, data_test = data.iloc[train_index,:], data.iloc[test_index,:] # get the train data

    labels_train_time, labels_test_time = ground['Next_Request_Time'].iloc[train_index], ground['Next_Request_Time'].iloc[test_index]# get the time labels
    labels_train_volume, labels_test_volume = ground['Next_Response_Vol'].iloc[train_index], ground['Next_Response_Vol'].iloc[test_index]# get the volume labels

    norm_train, norm_test = normalize_dataset(data_train, data_test)

    # regressor for the arrival time of next HTTP Request
    rf_reg_time = RandomForestRegressor()
    rf_reg_time.fit(norm_train, labels_train_time)# fit the regressor on the data-time labels
    prediction_rf_time = rf_reg_time.predict(norm_test)

    # regressor for the size of next burst from the server
    rf_reg_volume = RandomForestRegressor()
    rf_reg_volume.fit(norm_train, labels_train_volume)# fit the regressor on the data-volume labels
    prediction_rf_volume = rf_reg_volume.predict(norm_test)

    # compute the RMSEs from the available MSE maetric
    RMSE_time.append(np.sqrt(metrics.mean_squared_error(labels_test_time, prediction_rf_time)))
    RMSE_volume.append(np.sqrt(metrics.mean_squared_error(labels_test_volume, prediction_rf_volume)))

print("RMSE of the arrival time of next HTTP Request = : {:.2f} [s] (std = {:.2f})\n".format(np.mean(RMSE_time), np.std(RMSE_time)))
print("RMSE of the size of next burst from the servert = : {:.2f} [KB] (std = {:.2f})\n".format(np.mean(RMSE_volume)/10**3, np.std(RMSE_volume)/10**3))

RMSE of the arrival time of next HTTP Request = : 5.44 [s] (std = 2.37)

RMSE of the size of next burst from the servert = : 350.45 [KB] (std = 123.22)



The results are consistent with the one provided as a suggested reference in the presentation. That depends on the data partitioning and the sample in the train and test sets, but overall the results are adequate.