# HOMEWORK 3 - Network Measurement and Data Analysis Lab

*Stefano Maxenti, 10526141, 970133*

#**Homework**

Complete the following tasks:

* Use a dataset of 21 Video Sessions
* Recognize the Video Server(s) IP and select video traffic (***if more than one Server is found, keep the dominant flow only***)
* Detect Video Client HTTP Requests (Uplink packets with size larger or equal to 100 Bytes)
* Compute features to predict:
 1.   When the next UL Request is sent by the Video Client 
 2.   How large is the response of the Server to the next UL Request

**N.B.**: Below, you can find a list of useful functions for the tasks at hand (introduced during class).

### Index

[LIBRARIES AND FUNCTIONS](#libraries_and_functions)

[CLASS APPROACH](#class)

[CACHED APPROACH](#cached)

## Libraries and functions
<a id='libraries_and_functions'></a>

In [1]:
import json
import urllib
from urllib.request import urlopen

import numpy as np
import pandas as pd
import statistics
from matplotlib import pyplot as plt
import plotly.graph_objects as go

from sklearn.model_selection import train_test_split 
from sklearn import metrics
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from scipy import stats

import random
import math
import os
import re
from os import listdir
from os.path import isfile, join, splitext
import warnings
warnings.filterwarnings('ignore')

In [2]:
# https://stackoverflow.com/a/5967539
# Very basic implementation of human ordering, useful to read all files in order.
# It will prove useful when we preprocess data because order might have an effect.
def atoi(text):
    return int(text) if text.isdigit() else text

def natural_keys(text):
    '''
    alist.sort(key=natural_keys) sorts in human order
    http://nedbatchelder.com/blog/200712/human_sorting.html
    (See Toothy's implementation in the comments)
    '''
    return [ atoi(c) for c in re.split(r'(\d+)', text) ]

### Functions Ready-To-Use (although with some modifications)

In [3]:
def filter_traffic(data, domain, cached_ips, opt=False): # the opt parameter will be explained later on
   
    # Look in DNS Responses for googlevideo domain
    dns_data = data[data['Protocol']=='DNS']
    dns = dns_data[dns_data['Info'].apply(lambda x: 'googlevideo' in x and 'response' in x)]
    ips = dns.Address.values
    if (opt):
        for i in ips:
            cached_ips.append(i)
        ips = cached_ips

    server_names = dns.Name.values

    # Filtering on either "Source" or "Destination" IP, get the 
    # rows of the dataset that contain at least one of the selected IPs
    downlink = data[data['Source'].apply(lambda x: x in ips)].dropna(subset=['Length']) 

    uplink = data[data['Destination'].apply(lambda x: x in ips)].dropna(subset=['Length'])
    
    return uplink, downlink, ips


In [4]:
def find_dominant(uplink, downlink, verbose=False):

    # Expressed in MB

    # Order flows by cumulative DL Volume
    flows_DL = downlink.groupby(['Source','Destination'])['Length'].sum()/(10**6)
    # Get (Source,Destination) IPs of dominant flows
    dom_id = flows_DL[flows_DL==max(flows_DL)].index[0]
    if (verbose):
        print(flows_DL)
        print("DOM_ID:", dom_id)
    # Filter traffic selecting the dominant flow
    dom_dl = downlink[downlink['Source'] == dom_id[0]]
    dom_ul = uplink[(uplink['Source'] == dom_id[1])]
    return dom_ul, dom_dl


def timebased_filter(data, length=None, min_time=None, max_time=None):
    '''
    :param data: pd dataframe to be filtered. Must contain columns: "Length" and "Time"
    :param length: all packets shorter than length [Bytes] will be discarded (default 0)
    :param min_time: all packets with timestamp smaller than min_time [s] will be discarded (default 0)
    :param max_time: all packets with timestamp larger than max_time [s] will be discarded (default 1000)
    '''

    if length is None:
        length=0
    if min_time is None:
        min_time = 0
    if max_time is None:
        max_time = 1000
  
    filtered_data = data.copy().reset_index()
    mask = (filtered_data['Length']>=length) & (filtered_data['Time']>=min_time) & (filtered_data['Time']<= max_time)
    filtered_data = filtered_data.loc[mask[mask ==True].index]

    return filtered_data


def find_next(array, value):
    '''
    :param array: np.array, array of floats
    :param value: float, reference value
    :return: position of the closest element of the array greater than "value"
    '''
    delta = np.asarray(array) - value
    idx = np.where(delta >= 0, delta, np.inf).argmin()

    return idx

In [5]:
# Very basic standard normalization
def normalize_dataset(training_set, test_set):
    
    mean_train = training_set.mean()
    std_train = training_set.std()
    norm_train = (training_set - mean_train)/std_train
    norm_test = (test_set - mean_train)/std_train  

    return norm_train, norm_test, mean_train, std_train

### Functions to be completed


In [6]:
def features_extraction(uplink, downlink, playback_start=2, playback_end=180, min_ul_size=100, min_dl_size=50):
    '''
    Complete this function to extract both features and groundtruth.

    NB: The features extraction process is the same as the one introduced during
    the lecture. 
    '''
    
    uplink = timebased_filter(uplink, min_ul_size, playback_start, playback_end)
    downlink = timebased_filter(downlink, min_dl_size, playback_start, playback_end)

    dataset = pd.DataFrame(columns=['Request_Size','Inter_RR_Time','DL_Time','DL_Vol','DL_Size','PB_Time'])
    # ****************************************************************************
    # Feature 1: Client Request Size
    dataset['Request_Size'] = list(uplink.Length.values)

    # ****************************************************************************
    # Feature 2: Inter Request-Response Time
    rr_time = []
    response_time = []
    for t in uplink.Time:
        response_time.append(find_next(downlink.Time, t)) #index of next DL packet timestamp 
        rr_time.append(downlink.Time.iloc[response_time[-1]] - t)

    dataset['Inter_RR_Time'] = rr_time

    # ****************************************************************************
    # Feature 3-4-5: Download Time, Download Volume, Download Size (# Packets) 
    dt = []
    dv = []
    ds = []

    for rt1, rt2 in zip(response_time[:-1], response_time[1:]):
        
        #Download Time
        dt.append(downlink.Time.iloc[rt2-1] - downlink.Time.iloc[rt1])

        temp = timebased_filter(downlink, 0, downlink.Time.iloc[rt1], downlink.Time.iloc[rt2-1])

        #Download Volume
        dv.append(temp.Length.sum())

        #Download Size (# Packets) 
        ds.append(temp.shape[0])

    # Last Iteration data might be corrupted due to drastic interruption of capture 
    # process. If it is so, an error would occur during the features extraction.
    # To avoid this, we skip last HTTP iteration data when an error is raised 
    # using the try-except logic below.
    try:
        # Consider also last HTTP iteration
        #Download Time
        dt.append(downlink.Time.iloc[-1] - downlink.Time.iloc[rt2])

        temp = timebased_filter(downlink, 0, downlink.Time.iloc[rt2], downlink.Time.iloc[-1])
        #Download Volume
        dv.append(temp.Length.sum())

        #Download Size (# Packets) 
        ds.append(temp.shape[0])
    except:
        pass

    dataset['DL_Time'] = dt
    dataset['DL_Vol'] = dv
    dataset['DL_Size'] = ds

    # ****************************************************************************
    # Feature 5: Playback Time
    pbt = list(uplink.Time.values)
    dataset['PB_Time'] = pbt
    # ****************************************************************************

    #print(dataset)
    # Check Features Consistency
    dataset = dataset[(dataset > 0).all(1)]
    dataset = dataset[dataset['DL_Time']<20]
    #print(len(dataset))

    
    ###############################################################
    # TO BE COMPLETED
    ### EXTRACT GROUNDTRUTH HERE
    groundtruth = pd.DataFrame(columns=['Next_Request_Time','Next_Response_Vol'])
    # ****************************************************************************
    # GT 1: Next Request Time
    indexes = dataset.index
    service_time_list = []
    for x in indexes: # we iterate on all indexes
        service_time = dataset.loc[x]['Inter_RR_Time'] + dataset.loc[x]['DL_Time'] # this is the service time of the packet
        service_time_list.append(service_time)
    # The next request time happens after the difference between two consecutive playback times,
    # but we also need to remove the service time of the last packet.
    #print(len(service_time_list))
    groundtruth['Next_Request_Time'] = dataset['PB_Time'].diff().shift(periods=-1) - service_time_list
    # ****************************************************************************
    # GT 2: Next Response Volume
    #print(groundtruth)
  
    groundtruth['Next_Response_Vol'] = dataset['DL_Vol'].shift(periods=-1) # nothing to do, we just have to copy the column
    ## In both the features, shift is needed to preserve the indexes, otherwise they would be messed up.
    ###############################################################

    groundtruth.dropna(inplace=True)
    
    intersection = set(dataset.index).intersection(set(groundtruth.index))
    dataset = dataset.loc[intersection,:]
    groundtruth = groundtruth.loc[intersection,:]

    return dataset, groundtruth

This function is a general mock to train and test various regressors, while performing K-Fold-Cross-Validation and applying normalization on both train and test.

I want to compare the performance of various regressors, using the abstractness allowed by sklearn.
I use:
* Random forest regressor
* Multi-Layer-Perceptron regressor
* Ridge regressor
* Lasso regressor
* ElasticNet regressor
* ExtraTrees regressor
* DecisionTrees regressor

In [7]:
regressor_list = [RandomForestRegressor(), MLPRegressor(), Ridge(), Lasso(), ElasticNet(), ExtraTreesRegressor(), DecisionTreeRegressor()]

def train_test_model(model, kf, X, y):
    rmse_request_time = []
    rmse_response_vol = []
    for train, test in kf.split(X, y):
        X_train, X_test, mean0, std0  = normalize_dataset(X.iloc[train], X.iloc[test])
        y_train, y_test, mean1, std1  = normalize_dataset(y.iloc[train], y.iloc[test])
        #y_train, y_test = y.iloc[train], y.iloc[test]
        
        # Just a note on the following lines: differently from keras, each time we call
        # fit on a model, it retrains from stratch, not from the state it was before.
        # See: https://scikit-learn.org/stable/tutorial/basic/tutorial.html#refitting-and-updating-parameters
        # So it is like we are explicitly creating a new model for each fold, 
        # which is the way to go when performing K-Fold-Cross-Validation.
        # Maybe if would be better for readability to explicitly re-create the model as in the comment below..
        # This breaks extendibility of the code, though.
        '''
        if (model.__class__.__name__ is "RandomForest"):
            model = RandomForestRegressor()
        if (model.__class__.__name__ is "MLP"):
            model = MLPRegressor()
        if (model.__class__.__name__ is "Ridge"):
            model = Ridge()
        if (model.__class__.__name__ is "Lasso"):
            model = Lasso()
        if (model.__class__.__name__ is "ElasticNet"):
            model = ElasticNet()
        if (model.__class__.__name__ is "ExtraTreesRegressor"):
            model = ExtraTreesRegressor()
        if (model.__class__.__name__ is "DecisionTreeRegressor"):
            model = DecisionTreeRegressor()
        '''
        model.fit(X_train, y_train)
        p = model.predict(X_test) * std1.values + mean1.values # denormalization before computing RMSEs
        rmse_rt_ = math.sqrt(metrics.mean_squared_error((y.iloc[test])['Next_Request_Time'], pd.DataFrame(p)[0]))
        rmse_rv_ = math.sqrt(metrics.mean_squared_error((y.iloc[test])['Next_Response_Vol'], pd.DataFrame(p)[1]))/1000 # to KB
        rmse_request_time.append(rmse_rt_)
        rmse_response_vol.append(rmse_rv_)
    return rmse_request_time, rmse_response_vol

## Class approach
<a id='class'></a>

This function simply process all CSV files in the path and returns the dataset and the groundtruth as two different Dataframe with aligned indexes.

The *opt* parameter will be explained later on.

In [8]:
def preprocess_data(path, opt=False, verbose=False, save=False, remove_outliers=True):
    tcpdumpfiles = [f for f in listdir(path) if (isfile(join(path, f)) and splitext(join(path,f))[-1] == '.csv')]
    tcpdumpfiles.sort(key=natural_keys)

    X = pd.DataFrame() # dataset
    y = pd.DataFrame() # groundtruth

    cached_ips = [] # explained later on
    for f in tcpdumpfiles:
        df = pd.read_csv(path+"/"+f)
        print(f, end = ": ")
        domain_name = 'googlevideo'
        uplink, downlink, cached_ips = filter_traffic(df, domain_name, cached_ips, opt=opt)
        dom_ul, dom_dl = find_dominant(uplink, downlink, verbose=verbose)
        
        if (opt):
            cached_ips = np.unique(cached_ips).tolist()
            if (verbose):
                print("cached ips: " ,cached_ips)
                print("\n\n")
            
        X_, y_ = (features_extraction(dom_ul, dom_dl))
        
        print("DS:", len(X_), "GT:", len(y_)) # these values must be the same

        # apparently pandas.append() is deprecated
        X = pd.concat([X, X_], ignore_index=True)
        y = pd.concat([y, y_], ignore_index=True)
    if (remove_outliers): # we use Z-Score metric to filter out outliers
        df = pd.concat([X, y], axis=1) # we need to merge all features
        df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
        X = df[['Request_Size', 'Inter_RR_Time', 'DL_Time', 'DL_Vol', 'DL_Size','PB_Time']]
        y = df[['Next_Request_Time', 'Next_Response_Vol']]
        
    if (save): # to save the resuling CSVs with different names according to the parameters
        if (opt):
            X.to_csv('dataset_opt.csv', index=False)
            y.to_csv('groundtruth_opt.csv', index=False)
        else:
            X.to_csv('dataset.csv', index=False)
            y.to_csv('groundtruth.csv', index=False)
    print("\n\n\nX:" ,len(X), "y: " ,len(y)) # these values must be the same
    return X, y

In [9]:
path = 'Captures'
X, y = preprocess_data(path, save=True)

Capture_v2_0.csv: DS: 46 GT: 46
Capture_v2_1.csv: DS: 42 GT: 42
Capture_v2_2.csv: DS: 38 GT: 38
Capture_v2_3.csv: DS: 0 GT: 0
Capture_v2_4.csv: DS: 24 GT: 24
Capture_v2_5.csv: DS: 9 GT: 9
Capture_v2_6.csv: DS: 6 GT: 6
Capture_v2_7.csv: DS: 76 GT: 76
Capture_v2_8.csv: DS: 18 GT: 18
Capture_v2_9.csv: DS: 6 GT: 6
Capture_v2_10.csv: DS: 4 GT: 4
Capture_v2_11.csv: DS: 4 GT: 4
Capture_v2_12.csv: DS: 8 GT: 8
Capture_v2_13.csv: DS: 5 GT: 5
Capture_v2_14.csv: DS: 12 GT: 12
Capture_v2_15.csv: DS: 1 GT: 1
Capture_v2_16.csv: DS: 15 GT: 15
Capture_v2_17.csv: DS: 6 GT: 6
Capture_v2_18.csv: DS: 7 GT: 7
Capture_v2_19.csv: DS: 1 GT: 1
Capture_v2_20.csv: DS: 10 GT: 10
Capture_v2_21.csv: DS: 6 GT: 6



X: 318 y:  318


Let's use cross validation to have an idea on the performance.

In [10]:
## CLASS
X = pd.read_csv('dataset.csv')
y = pd.read_csv('groundtruth.csv')

In [11]:
kf = KFold(n_splits = 10, shuffle=True, random_state=42)
for reg in regressor_list:
    rmse_request_time, rmse_response_vol = train_test_model(reg, kf, X, y)
    print("\n", reg.__class__.__name__, "\n\t\t\t\t", statistics.mean(rmse_request_time), statistics.mean(rmse_response_vol))
    print("***********************************************************************************")


 RandomForestRegressor 
				 4.366923345995478 359.83011705101086
***********************************************************************************

 MLPRegressor 
				 6.778015256899018 506.0815759404145
***********************************************************************************

 Ridge 
				 4.731863523052492 444.2557409898737
***********************************************************************************

 Lasso 
				 4.7181510966379285 484.95480563411684
***********************************************************************************

 ElasticNet 
				 4.7181510966379285 445.23858414871285
***********************************************************************************

 ExtraTreesRegressor 
				 4.51564053636824 368.4599840725476
***********************************************************************************

 DecisionTreeRegressor 
				 5.874781336508752 481.90549377897185
***********************************************************************************

*Next_Request_Time* is predicted almost equally in all regressors but MLP, whereas *Next_Response_Vol* varies more.

The best performance for both features is obtained with a Random Forest regressor.

## Cached approach (a possible improvement)
<a id='cached'></a>

So far, I have just proceeded following the same approach used in class. I notice something strange though: a good number of captures shows 0 (or very low) sequences.

In addition to that, here I show the size of the dominant flow per file:

In [12]:
path = 'Captures'
tcpdumpfiles = [f for f in listdir(path) if (isfile(join(path, f)) and splitext(join(path,f))[-1] == '.csv')]
tcpdumpfiles.sort(key=natural_keys)
opt = False
verbose = True

dummy = [] # not needed here, just a placeholder

for f in tcpdumpfiles:
    df = pd.read_csv(path+"/"+f)
    print(f, os.path.getsize(path+'/'+f)/10**6, "MB")
    
    domain_name = 'googlevideo'
    uplink, downlink, dummy = filter_traffic(df,domain_name, dummy, opt=opt)
    dom_ul, dom_dl = find_dominant(uplink, downlink, verbose=verbose)
    print("\n")

Capture_v2_0.csv 7.199843 MB
Source         Destination
91.81.217.140  192.168.1.6    48.812378
Name: Length, dtype: float64
DOM_ID: ('91.81.217.140', '192.168.1.6')


Capture_v2_1.csv 7.598801 MB
Source         Destination
91.81.217.141  192.168.1.6    9.993622
Name: Length, dtype: float64
DOM_ID: ('91.81.217.141', '192.168.1.6')


Capture_v2_2.csv 8.549215 MB
Source         Destination
91.81.217.140  192.168.1.6    44.867013
91.81.217.141  192.168.1.6    13.785031
Name: Length, dtype: float64
DOM_ID: ('91.81.217.140', '192.168.1.6')


Capture_v2_3.csv 8.851757 MB
Source          Destination
74.125.163.138  192.168.1.6    0.008266
74.125.99.91    192.168.1.6    0.008056
Name: Length, dtype: float64
DOM_ID: ('74.125.163.138', '192.168.1.6')


Capture_v2_4.csv 8.607993 MB
Source          Destination
74.125.104.103  192.168.1.6     0.012088
74.125.111.106  192.168.1.6     0.015628
74.125.154.138  192.168.1.6     1.054021
91.81.217.140   192.168.1.6    17.478906
Name: Length, dtype: float

It is possible to see the dominant flow from some captures (for example, "Capture_v2_3.csv") is less than 1 MB long. A 3-minute long YouTube video tends to be bigger. The capture is however quite big.

My first hyphothesis is that there is some extra traffic, maybe coming from other applications.

To confirm or deny it, I need some more information.

The *opt* parameter comes in handy: it is used not only to look for *googlevideo* domain in the new capture, but to keep in memory as well all previous IPs belonging to YouTube in a list (*cached_ips*). In this way, we look for all *googlevideo* domains but we also check whether there is a IP used in a previous capture to provide YouTube content. Using this list avoids to get flows from different applications than the one required by the homework.

I then call the *find_dominant()* function with the verbose flag to see all flows in the capture.

Let's find out what happens.

In [13]:
path = 'Captures'
tcpdumpfiles = [f for f in listdir(path) if (isfile(join(path, f)) and splitext(join(path,f))[-1] == '.csv')]
tcpdumpfiles.sort(key=natural_keys)
opt = True ## <=====
verbose = True

cached_ips = []

for f in tcpdumpfiles:
    df = pd.read_csv(path+"/"+f)
    print(f, os.path.getsize(path+'/'+f)/10**6, "MB")
    
    domain_name = 'googlevideo'
    uplink, downlink, cached_ips = filter_traffic(df,domain_name, cached_ips, opt=opt)
    dom_ul, dom_dl = find_dominant(uplink, downlink, verbose=verbose)
    print("\n")

Capture_v2_0.csv 7.199843 MB
Source         Destination
91.81.217.140  192.168.1.6    48.812378
Name: Length, dtype: float64
DOM_ID: ('91.81.217.140', '192.168.1.6')


Capture_v2_1.csv 7.598801 MB
Source         Destination
91.81.217.140  192.168.1.6    42.213374
91.81.217.141  192.168.1.6     9.993622
Name: Length, dtype: float64
DOM_ID: ('91.81.217.140', '192.168.1.6')


Capture_v2_2.csv 8.549215 MB
Source         Destination
91.81.217.140  192.168.1.6    44.867013
91.81.217.141  192.168.1.6    13.785031
Name: Length, dtype: float64
DOM_ID: ('91.81.217.140', '192.168.1.6')


Capture_v2_3.csv 8.851757 MB
Source          Destination
74.125.163.138  192.168.1.6     0.008266
74.125.99.91    192.168.1.6     0.008056
91.81.217.140   192.168.1.6    22.007896
91.81.217.141   192.168.1.6    26.819344
Name: Length, dtype: float64
DOM_ID: ('91.81.217.141', '192.168.1.6')


Capture_v2_4.csv 8.607993 MB
Source          Destination
74.125.104.103  192.168.1.6     0.012088
74.125.111.106  192.168.1

Capture_v2_18.csv 7.475286 MB
Source           Destination
173.194.160.200  192.168.1.6     1.708232
173.194.187.71   192.168.1.6     0.335498
173.194.188.136  192.168.1.6     0.187882
173.194.188.72   192.168.1.6     1.005700
74.125.111.102   192.168.1.6     2.673684
74.125.111.106   192.168.1.6     0.984788
74.125.154.138   192.168.1.6     1.947016
74.125.160.202   192.168.1.6     1.663054
74.125.99.137    192.168.1.6     1.441036
74.125.99.166    192.168.1.6     0.200582
74.125.99.168    192.168.1.6     0.493608
91.81.217.140    192.168.1.6    15.990186
91.81.217.141    192.168.1.6     0.004896
Name: Length, dtype: float64
DOM_ID: ('91.81.217.140', '192.168.1.6')


Capture_v2_19.csv 7.646476 MB
Source           Destination
173.194.160.200  192.168.1.6     0.659220
173.194.182.135  192.168.1.6     0.017631
173.194.187.71   192.168.1.6     0.329843
173.194.188.136  192.168.1.6     0.000264
173.194.188.72   192.168.1.6     1.513027
74.125.111.102   192.168.1.6     1.172326
74.125.111.1

Let's concentrate on the same capture as before: we can see that there is a huge flow towards IPs 91.81.217.140 and 91.81.217.141.

The same happens with all the captures.

It is possible to notice that this IP was the YouTube IP server in other captures in the previous experiment (for example, in Capture_v2_0)!

If we assume that the capture files were created one after the another, I can conclude that the browser does not repeat the DNS lookup because it is already cached and gets content from the previous IP.

This is why I implemented the reading of the files in an orderly matter, as to realistically simulate the actual capture, taking into account the flows of DNS requests.

Let's get some details on those IPs.

In [14]:
cached_ips = np.unique(cached_ips).tolist()

for ip in cached_ips:
    token = ''
    response = urlopen('http://ipinfo.io/'+ip+'/org'+token)
    html_content = response.read()
    encoding = response.headers.get_content_charset('utf-8')
    html_text = html_content.decode(encoding)
    print(ip, ":", html_text)
    response.close()

172.217.132.137 : AS15169 Google LLC

173.194.160.200 : AS15169 Google LLC

173.194.160.219 : AS15169 Google LLC

173.194.182.135 : AS15169 Google LLC

173.194.182.138 : AS15169 Google LLC

173.194.182.230 : AS15169 Google LLC

173.194.187.136 : AS15169 Google LLC

173.194.187.71 : AS15169 Google LLC

173.194.188.105 : AS15169 Google LLC

173.194.188.136 : AS15169 Google LLC

173.194.188.230 : AS15169 Google LLC

173.194.188.72 : AS15169 Google LLC

209.85.226.38 : AS15169 Google LLC

74.125.104.103 : AS15169 Google LLC

74.125.105.10 : AS15169 Google LLC

74.125.110.102 : AS15169 Google LLC

74.125.111.102 : AS15169 Google LLC

74.125.111.105 : AS15169 Google LLC

74.125.111.106 : AS15169 Google LLC

74.125.153.11 : AS15169 Google LLC

74.125.153.24 : AS15169 Google LLC

74.125.153.59 : AS15169 Google LLC

74.125.153.7 : AS15169 Google LLC

74.125.154.138 : AS15169 Google LLC

74.125.160.202 : AS15169 Google LLC

74.125.160.38 : AS15169 Google LLC

74.125.162.39 : AS15169 Google LLC



It is interesting to highline that two of the IPs (the ones with much more traffic according to above) belong to Vodafone network. It makes absolute sense, since it is common for an OTT to provide its servers inside the ISP network to reduce latency and improve performance and avoid the expensive use of transits.

Of course not all videos may always be available in the YouTube cache inside Vodafone networks and for them the browser reaches IPs belonging directly to Google network.

I now recreate the dataset with the *opt* flag set to True.

Let's see what happens.

In [15]:
path = 'Captures'
X, y = preprocess_data(path, opt=True, verbose=False, save=True)

Capture_v2_0.csv: DS: 46 GT: 46
Capture_v2_1.csv: DS: 34 GT: 34
Capture_v2_2.csv: DS: 38 GT: 38
Capture_v2_3.csv: DS: 61 GT: 61
Capture_v2_4.csv: DS: 65 GT: 65
Capture_v2_5.csv: DS: 66 GT: 66
Capture_v2_6.csv: DS: 86 GT: 86
Capture_v2_7.csv: DS: 91 GT: 91
Capture_v2_8.csv: DS: 26 GT: 26
Capture_v2_9.csv: DS: 39 GT: 39
Capture_v2_10.csv: DS: 46 GT: 46
Capture_v2_11.csv: DS: 45 GT: 45
Capture_v2_12.csv: DS: 35 GT: 35
Capture_v2_13.csv: DS: 15 GT: 15
Capture_v2_14.csv: DS: 14 GT: 14
Capture_v2_15.csv: DS: 27 GT: 27
Capture_v2_16.csv: DS: 18 GT: 18
Capture_v2_17.csv: DS: 25 GT: 25
Capture_v2_18.csv: DS: 33 GT: 33
Capture_v2_19.csv: DS: 34 GT: 34
Capture_v2_20.csv: DS: 32 GT: 32
Capture_v2_21.csv: DS: 39 GT: 39



X: 826 y:  826


The dataset is definitely bigger. Let's train and test the same regressors as before.

In [16]:
## CACHED
X = pd.read_csv('dataset_opt.csv')
y = pd.read_csv('groundtruth_opt.csv')

In [17]:
kf = KFold(n_splits = 10, shuffle=True, random_state=42)
for reg in regressor_list:
    rmse_request_time, rmse_response_vol = train_test_model(reg, kf, X, y)
    print("\n", reg.__class__.__name__, "\n\t\t\t\t", statistics.mean(rmse_request_time), statistics.mean(rmse_response_vol))
    print("***********************************************************************************")


 RandomForestRegressor 
				 2.415683157315008 342.78866712048347
***********************************************************************************

 MLPRegressor 
				 2.4346561884279323 360.18022516716974
***********************************************************************************

 Ridge 
				 2.5000762609079596 368.41868049688895
***********************************************************************************

 Lasso 
				 2.6283018094399866 404.90921745410253
***********************************************************************************

 ElasticNet 
				 2.6283018094399866 404.90921745410253
***********************************************************************************

 ExtraTreesRegressor 
				 2.4031457802086678 365.9908501961011
***********************************************************************************

 DecisionTreeRegressor 
				 3.3820637160038527 462.34991722810815
****************************************************************************

We obtain better performance because, with more data, we have reduced the effect of outliers in the smaller dataset.
They affected the *Next_Request_Time* more because the flows, albeit from *googlevideo*, were probably not videos but metrics or ads. The best absolute value for this feature comes from the ExtraTrees regressor.

As in the previous case, overall, we obtain the best performances using a Random Forest regressor.