# Session 3: Hands-On Excercises - using netUnicorn in practice

In this session, we will implement an iterative approach to dataset collection using Trustee to analyze collected data and verify that our dataset doesn't have any obvious issues or shortcuts.

We will implement one of tasks for our pipeline, combine tasks into a pipeline, create an experiment, and deploy it. After we will collect the data, we will explore it using XAI tools if any shortcuts or problems are presented, and will fix them and recollect the data to improve our dataset.

In [None]:
# Required imports

import os
import time
import pandas as pd

# netunicorn.client is responsible for your connection to the netunicorn instance
from netunicorn.client.remote import RemoteClient, RemoteClientException

# netunicorn.base contains all "building blocks" to create Tasks, Pipelines, Experiments, etc. 
from netunicorn.base import Experiment, ExperimentStatus, Pipeline, Task

## Problem definition

Problem: classification between YouTube and Vimeo traffic for the same video using raw PCAPs.  
Approach: watch YouTube and Vimeo, collect network traffic, mark flows as "youtube", "vimeo", "other", and try to build a classifier on top of this.

### Tasks implementation: ping
Let's implement a simple ping task that would verify connectivity to both YouTube and Vimeo before we start doing anything. The task should do nothing during initialization except base class initialization, and implement ping with 3 packets to both "youtube.com" and "vimeo.com". If both pings finished successfully, return None.

In [None]:
class PingYouTubeAndVimeoTask(Task):
    """Pings YouTube and Vimeo and returns None if success"""

    # we need to ensure that inetutils-ping package is installed in our Debian-based image
    requirements = ["apt install -y inetutils-ping"]

    def __init__(self, *args, **kwargs):
        # nothing interesting here
        super().__init__(*args, **kwargs)

    def run(self):
        # implement actual ping logic here

        # remove the line below when the implementation is finished
        pass

## Pipeline and Experiment creation

Great! Now let's combine our pipeline.  

Pipeline should:
 - Ping YouTube and Vimeo
 - Start tcpdump capture
 - Watch YouTube several times
 - Stop tcpdump capture
 - Start tcpdump capture again
 - Watch Vimeo several times
 - Stop tcpdump capture and save all files for local analysis

In [None]:
# We will import all other tasks to not reimplement them

# Tasks to start tcpdump and stop named tcpdump task
from netunicorn.library.tasks.capture.tcpdump import StartCaptureLinuxImplementation, StopNamedCaptureLinuxImplementation

# Tasks for watching the corresponding video platform
from netunicorn.library.tasks.video_watchers.youtube_watcher import WatchYouTubeVideoLinuxImplementation
from netunicorn.library.tasks.video_watchers.vimeo_watcher import WatchVimeoVideoLinuxImplementation

Now let's combine the pipeline

In [None]:
# creating the pipeline and removing early stopping so if any task fails pipeline would go on working
pipeline = Pipeline()
pipeline.early_stopping = False

# ping youtube and vimeo
pipeline.then(PingYouTubeAndVimeoTask())

# starting tcpdump for youtube
pipeline.then(StartCaptureLinuxImplementation(filepath="/tmp/capture_youtube.pcap", name="capture_youtube"))

# watching youtube several times
for _ in range(4):
    pipeline.then([
        WatchYouTubeVideoLinuxImplementation("https://www.youtube.com/watch?v=dQw4w9WgXcQ", 20),
        WatchYouTubeVideoLinuxImplementation("https://www.youtube.com/watch?v=dQw4w9WgXcQ", 20),
    ])

# stopping tcpdump for youtube
pipeline.then(StopNamedCaptureLinuxImplementation(capture_task_name="capture_youtube"))

# starting tcpdump for vimeo
pipeline.then(StartCaptureLinuxImplementation(filepath="/tmp/capture_vimeo.pcap", name="capture_vimeo"))

# watching vimeo
for _ in range(3):
    pipeline.then([
        WatchVimeoVideoLinuxImplementation("https://vimeo.com/375468729", 15),
        WatchVimeoVideoLinuxImplementation("https://vimeo.com/375468729", 15),
    ])

# stopping tcpdump for vimeo
pipeline.then(StopNamedCaptureLinuxImplementation(capture_task_name="capture_vimeo"))

# let's print the resulting pipeline
for element in pipeline.tasks:
    print(element)

In [None]:
# we have a netunicorn instance deployed locally, so let's use it
NETUNICORN_ENDPOINT = 'http://localhost:26611'
NETUNICORN_LOGIN = 'test'
NETUNICORN_PASSWORD = 'test'

# create a client and check that connection and instance are ok
client = RemoteClient(endpoint=NETUNICORN_ENDPOINT, login=NETUNICORN_LOGIN, password=NETUNICORN_PASSWORD)
client.healthcheck()

Now let's ask for all available nodes and just take the first one.

We're using local netunicorn instance for these experiments that just can deploy docker containers locally, but with other connectors you can connect to external Kubernetes, AWS, Azure, etc.

In [None]:
nodes = client.get_nodes()
working_nodes = nodes.take(1)
print(working_nodes)

In [None]:
# creating the experiment - mapping our pipeline to all nodes
experiment = Experiment().map(pipeline, working_nodes)
print(experiment)

By default, netunicorn will install all dependencies by itself, but let's use instead prepared docker image to speed up things

In [None]:
from netunicorn.base import DockerImage
for deployment in experiment:
    deployment.environment_definition = DockerImage(image='pinot.cs.ucsb.edu/sigcommtutorial:latest')  # set the required image
    deployment.environment_definition.runtime_context.additional_arguments = ["/tmp:/tmp"]             # also mount the local folder to save files
    deployment.cleanup = False                                                                         # and do not delete image afterwards

Removing all potential previous results:

In [None]:
!rm -rf /tmp/capture*

## Experiment preparation and execution

Now we have a prepared experiment - pipeline mapped to some nodes. Let's prepare and start it.

Let's name our experiment somehow, delete previous execution if it existed, and ask netunicorn to prepare the experiment

In [None]:
experiment_label = "session3-1"

try:
    client.delete_experiment(experiment_label)
except RemoteClientException:
    pass

client.prepare_experiment(experiment, experiment_label)
time.sleep(2)

We will track preparation by periodically asking about status of the experiment

In [None]:
while True:
    info = client.get_experiment_status(experiment_label)
    print(info.status)
    if info.status != ExperimentStatus.PREPARING:
        break
    time.sleep(10)

Let's check that all deployments are deployed correctly

In [None]:
for deployment in client.get_experiment_status(experiment_label).experiment:
    print(f"Prepared: {deployment.prepared}, error: {deployment.error}")

Let's ask to start the execution and wait till experiment would be finished

In [None]:
client.start_execution(experiment_label)

while True:
    info = client.get_experiment_status(experiment_label)
    print(info.status)
    if info.status != ExperimentStatus.RUNNING:
        break
    time.sleep(10)

Here's how we can get a full information about the experiment:

In [None]:
from returns.pipeline import is_successful

for report in info.execution_result:
    print(f"Node name: {report.node.name}")    # execution node name
    print(f"Error: {report.error}")            # if any error happened

    result, log = report.result  # report stores results of execution and corresponding log
    
    # result is a returns.result.Result object, could be Success of Failure
    print(f"Result is: {type(result)}")

    # let's unwrap the result (from the Success or Failure container to the actual result)
    data = result.unwrap() if is_successful(result) else result.failure()

    # and print all task names and corresponding execution results
    for key, value in data.items():
        print(f"{key}: {value}")

    # we also can explore logs of the executor in case there's anything there
    for line in log:
        print(line.strip())
    print()

## Data preprocessing

Now we have raw PCAPs with data and need to preprocess it to convert to some features we will work with.

For this tutorial, we selected the CICFlowMeter format, which creates flow statistics features vectors from raw PCAPs. E.g., if inside your PCAP you have three connections (5-tuple flows), it will return a CSV with five rows and columns that contain this flow description (e.g., mean IAT, total length, number of packets, etc..)

For preprocessing with CICFlowMeter, we will use the prepared docker container

In [None]:
# create a CSV for youtube traffic
!docker run -v /tmp/capture_youtube.pcap:/tmp/capture_youtube.pcap -v /tmp:/tmp/output --rm pinot.cs.ucsb.edu/cicflowmeter:latest /tmp/capture_youtube.pcap /tmp/output

In [None]:
# create a CSV for vimeo traffic
!docker run -v /tmp/capture_vimeo.pcap:/tmp/capture_vimeo.pcap -v /tmp:/tmp/output --rm pinot.cs.ucsb.edu/cicflowmeter:latest /tmp/capture_vimeo.pcap /tmp/output

For data analysis, we will take resulting CSVs and create a random forest classif

Now let's preprocess the data:

In [None]:
df_youtube = pd.read_csv("/tmp/capture_youtube.pcap_Flow.csv")
df_vimeo = pd.read_csv("/tmp/capture_vimeo.pcap_Flow.csv")

print(df_youtube.columns)   # these are all columns that CICFlowMeter uses

To simplify the tutorial a bit and avoid dimensionality problems (when we have too many features for our dataset size) we will use a subset of features that represent a typical video streaming flow.

In [None]:
features = [
    "Label",
    "Protocol",
    "Flow Duration",
    "Flow Bytes/s",
    "Flow Packets/s",
    "Flow IAT Mean",
    "Bwd IAT Mean",
    "Down/Up Ratio",
    "Active Mean",
    "Idle Mean"
]

Let's clean YouTube and Vimeo traffic. We will mark all connections with more than 30 forward or backward packets as video stream connections, and will drop extra UDP traffic not related to streaming.

In [None]:
df_youtube['Label'] = 'other'
df_youtube.loc[(df_youtube['Total Fwd Packet'] > 30) | (df_youtube['Total Bwd packets'] > 30), 'Label'] = 'youtube'  
df_youtube = df_youtube.drop(df_youtube[(df_youtube['Protocol'] == 17) & (df_youtube['Label'] != 'youtube')].index)

In [None]:
df_vimeo['Label'] = 'other'
df_vimeo.loc[(df_vimeo['Total Fwd Packet'] > 30) | (df_vimeo['Total Bwd packets'] > 30), 'Label'] = 'vimeo'
df_vimeo = df_vimeo.drop(df_vimeo[(df_vimeo['Protocol'] == 17) & (df_vimeo['Label'] != 'vimeo')].index)

Now we can concat these two dataframes together and leave only features that we need

In [None]:
df = pd.concat([df_youtube, df_vimeo], ignore_index=True)
df = df[features]
df = df.dropna()  # remove rows with Nones

In [None]:
# let's look at our dataset
df.head()

## Classifier training
Now let's train a random forest classifier based on features of our data frame

In [None]:
# required imports
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import plot_tree

In [None]:
# separate the data frame to features and answers
target_variable = 'Label'
train_features = list(set(df.columns) - {target_variable})
x_train = df[train_features]
y_train = df[target_variable]

In [None]:
# and start training a classifier
clf = RandomForestClassifier()
clf.fit(x_train, y_train)

Great, we have our classifier trained! Let's explore the results:

In [None]:
y_pred = clf.predict(x_train.values)
print(metrics.classification_report(y_train, y_pred))

Look suspicious :) Let's explain the actual model with Trustee and see the reasons of such performance

## Classifier exploration and analysis

In [None]:
from trustee import ClassificationTrustee
import matplotlib.pyplot as plt

# create and train a trustee tree
trustee = ClassificationTrustee(expert=clf)
trustee.fit(x_train, y_train, num_samples=len(x_train) // 2, num_iter=20, train_size=0.99)

# print trustee explanation results
_, dt, _, score = trustee.explain()
print(f"Training score of pruned DT: {score}")
dt_y_pred = dt.predict(x_train)

print("Model explanation global fidelity report:")
print(metrics.classification_report(clf.predict(x_train), dt_y_pred))
print("Model explanation score report:")
print(metrics.classification_report(y_train, dt_y_pred))

# plot a tree
fig = plt.figure(figsize=(25,20))
plot_tree(dt, feature_names=x_train.columns, class_names=sorted(df['Label'].unique()), filled=True, max_depth=3)

### Discussion on analysis results

## Iteration #2 - fixing the dataset
Change the pipeline to fix the problem - prohibit usage of QUIC efficiently removing protocol difference in data

In [None]:
#creating a pipeline again

pipeline = Pipeline()
pipeline.early_stopping = False

pipeline.then(PingYouTubeAndVimeoTask())

pipeline.then(StartCaptureLinuxImplementation(filepath="/tmp/capture_youtube.pcap", name="capture_youtube"))
for _ in range(2):
    pipeline.then([
        WatchYouTubeVideoLinuxImplementation("https://www.youtube.com/watch?v=dQw4w9WgXcQ", 15, webdriver_arguments=["disable-quic"]),    # notice "disable-quic"
        WatchYouTubeVideoLinuxImplementation("https://www.youtube.com/watch?v=dQw4w9WgXcQ", 15, webdriver_arguments=["disable-quic"]),
    ])
pipeline.then(StopNamedCaptureLinuxImplementation(capture_task_name="capture_youtube"))

pipeline.then(StartCaptureLinuxImplementation(filepath="/tmp/capture_vimeo.pcap", name="capture_vimeo"))
for _ in range(2):
    pipeline.then([
        WatchVimeoVideoLinuxImplementation("https://vimeo.com/375468729", 15),
        WatchVimeoVideoLinuxImplementation("https://vimeo.com/375468729", 15),
    ])
pipeline.then(StopNamedCaptureLinuxImplementation(capture_task_name="capture_vimeo"))

for element in pipeline.tasks:
    print(element)

Again let's implement the experiment, deploy it, and run till completion.

In [None]:
experiment = Experiment().map(pipeline, working_nodes)

for deployment in experiment:
    deployment.environment_definition = DockerImage(image='pinot.cs.ucsb.edu/sigcommtutorial:latest')
    deployment.environment_definition.runtime_context.additional_arguments = ["/tmp:/tmp"]
    deployment.cleanup = False

!rm -rf /tmp/capture*

experiment_label = "session3-2"

try:
    client.delete_experiment(experiment_label)
except RemoteClientException:
    pass

client.prepare_experiment(experiment, experiment_label)
time.sleep(2)

while True:
    info = client.get_experiment_status(experiment_label)
    print(info.status)
    if info.status != ExperimentStatus.PREPARING:
        break
    time.sleep(10)

for deployment in client.get_experiment_status(experiment_label).experiment:
    print(f"Prepared: {deployment.prepared}, error: {deployment.error}")

client.start_execution(experiment_label)

while True:
    info = client.get_experiment_status(experiment_label)
    print(info.status)
    if info.status != ExperimentStatus.RUNNING:
        break
    time.sleep(10)


for report in info.execution_result:
    print(f"Node name: {report.node.name}")
    print(f"Error: {report.error}")

    result, log = report.result  # report stores results of execution and corresponding log
    
    # result is a returns.result.Result object, could be Success of Failure
    print(f"Result is: {type(result)}")
    data = result.unwrap() if is_successful(result) else result.failure()
    for key, value in data.items():
        print(f"{key}: {value}")

    # we also can explore logs
    for line in log:
        print(line.strip())
    print()

Again using CICFlowmeter to generate data...

In [None]:
!docker run -v /tmp/capture_youtube.pcap:/tmp/capture_youtube.pcap -v /tmp:/tmp/output --rm pinot.cs.ucsb.edu/cicflowmeter:latest /tmp/capture_youtube.pcap /tmp/output
!docker run -v /tmp/capture_vimeo.pcap:/tmp/capture_vimeo.pcap -v /tmp:/tmp/output --rm pinot.cs.ucsb.edu/cicflowmeter:latest /tmp/capture_vimeo.pcap /tmp/output

Again absolutely the same procedure for data preparation and cleaning

In [None]:
df_youtube = pd.read_csv("/tmp/capture_youtube.pcap_Flow.csv")
df_vimeo = pd.read_csv("/tmp/capture_vimeo.pcap_Flow.csv")

df_youtube['Label'] = 'other'
df_youtube.loc[(df_youtube['Total Fwd Packet'] > 30) | (df_youtube['Total Bwd packets'] > 30), 'Label'] = 'youtube'
df_youtube = df_youtube.drop(df_youtube[(df_youtube['Protocol'] == 17) & (df_youtube['Label'] != 'youtube')].index)

df_vimeo['Label'] = 'other'
df_vimeo.loc[(df_vimeo['Total Fwd Packet'] > 30) | (df_vimeo['Total Bwd packets'] > 30), 'Label'] = 'vimeo'
df_vimeo = df_vimeo.drop(df_vimeo[(df_vimeo['Protocol'] == 17) & (df_vimeo['Label'] != 'vimeo')].index)

df = pd.concat([df_youtube, df_vimeo], ignore_index=True)
df = df[features]
df = df.dropna()

target_variable = 'Label'
features = list(set(df.columns) - {target_variable})
x_train = df[features]
y_train = df[target_variable]

And also training a classifier, explaining it with Trustee, and visualizing the results

In [None]:
clf = RandomForestClassifier()
clf.fit(x_train, y_train)

y_pred = clf.predict(x_train.values)
print(metrics.classification_report(y_train, y_pred))

trustee = ClassificationTrustee(expert=clf)
trustee.fit(x_train, y_train, num_samples=len(x_train) // 2, num_iter=20, train_size=0.99)

_, dt, _, score = trustee.explain()
print(f"Training score of pruned DT: {score}")
dt_y_pred = dt.predict(x_train)

print("Model explanation global fidelity report:")
print(metrics.classification_report(clf.predict(x_train), dt_y_pred))
print("Model explanation score report:")
print(metrics.classification_report(y_train, dt_y_pred))

fig = plt.figure(figsize=(25,20))
plot_tree(dt, feature_names=x_train.columns, class_names=sorted(df['Label'].unique()), filled=True, max_depth=3)

## Iteration #2 - results discussion