# <b>Task01 - Traffic Application Classification</b>
___

In the first task, we solve a traffic classification problem determining which mobile app generated particular traffic flows. 

We use the `MIRAGE` dataset, collected from volunteers over a period of two years, from 2017 to 2019. It consists of 44k flows of data, each characterized by several measurements, including inter-arrival times, packet lengths, TCP receiver window values, and application payloads. The goal is to classify each flow into one of 16 different classes representing different mobile apps, or as background traffic. The classes are well-balanced, with a coefficient of 0.94.

**Note** In this notebook we report only the validation of the models and the experiments _without the training_. If you want to inspect our training approach or run again a model training see README (**Training the models** section).

# Table of Content
- Configuration
- Load features
- Validate the models
- k-nearest-neighborhood class probability
- Shallow learners
- Unsupervised clustering

## Configuration

Before we begin, we need to set up our environment and load the necessary libraries and modules. We also need to specify the paths to the data files and define some global variables that will be used throughout the notebook. 

The `DEMO` flag controls whether we are running the notebook in demonstration mode (`True`) or full mode (`False`). In demonstration mode some experiments will be run with less samples and the output will not be saved.

In [1]:
# Make mltoolbox and utls reachable from this folder
import sys
sys.path.append('../')

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

from utils import*

# Features and embeddings paths
FEATURES = '../data/task01/features'
EMBEDDINGS = '../data/task01/embeddings'
INTERIM = '../data/interim'

# Demonstrative flag
DEMO = True

## Load features

In this section, we will load the data files that contain the features for our machine learning models. We will use the `pandas` library to read in the CSV files and store the data in dataframes. 
- The `ipaddress` dataframe will contain the word2vec embeddings for the IP addresses
- The `payload` dataframe will contain the payload bytes
- The `statistics` dataframe will contain Tstat-style features
- The `sequences` dataframe will contain statistical features in sequence referred to each byte. 

The data in these dataframes will be used as input to our models.

In [2]:
import pandas as pd

# Load ip address word2vec embeddings - entity
ipaddress=pd.read_csv(f'{FEATURES}/ipaddress.csv', index_col=[0])
# Load payload bytes - quantity
payload=pd.read_csv(f'{FEATURES}/payload.csv', index_col=[0])
# Load statistics features - quantity
statistics=pd.read_csv(f'{FEATURES}/statistics.csv', index_col=[0])
# Load statistics sequences - quantity
sequences=pd.read_csv(f'{FEATURES}/sequences.csv', index_col=[0])

Then we merge the dataframes containing our features into a single dataframe called concat. We start by resetting the index of the payload dataframe and dropping the 'label' column. Then, we perform an inner join on the 'index' column with the statistics dataframe, also dropping the 'label' column. We repeat this process for the sequences dataframe and the ipaddress dataframe. Finally, we set the 'index' column as the index of the resulting dataframe. This results in a single dataframe that contains all of the features for our models, with the 'index' column serving as the primary key.

In [3]:
# Merge the features as raw concatenation
concat = payload.reset_index().drop(columns=['label'])\
                .merge(statistics.reset_index().drop(columns=['label']), 
                       on='index', how='inner')\
                .merge(sequences.reset_index().drop(columns=['label']), 
                       on='index', how='inner')\
                .merge(ipaddress.reset_index(), on='index', how='inner')\
                .set_index('index')

Finally, we collect the features sets in a dictionary, we load the stratified-k-folds order we provide and retrieve the number of classes.

In [4]:
import joblib

# Collect the features in a dictionary
features = {'payload':payload, 'statistics':statistics,
            'sequences':sequences, 'ipaddress':ipaddress,
            'rawcat':concat, 'mae':None}

# Load stratified k folds
kfolds = joblib.load(f'../data/task01/skfolds/folds.save')

# Get the number of classes
n_classes = ipaddress.value_counts('label').shape[0]

## Validate the models

After having trained the models through the training scripts, we need to validate them.

The following function is responsible for evaluating the pre-trained classifiers using cross-validation.  The model predicts the labels of each one of the provided fold at a time. It then generates a summary of the model's performance on the validation set in the form of a classification report, which includes metrics such as precision, recall, and f1-score. The function can be called multiple times with different values of K in order to validate the model's performance on all of the folds of the dataset.

In [5]:
from tqdm.notebook import tqdm
from mltoolbox.classification import DeepClassifier
from sklearn.metrics import classification_report

def validate_single_run(feature, fname, K, pbar):
    # Retrieve the training and validation samples from the k-folds order
    X_train, X_val, y_train, y_val = get_datasets(kfolds, K, feature)
    
    # Load the classifier model from the specified file path
    mpath = f'../data/task01/classifiers/{fname}_k{K}'
    classifier = DeepClassifier(_load_model=True, model_path=mpath)
    
    # Use the classifier to predict labels for the validation set
    y_pred = classifier.predict(X_val, scale_data=True)
    report = classification_report(y_val, y_pred, labels=np.unique(y_val), 
                                   output_dict=True)
    
     # Extract the macro average f1-score from the report
    f1 = round(report['macro avg']['f1-score'], 2)
    
    # Update the progress bar object and set the postfix message
    pbar.update(1)
    pbar.set_postfix({'current fold':K, 'macro avg. f1': f1})
    
    return report

Now we can validate the models. Namely, we run a full stratified-k-folds cross validation over:
- Raw features independently
- Concatenation of the raw features
- Multi-modal embeddings

In [6]:
for fname, feature in features.items():
    # Initialize a progress bar with a total of 5 iterations (skf)
    pbar = tqdm(total=5)
    pbar.set_description(f'Validating {fname}')
    
    # Iterate over the stratified folds
    for K in range(5):
        if fname == 'mae':
            # Load the pre-trained multimodal embeddings
            feature=pd.read_csv(f'{EMBEDDINGS}/mae_embeddings_k{K}.csv', 
                               index_col=[0])
        # Validate the classifier getting the classification metrics
        report = validate_single_run(feature, fname, K, pbar)
        
        # Save the report to a CSV file if not demonstrative
        if not DEMO:
            pd.DataFrame(report).T.to_csv(f'{INTERIM}/{fname}_deep_k{K}.csv')
            
    # Close the progress bar       
    pbar.close()

  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

## k-nearest-neighborhood class probability

We now evaluate the embeddings space through the k-nearest-neighborhood class probability. It consists on applying a k-nearest-neighbors classifier on the whole dataset. Then, for each sample whose label is different from `unknown` (if present) we compute the probability of having samples with the same label in their neighborhood.

In [7]:
from mltoolbox.classification import KnnClassifier
from mltoolbox.metrics import k_class_proba_report

def kpc_single_run(feature, fname, K, k, pbar):
    # Retrieve the training and validation datasets for the current fold
    X_train, X_val, y_train, y_val = get_datasets(kfolds, K, feature)
    X, y = np.vstack([X_train, X_val]), np.hstack([y_train, y_val])
    
    # If demonstrative load less samples
    if DEMO: X, y = X[:3000], y[:3000]
    
    # Train a KNN classifier with cosine similarity and the specified number 
    # of neighbors
    knn = KnnClassifier(n_neighbors=k, metric='cosine')
    knn.fit(X, y, scale_data=True)
    
    # Keep only the samples with labels other than 'unknown' and predict labels
    to_keep = np.where(y!='unknown')[0].reshape(-1, 1)
    pcs = knn.predict_proba(to_keep)
    
    # Generate a report with the k-class probabilities
    y_true = y[np.ravel(to_keep)]# Extract the true labels
    report = k_class_proba_report(y_true, pcs, output_dict=True)
    
    # Extract the macro average k-class probability from the report
    kpc = report['macro avg']['kpc']
    
    # Update the progress bar and set the postfix message
    pbar.update(1)
    pbar.set_postfix({'current fold':K, f'{k}Pc': kpc})

    return report

We evaluate the neighborhood for the Multi-modal embeddings and the concatenation of the raw features. We average the experiment on the 5 folds. 

Note that, since we want to evaluate the embeddings neighborhood, we do not need to distinguish between training/validation samples, thus we merge together the subsets.

In [8]:
ranges = range(1, 20) # Ranges of neighborhood radius
# If demonstrative limit the runs
if DEMO: ranges = range(1, 5)

# Evaluate only the 'rawcat' and 'mae' features
for fname, feature in features.items():
    if fname in ['rawcat', 'mae']:
        # Initialize a progress bar
        pbar = tqdm(total=len(ranges)*5)
        pbar.set_description(f'Evaluating {fname} neighborhood')
        
        # Iterate over the stratified folds
        for K in range(5):
            for k in ranges: # Try different neighborhood radious
                if fname == 'mae':
                    # Load the pre-trained multimodal embeddings
                    feature=pd.read_csv(f'{EMBEDDINGS}/mae_embeddings_k{K}.csv', 
                                       index_col=[0])
                # Compute the class probability
                report = kpc_single_run(feature, fname, K, k, pbar)
                
                # Save the report to a CSV file if not demonstrative
                if not DEMO:
                    pd.DataFrame(report).T.to_csv(f'{INTERIM}/{fname}_{k}pc_k{K}.csv')
        
        # Close the progress bar 
        pbar.close()

  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

## Shallow learners

For the sake of completeness we investigate if shallow learners (instead of deep classifiers) can perform a good classification of the samples. 

Firstly we use a Random Forest classifier repeating the stratified k fold validation.

In [9]:
from sklearn.ensemble import RandomForestClassifier

for K in range(5):
    # Load the pre-trained multimodal embeddings
    embeddings=pd.read_csv(f'{EMBEDDINGS}/mae_embeddings_k{K}.csv', 
                       index_col=[0])
    # Retrieve the training and validation datasets for the current fold
    X_train, X_val, y_train, y_val = get_datasets(kfolds, K, embeddings)

    # Initialize a random forest classifier and fit it to the training data
    clf = RandomForestClassifier(random_state=0, n_jobs=-1)
    clf.fit(X_train, y_train)

    # Use the classifier to predict labels for the validation set
    y_pred = clf.predict(X_val)

    # Generate a classification report for the predictions
    report = classification_report(y_val, y_pred, labels=np.unique(y_val), 
                                   output_dict=True)

    # Extract the macro average f1-score from the report
    f1 = report['macro avg']['f1-score']

    # Print the macro average f1-score
    print(f'Fold {K}: Validating MAE Random Forest classifier:\n\tMacro avg f1:{f1}')

    # Save the report to a CSV file if not demonstrative
    if not DEMO:
        pd.DataFrame(report).T.to_csv(f'{INTERIM}/mae_rf_k{K}.csv')

Fold 0: Validating MAE Random Forest classifier:
	Macro avg f1:0.8426948417914268
Fold 1: Validating MAE Random Forest classifier:
	Macro avg f1:0.8490613856235416
Fold 2: Validating MAE Random Forest classifier:
	Macro avg f1:0.8505468380988368
Fold 3: Validating MAE Random Forest classifier:
	Macro avg f1:0.8680946209559376
Fold 4: Validating MAE Random Forest classifier:
	Macro avg f1:0.8642330777755712


Then we use a generic distance based 7-NN classifier.

In [10]:
from mltoolbox.classification import KnnClassifier

for K in range(5):
    # Load the pre-trained multimodal embeddings
    embeddings=pd.read_csv(f'{EMBEDDINGS}/mae_embeddings_k{K}.csv', 
                       index_col=[0])
    
    # Retrieve the training and validation datasets for the current fold
    X_train, X_val, y_train, y_val = get_datasets(kfolds, K, embeddings)
    
    # Initialize a KNN classifier with cosine similarity and 7 neighbors
    clf = KnnClassifier(n_neighbors=7, metric='cosine')
    
    # If demonstrative load less samples
    if DEMO: X_train, y_train = X_train[:3000], y_train[:3000]
    
    # Fit the classifier to the training data
    clf.fit(X_train, y_train)

    # Use the classifier to predict labels for the validation set
    y_pred = clf.predict(X_val)

    # Generate a classification report for the predictions
    report = classification_report(y_val, y_pred, labels=np.unique(y_val), 
                                   output_dict=True)

    # Extract the macro average f1-score from the report
    f1 = report['macro avg']['f1-score']

    # Print the macro average f1-score
    print(f'Fold {K}: Validating MAE 7-NN classifier:\n\tMacro avg f1:{f1}')

    # Save the report to a CSV file if not demonstrative
    if not DEMO:
        pd.DataFrame(report).T.to_csv(f'../data/interim/mae_7nn_k{K}.csv')

Fold 0: Validating MAE 7-NN classifier:
	Macro avg f1:0.10697046695664006
Fold 1: Validating MAE 7-NN classifier:
	Macro avg f1:0.11126296218833427
Fold 2: Validating MAE 7-NN classifier:
	Macro avg f1:0.101910778837631
Fold 3: Validating MAE 7-NN classifier:
	Macro avg f1:0.11079072741958787
Fold 4: Validating MAE 7-NN classifier:
	Macro avg f1:0.10907128829366836


## Unsupervised clustering

In our previous experiments, we tested deep and shallow learning models for a supervised learning task. We are now interested in evaluating whether it is possible to use the generated embeddings for an unsupervised learning task. To do this, we will use clustering algorithms to see if the generated embeddings can be used to group similar data points together. We will evaluate the performance of the clustering using various metrics to determine its effectiveness.

In [11]:
from mltoolbox.clustering import kMeans
from mltoolbox.metrics import silhouette_report
from sklearn.metrics import adjusted_rand_score

This function allows to perform clustering and evaluate its performance using two metrics: 
- silhouette coefficient
- adjusted rand index. 

We use a simple k-Means as clustering algorithm.

In [12]:
def cluster_single_run(fname, X, y, k, pbar):
    # Initialize and fit a KMeans clustering model
    kmeans = kMeans(n_clusters=k)
    kmeans.fit(X, scale_data=False)
    
    # Use the model to predict cluster labels for the input data
    y_pred = kmeans.predict(X, scale_data=False)
    
    # Generate a silhouette report for the predicted labels
    report = silhouette_report(X, y_pred, output_dict=True)
    
    sh = report['macro avg']['sh'] # Get average silhouette
    ari = adjusted_rand_score(y, y_pred) # Get adjusted rand index
    
    # Update the progress bar and set the postfix message
    pbar.update(1)
    pbar.set_postfix({'current feature':fname, f'{k}-Means sh:': sh, 'ari:':ari})
    
    return sh, ari

Being an unsupervised task, we do not know in advance the number of clusters to find, thus we iterate over a set of $k$ of the k-Means. Namely, we vary $k \in [ \frac{c}{2} ; 2c]$, where $c$ is the number of labels.

We cluster both the multi-modal embeddings and the concatenation of raw features and average the results over the stratified-k-folds.

In [13]:
ranges = range(8,33) # Try different k of k-Means
# If demonstrative limit the runs
if DEMO: ranges = range(8, 15)

# Iterate over the folds
for K in range(5):
    # Initialize dictionaries to store the performance 
    # metrics for each feature
    shs = {'rawcat':[], 'mae':[]}
    aris = {'rawcat':[], 'mae':[]}
    
    # Initialize a progress bar
    pbar = tqdm(total=len(ranges)*2)
    pbar.set_description(f'Evaluating clusters. Fold {K}')
    
    # Evaluate only the 'rawcat' and 'mae' features
    for fname, feature in features.items():
        if fname in ['rawcat', 'mae']:
            # Load the pre-trained multimodal embeddings
            if fname == 'mae':
                feature=pd.read_csv(f'{EMBEDDINGS}/mae_embeddings_k{K}.csv', 
                                   index_col=[0])

            # Retrieve the training and validation datasets for the current fold
            X_train, X_val, y_train, y_val = get_datasets(kfolds, K, feature)
            
            # Combine the training and validation datasets
            X = np.vstack([X_train, X_val])
            y = np.ravel(np.hstack([y_train, y_val]))
            
            # If demonstrative, limit the number of samples
            if DEMO: X, y = X[:1000], y[:1000]
            
            # Vary k of k-Means
            for k in ranges:
                # Evaluate the performance of the K-means model for the current k
                sh, ari = cluster_single_run(fname, X, y, k, pbar)
                
                # Update dictionaries
                shs[fname].append(sh)
                aris[fname].append(ari)
    
    # If not demonstrative finalize the experiments and save the report
    if not DEMO:
        # Manage silhouette reports
        sh_df = pd.DataFrame(shs, index=ranges)\
                  .rename(columns={x:f'sh_{x}' for x in shs.keys()})
        # Manage adjusted rand index reports
        ar_df = pd.DataFrame(aris, index=ranges)\
                  .rename(columns={x:f'ar_{x}' for x in aris.keys()})
        df = sh_df.reset_index().merge(
            ar_df.reset_index(), on='index').set_index('index')
        df.to_csv(f'{INTERIM}/clustering_k{K}.csv')
    
    # Close the progress bar
    pbar.close()

  0%|          | 0/14 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]

  0%|          | 0/14 [00:00<?, ?it/s]