# <b>Grid Search</b>
___

In this notebook we report the code used to evaluate the grid search. Note that we run the grid search only on the first task (Traffic Application Classification -- `MIRAGE` dataset.

**Note** In this notebook we report only the validation of the models and the experiments _without the training_. If you want to inspect our training approach or run again a model training see README (**Training the models** section).

# Table of Content
- Configuration
- Load features
- Validate the models
- Retrieve number of trainables

## Configuration

Before we begin, we need to set up our environment and load the necessary libraries and modules. We also need to specify the paths to the data files and define some global variables that will be used throughout the notebook. 

The `DEMO` flag controls whether we are running the notebook in demonstration mode (`True`) or full mode (`False`). In demonstration mode some experiments will be run with less samples and the output will not be saved.

In [1]:
# Make mltoolbox and utls reachable from this folder
import sys
sys.path.append('../')

from utils import*

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 

# Features and embeddings paths
FEATURES = '../data/task01/features'
EMBEDDINGS = '../data/task01/embeddings'
INTERIM = '../data/interim'
MAE = '../data/task01/mae'

# Demonstrative flag
DEMO = True

2023-01-10 11:02:09.272330: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load features

In this section, we will load the data files that contain the features for our machine learning models. We will use the `pandas` library to read in the CSV files and store the data in dataframes. 
- The `ipaddress` dataframe will contain the word2vec embeddings for the IP addresses
- The `payload` dataframe will contain the payload bytes
- The `statistics` dataframe will contain Tstat-style features
- The `sequences` dataframe will contain statistical features in sequence referred to each byte. 

The data in these dataframes will be used as input to our models.

In [2]:
import pandas as pd

# Load ip address word2vec embeddings - entity
ipaddress=pd.read_csv(f'{FEATURES}/ipaddress.csv', index_col=[0])
# Load payload bytes - quantity
payload=pd.read_csv(f'{FEATURES}/payload.csv', index_col=[0])
# Load statistics features - quantity
statistics=pd.read_csv(f'{FEATURES}/statistics.csv', index_col=[0])
# Load statistics sequences - quantity
sequences=pd.read_csv(f'{FEATURES}/sequences.csv', index_col=[0])

Then we merge the dataframes containing our features into a single dataframe called concat. We start by resetting the index of the payload dataframe and dropping the 'label' column. Then, we perform an inner join on the 'index' column with the statistics dataframe, also dropping the 'label' column. We repeat this process for the sequences dataframe and the ipaddress dataframe. Finally, we set the 'index' column as the index of the resulting dataframe. This results in a single dataframe that contains all of the features for our models, with the 'index' column serving as the primary key.

In [3]:
# Merge the features as raw concatenation
concat = payload.reset_index().drop(columns=['label'])\
                .merge(statistics.reset_index().drop(columns=['label']), 
                       on='index', how='inner')\
                .merge(sequences.reset_index().drop(columns=['label']), 
                       on='index', how='inner')\
                .merge(ipaddress.reset_index(), on='index', how='inner')\
                .set_index('index')

Finally, we collect the features sets in a dictionary, we load the stratified-k-folds order we provide and retrieve the number of classes.

In [4]:
import joblib

# Collect the features in a dictionary
features = {'payload':payload, 'statistics':statistics,
            'sequences':sequences, 'ipaddress':ipaddress,
            'rawcat':concat, 'mae':None}

# Load stratified k folds
kfolds = joblib.load(f'../data/task01/skfolds/folds.save')

# Get the number of classes
n_classes = ipaddress.value_counts('label').shape[0]

## Validate the models

After having trained the models through the training scripts, we need to validate them.

The following function is responsible for evaluating the pre-trained classifiers using cross-validation.  The model predicts the labels of each one of the provided fold at a time. It then generates a summary of the model's performance on the validation set in the form of a classification report, which includes metrics such as precision, recall, and f1-score. The function can be called multiple times with different values of K in order to validate the model's performance on all of the folds of the dataset.

In [5]:
from tqdm.notebook import tqdm
from mltoolbox.classification import DeepClassifier
from sklearn.metrics import classification_report

def validate_single_run(feature, mpath, K, pbar):
    # Retrieve the training and validation samples from the k-folds order
    X_train, X_val, y_train, y_val = get_datasets(kfolds, K, feature)
    
    # Load the classifier model from the specified file path
    classifier = DeepClassifier(_load_model=True, model_path=mpath)
    
    # Use the classifier to predict labels for the validation set
    y_pred = classifier.predict(X_val, scale_data=True)
    report = classification_report(y_val, y_pred, labels=np.unique(y_val), 
                                   output_dict=True)
    
     # Extract the macro average f1-score from the report
    f1 = round(report['macro avg']['f1-score'], 2)
    
    mname = (mpath.split('/')[-1]).replace('gridsearch_', '')
    # Update the progress bar object and set the postfix message
    pbar.update(1)
    pbar.set_postfix({'current model':mname, 
                      'macro avg. f1': f1})
    
    return report

Run the grid search validation.

In [6]:
# Iterate over the stratified folds
for K in range(5):
    # Initialize a progress bar with a total of 16 iterations (gs)
    pbar = tqdm(total=16)
    pbar.set_description(f'Validating Fold {K}')

    for l1 in [32, 64, 128, 256]:
        for l4 in [32, 64, 128, 256]:
            # Load the pre-trained multimodal embeddings
            feature=pd.read_csv(f'{EMBEDDINGS}/gridsearch_{l1}_{l4}_k{K}.csv', 
                               index_col=[0])
            mpath = f'../data/task01/classifiers/gridsearch_{l1}_{l4}_k{K}'
            # Validate the classifier getting the classification metrics
            report = validate_single_run(feature, mpath, K, pbar)

            # Save the report to a CSV file if not demonstrative
            if not DEMO:
                pd.DataFrame(report).T.to_csv(f'{INTERIM}/gridsearch_{l1}_{l4}_k{K}.csv')

    # Close the progress bar       
    pbar.close()

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

## Retrieve number of trainables

Finally, we evaluate the multi-modal encoder size. Namely, we iterate through different combinations of values for hyperparameters `l1` and `l4` and for each combination, we loads a pre-trained MAE, extract the encoder portion of the model, and count the number of trainable parameters.

In [7]:
from keras import models
from mltoolbox.representation import MultimodalAE

report = []
# Initialize a progress bar with a total of 5 iterations (skf)
pbar = tqdm(total=16)
pbar.set_description(f'Counting trainables')

for l1 in [32, 64, 128, 256]:
    for l4 in [32, 64, 128, 256]:
        # Retrieve the multimodal AE
        mae = MultimodalAE(model_path=f'{MAE}/gridsearch_{l1}_{l4}_k0',
                           _load_model=True)
        
        # Extract only the encoder
        i,o = mae.extract_encoder()
        
        # Get the number of trainables
        params = models.Model(i,o).count_params()/1e4
        report.append((l1, l4, params))
        
        # Update the progress bar object and set the postfix message
        pbar.update(1)
        pbar.set_postfix({'l1':l1, 'l4':l4, 'params [x1e4]':params})
# Close the progressbar
pbar.close()

# If not demonstrative, save report to file
if not DEMO:
    df = pd.DataFrame(report, columns=['l1', 'l4', 'trainables'])
    pd.DataFrame(df).to_csv(f'{INTERIM}/trainables.csv')

  0%|          | 0/16 [00:00<?, ?it/s]