<a href="https://colab.research.google.com/github/AImSecure/Laboratory2/blob/main/lab/notebooks/Lab2_FFNN_RNN_GNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Laboratory 2 — Model Engineering

text

## Setup

In [None]:
# --- Check Python and pip versions ---
!python --version
!pip install --upgrade pip

Python 3.12.12
Collecting pip
  Downloading pip-25.3-py3-none-any.whl.metadata (4.7 kB)
Downloading pip-25.3-py3-none-any.whl (1.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 24.1.2
    Uninstalling pip-24.1.2:
      Successfully uninstalled pip-24.1.2
Successfully installed pip-25.3


In [None]:
# --- Install required libraries ---
!pip install torch
!pip install numpy pandas scikit-learn matplotlib seaborn
!pip install tqdm



In [None]:
# --- Import libraries ---
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, TensorDataset

from tqdm import tqdm

### Colab Pro

In [None]:
# --- Check GPU availability ---
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

/bin/bash: line 1: nvidia-smi: command not found


In [None]:
# --- Check RAM availability ---
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 13.6 gigabytes of available RAM

Not using a high-RAM runtime


### Paths setup


In [None]:
# --- Mount Google Drive (for Google Colab users) ---
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# --- Define Paths ---
group = 'AImSecure'
laboratory = 'Laboratory2'

base_path = '/content/drive/MyDrive/'
project_path = base_path + f'Projects/{group}/{laboratory}/'
data_path = project_path + 'data/'
results_path = project_path + 'results/'

# Ensure directories exist
os.makedirs(project_path, exist_ok=True)
os.makedirs(data_path, exist_ok=True)
os.makedirs(results_path, exist_ok=True)

print(f"Project path: {project_path}")
print(f"Data path: {data_path}")
print(f"Results path: {results_path}")

Project path: /content/drive/MyDrive/Projects/AImSecure/Laboratory2/
Data path: /content/drive/MyDrive/Projects/AImSecure/Laboratory2/data/
Results path: /content/drive/MyDrive/Projects/AImSecure/Laboratory2/results/


In [None]:
# --- Set visual style ---
sns.set(style="whitegrid", palette="muted", font_scale=1.1)

def save_plot(fig: plt.Figure, filename: str, path: str = "./plots/", fmt: str = "png", dpi: int = 300, close_fig: bool = False) -> None:
    """
    Save a Matplotlib figure in a specific to a specified directory.

    Args:
        fig (plt.Figure): Matplotlib figure object to save.
        filename (str): Name of the file to save (e.g., 'plot.png').
        path (str, optional): Directory path to save the figure. Defaults to './plots/'.
        fmt (str, optional): File format for the saved figure. Defaults to 'png'.
        dpi (int, optional): Dots per inch for the saved figure. Defaults to 300.

    Returns:
        None
    """
    # Ensure the directory exists
    os.makedirs(path, exist_ok=True)
    save_path = os.path.join(path, f"{filename}.{fmt}")

    # Save the figure
    fig.savefig(save_path, bbox_inches='tight', pad_inches=0.1, dpi=dpi, format=fmt)
    # plt.close(fig) # Removed to display plots in notebook

    if close_fig:
        plt.close(fig)

    print(f"Saved plot: {save_path}")

## Task 1 — Frequency-based baseline

We implement a simple frequency-based baseline.  
- Transform sequences into feature vectors counting API call occurrences.  
- Helps evaluate whether simple approaches already perform well before using complex models.


In [None]:
# Create directory for plots
save_dir = results_path + 'images/' + 'task1_plots/'
os.makedirs(save_dir, exist_ok=True)

### Frequency-Based Approach

- Extract vocabulary from train and test datasets.
- Use vocabulary to create feature vectors (frequency counts per API call).
- Output dataframe: one row per sequence, one column per API call.

### Extract the vocabularies

In [None]:
# --- Extract vocabulary from training and test sets ---
train_vocab = set(api_call for seq in train_sequences for api_call in seq)
test_vocab = set(api_call for seq in test_sequences for api_call in seq)

In [None]:
# --- Count unique API calls ---
print(f"Train: {len(train_vocab)}, Test: {len(test_vocab)}")

#### Q: How many unique API calls does the training set contain? How many in the test set?



In [None]:
# --- Identify test-only API calls ---
unique_test_only = test_vocab - train_vocab
print(f"Test-only: {len(unique_test_only)}, {unique_test_only}")

#### Q: Are there any API calls that appear only in the test set (but not in the training set)? If yes, how many? Which ones?

### New frequency-based dataframes

- Create dataframe using training vocabulary as features.
- Count occurrences of each API call per sequence.

In [None]:
# --- Build frequency-based dataframes ---
train_df = pd.DataFrame([{api: seq.count(api) for api in train_vocab} for seq in train_sequences])
test_df = pd.DataFrame([{api: seq.count(api) for api in train_vocab} for seq in test_sequences])

#### Q: Can you use the test vocabulary to build the new test dataframe? If not, how do you handle API calls in the test set that do not exist in the training vocabulary?

In [None]:
# --- Compute sparsity ---
train_nonzeros = train_df.astype(bool).sum(axis=1)
test_nonzeros = test_df.astype(bool).sum(axis=1)

print(f"Train avg non-zero: {train_nonzeros.mean():.2f}, Test avg non-zero: {test_nonzeros.mean():.2f}")
print(f"Ratio train: {train_nonzeros.mean()/len(train_vocab):.2f}, Ratio test: {test_nonzeros.mean()/len(train_vocab):.2f}")

#### Q: How many non-zero elements per row do you have on average in the training set? How many in the test set? What is the ratio with respect to the number of elements per row?

#### Q: The original API sequences were ordered. Is it still the case now in the frequency-based dataframe? Why?

### Feed the frequency-based datasets to a classifier

- Any classifier can be used (shallow/deep neural or non-neural).
- Goal: evaluate baseline performance on sparse vectors without sequence information.

In [None]:
# --- Example: RandomForest classifier ---
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
clf.fit(train_df, y_train)
y_pred = clf.predict(test_df)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")

#### Q: Report how you chose the hyperparameters of your classifier, and the final performance on the test set.

#### Q: Is the final performance good, even ignoring the order of API calls and handling very sparse vectors?

## Task 2 - Feed Forward Neural Network (FFNN)

text

In [None]:
# Create directory for plots
save_dir = results_path + 'images/' + 'task2_plots/'
os.makedirs(save_dir, exist_ok=True)

### API Call Statistics

#### Q: Do you have the same number of API calls per sequence? If not, is the distribution of API calls per sequence the same for training and test sets?

#### Q: Can a FFNN handle a variable number of elements? If not, why?

### Fixed-Size Sequences

#### Q: How to estimate a fixed-size candidate? Which partition do you use to estimate it?

#### Q: Given the estimate, what technique could you use to obtain the same number of API calls per sequence?

#### Q: If at test time you have more API calls than the fixed-size, what do you do with the exceeding API calls?

### Handling Categorical Features

#### Q: Use a FFNN in both cases. Report how you selected the hyperparameters of your final model, and justify your choices.

#### Q: Can you obtain the same results for sequential identifiers and learnable embeddings? If not, why?

## Task 3 - Recursive Neural Network (RNN)

text

In [None]:
# Create directory for plots
save_dir = results_path + 'images/' + 'task3_plots/'
os.makedirs(save_dir, exist_ok=True)

### Sequence Modeling with RNNs

#### Q: With RNNs, do you still have to pad your data? If yes, how?

#### Q: Do you have to truncate the testing sequences? Justify your answer.

#### Q: Is the RNN padding more memory efficient compared to the FFNN’s one? Why?

#### Q: Start with a simple one-directional RNN. Is your network as fast as the FFNN? If not, why?

### Network Variations

#### Q: Is the RNN training as stable as the FFNN's one?

#### Q: How does your model's performance compare to the simple frequency baseline?

## Task 4 - Graph Neural Network (GNN)

text

In [None]:
# Create directory for plots
save_dir = results_path + 'images/' + 'task4_plots/'
os.makedirs(save_dir, exist_ok=True)

### Modeling API Sequences as Graphs

#### Q: Do you still have to pad your data? If yes, how?

#### Q: Do you have to truncate the testing sequences? Justify your answer.

#### Q: What is the advantage of modeling your problem with a GNN compared to an RNN? What do you lose?

### GNN Variations

#### Q: How does each model perform compared to the previous architectures? Can you beat the baseline?