# **CDS Project: Part 1**

*Institute of Software Security (E22)*  
*Hamburg University of Technology*  
*SoSe 2023*

## Learning objectives
---

- Use a basic Machine Learning (ML) pipeline with pre-trained models.
- Build your own data loader.
- Load and run a pre-trained ML model.
- Evaluate the performance of an ML model.
- Calculate and interpret performance metrics.

## Materials
---

- Lecture Slides 1, 2, and 3.
- PyTorch Documentation: [Datasets and Data Loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) 


## Project Description
---

In this project, you are given an ML model that is pre-trained on a vulnerability dataset. The dataset consists of code samples labeled with True or False flags, depending on the presence and absense of a vulnerability. Your goal is to use the pre-trained model to predict if the code samples in the validation set contain vulnerabilities or not and analyse the results. Please proceed to the below tasks. 

### Notes
PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

###*Task 1*

Build a data loader for the validation dataset present in the following path: "*data_students/student_dataset.hdf5*". You will be using this dataset to validate the performance of the ML model. The dataset is in HDF5 binary data format. This format is used to store large amount of data. Make sure that you import and familiarise yourself with the right Python libraries to handle HDF5 files. 


In [3]:
# TODO: import the necessary libraries to load the data from the specified path.
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt
import h5py
import numpy as np
import os
import random
import pandas as pd

In [4]:
BASE_PATH = os.getcwd()
filename = os.path.join(BASE_PATH,"data_students", "student_dataset.hdf5" )


with h5py.File(filename, "r") as f:
    print("Keys: %s" % f.keys())
    



In [None]:
class HDF5CodeDataset(Dataset):
    def __init__(self, file_path):
        self.file_path = file_path
        self.h5_file = h5py.File(file_path, 'r')
        self.samples = self.h5_file['vectors']      
        self.labels = self.h5_file['labels']         

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        sample = self.samples[idx]
        label = self.labels[idx]
        return sample, label

    def close(self):
        self.h5_file.close()

#Load the dataset and build a DataLoader
def get_validation_loader(hdf5_path, batch_size=32):
    dataset = HDF5CodeDataset(hdf5_path)
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
    return loader, dataset


In [None]:
val_loader, val_dataset = get_validation_loader(filename)

#Check a single batch
for batch_vectors, batch_labels in val_loader:
    print("Batch Vectors Shape:", batch_vectors.shape)
    print("Batch Labels:", batch_labels)
    break

print(val_dataset[0])  # Check the first sample and label

Batch Vectors Shape: torch.Size([32, 1, 768])
Batch Labels: tensor([False, False,  True,  True, False, False,  True, False, False, False,
        False, False, False, False,  True,  True, False,  True, False,  True,
        False, False, False, False, False, False, False, False,  True, False,
         True, False])
(array([[ 1.03547001e+00, -2.19101101e-01, -9.18475091e-01,
         1.74084693e-01,  3.97932082e-01, -2.96749353e-01,
         1.05609491e-01, -1.51622081e+00,  9.08545077e-01,
         2.22681236e+00, -3.17959100e-01,  7.20114648e-01,
         1.21622455e+00,  7.95506537e-01,  3.67359757e-01,
        -2.04543328e+00, -1.75873160e+00,  1.21686578e+00,
         5.05620539e-01, -4.81491238e-01,  9.69734862e-02,
        -3.43340904e-01, -3.18177231e-02, -6.40490651e-01,
         2.21731019e+00,  1.18113399e+00, -4.24810886e-01,
        -1.07952014e-01,  3.33056033e-01,  1.04620941e-01,
         2.20718455e+00, -1.18341935e+00,  5.23902714e-01,
        -2.19629502e+00,  3.17754

###*Task 2*

Generate a table with 10 random samples from the dataset and show their corresponding labels.

In [93]:
random_indices = random.sample(range(len(val_dataset)), 10)

# Retrieve the samples and labels
random_samples = [val_dataset[i] for i in random_indices]

# Create a DataFrame for better visualization
df_random_samples = pd.DataFrame({
    'Sample Index': random_indices,
    'Sample': [sample[0] for sample in random_samples],
    'Label': [sample[1] for sample in random_samples]
})

# Display the table
print(df_random_samples)

   Sample Index                                             Sample  Label
0           653  [[-0.25355354, 0.374036, -0.31967133, 1.416075...  False
1           217  [[1.4171106, 0.59021866, -0.01195909, 0.365434...  False
2           974  [[2.0117917, 0.38000637, -0.20979905, -0.11392...  False
3           519  [[1.0824207, 0.56207716, -3.0993059, -0.607007...  False
4             3  [[1.2847438, -0.025869051, -0.71129435, 1.0452...   True
5           406  [[-0.123753496, 1.6171626, -0.9375237, 0.42778...   True
6           283  [[0.13342273, -0.10581284, -1.0935313, -0.7948...   True
7           941  [[-0.2716333, 0.8326489, -0.11674489, -1.27786...  False
8           564  [[-0.151199, -1.2359986, 0.25981015, -0.026536...   True
9           866  [[2.2081115, -0.68636507, -2.8235893, -0.62246...  False


###*Task 3*

Inspect the dataset and answer the following questions:
1. How many samples are in the dataset?
2. How many positive examples (vulnerability-labeled instances) are in the dataset?
3. What is the vulnerable/non-vulnerable ratio?

In [94]:
num_samples = len(val_dataset)
print(f"Number of samples in the dataset: {num_samples}")
print(len(val_dataset[0][0][0]))  # Check the first sample and label 


Number of samples in the dataset: 1000
768


In [95]:
num_positive_examples = sum(val_dataset.labels[:])
num_negative_examples = num_samples - num_positive_examples
ratio = num_positive_examples / num_negative_examples if num_negative_examples > 0 else float('inf')
print(f"Number of negative examples in the dataset: {num_negative_examples}")
print(f"Number of positive examples in the dataset: {num_positive_examples}")
print(f"Ratio of positive to negative examples: {ratio:.2f}")

Number of negative examples in the dataset: 717
Number of positive examples in the dataset: 283
Ratio of positive to negative examples: 0.39


In [96]:
val_dataset.labels[:].sum()  # Check the first sample and label

283

###*Task 4*

Load and run the following pre-trained neural network model called VulnPredictionModel. 

In [98]:

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cpu device


In [106]:
from torch import nn

class VulnPredictModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_stack = nn.Sequential(
            nn.Linear(768, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        x = self.flatten(x)
        return self.linear_stack(x)
      

# TODO: intialize and load the model
model = VulnPredictModel()
model.load_state_dict(torch.load("model_2023-03-28_20-03.pth", map_location=torch.device('cpu')))
model.eval()


VulnPredictModel(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_stack): Sequential(
    (0): Linear(in_features=768, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): ReLU()
    (4): Linear(in_features=64, out_features=1, bias=True)
    (5): Sigmoid()
  )
)

###*Task 5*

Make a prediction on the provided dataset and compute the following values:
- True Positives
- True Negatives
- False Positives
- False Negatives

In [107]:
# TODO: makethe prediction for all the samples in the validation set.

def evaluate_model(model, dataloader, threshold=0.5):
    model.eval()
    TP = TN = FP = FN = 0

    with torch.no_grad():
        for inputs, labels in dataloader:
            outputs = model(inputs).squeeze()
            preds = (outputs >= threshold).int()
            labels = labels.int()

            TP += ((preds == 1) & (labels == 1)).sum().item()
            TN += ((preds == 0) & (labels == 0)).sum().item()
            FP += ((preds == 1) & (labels == 0)).sum().item()
            FN += ((preds == 0) & (labels == 1)).sum().item()

    return TP, TN, FP, FN

val_loader, _ = get_validation_loader(filename)
TP, TN, FP, FN = evaluate_model(model, val_loader)

print(f"True Positives: {TP}")
print(f"True Negatives: {TN}")
print(f"False Positives: {FP}")
print(f"False Negatives: {FN}")

True Positives: 20
True Negatives: 716
False Positives: 1
False Negatives: 263


### *Task 6*

Compute the corresponding performance metrics **manually** (do not use PyTorch's predefined metrics):
- Accuracy
- Precision
- Recall
- F1

In [108]:
Accuracy  = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)
F1_Score  = 2 * (Precision * Recall) / (Precision + Recall)

print(f"Accuracy: {Accuracy:.4f}")
print(f"Precision: {Precision:.4f}")
print(f"Recall: {Recall:.4f}") 
print(f"F1 Score: {F1_Score:.4f}")

Accuracy: 0.7360
Precision: 0.9524
Recall: 0.0707
F1 Score: 0.1316


### *Task 7*

Based on your performance metrics, answer the following questions:

- Explain the impact of accuracy vs. F1 score.
- In this particular problem, which metric one should focus more on?
- Is there a better metric suitable for the use case of vulnerability prediction? Why?
