# **CDS Project: Part 1**

*Institute of Software Security (E22)*  
*Hamburg University of Technology*  
*SoSe 2023*

## Learning objectives
---

- Use a basic Machine Learning (ML) pipeline with pre-trained models.
- Build your own data loader.
- Load and run a pre-trained ML model.
- Evaluate the performance of an ML model.
- Calculate and interpret performance metrics.

## Materials
---

- Lecture Slides 1, 2, and 3.
- PyTorch Documentation: [Datasets and Data Loaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) 


## Project Description
---

In this project, you are given an ML model that is pre-trained on a vulnerability dataset. The dataset consists of code samples labeled with True or False flags, depending on the presence and absense of a vulnerability. Your goal is to use the pre-trained model to predict if the code samples in the validation set contain vulnerabilities or not and analyse the results. Please proceed to the below tasks. 

###*Task 1*

Build a data loader for the validation dataset present in the following path: "*data_students/student_dataset.hdf5*". You will be using this dataset to validate the performance of the ML model. The dataset is in HDF5 binary data format. This format is used to store large amount of data. Make sure that you import and familiarise yourself with the right Python libraries to handle HDF5 files. 


In [1]:
# TODO: import the necessary libraries to load the data from the specified path.
import h5py

# Load the dataset
with h5py.File('data_students/student_dataset.hdf5', 'r') as f:
    print("Keys in the HDF5 file:", list(f.keys()))
  
    X = f['labels'][:]  
    y = f['vectors'][:]  

 



Keys in the HDF5 file: ['labels', 'source', 'vectors']


In [2]:

import torch
from torch.utils.data import TensorDataset, DataLoader

# Convert to tensors
X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32)

# Create dataset and loader
dataset = TensorDataset(X_tensor, y_tensor)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)


###*Task 2*

Generate a table with 10 random samples from the dataset and show their corresponding labels.



In [4]:
# TODO: display 10 random samples from the loaded dataset
import random
import pandas as pd
import torch

# Get 10 random samples
indices = random.sample(range(len(dataset)), 10)
samples = [dataset[i] for i in indices]

# Create a summary table
data = []
for i, (features, label) in enumerate(samples, 1):
    sample_info = {
        'Sample #': i,
       
        'Label Shape': tuple(label.shape),
        'First 3 Label Values': [round(x.item(), 4) for x in label.flatten()[:3]]
    }
    data.append(sample_info)

# Create and display pandas DataFrame
df = pd.DataFrame(data)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
print(df.to_string(index=False))


 Sample # Label Shape        First 3 Label Values
        1    (1, 768)      [0.4667, 1.1838, 0.32]
        2    (1, 768)    [1.4171, 0.5902, -0.012]
        3    (1, 768) [-0.9566, -2.0477, -3.4263]
        4    (1, 768)  [2.2081, -0.6864, -2.8236]
        5    (1, 768) [-0.3244, -0.0013, -1.4802]
        6    (1, 768)   [0.7624, 0.1513, -1.3426]
        7    (1, 768)  [2.5526, -0.5399, -0.4223]
        8    (1, 768)    [0.5805, 0.1009, 0.1584]
        9    (1, 768)   [-0.3426, 0.6134, 0.2494]
       10    (1, 768)    [1.0751, 3.053, -3.9202]


###*Task 3*

Inspect the dataset and answer the following questions:
1.  How many samples are in the dataset?
2. How many positive examples (vulnerability-labeled instances) are in the dataset?
3. What is the vulnerable/non-vulnerable ratio?

In [14]:
# TODO: inspect and understand the loaded dataset

total_samples = len(dataset)
print(f"1. Total samples: {stats['total_samples']}")

1. Total samples: 1000


In [23]:
# 2. How many positive examples (vulnerability-labeled instances) are in the dataset?

vulnerable_count = (X_tensor == 1).sum().item()
print(f"Number of vulnerable samples: {vulnerable_count}")

Number of vulnerable samples: 283


In [24]:
#3. What is the vulnerable/non-vulnerable ratio?
vulnerable = (X_tensor == 1).sum().item()
non_vulnerable = (X_tensor == 0).sum().item()

ratio = vulnerable / non_vulnerable if non_vulnerable > 0 else float('inf')

print(f"Vulnerable/Non-vulnerable ratio: {ratio:.2f}:1")
print(f"({vulnerable} vulnerable vs {non_vulnerable} non-vulnerable)")

Vulnerable/Non-vulnerable ratio: 0.39:1
(283 vulnerable vs 717 non-vulnerable)


###*Task 4*

Load and run the following pre-trained neural network model called VulnPredictionModel. 

``` python 
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

```

In [74]:

import torch
import numpy as np
from torch import nn

# 1. Define the model with proper forward pass
class VulnPredictModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_stack = nn.Sequential(
            nn.Linear(768, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        # Input shape: [batch_size, 768]
        return self.linear_stack(x)

# 2. Initialize model and device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = VulnPredictModel().to(device)
      

# TODO: intialize and load the model
model=VulnPredictModel()
model.load_state_dict(torch.load('model_2023-03-28_20-03.pth', map_location=torch.device('cpu')))
model.eval()
print("Model loaded successfully!")
    
    # Verify model structure
print("\nModel architecture:")
print(model)


Model loaded successfully!

Model architecture:
VulnPredictModel(
  (linear_stack): Sequential(
    (0): Linear(in_features=768, out_features=64, bias=True)
    (1): ReLU()
    (2): Linear(in_features=64, out_features=64, bias=True)
    (3): ReLU()
    (4): Linear(in_features=64, out_features=1, bias=True)
    (5): Sigmoid()
  )
)


###*Task 5*

Make a prediction on the provided dataset and compute the following values:
- True Positives
- True Negatives
- False Positives
- False Negatives

In [84]:
print("Original X shape:", X.shape)  
print("Original y shape:", y.shape)  


features = y.squeeze()  
labels = X.reshape(-1, 1) 


print("Corrected features shape:", features.shape)  
print("Corrected labels shape:", labels.shape)      

#DataLoader
features_tensor = torch.tensor(features, dtype=torch.float32)
labels_tensor = torch.tensor(labels, dtype=torch.float32)
dataset = TensorDataset(features_tensor, labels_tensor)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)




Original X shape: (1000,)
Original y shape: (1000, 1, 768)
Corrected features shape: (1000, 768)
Corrected labels shape: (1000, 1)


In [85]:
def evaluate(model, dataloader):
    model.eval()
    TP, FP, TN, FN = 0, 0, 0, 0
    
    with torch.no_grad():
        for batch_features, batch_labels in dataloader:
            batch_features = batch_features.to(device)
            batch_labels = batch_labels.to(device)
            
            outputs = model(batch_features)
            preds = (outputs > 0.5).float()
            
            TP += ((preds == 1) & (batch_labels == 1)).sum().item()
            FP += ((preds == 1) & (batch_labels == 0)).sum().item()
            TN += ((preds == 0) & (batch_labels == 0)).sum().item()
            FN += ((preds == 0) & (batch_labels == 1)).sum().item()
    
    return TP, FP, TN, FN

In [86]:
# TODO: makethe prediction for all the samples in the validation set.
TP, FP, TN, FN = evaluate(model, dataloader)

print(f"""
Evaluation Metrics:
- True Positives (TP): {TP}
- False Positives (FP): {FP}
- True Negatives (TN): {TN}
- False Negatives (FN): {FN}

""")


Evaluation Metrics:
- True Positives (TP): 20
- False Positives (FP): 1
- True Negatives (TN): 716
- False Negatives (FN): 263




### *Task 6*

Compute the corresponding performance metrics **manually** (do not use PyTorch's predefined metrics):
- Accuracy
- Precision
- Recall
- F1

In [96]:
# TODO: calculate accuracy


accuracy = (TP + TN) / (TP + FP + TN + FN)



# TODO: calculate precision
 
precision = TP / (TP + FP) if (TP + FP) > 0 else 0



# TODO: calculate recall

recall = TP / (TP + FN) if (TP + FN) > 0 else 0


# TODO: calculate F1-score


f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print(f"""
Manual Performance Metrics:
───────────────────────────────────
1. Accuracy  = (TP+TN)/Total = ({TP}+{TN})/({TP+FP+TN+FN}) = {accuracy:.4f}
  

2. Precision = TP/(TP+FP) = {TP}/({TP}+{FP}) = {precision:.4f}
   

3. Recall    = TP/(TP+FN) = {TP}/({TP}+{FN}) = {recall:.4f}
  

4. F1-Score  = 2*(Precision*Recall)/(Precision+Recall) = {f1:.4f}
  
───────────────────────────────────
""")


Manual Performance Metrics:
───────────────────────────────────
1. Accuracy  = (TP+TN)/Total = (20+716)/(1000) = 0.7360


2. Precision = TP/(TP+FP) = 20/(20+1) = 0.9524


3. Recall    = TP/(TP+FN) = 20/(20+263) = 0.0707


4. F1-Score  = 2*(Precision*Recall)/(Precision+Recall) = 0.1316

───────────────────────────────────



### *Task 7*

Based on your performance metrics, answer the following questions:

- Explain the impact of accuracy vs. F1 score.

Accuracy measures how often the model is correct overall. It’s a general metric, but it can be misleading in imbalanced datasets.

In our  case, even though the model has a high accuracy (73.6%), this is mostly due to correctly predicting the many negative (non-vulnerable) samples. But that doesn't mean it's doing well on the important class — the vulnerable code.

F1 score is the harmonic mean of precision and recall, and it focuses only on the positive class performance (vulnerable code). It's more reliable when:
we care about , how well you're catching a specific class,

The dataset is imbalanced, which is true here.

In our case, the F1 score is very low (13.16%), which reveals that the model is missing most of the actual vulnerable code, even though it’s accurate overall.






- In this particular problem, which metric one should focus more on?

We should focus more on Recall and F1 Score, not Accuracy.

Why?

In vulnerability prediction:

False negatives (missed vulnerabilities) are dangerous — they could lead to undetected security risks.

High precision is nice (we’re rarely wrong when we say something is vulnerable), but our recall is extremely low — we’re catching only 7% of real issues!

So, if the model is accurate but blind to most actual vulnerabilities, it’s not useful for practical security scanning.









- Is there a better metric suitable for the use case of vulnerability prediction? Why?

Yes — recall, F1 score, and Precision-Recall AUC are better suited for this use case. They emphasize the model’s ability to detect actual vulnerabilities, which is more critical than overall accuracy in a high-risk, imbalanced problem like security vulnerability prediction.

