<a href="https://colab.research.google.com/github/Boshra-01/Rust_VulDetect_ML_Model/blob/main/Rust_VulDetect_ML_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Setting up Environment

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Import Packages and Libraries

 We install essential Python libraries, PyTorch for deep learning operations, Transformers for accessing pre-trained models (GraphCodeBERT), scikit-learn for machine learning algorithms, and tqdm for progress visualization.

In [2]:
!pip install torch transformers scikit-learn tqdm

import os
import torch
import numpy as np
from tqdm import tqdm
from transformers import RobertaTokenizer, RobertaModel
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score



### Access Dataset

In [3]:
data_dir = "/content/drive/MyDrive/Rust5/"

## Graph Embedding

Graph embedding is a method to convert structured data (such as code represented as graphs or abstract syntax trees) into fixed-length numerical vectors. These vectors capture the relationships and properties of the original structure, allowing us to feed them into machine learning models.

### GraphCodeBERT Model

GraphCodeBERT is a pre-trained model designed for processing source code. It leverages both the structure of code (often represented as graphs) and natural language processing techniques. In this example, GraphCodeBERT is used to generate embeddings (vector representations) for LLVM IR (Intermediate Representation) code files. The model is based on the RoBERTa architecture and has been fine-tuned for code tasks.

### Loading GraphCodeBERT tokenizer and model

In [4]:
tokenizer = RobertaTokenizer.from_pretrained("microsoft/graphcodebert-base")
model = RobertaModel.from_pretrained("microsoft/graphcodebert-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of RobertaModel were not initialized from the model checkpoint at microsoft/graphcodebert-base and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### GPU

In [5]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

print("Setup complete! Loaded GraphCodeBERT and configured the environment.")

Setup complete! Loaded GraphCodeBERT and configured the environment.


### Extracting Embeddings from .ll Files
Preprocessing LLVM IR and removing comments and using mean pooling to get accurate embedding.

### Purpose

In this step, we create two functions. The first function, preprocess_llvm_ir, cleans LLVM IR code by removing comments. The second function, get_graphcodebert_embedding, tokenizes the cleaned code, feeds it through GraphCodeBERT, and uses mean pooling on the model's output to obtain a fixed-length vector (embedding) that represents the code snippet.

Preprocessing Function:

In [6]:
def preprocess_llvm_ir(code):

    """Removes comments and metadata from LLVM IR"""
    lines = code.split("\n")
    clean_lines = [line for line in lines if not line.strip().startswith(";")]  # Remove comments
    return "\n".join(clean_lines)

Embedding Extraction Function:

In [7]:
def get_graphcodebert_embedding(code_snippet):

    """Tokenize the code and extract GraphCodeBERT embeddings with mean pooling."""
    tokens = tokenizer(code_snippet, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    tokens = {key: val.to(device) for key, val in tokens.items()}

    with torch.no_grad():
        outputs = model(**tokens)

    # Use mean pooling for better representation
    embedding = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
    return embedding


### Dealing with Multiple Files

This step defines a function to iterate through all LLVM IR (.ll) files in the specified directory. For each file, it reads and preprocesses the code, extracts an embedding using GraphCodeBERT, and assigns a label based on the file name. Finally, all embeddings and labels are stored as *NumPy arrays* for further processing.

In [8]:
def extract_embeddings(data_dir):
    embeddings = []
    labels = []

    for file_name in tqdm(os.listdir(data_dir), desc="Extracting embeddings from .ll files"):
        if file_name.endswith('.ll'):
            file_path = os.path.join(data_dir, file_name)
            with open(file_path, 'r') as f:
                llvm_ir = f.read()

                # Preprocessng llvm ir and get embedding
                llvm_ir_cleaned = preprocess_llvm_ir(llvm_ir)
                embedding = get_graphcodebert_embedding(llvm_ir_cleaned)

                label = 1 if "_v.ll" in file_name else 0  # Assigning labels

                embeddings.append(embedding)
                labels.append(label)

    return np.array(embeddings), np.array(labels)

# Extracting embeddings
X, y = extract_embeddings(data_dir)
print(f"Extracted embeddings for {len(X)} files.")


Extracting embeddings from .ll files: 100%|██████████| 26/26 [01:12<00:00,  2.77s/it]

Extracted embeddings for 20 files.





### Splitting Data into Training and Testing Sets

We split our dataset into training and testing sets to evaluate our model's performance on unseen data. This uses scikit-learn’s train_test_split function with stratification to ensure that both classes are represented proportionally in each set.
*   *stratify=y* ensures that both classes are represented proportionally in both splits.
*   *random_state=42* is used to make the split reproducible.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print(f"Training samples: {len(X_train)}, Testing samples: {len(X_test)}")

Training samples: 16, Testing samples: 4


## Logistic Regression

Logistic Regression is a supervised machine learning algorithm. It is a classification algorithm used to predict the probability of a binary outcome (e.g. 0 or 1 / vulnerable vs. non-vulnerable code). It uses a sigmoid function to map a linear combination of input features to a probability between 0 and 1.  


---


In this workshop, logistic regression is used to classify LLVM IR code based on the embeddings generated by GraphCodeBERT. Logistic Regression is being used for smaller dataset and less complexity as simple baseline. We set max_iter=1000 to ensure the algorithm has enough iterations to converge.


### Training the Logistic Regression Classifier

In [10]:
from sklearn.linear_model import LogisticRegression

# Training Logistic Regression classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)


## Model Evaluation

This step evaluates the trained classifier by computing the accuracy and generating a classification report. The report provides detailed metrics including precision, recall, and F1-score for each class, helping us understand the performance of the model on the test set.

### Results and Analysis

In [11]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Model Accuracy: {accuracy:.4f}")

# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Model Accuracy: 0.7500
Classification Report:
              precision    recall  f1-score   support

           0       0.67      1.00      0.80         2
           1       1.00      0.50      0.67         2

    accuracy                           0.75         4
   macro avg       0.83      0.75      0.73         4
weighted avg       0.83      0.75      0.73         4



### Checking Vulnerability Status on a New LLVM IR File

Finally, we demonstrate how to use the pre-trained classifier to predict whether a new LLVM IR file is vulnerable. We first preprocess the new code, extract its embedding using the same process as before, and then use the classifier to make a prediction.

In [13]:
from google.colab import files
uploaded = files.upload()

Saving fixed_park.ll to fixed_park (1).ll


In [14]:
filename = list(uploaded.keys())[0]

In [15]:
code = uploaded[filename].decode("utf-8")

# Preprocess the code
def preprocess_llvm_ir(code):
    lines = code.split("\n")
    clean_lines = [line for line in lines if not line.strip().startswith(";")]
    return "\n".join(clean_lines)

code_clean = preprocess_llvm_ir(code)


# Extract embedding using GraphCodeBERT (mean pooling)
def graph_embedding(code_snippet):
    tokens = tokenizer(code_snippet, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    tokens = {key: val.to(device) for key, val in tokens.items()}


    with torch.no_grad():
        outputs = model(**tokens)
    embedding = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
    return embedding


embedding = graph_embedding(code_clean).reshape(1, -1)

# Using pre-trained classifier (clf) to predict vulnerability:
prediction = clf.predict(embedding)[0]
result = "Vulnerable." if prediction == 1 else "Non-Vulnerable."
print(f"The code sample '{filename}' is {result}")


The code sample 'fixed_park (1).ll' is Non-Vulnerable.


In [16]:
print(embedding)

[[ 1.07208997e-01 -2.08528906e-01  5.51155172e-02  5.63649274e-02
   4.42484081e-01 -2.77596533e-01 -2.30646394e-02  4.97877032e-01
   2.57303178e-01 -2.98079848e-01  1.23061955e-01 -1.97312042e-01
  -4.74991463e-02 -1.60743386e-01 -9.83830243e-02 -1.11717463e-01
  -2.54856288e-01  1.28773406e-01 -1.99194476e-02 -6.38989657e-02
  -2.50214756e-01  2.62920380e-01  6.85099736e-02  1.39488965e-01
  -1.91646926e-02 -2.56677985e-01  6.50677562e-01  4.40217316e-01
  -1.56531319e-01 -5.68266772e-02  6.07435852e-02  2.27192372e-01
   2.88043201e-01  3.23595433e-03  2.13428557e-01 -9.29993466e-02
   1.54067576e-01  4.05771472e-02  3.17376494e-01  1.19723849e-01
   1.02170832e-01 -4.44759816e-01 -2.96016932e-01  6.43637180e-02
  -1.32637229e-02  1.76111355e-01  4.30729240e-02  5.38794547e-02
   1.30540784e-03 -5.32410927e-02 -1.72169674e-02  1.24566287e-01
   2.66256243e-01  1.09321937e-01 -2.04648852e-01 -5.43771908e-02
  -9.66179222e-02 -1.16886377e-01 -2.50779539e-01 -2.28378884e-02
   2.84260