Skip to content

AutomatedTransformerMalwareAnalysis/Graphene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

183 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graphene

Graph-based malware detection using machine learning.

Table of Contents
Installation
Usage

This program uses pip to manage all module dependencies. The easiest way to get started is to initialize a new virtual environment and install all packages using the requirements.txt file.

Windows

$ py -m venv venv
$ ./venv/bin/activate
$ pip install -r requirements.txt

Linux

$ python3 -m venv venv
$ source ./venv/bin/activate
$ pip3 install -r requirements.txt

Please note that your system may vary slightly in the installation process.

Very generally, Graphene has two modes: feature extraction and model training. The biggest difference in running Graphene will come from what features are extracated and what model architecture is trained.

Most capabilities of Graphene can be accessed through the Graphene.py Python script.

Feature Extraction

Generating a dataset from executables can be done by passing generate as the mode of operation.

$ py src/Graphene.py --mode generate

Configuration data for the generation process can be found in generate.json

Graph Traversal

A total of four traversal algorithms are used:

  1. Breadth-First
  2. Depth-First
  3. Beam Traversal
  4. Node2Vec

The beam traversal has capabilities for three different heuristic algorithms: out-degree, function size, and random weight assignment.

Node2Vec generates its own embeddings at runtime. It is currently only implemented for the RNN and DNN due to RoBERTa utilizing its own tokenizer.

Machine Learning

Various model architectures are also supported.

Architecture Attributes Config File
Recurrent Neural Network Multi-layered model with LSTM rnn.json
RNN with Node2Vec RNN architecture with Node2Vec embeddings rnn_node2vec.json
Deep Neural Network Six Linear layers with ReLU activation dnn.json
DNN with Node2Vec DNN architecture with Node2Vec embeddings dnn_node2vec.json
Large Language Model Utilizes RoBERTa as base model tformer.json

Training a model requires specifying the model architecture beforehand. Like with feature extraction, this is done using the -m or --mode command line argument. A list of available options can be obtained by running the following command.

$ py src/Graphene.py --help

Current options for model training are:

  • dnn: Train a DNN using standard traversal algorithms and embeddings.
  • rnn: Train a RNN using standard traversal algorithms and embeddings.
  • dnn_node2vec: Train a DNN using embeddings generated by Node2Vec.
  • rnn_node2vec: Train a RNN using embeddings generated by Node2Vec.
  • tformer_train: Trains a RoBERTa-based binary classifier.

Parameters used by the model during training are defined in the corresponding .json file. A link to any given model's configuration file can be found in the table above.

Explainability

Explainability mechanisms for the RoBERTa model are implemented using the Captum module. This allows for obtaining explanations at the token level, which can then be aggregated to generate a word-level attribution. See Explainability.py for an example of how word attributions can be calculated.

Adversarial Attacks

The second portion of this repository allows for launching adversarial attacks on a RoBERTa model. In almost every case, the model is trained and the attack is launched in series. The type of attack launched is in a white-box scenario, meaning there is full access to the target model. Adversarial attacks utilize the explainability mechanisms explained in the previous section.

About

Graph-based malware detection using machine learning.

Topics

Resources

Stars

Watchers

Forks

Contributors