Graph-based malware detection using machine learning.
| Table of Contents |
|---|
| Installation |
| Usage |
This program uses pip to manage all module dependencies. The easiest way to get started is to initialize a new virtual environment and install all packages using the requirements.txt file.
$ py -m venv venv
$ ./venv/bin/activate
$ pip install -r requirements.txt
$ python3 -m venv venv
$ source ./venv/bin/activate
$ pip3 install -r requirements.txt
Please note that your system may vary slightly in the installation process.
Very generally, Graphene has two modes: feature extraction and model training. The biggest difference in running Graphene will come from what features are extracated and what model architecture is trained.
Most capabilities of Graphene can be accessed through the Graphene.py Python script.
Generating a dataset from executables can be done by passing generate as the mode of operation.
$ py src/Graphene.py --mode generate
Configuration data for the generation process can be found in generate.json
A total of four traversal algorithms are used:
- Breadth-First
- Depth-First
- Beam Traversal
- Node2Vec
The beam traversal has capabilities for three different heuristic algorithms: out-degree, function size, and random weight assignment.
Node2Vec generates its own embeddings at runtime. It is currently only implemented for the RNN and DNN due to RoBERTa utilizing its own tokenizer.
Various model architectures are also supported.
| Architecture | Attributes | Config File |
|---|---|---|
| Recurrent Neural Network | Multi-layered model with LSTM | rnn.json |
| RNN with Node2Vec | RNN architecture with Node2Vec embeddings | rnn_node2vec.json |
| Deep Neural Network | Six Linear layers with ReLU activation |
dnn.json |
| DNN with Node2Vec | DNN architecture with Node2Vec embeddings | dnn_node2vec.json |
| Large Language Model | Utilizes RoBERTa as base model | tformer.json |
Training a model requires specifying the model architecture beforehand. Like with feature extraction, this is done using the -m or --mode command line argument. A list of available options can be obtained by running the following command.
$ py src/Graphene.py --help
Current options for model training are:
dnn: Train a DNN using standard traversal algorithms and embeddings.rnn: Train a RNN using standard traversal algorithms and embeddings.dnn_node2vec: Train a DNN using embeddings generated by Node2Vec.rnn_node2vec: Train a RNN using embeddings generated by Node2Vec.tformer_train: Trains a RoBERTa-based binary classifier.
Parameters used by the model during training are defined in the corresponding .json file. A link to any given model's configuration file can be found in the table above.
Explainability mechanisms for the RoBERTa model are implemented using the Captum module. This allows for obtaining explanations at the token level, which can then be aggregated to generate a word-level attribution. See Explainability.py for an example of how word attributions can be calculated.
The second portion of this repository allows for launching adversarial attacks on a RoBERTa model. In almost every case, the model is trained and the attack is launched in series. The type of attack launched is in a white-box scenario, meaning there is full access to the target model. Adversarial attacks utilize the explainability mechanisms explained in the previous section.