Install dependencies, e.g., using a conda environment named {ENV} & pip
conda create --name {ENV} python=3.10
conda activate {ENV}
pip install r2pipe lief pefile capstone tokenizers
python pipeline.py --file_or_dir="file.exe"
python pipeline.py --file_or_dir="directory"
python pipeline.py --help
python address.py --build --merge
python address.py --help
python example.py
The goal is to extract disassembled functions from PE executables. We'll want a pure function that takes a file as input and returns disassembled functions, along with their locations in the binary, e.g.,
from argparse import ArgumentParser
from dataclasses import dataclass
from multiprocessing import Pool
from pathlib import Path
@dataclass(frozen=True)
class Function:
"""
Crucial information from each function.
# TODO: include any information that may be useful.
"""
instructions: list[str]
offsets: list[int]
upper_offsets: list[int]
def extract_functions_from_binary(file: Path) -> list[Function]:
"""
Extract disassembled functions and their locations from a PE binary.
Args:
file: file to extract from
Returns:
information about each function in the file
"""
...
def main() -> None:
parser = ArgumentParser()
parser.add_argument("--file", required=False, type=Path, help="File to disassemble.")
parser.add_argument("--files", required=False, type=Path,
help="Either a file containing a list of files or a directory containing files to disassemble.")
parser.add_argument("--num_workers", default=8, type=int, help="Degree of parallelization.")
args = parser.parse_args()
if args.file is not None and args.files is None:
files = [args.file]
elif args.files is not None and args.file is None:
if args.files.is_dir():
files = list(args.files.iterdir())
else:
files = args.files.read_text().split("\n")
else:
raise ValueError("Specify --file or --files, not both.")
with Pool(args.num_workers) as pool:
pool.map(extract_functions_from_binary, files)
if __name__ == "__main__":
main()
Eventually, we'll want to parallelize this, so a pure function is easiest to work with.
There are many aspects of this problem that we will need to consider, e.g., architectures, instruction sets, obfuscations, etc. For now, start with the simplest variation of this problem and we'll refine the specifications as we go. Depending on the difficulty of the task, we may be able to write a similar function that extracts basic blocks and write dedicated functions for different architectures, e.g., x86, x64, ARM, etc., or file types, e.g., Android APK, Linux ELF, Mac Mach-O etc.
Survey papers are probably the best way to introduce yourself to a new research topic.
- From Hack to Elaborate Technique—A Survey on Binary Rewriting (2019)
- Adversarial EXEmples: A Survey and Experimental Evaluation of Practical Attacks on Machine Learning for Windows Malware Detection (2021)
- A Survey of Binary Code Similarity (2021)
- Arms Race in Adversarial Malware Detection: A Survey (2021)
- Deep Learning for Android Malware Defenses: A Systematic Literature Review (2022)
- A Survey on Ransomware: Evolution, Taxonomy, and Defense Solutions (2022)
- File Packing from the Malware Perspective: Techniques, Analysis Approaches, and Directions for Enhancements (2022)
- Deep Learning for Zero-day Malware Detection and Classification: A Survey (2023)
- A Survey of Malware Analysis Using Community Detection Algorithms (2023)
These are some top-tier paper closely related to this project.
Malware classification from raw bytes
- Malware Detection by Eating a Whole EXE (2018)
- Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection (2021)
- Recasting Self-Attention with Holographic Reduced Representations (2023)
Malware classification from (not) raw bytes
- SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection (2020)
- DANdroid: A Multi-View Discriminative Adversarial Network for Obfuscated Android Malware Detection (2020)
- Maat: Automatically Analyzing VirusTotal for Accurate Labeling and Effective Malware Detection (2021)
Binary analysis with deep learning
- Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection (2017)
- PalmTree: Learning an Assembly Language Model for Instruction Embedding (2021)
- DeepDi: Learning a Relational Graph Convolutional Network Model on Instructions for Fast and Accurate Disassembly (2022)
Sequence-to-sequence modeling
- Style Transfer Through Back-Translation (2018)
- Unsupervised Translation of Programming Languages (2020)
- Leveraging Automated Unit Tests for Unsupervised Code Translation (2022)
Adversarial malware generation
- Intriguing Properties of Adversarial ML Attacks in the Problem Space (2020)
- Malware Makeover: Breaking ML-based Static Analysis by Modifying Executable Bytes (2021)
- Structural Attack against Graph Based Android Malware Detection (2021)
Adversarial malware defense
- Adversarial Deep Ensemble: Evasion Attacks and Defenses for Malware Detection (2020)
- A Framework for Enhancing Deep Neural Networks Against Adversarial Malware (2021)
- Adversarial Training for Raw-Binary Malware Classifiers (2023)
- On The Empirical Effectiveness of Unrealistic Adversarial Hardening Against Realistic Adversarial Attacks (2023)