Skip to content

CGrabda/AssemblyFunctionExtraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AssemblyFunctionExtraction

Setup

Install dependencies, e.g., using a conda environment named {ENV} & pip

conda create --name {ENV} python=3.10
conda activate {ENV}
pip install r2pipe lief pefile capstone tokenizers

Usage

Disassembly pipeline

python pipeline.py --file_or_dir="file.exe"
python pipeline.py --file_or_dir="directory"         
python pipeline.py --help         

Virtual-physical address map

python address.py --build --merge
python address.py --help

The program does not work lol

python example.py        

Goals

The goal is to extract disassembled functions from PE executables. We'll want a pure function that takes a file as input and returns disassembled functions, along with their locations in the binary, e.g.,

from argparse import ArgumentParser
from dataclasses import dataclass
from multiprocessing import Pool
from pathlib import Path


@dataclass(frozen=True)
class Function:
    """
    Crucial information from each function.

    # TODO: include any information that may be useful.
    """
    instructions: list[str]
    offsets: list[int]
    upper_offsets: list[int]


def extract_functions_from_binary(file: Path) -> list[Function]:
    """
    Extract disassembled functions and their locations from a PE binary.

    Args:
        file: file to extract from

    Returns:
        information about each function in the file
    """
    ...


def main() -> None:
    parser = ArgumentParser()
    parser.add_argument("--file", required=False, type=Path, help="File to disassemble.")
    parser.add_argument("--files", required=False, type=Path,
        help="Either a file containing a list of files or a directory containing files to disassemble.")
    parser.add_argument("--num_workers", default=8, type=int, help="Degree of parallelization.")
    args = parser.parse_args()

    if args.file is not None and args.files is None:
        files = [args.file]
    elif args.files is not None and args.file is None:
        if args.files.is_dir():
            files = list(args.files.iterdir())
        else:
            files = args.files.read_text().split("\n")
    else:
        raise ValueError("Specify --file or --files, not both.")

    with Pool(args.num_workers) as pool:
        pool.map(extract_functions_from_binary, files)


if __name__ == "__main__":
    main()        

Eventually, we'll want to parallelize this, so a pure function is easiest to work with.

There are many aspects of this problem that we will need to consider, e.g., architectures, instruction sets, obfuscations, etc. For now, start with the simplest variation of this problem and we'll refine the specifications as we go. Depending on the difficulty of the task, we may be able to write a similar function that extracts basic blocks and write dedicated functions for different architectures, e.g., x86, x64, ARM, etc., or file types, e.g., Android APK, Linux ELF, Mac Mach-O etc.

Resources

Tools

Reference

Papers

Surveys

Survey papers are probably the best way to introduce yourself to a new research topic.

Papers

These are some top-tier paper closely related to this project.

Malware classification from raw bytes

Malware classification from (not) raw bytes

Binary analysis with deep learning

Sequence-to-sequence modeling

Adversarial malware generation

Adversarial malware defense

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages