UnderstandableBinary - ML binary demangler

What is this?

This is a project to use machine learning to convert raw decompiled binary files into cleaner variations.

We take a large dataset of C/C++ code, compile it, decompile the binaries, then train a model to translate the decompiled binaries into their original version,.

Afterward, we have a model which can convert ugly decompiled code into cleaner code.

How to install

Install dependencies using your platform's package manager (recommend Homebrew on macOS):

> git clone git@github.com:Jakobeha/UnderstandableBinary.git

How to use

> cd UnderstandableBinary
> run.sh [options]...

You can also open in IntelliJ and there are sample run configurations. Note that you may need to change some global library locations (e.g. path to Poetry)

Project layout

This project uses many different languages and frameworks. READMEs and run.sh scripts are in subdirectories. The root is an IntelliJ project, however modules are in subdirectories.

../UnderstandableBinary-data/: The default location where the dataset is generated and stored. This cannot be in UnderstandableBinary/ because the dataset is extremely large and contains code, which confuses a lot of tools and find and makes everything a hassle. You can override the dataset dir, and you may want to make it on a separate volume with more storage.
python/: Python scripts which use poetry for dependency management. Mainly for training and running the model since that is in Python
get-data/: Generate dataset
- apt/: Download and build code from debian APT repo
- vcpkg/: Download and build code from vcpkg repo
- decompile/: decompile binaries using Ghidra
local/: Local directory where you can store scratch data which isn't the dataset. Also, some log files are stored here
- ghidra_logs/: Ghidra script log files
docs/: documentation

N-Bref: A neural-based decompiler framework and binary analysis tool
G-3PO and GptHidra: generate an explanation for a decompiled function and suggest variable names (uses GPT3 and Ghidra)
Gepetto: generate a doc-comment for a decompiled function, plus rename variables and add comments in the function body (uses GPT3 and IDA Pro)

Contributing

Conventions:

File and directory names are usually kebab-case unless there's another reason (e.g. Java)
Use PEP and shellcheck (IntelliJ defaults)

TODO: add more

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.idea		.idea
docs		docs
get-data		get-data
python		python
vendor		vendor
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

docs

docs

get-data

get-data

python

python

vendor

vendor

.gitattributes

.gitattributes

.gitignore

.gitignore

.gitmodules

.gitmodules

README.md

README.md

run.sh

run.sh

Repository files navigation

UnderstandableBinary - ML binary demangler

What is this?

How to install

How to use

Project layout

Related

Contributing

About

Releases

Packages

Languages

Jakobeha/UnderstandableBinary

Folders and files

Latest commit

History

Repository files navigation

UnderstandableBinary - ML binary demangler

What is this?

How to install

How to use

Project layout

Related

Contributing

About

Resources

Stars

Watchers

Forks

Languages