implements the superset disassembler #944

KennethAdamMiller · 2019-04-04T22:02:23Z

This PR implements the superset disassembler as described in the following paper. The top post acts as a working documentation. We will update it as the work and discussions proceed.

Requirements Specification

Functional Requirements

1.1 The superset disassembler shall be a selectable and configurable option.
1.2 The superset disassembler shall expose a user interface, that will enable configuration of multiple options
1.3 The superset disassembler shall expose a library interface.

The first requirement will enable seamless integration in the platform, I would like to be able to do bap /bin/ls --disassembler=superset. It shall be packed as a separate configuration feature, so that we can do ./configure --enable-superset-disassembler. The second requirement will allow me to choose my preferences, e.g., bap /bin/ls --disassembler=superset --superset-disassembler-features=loops-with-breaks. Finally, the disassembler should expose a stable API which could be used to build ad-hoc and fine grained solutions.

So, if you agree with those goals, let's always keep them in mind. Right now, as far as I understand the code, only option 3 is partially fulfilled.

Administrative Requirements

Next, are non-functional requirements, so to say administrative issues. As an artifact, this code doesn't belong to bap_disasm. Neither it should be internal to bap.std library nor it shall be a part of Bap.Std interface. (The same is true for recursive descent disassembler, and we will remove it in BAP 2.0 in a separate library). Therefore it should be packed into two (optionally three) components. The reusable library, which exposes the programmatic interface to the disassembler. The library shall depend on the Bap.Std interface and, if necessary, others. A plugin, which exposes some of the library interface to the command line interface, making sane defaults. And a frontend, which will provide utility functions (we can pack them into the plugin, we will see later, whether we need it or not, the main concern would be dependencies). Therefore, we have the following tasks:

2.1 move the library code out of lib/bap_disasm.
2.2 implement the plugin which will set up and load the superset disassembler
2.3 (optional) implement the frontend
2.4 ensure that all compilation units are properly namespaced

Concerning the requirement 2.1, it is not really necessary to keep it in the bap repository, if you want you can keep it in your own repository, move your repository to BinaryAnalysisPlatform organization, or pick a place in the bap repository, e.g., lib/bap_superset_disassembler.

And no matter what choice you will made, you have to give proper names to all your compilation units, aka files. OCaml has a flat namespace for compilation units, so if you have a file named features.ml you will not be able to link any other plugin or library that has the same file in its code base. Therefore, you need to prefix all your files, e.g., start all library files with bap_superset_disassembler_ and all your plugin files with super_disassembler_ prefixes.

Coding Standards Requirements

Those are self-explanatory

3.1 no dead code
3.2 no commented out code
3.3 all modules shall have interface files (exception: modules that define only types or module types)
3.4 no debugging output in the library code
3.5 no todos in the released code
3.6 no exceptions beyond failed invariants or preconditions

Quality Requirements

4.1 provide unit tests, that ensure invariants of crucial components
4.2 provide a set of functional tests

The number of the tests is to be decided. Though I would like to have close to 100% coverage of the core components.

Documentation Requirements

5.1 provide an overall description of the algorithm
5.2 provide the detailed description of the disassembler architecture and implementation
5.3 document the public interface

The overall description should include a brief overview of the algorithm, purposes and tradeoffs. It shall reference the paper. If there are any differences between the paper and the implementation they should be highlighted. This documentation will end up in the plugin man page. So a user shall be able to understand without further ado, why does he need this plugin, how to enable it, and how to configure.

The detailed documentation is needed for us to support and bug fix it. It could be spread around the github discussions, comments in the internal mli files, and ml files. It shall document the purposes and invariants (if any) of all modules, and some crucial functions.

Finally, all public (accessible via the public mli file) functions shall be thoroughly documented, so that a user can apply them without having to refer to the implementation.

ivg · 2019-04-05T12:12:47Z

Cool, glad to see this PR. Great work!

But we have lots of things to do. First of all, let's set up our end goals, the final requirements, so that we can always trace our decisions back to them.

UPDATE: moved the specification to the top post.

So, godspeed! And don't hesitate to ask questions and further directions, we will happy to help and assist you. Now, when we have this PR as good vantage point it should be easy.

KennethAdamMiller · 2019-04-07T19:11:15Z

Thanks for your review. Yes I absolutely agree with those requirements and the code quality rules.

Ok, if you plan to move all the disassemblers out, then perhaps it mostly belongs under plugins? I can make a library for it.

I can certainly do the command line updates to Bap, and the documentation requests. However, the API changes require much more in depth thought, so I want to ask about those.

There are certainly many improvements that need to be made. In so much as the API goes, I would like to ask if the internals of the graph implementation should be kept, because I had to have the ability to make the graph an reverse instruction sequence graph, where per-instruction flows could be recorded. When I started it, I don't think that graph interface that Bap now provides was available, and it was hard to refactor then. In addition, I think that the features implemented should be provided as a kind of plugin or registration capability so that others see that they can provide new heuristics to the fixpoint convergence. For the most part, there isn't so much usage of design where current features are kind of implemented directly on the naked representation, so I think this represents the first refactoring that needs to be done. Thoughts?

In so much as the library, plugin, and tests go, I can make the interface look ok pretty fast, and I have a binary with a cmdliner that provides the ability to call it up. I do already have some tests, but they need to be refactored after the design and interface are done and they need to provide better coverage. But for the plugin capability of superset_disasm, without additional ability to intervene in the construction process of bap while it's lifting a binary, I don't see how I can provide a plugin to meaningful plugin because there is not a way yet to provide a pass that runs at disassembly time. I asked about that once on the channel, and you advised that I should provide my own instance of the bap process chain, but I do know that that will get refactored and provided to users, so that if I implement it, I will likely need to later remove such a chain.

Among further or additional work that may be done to provide substantial improvements, I have a number of suggestions regarding the workings of the disassembler that represent different improvements to various things, including at least performance and code quality. Right now, the capability as implemented functions at a sufficient level when used, but there is room for it to become much sharper. Should I put those forward for review?

ivg · 2019-04-08T12:24:42Z

Ok, if you plan to move all the disassemblers out, then perhaps it mostly belongs under plugins? I can make a library for it.

Yep. Can't really predict how much code will end up in the plugins and how much will end up in the lib folder, but definitely lots of code doesn't belong to library here.

There are certainly many improvements that need to be made. In so much as the API goes, I would like to ask if the internals of the graph implementation should be kept, because I had to have the ability to make the graph an reverse instruction sequence graph, where per-instruction flows could be recorded. When I started it, I don't think that graph interface that Bap now provides was available, and it was hard to refactor then. In addition, I think that the features implemented should be provided as a kind of plugin or registration capability so that others see that they can provide new heuristics to the fixpoint convergence. For the most part, there isn't so much usage of design where current features are kind of implemented directly on the naked representation, so I think this represents the first refactoring that needs to be done. Thoughts?

You can always expose some of the interface as experimental and some as stable. You can also play with library vs plugin tradeoff, where libraries stored in the plugin are private, vs libraries in the lib folder, which are public. See our x86 and llvm plugins/libraries.

Having it extensible is really nice. So yes, we shall add some injection points, where users can inject their functions.

But for the plugin capability of superset_disasm, without additional ability to intervene in the construction process of bap while it's lifting a binary, I don't see how I can provide a plugin to meaningful plugin because there is not a way yet to provide a pass that runs at disassembly time.

No problem, we will add an injection point in BAP, so that a user can specify its own disassembling algorithm. We didn't have one, just because we had only one option :) Now we have the superset disassembler, VSA, and other options, so it is time to create such point.

Should I put those forward for review?

I think we shall focus on achieving our requirements and keep the room for further improvements. Ideal course of actions, is to make it work with minimal API exposed, then gradually improve and expose more and more functionality via the library interface.

ivg · 2019-04-08T12:30:27Z

besides, do not hesitate to edit the top post and update it with new information. Treat it as a project wiki.

KennethAdamMiller · 2019-04-08T15:11:04Z

Ok, all of that sounds good.

"top pos"?

ivg · 2019-04-08T16:00:05Z

"top pos"?

the top post

XVilka · 2019-08-12T07:21:20Z

Should it be updated on top of BAP 2.0 PR then? #960

KennethAdamMiller · 2019-08-12T13:49:26Z

I will update it to use BAP 2.0

ivg · 2019-08-15T17:50:02Z

OK, now when BAP 2.0 has became a reality, we should rethink this PR from scratch. It basically provides some facilities that we no longer need (we now have a speculative disassembler by default) and some facilities that we need, and which are much easier to integrate now (the probabilistic part of the disassembler).

In BAP 2.0 we have a knowledge base, which is basically a global storage of information which could be used by different components for information exchange.

The new disassembler that drives program reconstruction in modern BAP is by default speculative. It will consult with the knowledge base, when it will decide whether it needs to speculate or not. In particular, it will consult with the core-theory:is-valid property of the core-theory:program and will attempt to disassemble all addresses for which a corresponding object is-valid. Therefore, what the speculative disassembler should do, is for each address provide its assessment whether the corresponding code is valid or not. In the default setup, our knowledge providers mark only function starts as valid, letting the disassembler to find out the rest of the code. Which basically leads to the same results as with the classical recursive descent disassembler (it is still more conservative than the old implementation, since we fall after each jump unless we're have evidence that the instruction is a barrier).

So the only thing that we need from the probabilistic disassembler is to mark what it thinks to be code with is-valid (true) and then the default engine will disassemble it (if it is possible).

Also, the new disassembler is backtracking and it will automatically classify as data all instruction sequences which end up in data (i.e., if an execution from a certain point is guaranteed to hit an invalid instruction or otherwise non-executable code, then all instructions that constitute this chain will be marked as data).

I hope this makes sense, for starters, so if you're interested (or anyone else is interested) in updating this PR and bringing the probabilistic disassembler to BAP, then feel free to ask questions and requests for explanations :)

KennethAdamMiller · 2019-08-16T04:21:37Z

I'm still very interested in finishing this out. I've been waiting until bap 2.0 lands to continue the plugin development, because I had started on it. I'm also super busy with other tasks, so it may be a while.

On another note, I do have some considerations that I'll have to carefully examine. Having the speculative disassembler that BAP now provides be backtracking in particular kind of makes me wonder exactly what your specification of what backtracking is. You wrote "backtracking and it will automatically classify as data all instruction sequences which end up in data" - are those two different features or part of the same feature? If different, and if by eliminating sequences of instructions that fall through to a well-formed assembler violation by traveling from each instruction to its ancestor set, then that was the mechanism which my disassembler had implemented itself. If you mean backtracking in another sense, then I'll need to be able to control it with some kind of lambda function or something. I haven't seen the new API, so I'm not certain yet. Also, on the speculation part, I would like to know if I first should do all the processing on my side before providing the set of labels through the knowledge api to the disassembler, or if my api is something that will get resolved by the BAP mechanism, and your is-valid can be run iteratively. Because, I have found that having a fixpoint for disassembly confidence convergence is one mechanism, but there are also some good results for iterating on the disassembly itself, because you can obtain increasingly better results.

KennethAdamMiller · 2019-08-16T04:21:52Z

It may take me a long time, but I won't give up on merging this.

XVilka · 2019-11-14T04:37:12Z

I guess the milestone should be moved then?

KennethAdamMiller · 2019-11-14T04:40:54Z

Yes, you could update that.

Sorry, I've been super busy working pursuing a degree and research at the same time.

KennethAdamMiller · 2020-07-01T05:21:02Z

Ok, upon rethinking this, one of the first things I've encountered is that because I cannot retrieve the original image from the project that I have to load the file by retrieving the name from somewhere in the pass. Which will get garbage collected away, so not that big of a deal, and I will then write the results bap knowledge as a retrieval from the superset.

ivg · 2020-07-01T12:38:39Z

Ok, upon rethinking this, one of the first things I've encountered is that because I cannot retrieve the original image from the project

If you mean code section, you can get it, use Project.Input.code.

that I have to load the file by retrieving the name from somewhere in the pass.

That is not a good idea. First of all, it is not guaranteed that there will be a file at all. Second, it will not work with the existing file loaders. Third, it is just suboptimal.

Initial superset disasm integration

b08a71d

ivg added this to the 1.7.0 milestone Apr 5, 2019

ivg changed the title ~~Initial superset disasm integration~~ implements the superset disassembler Apr 5, 2019

ivg modified the milestones: 1.7.0, 2.0.0 Aug 15, 2019

XVilka mentioned this pull request Jun 12, 2020

enables ARM Thumb support #1122

Closed

ivg marked this pull request as draft June 12, 2020 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implements the superset disassembler #944

implements the superset disassembler #944

KennethAdamMiller commented Apr 4, 2019 •

edited

ivg commented Apr 5, 2019 •

edited

KennethAdamMiller commented Apr 7, 2019 •

edited

ivg commented Apr 8, 2019

ivg commented Apr 8, 2019 •

edited

KennethAdamMiller commented Apr 8, 2019

ivg commented Apr 8, 2019

XVilka commented Aug 12, 2019

KennethAdamMiller commented Aug 12, 2019

ivg commented Aug 15, 2019

KennethAdamMiller commented Aug 16, 2019

KennethAdamMiller commented Aug 16, 2019

XVilka commented Nov 14, 2019

KennethAdamMiller commented Nov 14, 2019

KennethAdamMiller commented Jul 1, 2020

ivg commented Jul 1, 2020

implements the superset disassembler #944

Are you sure you want to change the base?

implements the superset disassembler #944

Conversation

KennethAdamMiller commented Apr 4, 2019 • edited

Requirements Specification

Functional Requirements

Administrative Requirements

Coding Standards Requirements

Quality Requirements

Documentation Requirements

ivg commented Apr 5, 2019 • edited

KennethAdamMiller commented Apr 7, 2019 • edited

ivg commented Apr 8, 2019

ivg commented Apr 8, 2019 • edited

KennethAdamMiller commented Apr 8, 2019

ivg commented Apr 8, 2019

XVilka commented Aug 12, 2019

KennethAdamMiller commented Aug 12, 2019

ivg commented Aug 15, 2019

KennethAdamMiller commented Aug 16, 2019

KennethAdamMiller commented Aug 16, 2019

XVilka commented Nov 14, 2019

KennethAdamMiller commented Nov 14, 2019

KennethAdamMiller commented Jul 1, 2020

ivg commented Jul 1, 2020

KennethAdamMiller commented Apr 4, 2019 •

edited

ivg commented Apr 5, 2019 •

edited

KennethAdamMiller commented Apr 7, 2019 •

edited

ivg commented Apr 8, 2019 •

edited