New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final 1.3 update #688

Merged
merged 61 commits into from Aug 24, 2017

Conversation

Projects
None yet
4 participants
@ivg
Member

ivg commented Aug 11, 2017

This PR brings multiple new features from our development branches. In general this PR focuses on three main features:

  1. BIL correctness and normalization
  2. BIL readability and performance
  3. Primus interpreter

Note for reviewers and readers: I wouldn't expect you guys to check the code thoroughly, but you are welcome to ask questions, request clarifications, and discuss changes. Most likely just reading this PR message would be enough for the review.

Note: the Table of Contents doesn't work since Github doesn't emit anchors on the PR page. They will work (I hope so), when we will move this page to the release page.

Table of Contents

Detailed Description

BIL Typechecker

We added a type checker, and now force all BIL code to be well-typed. Lifters must produce only well-typed code. So far, we will issue a warning for each piece of code that is not, along the 1.4 we are planning to fix all issues if any. The Type module now have the infer function, that will infer type for an expression, and check that will type check a BIL program.

BIL Effect Analysis

A precise effect analysis was added to ensure correctness of program transformations. The effect analysis distinguishes between different effects and coeffects:

  • reads (coeffect)
  • loads (coeffect)
  • writes (effect)
  • raises (effect)

The raises effect includes the division by zero CPU exception and a conservative AI will infer a possible presence of the division by zero in a BIL expression.
A new module, called Eff is added, that provides functions for the effect analysis.

BIL Simplification and Constant Propagation

The old constant propagation and simplification algorithm was buggy, and didn't take into account effects when removing statements and subexpressions. It was rewritten from scratch with an algorithm that is sound and parametrized by the set of effects, that may be ignored during the analysis. The new analysis treats operations accurately with respect to their algebraic properties and takes into account operations associativity. It is also now a big step analysis and no longer requires to be wrapped into the fixpoint function. Basically, we are moving from the open and semi-open recursion to the closed recursion wherever it is possible, as it is more efficient and intuitive approach. We will probably remove the fixpoint function in BAP.2.0.

BIL normal form and normalization utilities

We introduce a notion of a BIL Normal Form, that is a sub-language of BIL that disallows certain constructions, such as let and if/then/else expressions, complex memory expressions, and complex conditionals. This doesn't impose, that those expressions are not allowed in general, and lifters can still emit code that is easier to write. All such expressions are removed, i.e., are rewritten during the normalization process. The normalization procedure is described in details in the documentation, as well as in comments. Here is a brief highlight:

  1. no let expressions
  2. no if/then/else expressions
  3. all memory operations are one-byte only (size and endianness can be ignored)
  4. no anonymous memory storages - load expressions may load only from a variable
  5. no loads and stores inside conditional expressions
  6. no ill-typed expressions

SSA transformation plugin

This is a simple plugin that just transfers all functions into SSA form. It was moved from the bap-plugins repository.

Dead code elimination plugin

A sound analysis that will remove all unused code. This plugin is very useful, as it will remove lots of unnecessary code that are emitted by lifters. In particular, on x86 it removes flag compuations that are not used anywhere. As a result, this reduces the size of the output to 50% on x86 and makes BIL much more readable. It also highly increases the speed of interpretation. Especially, since many flag computations are quite computationally hard.

New pretty printer for bitvectors

Printers from the Z library proved to be slow, buggy, and ugly. Thus we decided to provide our own pretty-printer that will suit all the needs. A generic printer is added, that allows a user to customize the look of the textual representation of a bitvector. We also provide 9 different instantiations of the generic printer, e.g., pp_hex, pp_dec, pp_oct and pp_bin that will print a bitvector in the corresponding format. We also remove suffixes from a bitvector to make output more readable and concise. The full version is available via pp_hex_full, pp_dec_full, ... printers. We also change a default textual representation (i.e., to_string and pp) functions. The new representation now includes the signedness suffix, and moves away from using true and false for the one-bit words.

New functions for bitvectors

We added the to_int[XX]_exn family of functions, that will efficiently cast a bitvector into a corresponding OCaml type without extra allocations whenever possible. For example, Word.to_int_exn is usually an identity function w.r.t. the GC (i.e., it still has a check that value fits, and if it is, the it is an identity).

We also added the Word.unsigned function that will cast away the signedness of a bitvector.

The Unsafe module was added to the bitvector interface that provides arithmetic instructions that do not check that operands have correct widths. The module can be used for interpreting a well-typed BIL code. Moreover, these operations are included by default in the Bitvector top-level interface, thus, a Word.(a + b) will result in the unsafe (+) operator.

New pretty printer for BIL

We implemented a new pretty printer that (i) doesn't emit unnecessary parentheses anymore, and uses more concise syntax whenever it is possible. We followed the C language for the operator precedence (not OCaml). Here is the precedence table:

   +-------------------------------------------+----+
   | x[y], <cast>:<N>[x], extract:<N>:<M>[x]   | 10 |
   +-------------------------------------------+----+
   | ~x, -x                                    |  9 |
   +-------------------------------------------+----+
   | *,/,/$, %, %$                             |  8 |
   +-------------------------------------------+----+
   | +,-                                       |  7 |
   +-------------------------------------------+----+
   | <<,>>,~>>                                 |  6 |
   +-------------------------------------------+----+
   | <,>,<=,>=                                 |  5 |
   +-------------------------------------------+----+
   | =, <>                                     |  4 |
   +-------------------------------------------+----+
   | &                                         |  3 |
   +-------------------------------------------+----+
   | ^                                         |  2 |
   +-------------------------------------------+----+
   | |                                         |  1 |
   +-------------------------------------------+----+
   | let.., x with y <- z, if.., . (dot)       |  0 |
   +-------------------------------------------+----+
  1. Higher precedence means that operators binds tighter.
  2. Any bil statement has precedence lower than an expression

The load and store expressions are also simplifed, now it is not necessary to put the endiannes and size if an expression is byte sized.

Identifiable values for the Primus interpreter

We used the word type for values in the Primus interpreter, however, it showed that it is very hard to write a data tracking analysis without being able to identify values. For example, in the taint propagation plugin, we implemented a stack machine for tracking the taint. It is not only hard to maintain and understand, but also is inefficient. This defeats the main reason why we decided to use the word type as a value representation in the interpreter -- the efficiency.

We now represent a computation result as a pair of a word and a unique identifier. Basically the same as with the Bil.Result.t except that we are not using a variant type to represent storages and bottom values. In the Primus model, there are no such values, and everything is represented with a machine word, coined with an identifier.

To make it easier to work with the new values, we lifted most of the bitvector interface into the value interface.

New Primus interpreter

The Primus Interpreter is reimplemented and doesn't use the eval or microx code anymore (though there is still lots of code sharing since we factorized a lot of code from microx to the BIL library). The interpreter now tracks data much more closely, and provides an interface that allows code summaries, that are implemented in Lisp or in OCaml or in any other language to describe all code effect precisely. We also revisited observations that are made by the interpreter, and moved memory and environment observations into the interpreter.

New Primus Mark Visitied plugin

This plugin will mark each term that is visited (evaluated) by the Primus interpreter with the visited attribute. It is useful for tracking the coverage.

Diagnostic messages with better backtraces

There is an impedance mismatch between OCaml exceptions and the error monad. Sometimes we switch from one to another and loose the backtracing information. Now the backtraces are always stored in Or_error.t values and are printed in case of failure.

Caveats, Limitations, and Regressions

  1. The normalized form is not represented in a type system, thus it maybe applied more than once. In future we are planning to add phantom types for that.

  2. The dead-code elimination plugin is computationally intensive but is not cached. Thus the startup time of each binary is actually increased. We will later and the caching layer to alleviate this.

Fixes and updates for the x86 lifter

The shift operation representation was slightly incorrect, as it was using x := x instruction to represent that the value of x is unchanged. From the analysis perspective this is not entirely true, as it is an update of a variable, an unnecessary usage (that keeps DCE from removing this statement) and, in many cases, it left the right hand side left variable upper exposed, that was introducing a source of undefined values, as well as prohibited certain kind of optimizations on the interprocedural values (we were forced to treat x as an implicit input parameter of a function, where it is upper exposed).

The shift lifter also introduced unnecessary temporal variables, as well as used lots of ite. The former is removed, the latter is translated into one if statement (basically all ite were guarded by the same condition, so we can gather them, that will lead to much better CFG).

Bug fixes in ARM lifter

The ARM lifter was emitting badly typed code. Now it is fixed.

Support for tracing and other minor fixes

We added a handler for SIGINT so that gprof and other utilites, that requires a correct exit call, can work properly. We also added BAP_DEBUG=true to the Travis environment to ease the debugging on it.

TODO

Note, this is a work in progress, there are still a few minor issues that should be resolved before this branch would be ready for the merge. Here comes a list

  • Mark all new functions with the @since tag
  • Document all new functions
  • update all lifters with the typechecking code
  • remove calls to Bil.fixpoint as it is no longer necessary (in most cases)
  • check that all new plugins have sufficient manpages and documentation
  • check that all usages of Word.string_of_value are consistent, so that we will not break any integrations with external tools
  • update README.md with nice examples of BIL. Ideally, use the well-known strcpy as an example.
  • rewrite observer notifications so that flambda can inline them
  • cleanup the observations, make sure that we don't have any dead observation handlers.

ivg added some commits Jul 5, 2017

makes main state global and few bug fixes
The main change is that now there is only one main state in the
machine (aka project), not a separate copy per machine.

This commit also fixes leave-{blk,sub} events and adds several new
observations, such as exn-raised, that occurs every time a machine
switches to an exceptional control flow.

The commit also adds few common method to the monad interface.
adds attributes to bitvectors
Now it is possible to attach arbitrary values to a bitvector.
ensures normalization and well-typedness
also fixes a bug in Eval.binop.
fixes a bug in arm lifter
This bug leads to a type error, as a result of a byte load and a half
word load is stored in a 32 bit variable.
fixes a bug in type checker and exp normalizer
a type checker didn't held correctly shifts
and the normalizer didn't properly recurse
do not force constant folding
allow a backend to choose whether to propagate consts or not.
removes exp normalization
BIR becomes unreadable. We will apply normalization when needed.

Later, we may add a memoization or stuff like that.
sets BAP_DEBUG to true
this enables better diagnostics of compilation and configuration errors
on travis.
adds primus mark visted plugin
the plugin will mark all terms that were visited by Primus with the
visited attribute.
a precedence sensitive exp pretty printer
Should do this long ago, but finally found some time. No more
unnecessary parentheses!

Note: operator precedences follow the C language, not OCaml.

   +-------------------------------------------+----+
   | x[y], <cast>:<N>[x], extract:<N>:<M>[x]   | 10 |
   +-------------------------------------------+----+
   | ~x, -x                                    |  9 |
   +-------------------------------------------+----+
   | *,/,/$, %, %$                             |  8 |
   +-------------------------------------------+----+
   | +,-                                       |  7 |
   +-------------------------------------------+----+
   | <<,>>,~>>                                 |  6 |
   +-------------------------------------------+----+
   | <,>,<=,>=                                 |  5 |
   +-------------------------------------------+----+
   | =, <>                                     |  4 |
   +-------------------------------------------+----+
   | &                                         |  3 |
   +-------------------------------------------+----+
   | ^                                         |  2 |
   +-------------------------------------------+----+
   | |                                         |  1 |
   +-------------------------------------------+----+
   | let.., x with y <- z, if.., . (dot)       |  0 |
   +-------------------------------------------+----+

1) Higher precedence means that operators binds tighter.
2) Any bil statement has precedence lower than an expression.
fixes simplifier, adds cast simplification
1. at some branches the simplified didn't recurce
2. if a cast casts to the same type it is removed
adds ssa and dead-code-elimination plugins
The ssa just translates a program into the SSA form
The dead-code-elimination is a conservative deadcode elimination
plugin, that helps alot with x86 binaries by removing tons of unused
flags calcualations.
makes string_of_value more consistent
The string_of_value function now emits the `0x` prefix for hexnumbers,
unless instructed explicitly with the `prefix:false` flag.

The reason for this change is to enable a consitent handling of signed
and unsinged, negative and positive values.
publishes Exp.substitute function
long time ago we forgot to make it public.
enhances dead-code elimination plugin
1. Now it will run until a fix point is reached
2. A simple constant propagation is added, so that more virtuals are
removed
renames mem32 and mem64 in x86 lifter to mem
there is no need to have two different names for memory.
@maurer

This comment has been minimized.

Show comment
Hide comment
@maurer

maurer Aug 15, 2017

Member

Yeah, what you wrote there was (almost) exactly what I was suggesting. The main difference is that I'm not suggesting that you put all basic block addresses there, but instead the list you assumed it was possible to jump to when you did your dead code analysis, e.g. only the immediate successor nodes for the indirect jump. Hopefully this is usually a short list... We could even package it as a separate pass that you do before any analysis that assumes a complete CFG. This would get us two wins:

  • If __cfi_abort or whatever we call it gets called, we can use what it got called with to revise/expand the CFG
  • When running code that has been deadcode-eliminated, there is no uncertainty over whether the execution is viable - any deadcode-eliminated trace that doesn't include a call to __cfi_abort is definitely equivalent to a trace of the original program.
Member

maurer commented Aug 15, 2017

Yeah, what you wrote there was (almost) exactly what I was suggesting. The main difference is that I'm not suggesting that you put all basic block addresses there, but instead the list you assumed it was possible to jump to when you did your dead code analysis, e.g. only the immediate successor nodes for the indirect jump. Hopefully this is usually a short list... We could even package it as a separate pass that you do before any analysis that assumes a complete CFG. This would get us two wins:

  • If __cfi_abort or whatever we call it gets called, we can use what it got called with to revise/expand the CFG
  • When running code that has been deadcode-eliminated, there is no uncertainty over whether the execution is viable - any deadcode-eliminated trace that doesn't include a call to __cfi_abort is definitely equivalent to a trace of the original program.
@ivg

This comment has been minimized.

Show comment
Hide comment
@ivg

ivg Aug 15, 2017

Member

Yeah, @maurer, I think it would be a good idea to have this pass. The main concern is that it will by default clobber the output. We can alleviate this by rewriting the test like this:


when r0 = a0 call r0
when r0 = a1 call r0
...
when r0 = aN call r0
call __cfi_abort with noreturn

It will be easier to analyse, and easier to eliminate clauses that are not feasible.

Yeah, looks nice, we will add this cfi-or-abort pass. But let's it do later, after the release.

Member

ivg commented Aug 15, 2017

Yeah, @maurer, I think it would be a good idea to have this pass. The main concern is that it will by default clobber the output. We can alleviate this by rewriting the test like this:


when r0 = a0 call r0
when r0 = a1 call r0
...
when r0 = aN call r0
call __cfi_abort with noreturn

It will be easier to analyse, and easier to eliminate clauses that are not feasible.

Yeah, looks nice, we will add this cfi-or-abort pass. But let's it do later, after the release.

@maurer

maurer approved these changes Aug 15, 2017

ivg added some commits Aug 16, 2017

docs update and few renamings
Updated bap.mli with a documentation for new functions.
Also, renamed Image.Scheme.reference to Image.Scheme.relocation.

ivg added some commits Aug 16, 2017

automates documentation generation
Now it is enough just to type `make doc` and everything will work out of
box, if a correct version of OCaml is used, i.e., 4.03. The Makefile
recipe will pull the latest version of argot (git is required) and use
it to build the documentation. BAP should be installed either from the
source tree, or from opam, doesn't matter.
updates Primus docs
also removes some unnecessary stuff from it.
cleans and documents the monads library
I've removed few unnecessary funciton, e.g., `State.modify` and also
fixed the type of the call/cc function.
adds type checker to backends
also fixes few bugs that were found as a result

1) a zero width word was created (should be a one bit width)
2) a bug in the typechecker that led to a stackoverflow

So far no more type errors
adds missing manpages
also added parameters to the primus loader.
gone crazy and rewrote the taint propagation plugin
because my last update to primus broke it(((
dispatches linker unresolved calls to a lisp stub
The __primus_linker_unresolved_call function is called in case if a
linker can resolved a call.

Also adds the `pc` method to the interpreter interface and publish it
as a primitive `get-current-program-counter`.
fixes a bug in random number generation
not actually in the random number generator, but rather in a casting it
to a word.
enhances the run plugin with the multi-entry mode
Now we can specify a list of entry points, or a magic `all-subroutines`
to start execution in parallel from all specified entry points. If no
specified, then all subroutine terms marked with the `entry_point`
attribute are entered.
adds a Primus component that ensures termination
The primus-limit plugin will terminate a machine after a certain amount
of computations has happened.

@gitoleg gitoleg merged commit 817eac8 into BinaryAnalysisPlatform:master Aug 24, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@ivg ivg deleted the ivg:final-1.3-update branch Mar 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment