Proposition for better code generation from execution trace

Current process

How it is implemented

Spice86 registers and dumps:
- RAM (with executable code)
- Functions (with their segmented address)
- Execution data:
  - jumps / calls / rets (from where to where)
  - unaligned rets (rets that did not return where you would expect due to stack manipulation)
  - self modifying code data (data about which instruction modified which executable area and with which value)
Then you load that into ghidra with the spice86-ghidra plugin and generate code. This proved to work extremely well with LOGO.exe and reasonably well with DUNE.

What we do well

Generated code is not too bad to read thanks to ghidra function segmentation heuristics and JumpHander to transform function calls done via JMP to actual C# function calls
We handle some cases of self modifying code while still producing readable C#
We handle unaligned returns while keeping a high level C# function structure

Limitations

Ghidra does not handle real mode code well, this means code discovery with segmented address does not work well and sometimes jumps destination is not correct (See https://github.com/NationalSecurityAgency/ghidra/issues/981). We fixed that by forcing destinations with emulator data and having some heuristics to tell what is CS value at a location based on the fuctions segmented addresses, but it does not handle all cases.
Self modifying code is handled by detecting which byte is modified and generating code from there. For example, in MOV AX, 1234, if the part 1234 is modified, we just translate that by using the value stored in ram at the address where the constant is instead. Since we only register byte values without context, we can't know in all cases what are the possible instructions. Programs that swap code in and out of memory, unpackers that erase their code with the code they unpack would be poorly supported.
Since we don't analyze the flow (ghidra does that for us), we can't do advanced analysis:
- We don't transform jumps to if / while (resharper does it for us though).
- We don't transform assembly tests to higher level constructs. Example if(ZF) instead of if(AX==0)
We don't handle instructions that jump in the middle of other instructions

New process

Since we have an emulator observing the program beeing executed, we could generate perfect code by recording more advanced data on execution flow and by changing generation procedure.

What we want to record

I propose to record execution flow as a Control-flow graph (CFG).

This graph would be built by the emulator as it observes execution.

It would be able to represent self modifying code as branches in the graph.

Each node would represent an observed variant of an instruction.

Data we want to record for an instruction:

Its segmented address so that we know that code is self modifying if there is more than one instruction at an address
Instruction bytes so that we can convert it to C#
List of its possible successors (link to instructions that are executed just after it) so that we can generate control flow
The function it belongs to as registered by Spice86

There would be several roots to the graphs, one for each independent entry point:

One for the actual program entry point
One for each external interrupt handler that is called (timer, keyboard and so on)
We would create a new Recorder each time we detect a new handler beeing executed by the CPU for those interrupts

How to record that without wrecking the performances

Data structure to help the recording

class RawInstruction {
  // Address of the instruction. In case code rewrites itself, there are more than 1 instruction per address
  SegmentedAddress address
  // index in case there are more than 1 instruction per address
  int indexForAddress
  // bytes that were executed by CPU
  byte[] bytes
  // Possible next instructions for this instruction
  Map<SegmentedAddress, List<RawInstruction>> successors
  // Optimization, stores a combination of segmented address and index to quickly know if an address + index has been registered as successor for this instruction
  Set<long> successorKeys
  // function this instruction was executed with
  FunctionInformation function 
}

// One instance of this for each possible root (one for program entry point, and one for each actual hardware interrupt handler)
class ControlFlowRecorder {
  // the CFG graph, what we are interrested in. Initial node would be a dummy instruction (either a NOP or an other type than RawInstruction with an interface)
  RawInstruction root
  // to know where to attach new instruction at runtime (optimisation)
  RawInstruction previousInstruction
  // global state for the instructions
  InstructionsInMemory instructionsInMemory
  // to know what is the current function
  FunctionHandler functionHandler
  // To register and unregister breakpoints
  MemoryBus memory
}

// Only one instance of this for the whole run
class InstructionsInMemory {
  // size is 1024x1024, to know what is currently at an address (optimisation)
  RawInstruction[] currentInstructionForAddress
  // all the instructions that were encountered at address, to know when code was modified if the variant has been encountered before
  Map<SegmentedAddress, List<RawInstruction>> encounteredInstructionsForAddress
}

Process

Overview

When the CPU finishes executing an instruction, it calls the appropriate ControlFlowRecorder to tell it:
- The start segmented address of the instruction
- The instruction length
The recorder job is to register the instruction in the graph with minimal overhead. For this it needs to determine:
- If the instruction has ever been executed and needs a new node in the graph or not
- If the instruction has been executed after the previous instruction. If not, recorder will need to attach it to the previous instruction successors.

Checking if the CPU executed something new or not

First the recorder checks in currentInstructionForAddress if there is something or not.

If there is an instruction, it takes it from here as currentInstruction
If there is nothing:
- it checks in encounteredInstructionsForAddress if it can find an instruction for the executed bytes, if one is found, it takes it from here as currentInstruction
- if not, it creates a new RawInstruction and registers it (process described later)
- it sets the memory breakpoints for the bytes of the instruction (process described later)

Checking if the path taken is new or not

Then the recorder needs to know whether or not to attach currentInstruction to previousInstruction

It calculates a quick key composed of currentInstruction index and currentInstruction address as a long, and checks in previousInstruction.successorKeys if there is something or not.
If no result was found, it adds the instruction to previousInstruction.successors at the correct address

Finally

Then the records is done and sets currentInstruction to previousInstruction

How to create a new RawInstruction

Recorder instanciates a new RawInstruction from:
address is the address given by CPU
indexForAddress is the size of the list at address in encounteredInstructionsForAddress
bytes is an array of bytes from instruction bounds given by CPU
function is the current function given by functionHandler
Recorder adds the instruction to encounteredInstructionsForAddress
Recorder sets the memory breakpoints for the bytes of the instruction

How to register instruction for self modifying code detection

Recorder sets Write breakpoints for each byte of the instruction in memory. Breakpoint will do the following things when activated:
Delete all the breakpoints for the space occupied by the instruction
Set currentInstructionForAddress[instruction.address] to null. Since instruction is modified we consider it as not encountered until CPU executes it.

Gaps

How to handle an instruction beeing accessed by different CS:IP combinations?

Do we consider each CS:IP as separate nodes in the graph? Potientially could lead to lot of duplication
Do we compute CS in generated code or at least set it at the beginning of each function?
Do we do like current generator where there are different CS variables to use and they are used appropriately?

Storing and loading the CFGs to disk

In order to support multiple sessions, we need to be able to dump and reload the CFGs to disk so that next time the user loads a program, the paths that have been explored do not have to be explored again.

Storing

Several choices possible here, SQL DB, Graph DB, human readable Json files

Loading

When we load from disk:

We fill encounteredInstructionsForAddress so that recording can work as before
We set root and currentInstruction as it is in disk
Nothing else is needed. If CPU modifies code it will be detected at execution time when trying to attach the new instruction to the graph.

Code generation from the CFG

Code generation would happen when data dump occurs. There are several steps to go from the CFG to actual C# files

Emulator outputs a very crude CFG that is not very usable as is for code generation. We need to transform it in several stages to be able to output usable C#:

Give meaning to the bytes and have a graph of nodes with a meaning instead of bytes nodes.
Make links between nodes bidirectional so we know what is the predecessor
Group together instructions from self modifying code that are at the same address and that can be expressed as one instead of n instructions. Think about MOV AX, 1234 with the 1234 part being modified
Insert new nodes in the graph that are not instructions to handle self modifying code
Group instructions in blocks
Split the graph into functions
Generate C# from each function graph

Convert the CFG emulator graph to a higher level CFG graph

The graph of RawInstruction is good for the emulation phase because it allows to register code flow with minimal performance impact.

But it is too crude to be able to use it for the next steps.

We need a better representation of the instructions and of the flow in order to be able to use the graph to do generate beautiful C# :)

Let's convert it to higher level structure:

class CfgNode {
  SegmentedAddress address;
  List<CfgNode> Predecessors {get;set;};
  List<CfgNode> Successors {get;set;};
  // Equals and HashCode to use the address
}

class Root: CfgNode {
}

class Instruction : CfgNode {
  List<Field<?>> FieldsInOrder {get;}
  Field<byte[]> prefixes
  // includes subopcode
  Field<byte[]> opcode
  // get what allows to uniquely identify the instruction among other at the same address. usually all the fields except in some cases when they are modified (example imm value or disp), in this case instead of bytes there will be nulls
  abstract Discriminator Discriminator {get;}
  // Equals and HashCode to use the discriminator and super methods
}

class InstructionWithModRM : Instruction {
  Field<byte> modRm
  Field<byte> sib
  Field<byte[]> disp
}

class InstructionWithModRMAndImmValue: InstructionWithModRM  {
  Field<byte[]> immValue
}

class AddReg8Imm8WithModRm: InstructionWithModRMAndImmValue{
}

class Field<T> {
  int indexInInstruction;
  int length;
  SegmentedAddress SegmentedAddress {get;};

  T value;
  // differs to value for fields which do not represent something that changes CPU logic.
  // opcode would have value and valueForDiscriminator with the same value.
  // 1234 in MOV AX, 1234 would be a field<ushort> with 0x1234 as value and [null, null] in valueForDiscriminator
  byte[] valueForDiscriminator;
  // When true value is to be used, when false get the value from memory instead
  bool UseValue {get;set;}
}

class Discriminator {
  byte?[] DiscriminatorValue {get;}
  // Implement Hascode and Equals with the DiscriminatorValue
}

Code will need to parse each RawInstruction, determine prefixes, opcode and other bytes, and convert it to its proper instance.

Conversion would happen as follow:

Graph of RawInstruction is browsed for each ControlFlowRecorder.
Each RawInstruction is converted to an Instruction without filling its successors.
The converted RawInstruction is put into a Dictionary<RawInstruction, Instruction>
The graph browsing is finite because we don't process instructions twice.
When all nodes of the raw graph has been processed, the Dictionary is read entry by entry and each successor of RawInstruction is retrieved as Instruction and added to the list of successor of the Instruction for the entry.
Instanciate a Root CfgNode for each ControlFlowRecorder and attach the successors of the RawInstruction root to those roots

Output: A list of Root nodes representing the entry point or various harware handlers

Make bidirectional links

Until now we have only been having forward links so it's not easy to know from where an instruction was called. We need to add this information now.

Algorithm is quite simple: browse each root and for each node get all the successors:

Add the node to the Set of nodes already encountered
Add the node to each successor predecessors

Group instructions from self modifying code

This step is what will allow code that stores values in instructions to get more performances or shorter listing to look not too bad.

In case the original code was doing so, it would replace 1234 in MOV AX, 1234 with 1235 for example.

We want to say that this is not to be used as an immediate value but as the value in the memory address, something the CPU cannot do but we can (we are in C# :) )

For each instruction with more than 1 successor:

Build a Dictionary<SegmentedAddress, Dictionary<Discriminator, List>> called successorsPerAddressAndDiscriminator from the successors. Each time there is more than one item in the value list, it means there was code modification at execution time.
The purpose is to process cases where the List in the dictionary is more than 1 in lenght, it means there are instructions that differ only by immediate values which is something we don't need in C#.
For each of such List we need to reduce it to size 1:
- Take the first Instruction, this is the one that is getting kept. Let's call it firstInstruction. Other should disappear from the graph
- Browse FieldsInOrder field of the Instruction. For each Field check whether all the instructions in the list have the same value. If so UseValue = true, otherwise UseValue = false
- Now the first instruction has been enriched with others fields. We also need to group successors and predecessors.
- Add all the successors of all the instructions in the list to a Set to deduplicate this. Convert the Set to a List, this is the list of successors for this instruction.
- Do the same for the predecessors.
- For each Instruction in the list, browse the predecessors, and replace the Instruction in the successors of the predecessor with firstInstruction

After this step, we should have a saner graph ready for next step.

Convert self modifying code selectors to actual nodes in the graph

We need to insert new nodes that represent the switch to do when a successor of an instruction is modified and we need to decide which path to take

// This is not an actual instruction, it's just a node to represent that we select between several instuctions
SelfModifiedCodeNode : CfgNode {
}

Here is what to do:

For each instruction (call it parent), build a Dictionary<SegmentedAddress, Dictionary<Discriminator, Instruction>> for the successors.
If a Dictionary<Discriminator, Instruction> has more than one value, it means those instructions need to be replaced with a SelfModifiedCodeNode:
Create the SelfModifiedCodeNode:
The address of the node is the common address for all those discriminators
The successors is the list of Instruction values for the Dictionary at this address
The predecessor is the Instruction beeing processed (parent)
add it to the parent successors list
Remove the Instructions that have been added to the SelfModifiedCodeNode as successors from parent successors list
Check each predecessor of the added Instructions. If they have all the instructions that are in the new SelfModifiedCodeNode successors, replace them with the new node and add the predecessor to the predecessors of this SelfModifiedCodeNode

Blocks

Blocks are groups of nodes that are linear (IE no more than one predecessor except for the entry and no more than one successor except for the exit)

Blocks are important because:

They will be directly translated to C# without any control flow
We can do optimization like removing intermediate values and flag calculations if they are erased lated in the block without beeing used
We can do good looking ifs for exit nodes if the conditionals depend on the flags and we know where in the node the flags have been modified (neat!)

class CfgBlock : CfgNode {
  // for goto target
  int id;
  List<CfgNode> Nodes {get;}
  // redefine Predecessors property to use the first item of Nodes predecessors and Successors the last item Successors
}

How to convert the graph of CfgNode to a graph of CfgBlock

Maintain a Dictionary<CfgNode, CfgBlock> (called encountered) to know what has been generated
Browse the graph of CfgBlock for each root
When finding a CfgNode that is not in encountered, create a CfgBlock that correspond to it and add it to encountered. the Id could just be the size of the encountered dictionary.
Add CfgBlock to the CfgNode, and do the same thing for its successors until a node with more than one predecessor / successor is encountered:
If a node with more than one predecessor is encountered, do not add it to the CfgBlock. The block is done.
If a node with more than one successors is encountered, add it. The block is done.
Do the same recursively for each exit node of the cfgblock that has been processed successors.

At the end we should have a grah of Root -> CfgBlock. Are we ready to generate some C#? Not yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly