In [None]:
# PANDA Tutorial
This notebook covers basic usage of PANDA: the Platform for Architecture-Neutral Dynamic Analysis.

In [1]:
'''
Install pandare and initalize a PANDA object.
The 'generic' argument will fetch (and cache) an i386 ubuntu virtual machine for our analysis
'''
!pip install pandare

from pandare import Panda, blocking
panda = Panda(generic='i386_wheezy')

print("Loaded PANDA")

using generic i386_wheezy
Loading libpanda from /home/andrew/git/panda/build
[32m[PYPANDA] [39m[1mPanda args: [/home/andrew/git/panda/build/i386-softmmu/libpanda-i386.so -L /home/andrew/git/panda/build/pc-bios /home/andrew/.panda/wheezy_panda2.qcow2 -display none -m 128M -serial unix:/tmp/pypanda_s5m0397a9,server,nowait -monitor unix:/tmp/pypanda_m0m_41rib,server,nowait][0m
Loaded PANDA


## Controlling the Guest Virtual Machine

Before we do any sort of analysis, let's figure out how to drive
the execution of the guest. We do this this in a "blocking" function
(named after the fact that it will block on guest behavior).
First we revert to a snapshot named "root" which we've created
in all the _generic_ virtual machines. This snapshot was taken
just after logging in to a system.

Then we'll execute a command by typing into a serial console,
and pressing enter. panda.run_serial_cmd will capture all the
output printed until the next bash prompt is printed.

In particular, we'll run `file` on `ls`.

Then run the guest and the asynchronous `run_commands` function
will already be queued up to run (thanks to its decorator).

In [3]:
@panda.queue_blocking
def run_commands():
    print("Running commands...")
    panda.revert_sync("root")
    print("Results: ", panda.run_serial_cmd("file /bin/ls"))
    # Now we're done with our analysis so we want panda.run() to unblock - 
    # we have to tell panda that we're done to make that happen
    panda.end_analysis()

panda.run()

Running commands...
Results:  /bin/ls: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.26, BuildID[sha1]=0xd3280633faaabf56a14a26693d2f810a32222e51, stripped


## Introspection
Now let's some basic introspection. While we're typing the
head command and waiting for output, the guest system is running
a bunch of code. Let's capture the address of every unique basic block.

We'll do this by defining a "PANDA callback" which will be run
whenever a specific event occurs. In this case, we want our code
to run before each basic block executes.

Then we'll queue up `run_commands` to run again (since it already ran, it's no longer in the queue) and run the analysis again.

In [4]:
blocks = set()

@panda.cb_before_block_exec
def simple_before_block_callback(cpu, tb):
    pc = panda.current_pc(cpu)
    blocks.add(pc)

panda.queue_async(run_commands)
panda.run()
print("Total block count: ", len(blocks))


Running commands...
Results:  /bin/ls: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.26, BuildID[sha1]=0xd3280633faaabf56a14a26693d2f810a32222e51, stripped
Total block count:  15207


More than ten-thousand unique blocks were run while we were running a simple command.
What's going in this system?


## Introspection
Let's use PANDA's operating-system introspection (`OSI`)
plugin to learn about what the system's doing.

But first, we'll disable our previously-registered callback
since we no longer want it to run.

We'll also need a handle to the foreign-function
interface (`ffi`) used by panda here (this will chnange in
a future version of PANDA).

Then we'll set up a different callback to run whenever
the ASID (process ID) changes. In this callback, we'll use the `OSI` plugin
to get the current process name.

In [5]:
from pandare import ffi

panda.disable_callback('simple_before_block_callback')

# Load operating-system introspection plugin
panda.load_plugin("osi")

procs = set()
@panda.cb_asid_changed
def asid_changed_procnames(cpu, old_asid, new_asid):
    # Try to get the current process
    proc = panda.plugins['osi'].get_current_process(cpu) 
    if proc != ffi.NULL:
        name = ffi.string(proc.name).decode()
        procs.add(name)
    # PANDA supports actively modifying the guest behvior.
    # For example, you can prevent an ASID change. But we don't
    # want to do that so we return 0
    return 0

# Again, we queue up our blocking run_commands function to run
panda.queue_async(run_commands)
panda.run()

print("Processes observed:")
for name in procs:
    print(name)


Clearly multiple processes are running - that's usually the case when working with
a whole-system analysis platform like PANDA.

But let's focus in on `file` and try learning more about it.

Again, we'll disable our prior callback since we're done with that.

Then we'll load the `syscalls2` plugin which provides an additional set of
`panda plugin-to-plugin` (ppp) callbacks. With `syscalls2` we can register
a callack to run whenever a specific syscall is about to be called from userspace. For example, when working with the `read` system call,
a callback registered to run on `sys_read_enter` will run after userspace issues a `read` system call, but before the kernel processes it.
On the other hand, a callback registered to run on `sys_read_return` will run after the kernel has finished processing the system call, but before userspace has had a chance
to continue execution.

Using `syscalls2` we'll see what files are opened by the `file` process.

In [5]:
panda.disable_callback('asid_changed_procnames')

panda.load_plugin("syscalls2")

# Register a PANDA Plugin-to-Plugin callback
# Arguments are described in syscalls2 documentation
# generally cpu, pc, [syscall args, ...]
@panda.ppp("syscalls2", "on_sys_open_enter")
def syscalls_file_open(cpu, pc, fname_ptr, flags, mode):
    print("OPEN")
    proc = panda.plugins['osi'].get_current_process(cpu) 
    if proc == ffi.NULL:
        print("NULL")
        return
    name = ffi.string(proc.name).decode()
    print(name)
    if name == 'file':
        fname = panda.read_str(cpu, fname_ptr)
        print(fname)

panda.queue_async(run_commands)
panda.run()

Running commands...


## Taint Analysis - Setup

Let's figure out how the contents of the magicfile affect `file`'s behavior. To do this, we'll use a dynamic taint analysis.

However, dynamic taint analyses are slow and we wouldn't want to slow down the guest so much that
it changes its behavior. To avoid this, we're going to capture a `PANDA recording` of the guest executing our commands
and then conduct our analysis on that recording.

In [3]:

@panda.queue_blocking
def take_recording():
    panda.record_cmd("file /bin/ls", recording_name="file_ls")
    panda.end_analysis()

panda.run()

Finished recording


## Taint Analysis

Our recording has placed two files on disk - `file_ls-rr-nondet.log` and `file_ls-rr-snp`. Together these files compactly and precisely
capture everything the guest was doing while we were running our command. Now we'll run that same replay with a taint analysis.

At a high level, a taint analysis is simply tracking how data flow from a source into the rest of a system. In this example, we will label
an input file as tainted and then use PANDA's taint system: `taint2` to track where those data go.

There are three types of plugins in PANDA's taint ecosystem:
1) The core taint system (the `taint2` plugin). This tracks how tainted data flow through a system. When tainted data is copied or computed on, the outputs are labeled as tainted.
2) Taint-labeling plugins (`filetaint`, `tainted_net`, `tainted_mmio`): these plugins apply taint labels to data which the taint system then tracks
3) Taint-querying plugins (`tainted_branch`, `tainted_instr`): these plugins track when tainted data are used.

In general, if you're doing a taint analysis with PANDA, you'll want to use one plugin from each of these categories.

### Replaying a recording

So now that we have a recording, we need to get the system to re-execute the same behavior.

In [None]:
# Queue up the replay to run
panda.run_replay("file_ls")

# Actually run it
panda.run()

## Taint Analysis: Labeling and Tracking

Now let's label data in the magicfile as tainted and identify where tainted branches in the `file` process are. In other words, we're going to find the addresses in `file` where there are branches that are controlled by data in this input file.

Because there will be a lot of data, we'll want to store the results effeciently. Fortunately, the tainted_branch plugin (among others) only supports outputting results into a `pandalog`: a binary format used by PANDA plugins.

In [None]:
panda.load_plugin("taint2")
panda.load_plugin("file_taint", args={"filename": "/usr/share/misc/magic.mgc"})
panda.load_plugin("tainted_branch")

# Store results in output.plog
panda.set_pandalog("output.plog")

# Queue up the replay to run
panda.run_replay("file_ls")
# Actually run it
panda.run()

## Taint Analysis: Analyze Results

The pandalog (also called a plog) is an efficient way to store data, but we actually want to see the results. Let's use PANDA's `plog_reader` to see what's in it

In [None]:
from panda import PLogReader
from google.protobuf.json_format import MessageToJson

with PLogReader("output.plog") as plr:
    for i, m in enumerate(plr):
        if i > 0: print(',')
        print(MessageToJson(m), end='')