|
| 1 | +DataFlowSanitizer Design Document |
| 2 | +================================= |
| 3 | + |
| 4 | +This document sets out the design for DataFlowSanitizer, a general |
| 5 | +dynamic data flow analysis. Unlike other Sanitizer tools, this tool is |
| 6 | +not designed to detect a specific class of bugs on its own. Instead, |
| 7 | +it provides a generic dynamic data flow analysis framework to be used |
| 8 | +by clients to help detect application-specific issues within their |
| 9 | +own code. |
| 10 | + |
| 11 | +DataFlowSanitizer is a program instrumentation which can associate |
| 12 | +a number of taint labels with any data stored in any memory region |
| 13 | +accessible by the program. The analysis is dynamic, which means that |
| 14 | +it operates on a running program, and tracks how the labels propagate |
| 15 | +through that program. The tool shall support a large (>100) number |
| 16 | +of labels, such that programs which operate on large numbers of data |
| 17 | +items may be analysed with each data item being tracked separately. |
| 18 | + |
| 19 | +Use Cases |
| 20 | +--------- |
| 21 | + |
| 22 | +This instrumentation can be used as a tool to help monitor how data |
| 23 | +flows from a program's inputs (sources) to its outputs (sinks). |
| 24 | +This has applications from a privacy/security perspective in that |
| 25 | +one can audit how a sensitive data item is used within a program and |
| 26 | +ensure it isn't exiting the program anywhere it shouldn't be. |
| 27 | + |
| 28 | +Interface |
| 29 | +--------- |
| 30 | + |
| 31 | +A number of functions are provided which will create taint labels, |
| 32 | +attach labels to memory regions and extract the set of labels |
| 33 | +associated with a specific memory region. These functions are declared |
| 34 | +in the header file ``sanitizer/dfsan_interface.h``. |
| 35 | + |
| 36 | +.. code-block:: c |
| 37 | +
|
| 38 | + /// Creates and returns a base label with the given description and user data. |
| 39 | + dfsan_label dfsan_create_label(const char *desc, void *userdata); |
| 40 | +
|
| 41 | + /// Sets the label for each address in [addr,addr+size) to \c label. |
| 42 | + void dfsan_set_label(dfsan_label label, void *addr, size_t size); |
| 43 | +
|
| 44 | + /// Sets the label for each address in [addr,addr+size) to the union of the |
| 45 | + /// current label for that address and \c label. |
| 46 | + void dfsan_add_label(dfsan_label label, void *addr, size_t size); |
| 47 | +
|
| 48 | + /// Retrieves the label associated with the given data. |
| 49 | + /// |
| 50 | + /// The type of 'data' is arbitrary. The function accepts a value of any type, |
| 51 | + /// which can be truncated or extended (implicitly or explicitly) as necessary. |
| 52 | + /// The truncation/extension operations will preserve the label of the original |
| 53 | + /// value. |
| 54 | + dfsan_label dfsan_get_label(long data); |
| 55 | +
|
| 56 | + /// Retrieves a pointer to the dfsan_label_info struct for the given label. |
| 57 | + const struct dfsan_label_info *dfsan_get_label_info(dfsan_label label); |
| 58 | +
|
| 59 | + /// Returns whether the given label label contains the label elem. |
| 60 | + int dfsan_has_label(dfsan_label label, dfsan_label elem); |
| 61 | +
|
| 62 | + /// If the given label label contains a label with the description desc, returns |
| 63 | + /// that label, else returns 0. |
| 64 | + dfsan_label dfsan_has_label_with_desc(dfsan_label label, const char *desc); |
| 65 | +
|
| 66 | +Taint label representation |
| 67 | +-------------------------- |
| 68 | + |
| 69 | +As stated above, the tool must track a large number of taint |
| 70 | +labels. This poses an implementation challenge, as most multiple-label |
| 71 | +tainting systems assign one label per bit to shadow storage, and |
| 72 | +union taint labels using a bitwise or operation. This will not scale |
| 73 | +to clients which use hundreds or thousands of taint labels, as the |
| 74 | +label union operation becomes O(n) in the number of supported labels, |
| 75 | +and data associated with it will quickly dominate the live variable |
| 76 | +set, causing register spills and hampering performance. |
| 77 | + |
| 78 | +Instead, a low overhead approach is proposed which is best-case O(log\ |
| 79 | +:sub:`2` n) during execution. The underlying assumption is that |
| 80 | +the required space of label unions is sparse, which is a reasonable |
| 81 | +assumption to make given that we are optimizing for the case where |
| 82 | +applications mostly copy data from one place to another, without often |
| 83 | +invoking the need for an actual union operation. The representation |
| 84 | +of a taint label is a 16-bit integer, and new labels are allocated |
| 85 | +sequentially from a pool. The label identifier 0 is special, and means |
| 86 | +that the data item is unlabelled. |
| 87 | + |
| 88 | +When a label union operation is requested at a join point (any |
| 89 | +arithmetic or logical operation with two or more operands, such as |
| 90 | +addition), the code checks whether a union is required, whether the |
| 91 | +same union has been requested before, and whether one union label |
| 92 | +subsumes the other. If so, it returns the previously allocated union |
| 93 | +label. If not, it allocates a new union label from the same pool used |
| 94 | +for new labels. |
| 95 | + |
| 96 | +Specifically, the instrumentation pass will insert code like this |
| 97 | +to decide the union label ``lu`` for a pair of labels ``l1`` |
| 98 | +and ``l2``: |
| 99 | + |
| 100 | +.. code-block:: c |
| 101 | +
|
| 102 | + if (l1 == l2) |
| 103 | + lu = l1; |
| 104 | + else |
| 105 | + lu = __dfsan_union(l1, l2); |
| 106 | +
|
| 107 | +The equality comparison is outlined, to provide an early exit in |
| 108 | +the common cases where the program is processing unlabelled data, or |
| 109 | +where the two data items have the same label. ``__dfsan_union`` is |
| 110 | +a runtime library function which performs all other union computation. |
| 111 | + |
| 112 | +Further optimizations are possible, for example if ``l1`` is known |
| 113 | +at compile time to be zero (e.g. it is derived from a constant), |
| 114 | +``l2`` can be used for ``lu``, and vice versa. |
| 115 | + |
| 116 | +Memory layout and label management |
| 117 | +---------------------------------- |
| 118 | + |
| 119 | +The following is the current memory layout for Linux/x86\_64: |
| 120 | + |
| 121 | ++---------------+---------------+--------------------+ |
| 122 | +| Start | End | Use | |
| 123 | ++===============+===============+====================+ |
| 124 | +| 0x700000008000|0x800000000000 | application memory | |
| 125 | ++---------------+---------------+--------------------+ |
| 126 | +| 0x200200000000|0x700000008000 | unused | |
| 127 | ++---------------+---------------+--------------------+ |
| 128 | +| 0x200000000000|0x200200000000 | union table | |
| 129 | ++---------------+---------------+--------------------+ |
| 130 | +| 0x000000010000|0x200000000000 | shadow memory | |
| 131 | ++---------------+---------------+--------------------+ |
| 132 | +| 0x000000000000|0x000000010000 | reserved by kernel | |
| 133 | ++---------------+---------------+--------------------+ |
| 134 | + |
| 135 | +Each byte of application memory corresponds to two bytes of shadow |
| 136 | +memory, which are used to store its taint label. As for LLVM SSA |
| 137 | +registers, we have not found it necessary to associate a label with |
| 138 | +each byte or bit of data, as some other tools do. Instead, labels are |
| 139 | +associated directly with registers. Loads will result in a union of |
| 140 | +all shadow labels corresponding to bytes loaded (which most of the |
| 141 | +time will be short circuited by the initial comparison) and stores will |
| 142 | +result in a copy of the label to the shadow of all bytes stored to. |
0 commit comments