Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



59 Commits

Repository files navigation

Provenance Tracking with Clam


Propagate tags (numerical identifiers) from many sources to many sinks.


User defines sources and sinks (memory locations) via a configuration file (option -add-metadata-config):

read, 2, clam-prov-type:input
read, 2, clam-prov-size:3
write, 2, clam-prov-type:output
write, 2, clam-prov-size:3

This states that the clam-prov type of the second parameter of any call to read is an input (i.e., source) and the clam-prov type of the second parameter of any call to write is an output (i.e., sink). Also, it states that the size of the second parameter of any call to read is specified by the third parameter and the size of the second parameter of any call to write is specified by the third parameter. Note that, in general, a program will have many sources and many sinks since it can have many calls to read and write.

The analysis then assigns a unique numerical identifier (i.e., tags) to each source and it relies on the Clam static analyzer to propagate those tags across memory and function boundaries. The output of the analysis maps each sink to a set with all possible tags. With this, each sink is connected back to a subset of sources. Currently, this output is encoded as metadata named clam-prov-tags. For instance:

%res = call i64 @write(i32 %param, i8* %param1, i64 %param2), !call-site-metadata !7, !clam-prov-tags !9
!7 = !{!"4", !8}
!8 = !{!"2", !"clam-prov-type:output", !"clam-prov-size:3"}
!9 = !{i64 2}

This says that the second parameter of call to write (identifier 4) is tagged with tags {2}.


To run tests

  • lit: sudo pip install lit
  • cmp utility

Compilation and Installation


Checking installation

To run some regression tests:

 cmake --build . --target test-all

Usage  test.c -o test.prov.bc

By default, uses the file config/sources-sinks.config from the install directory to know which are the sources and sinks. The output bitcode file test.prov.bc contains call-site-metadata metadata indicating the tags assigned to both sources (e.g., read calls) and sinks (e.g., write calls).

If we want to use a different set of sources and sinks then use the command:  test.c --add-metadata-config=addMetadata.config -o test.prov.bc

Output Propagated Tags

Alternatively, the same information provided by LLVM metadata call-site-metadata can be printed to a file in DOT format. The tags propagated to sinks from sources can be outputted using the argument dependency-map-file as follows: test.c --add-metadata-config=addMetadata.config

Following is an example output file

digraph clam_prov_dependency_map{
"0" [label="function name:read\ncall site:0"];
"1" [label="function name:read\ncall site:1"];
"2" [label="function name:write\ncall site:2"];
"2" -> "0" [label="WasDependentOn"];
"2" -> "1" [label="WasDependentOn"];

The output above says the following:

  • First call-site read has the call site tag 0
  • Second call-site read has the call site tag 1
  • Third call-site write has the call site tag 2
  • The third call-site (write) has the propagated call site tags 0, and 1.

Log call-sites (Linux)

Logging can be added to a Linux program to emit call-sites using the argument add-logging-config as follows:

 clang -Xclang -disable-O0-optnone -c -emit-llvm test.c -o test.bc
 clam-pp --crab-devirt test.bc -o test.pp.bc
 clam-prov test.pp.bc --add-metadata-config=addMetadata.config --add-logging-config=call-site-logging.config -o test.out.pp.bc

The above specifies the file call-site-logging.config to configure how to log the call-sites when program is executed. The configurations must have the keys:

  • output_mode - Whether to write to a file (at ~/.clam-prov/audit.log) or to a pipe (at ~/.clam-prov/audit.pipe). Specify 0 to write to the file, or specify 1 to write to the pipe
  • max_records - The maximum call-site records to buffer before writing to the file or the pipe

The output is written as a series of records in binary format. Each record contains the following fields in the given order:

  • time in milliseconds expressed as an unsigned long (8 bytes)
  • process id expressed as an integer (4 bytes)
  • call site tag expressed as a signed long (8 bytes)
  • function return value expressed as a signed long (8 bytes)
  • name of the function expressed as a char array (256 bytes)

The source file CallSiteLogReader.c demonstrates how to read the call site log file.

To be able to generate an executable to log call-sites from test.out.pp.bc (above), the shared library must be linked as follows:

llc -relocation-model=pic test.out.pp.bc -o test.out.pp.s 
gcc -L./install/lib test.out.pp.s -o test.out.pp.native -lclamprovlogger

Finally, the generated executable can be executed as:

LD_LIBRARY_PATH=./install/lib ./test.out.pp.native