Skip to content
Trigram database written in C++, suited for malware indexing
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
contrib/systemd initial public release May 21, 2018
docs enhance README.md May 26, 2018
lib initial public release May 21, 2018
test a little bit better tests Jan 7, 2019
.gitlab-ci.yml adjust continous delivery scripts Jan 10, 2019
Benchmark.cpp make use of both flat and bitmap indexing modes May 31, 2018
BitmapIndexBuilder.cpp proper(?) exceptions when there are write problems Jan 31, 2019
BitmapIndexBuilder.h
CMakeLists.txt properly split implementations of Bitmap/Flat builders May 31, 2018
Command.h yet another approach to fix default_index_types Jan 7, 2019
Core.h properly split implementations of Bitmap/Flat builders May 31, 2018
Daemon.cpp return client's connection_id in response to ping command Feb 1, 2019
Daemon.h initial public release May 21, 2018
Database.cpp proper(?) exceptions when there are write problems Jan 31, 2019
Database.h initial public release May 21, 2018
DatabaseHandle.cpp initial public release May 21, 2018
DatabaseHandle.h initial public release May 21, 2018
DatabaseSnapshot.cpp proper(?) exceptions when there are write problems Jan 31, 2019
DatabaseSnapshot.h proper(?) exceptions when there are write problems Jan 31, 2019
DatasetBuilder.cpp make use of both flat and bitmap indexing modes May 31, 2018
DatasetBuilder.h make use of both flat and bitmap indexing modes May 31, 2018
Dockerfile implement some basic end-to-end tests Jan 7, 2019
ExclusiveFile.cpp
ExclusiveFile.h initial public release May 21, 2018
FlatIndexBuilder.cpp proper(?) exceptions when there are write problems Jan 31, 2019
FlatIndexBuilder.h
IndexBuilder.h fix memory leaks during indexing (ooouch!) Jan 9, 2019
Indexer.cpp make use of both flat and bitmap indexing modes May 31, 2018
Indexer.h make use of both flat and bitmap indexing modes May 31, 2018
Json.h initial public release May 21, 2018
LICENSE initial public release May 21, 2018
MemMap.cpp initial public release May 21, 2018
MemMap.h initial public release May 21, 2018
NetworkService.cpp change log messages a little bit Jan 5, 2019
NetworkService.h
NewDatabase.cpp initial public release May 21, 2018
OnDiskDataset.cpp fix bug which caused internal_pick_common to segfault sometimes Jan 7, 2019
OnDiskDataset.h add tests for internal_pick_common Jan 7, 2019
OnDiskIndex.cpp report topology size in response to topology command Jan 31, 2019
OnDiskIndex.h report topology size in response to topology command Jan 31, 2019
Query.cpp implement 'min N of (...)' operator corresponding to yara's 'N of them' Jan 5, 2019
Query.h implement 'min N of (...)' operator corresponding to yara's 'N of them' Jan 5, 2019
QueryParser.cpp fix problem with default_index_types being partial deadcode Jan 7, 2019
QueryParser.h initial public release May 21, 2018
README.md Update README.md Jan 8, 2019
Responses.cpp
Responses.h return client's connection_id in response to ping command Feb 1, 2019
Task.cpp initial public release May 21, 2018
Task.h initial public release May 21, 2018
Tests.cpp
Utils.cpp proper(?) exceptions when there are write problems Jan 31, 2019
Utils.h proper(?) exceptions when there are write problems Jan 31, 2019
ZHelpers.cpp initial public release May 21, 2018
ZHelpers.h initial public release May 21, 2018
entrypoint.sh adjust Dockerfile to create empty database if it does not exist on th… Oct 1, 2018

README.md

UrsaDB

A 3gram search engine for querying Terabytes of data in milliseconds. Optimized for working with binary files (for example, malware dumps).

Created in CERT.PL. Originally by Jarosław Jedynak (tailcall.net), extended and improved by Michał Leszczyński.

How does it work?

gram3 index

UrsaDB is using few slightly different methods of indexing files, having gram3 indexes as a most basic concept.

When the database is about to create a gram3 index for a given file, it extracts all possible three-byte combinations from it. An index is a big map of: 3gram => list of files which contain it.

For instance, if we would index a text file containing ASCII string TEST MALWARE (ASCII: 54 45 53 54 20 4D 41 4C 57 41 52 45), then the database would generate the following trigrams (_ denotes space character):

# Substring Trigram
0 TES 544553
1 EST 455354
2 ST_ 535420
3 T_M 54204D
4 _MA 204D61
5 MAL 4D616C
6 ALW 414C57
7 LWA 4C5741
8 WAR 574152
9 ARE 415245

An index maps a trigram to a list of files, so the new file will be added to the abovementioned lookups.

gram3 queries

When querying for string TEST MALWARE, the database will query trigram index in order to determine which files do contain sequence 544553, then which files contain 455354 and so on till 415245. Such partial results will be ANDed and then the result set (list of probably matching files) is returned.

The drawing presents how trigrams are mapped to file contents.

Such searching technique sometimes may yield false positives, but it's never going to yield any true negatives. Thus, it may be appropriate for quick filtering (see mquery project - we use UrsaDB there in order to accelerate the process of malware searching).

text4 index

String literals are very common in binaries. Thus, it's useful to have a specialized index for ASCII characters.

In text4 index, ASCII characters are packed in a manner similar to base64 algorithm. Due to that, it is possible to generate a trigram out of four characters.

Note that such an index doesn't respond to queries containing non-ASCII bytes, so it should be combined with at least gram3 index.

wide8

Because searching for UTF-16 is also useful, there is a special index which works similarily to text4. In this case, ASCII characters interleaved with zeros are decoded.

hash4

Yet another type of index is hash4, which creates trigrams based on hashes of 4-byte sequences in the source file.

Full package installation

This repository is only for UrsaDB project (3gram database). In order to see instructions on how to set up the whole mquery system, see CERT-Polska/ursadb.

Installation (Docker way)

Docker image may be built by executing docker build -t ursadb . on the source code pulled from this repo.

Installation (standard way)

  1. Compile from sources:
$ mkdir build
$ cd build
$ cmake -D CMAKE_C_COMPILER=gcc-7 -D CMAKE_CXX_COMPILER=g++-7 -D CMAKE_BUILD_TYPE=Release ..
$ make
  1. Deploy output binaries (ursadb, ursadb_new) to appropriate place, e.g:
# cp ursadb ursadb_new /usr/local/bin/
  1. Create new database:
$ mkdir /opt/ursadb
$ ursadb_new /opt/ursadb/db.ursa
  1. Run UrsaDB server:
$ ursadb /opt/ursadb/db.ursa
  1. (Optional) Consider registering UrsaDB as a systemd service:
cp contrib/systemd/ursadb.service /
systemctl enable ursadb

Usage

Interaction with the database could be done using ursadb-cli (see another repository).

Queries

Indexing

Directly provided paths

A filesystem path could be indexed using index command:

index "/opt/something";

or multiple paths at once:

index "/opt/something" "/opt/foobar";

by default it will be indexed using gram3 index. Index types may be specified manually:

index "/opt/something" with [gram3, text4, hash4, wide8];

Paths in a list file

For convenience it's also possible to make UrsaDB read a file containing a list of targets to be indexed, each one separated by a newline.

index from list "/tmp/file-list.txt"

or

index from list "/tmp/file-list.txt" with [gram3, text4, hash4, wide8];

while exemplary contents of /tmp/file-list.txt is:

/opt/something
/opt/foobar

Select

Strings ("primitives")

Select queries could use ordinary strings, hex strings and wide strings.

Query for ASCII bytes abc:

select "abc";

The same query with hex string notation:

select {616263};

Query for wide string abc (the same as {610062006300}):

select w"abc";

Logical operators

Elements could be AND-ed:

select "abc" & "bcd";

and OR-ed:

select "abc" | "bcd";

Queries may also use parenthesis:

select ("abc" | "bcd") & "cde";

Minimum operator

You may query for samples which contain at least N of M strings:

select min 2 of ("abcd", "bcdf", "cdef");

is equivalent to:

select ("abcd" & "bcdf") | ("abcd" & "cdef") | ("bcdf" & "cdef");

Note that min N of (...) is executed more efficiently than latter "combinatorial" example. Such syntax is directly corresponding to yara's "sets of strings" feature.

This operator accepts arbitrary expressions as it's arguments, e.g.:

select min 2 of ("abcd" & "bcdf", "lorem" & "ipsum", "hello" & "hi there");

in this case inner expressions like "abcd" & "bcdf" will be evaluated first.

Minimum operator could be also nested in some expression, e.g.:

select "abcd" | ("cdef" & min 2 of ("hello", "hi there", "good morning"));

Status

Query for the status of tasks running in the database:

status;

The output format is a JSON object with the details of all tasks. Exact format of these sub-objects is defined in Responses.cpp. See ursadb-cli for a working implementation.

Topology

Check current database topology - what datasets are loaded and which index types they use.

topology;

Exemplary output:

> topology;
OK
DATASET aa266884
INDEX aa266884 gram3
INDEX aa266884 text4

DATASET bc43a921
INDEX bc43a921 gram3
INDEX bc43a921 text4

Means that there are two datasets (partitions), both backed with indexes of type gram3 and text4.

Reindex

Add new index type to the existing dataset. Before reindexing, you need to determine the ID of the dataset which has to be indexed (may be done using topology command).

Example:

reindex "bc43a921" with [hash4];

will reindex already existing dataset bc43a921 with hash4 type index.

Compact

Force database compacting.

In order to force compacting of all datasets into a single one:

compact all;

In order to force smart compact (database will decide which datasets do need compacting, if any):

compact smart;
You can’t perform that action at this time.