A 3gram search engine for querying Terabytes of data in milliseconds. Optimized for working with binary files (for example, malware dumps).
How does it work?
UrsaDB is using few slightly different methods of indexing files, having
gram3 indexes as a most basic concept.
When the database is about to create a
gram3 index for a given file, it extracts all possible three-byte combinations from it. An index is a big map of:
list of files which contain it.
For instance, if we would index a text file containing ASCII string
TEST MALWARE (ASCII:
54 45 53 54 20 4D 41 4C 57 41 52 45), then the database would generate the following trigrams (
_ denotes space character):
An index maps a trigram to a list of files, so the new file will be added to the abovementioned lookups.
When querying for string
TEST MALWARE, the database will query trigram index in order to determine which files do contain sequence
544553, then which files contain
455354 and so on till
415245. Such partial results will be ANDed and then the result set (list of probably matching files) is returned.
Such searching technique sometimes may yield false positives, but it's never going to yield any true negatives. Thus, it may be appropriate for quick filtering (see mquery project - we use UrsaDB there in order to accelerate the process of malware searching).
String literals are very common in binaries. Thus, it's useful to have a specialized index for ASCII characters.
text4 index, ASCII characters are packed in a manner similar to base64 algorithm. Due to that, it is possible to generate a trigram out of four characters.
Note that such an index doesn't respond to queries containing non-ASCII bytes, so it should be combined with at least
Because searching for
UTF-16 is also useful, there is a special index which works similarily to
text4. In this case, ASCII characters interleaved with zeros are decoded.
Yet another type of index is
hash4, which creates trigrams based on hashes of 4-byte sequences in the source file.
Full package installation
This repository is only for UrsaDB project (3gram database). In order to see instructions on how to set up the whole mquery system, see CERT-Polska/ursadb.
Installation (Docker way)
Docker image may be built by executing
docker build -t ursadb . on the source code pulled from this repo.
Installation (standard way)
- Compile from sources:
$ mkdir build $ cd build $ cmake -D CMAKE_C_COMPILER=gcc-7 -D CMAKE_CXX_COMPILER=g++-7 -D CMAKE_BUILD_TYPE=Release .. $ make
- Deploy output binaries (
ursadb_new) to appropriate place, e.g:
# cp ursadb ursadb_new /usr/local/bin/
- Create new database:
$ mkdir /opt/ursadb $ ursadb_new /opt/ursadb/db.ursa
- Run UrsaDB server:
$ ursadb /opt/ursadb/db.ursa
- (Optional) Consider registering UrsaDB as a systemd service:
cp contrib/systemd/ursadb.service / systemctl enable ursadb
Interaction with the database could be done using
ursadb-cli (see another repository).
Directly provided paths
A filesystem path could be indexed using
or multiple paths at once:
index "/opt/something" "/opt/foobar";
by default it will be indexed using
gram3 index. Index types may be specified manually:
index "/opt/something" with [gram3, text4, hash4, wide8];
Paths in a list file
For convenience it's also possible to make UrsaDB read a file containing a list of targets to be indexed, each one separated by a newline.
index from list "/tmp/file-list.txt"
index from list "/tmp/file-list.txt" with [gram3, text4, hash4, wide8];
while exemplary contents of
Select queries could use ordinary strings, hex strings and wide strings.
Query for ASCII bytes
The same query with hex string notation:
Query for wide string
abc (the same as
Elements could be AND-ed:
select "abc" & "bcd";
select "abc" | "bcd";
Queries may also use parenthesis:
select ("abc" | "bcd") & "cde";
You may query for samples which contain at least N of M strings:
select min 2 of ("abcd", "bcdf", "cdef");
is equivalent to:
select ("abcd" & "bcdf") | ("abcd" & "cdef") | ("bcdf" & "cdef");
min N of (...) is executed more efficiently than latter "combinatorial" example. Such syntax is directly corresponding to yara's "sets of strings" feature.
This operator accepts arbitrary expressions as it's arguments, e.g.:
select min 2 of ("abcd" & "bcdf", "lorem" & "ipsum", "hello" & "hi there");
in this case inner expressions like
"abcd" & "bcdf" will be evaluated first.
Minimum operator could be also nested in some expression, e.g.:
select "abcd" | ("cdef" & min 2 of ("hello", "hi there", "good morning"));
Query for the status of tasks running in the database:
The output format is a JSON object with the details of all tasks. Exact format of these sub-objects is defined in Responses.cpp. See
ursadb-cli for a working implementation.
Check current database topology - what datasets are loaded and which index types they use.
> topology; OK DATASET aa266884 INDEX aa266884 gram3 INDEX aa266884 text4 DATASET bc43a921 INDEX bc43a921 gram3 INDEX bc43a921 text4
Means that there are two datasets (partitions), both backed with indexes of type
Add new index type to the existing dataset. Before reindexing, you need to determine the ID of the dataset
which has to be indexed (may be done using
reindex "bc43a921" with [hash4];
will reindex already existing dataset
hash4 type index.
Force database compacting.
In order to force compacting of all datasets into a single one:
In order to force smart compact (database will decide which datasets do need compacting, if any):