Feature/datasets/v43 #4166

victorjulien · 2019-09-04T07:46:11Z

Rebased version of #4047. For merge as experimental feature.

PRScript output (if applicable):

PR victorjulien-pcap: https://buildbot.openinfosecfoundation.org/builders/victorjulien-pcap/builds/301
PR victorjulien: https://buildbot.openinfosecfoundation.org/builders/victorjulien/builds/302

Thread safe hash table implementation based on the Flow hash, IP Pair hash and others. Hash is array of buckets with per bucket locking. Each bucket has a list of elements which also individually use locking.

Datasets are sets/lists of data that can be accessed or added from the rule language. This patch implements 3 data types: 1. string (or buffer) 2. md5 3. sha256 The patch also implements 2 new rule keywords: 1. dataset 2. datarep The dataset keyword allows matching against a list of values to see if it exists or not. It can also add the value to the set. The set can optionally be stored to disk on exit. The datarep support matching/lookups only. With each item in the set a reputation value is stored and this value can be matched against. The reputation value is unsigned 16 bit, so values can be between 0 and 65535. Datasets can be registered in 2 ways: 1. through the yaml 2. through the rules The goal of this rules based approach is that rule writers can start using this without the need for config changes. A dataset is implemented using a thash hash table. Each dataset is its own separate thash.

pevma · 2019-09-04T08:34:48Z

doc/userguide/rules/datasets.rst

+type <type>
+  the data type: string, md5, sha256
+save <file name>
+  file name for saving the in memory data when Suricata exits


the in is probably in reverse order

Maybe I should have written 'in-memory'.

jasonish · 2019-09-04T15:27:34Z

doc/userguide/rules/datasets.rst

+load <file name>
+  file name for load the data when Suricata starts up
+state
+  sets both 'save' and 'load' to the same value


When would you want to use a different combination of the save, load and state?

Is an initial load file something we could see a rule distributor releasing?

You can either use load, load and save, or state. Keyword parsing will block the rest.

Just a load would indeed be what I expect to be used by a rule vendor.

State would be used in case you use either rules to fill a set (dataset + set operator) or if you manage it over unix socket.

Save is something you can use to just extract data. E.g. get all unique UAs.

I wonder if the default directories should differ for a set vs, state. Think of the case where a ruleset tarball contains a rule, and an associated dataset, I think the best place to put that dataset is in the rule directory itself, or in a sub-directory of the rule directory named datasets. In fact, that might be a good convention for rule publishers to follow as well.

The rule would then contain dataset: load datasets/myset.lst. The rule parser would then look for this relative to the directory the rule is being loaded from.

Its a little less clear what to do with save, and state, and its likely those won't be used by rule publishers.

jasonish · 2019-09-04T15:38:22Z

doc/userguide/rules/datasets.rst

+Example adding 'google.com' to set 'myset'::
+
+    dataset-add myset string Z29vZ2xlLmNvbQ==
+


Why as base64? Can we not do that conversion on the Suricata side of the socket?

Also for hex notation, is a leading 0x required?

We are working with binary data, so base64 seemed appropriate. It's in files too, not just the socket.

No leading 0x is required or supported. It's the output that md5sum and similar use.

jasonish · 2019-09-04T16:00:08Z

doc/userguide/rules/datasets.rst

+     - dns-sha256-seen:
+       type: sha256
+       state: dns-sha256-seen.lst
+


I don't think we should use lists here. Instead:

datasets: ua-seen: type: string state: ua-seen.lst dns-sha256-seen: type: sha256 state: dns-sha256-seen.lst

If the list does make more sense, then it should look like:

datasets: - name: ua-seen type: string state: ua-seen.lst - name: dns-sha256-seen type: sha256 state: dns-sha256-seen.lst

Good point: https://redmine.openinfosecfoundation.org/issues/3143

jasonish · 2019-09-04T16:03:57Z

src/datasets.c

+
+    const char *data_dir = ConfigGetDataDirectory();
+    if ((ret = stat(data_dir, &st)) != 0) {
+        SCLogNotice("data-dir '%s': %s", data_dir, strerror(errno));


Probably an error here? Perhaps even failure, as something I explicitly configured is not going to be active.

ConfigGetDataDirectory will return a default if not configured. Error will be handled later when the file is actually opened.

jasonish · 2019-09-04T16:14:08Z

src/datasets.c

+            list_pos++;
+        }
+    }
+    SCLogNotice("datasets done: %p", datasets);


Fixing this and others.

jasonish · 2019-09-04T16:24:12Z

src/datasets.c

+                set->load, strerror(errno));
+        return -1;
+    }
+


When the filename comes from state, I think its very likely the file may not exist yet, and should probably be created. I would expect an error on load though.

Yeah good catch. Fixing this and adding a SV test.

jasonish · 2019-09-04T19:35:21Z

configure.ac

@@ -2532,6 +2532,7 @@ else
    EXPAND_VARIABLE(sysconfdir, e_sysconfrulesdir, "/suricata/rules")
    EXPAND_VARIABLE(localstatedir, e_localstatedir, "/run/suricata")
    EXPAND_VARIABLE(datadir, e_datarulesdir, "/suricata/rules")
+    EXPAND_VARIABLE(localstatedir, e_datadir, "/lib/suricata/data")


This directory appears to have 2 usages. One as a place to hold datasets that may be distributed as part of a ruleset, but the other is more like a cache, for example the state files.

I like to separate the 2, so I can rm -rf the cache type files to reset my state, but not lose files that may be required to startup.

It would also be nice to set some guidelines for people who wish to distribute rules that make use of datasets.

victorjulien added 7 commits September 3, 2019 15:17

suricata: --data-dir option

5d5612f

suricata: expose system as global

0b120bb

thash: generalize hash table as used in flow

b286c14

Thread safe hash table implementation based on the Flow hash, IP Pair hash and others. Hash is array of buckets with per bucket locking. Each bucket has a list of elements which also individually use locking.

datasets: unix socket dataset-add command

1d6a358

suricatasc: add dataset-add command

d5ceafa

doc/dataset: initial documentation

0107b9a

victorjulien requested review from jasonish, norg, shivan1b and a team as code owners September 4, 2019 07:46

pevma reviewed Sep 4, 2019

View reviewed changes

victorjulien mentioned this pull request Sep 4, 2019

RFC: feature/datasets/v41 #4047

Closed

jasonish reviewed Sep 4, 2019

View reviewed changes

victorjulien merged commit 0107b9a into OISF:master Sep 4, 2019

jasonish reviewed Sep 4, 2019

View reviewed changes

victorjulien mentioned this pull request Sep 6, 2019

Cleanups/dataset/v2 #4177

Merged

victorjulien deleted the feature/datasets/v43 branch October 21, 2019 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/datasets/v43 #4166

Feature/datasets/v43 #4166

victorjulien commented Sep 4, 2019

pevma Sep 4, 2019

victorjulien Sep 5, 2019

jasonish Sep 4, 2019

jasonish Sep 4, 2019

victorjulien Sep 5, 2019

jasonish Sep 5, 2019

jasonish Sep 4, 2019

victorjulien Sep 5, 2019

victorjulien Sep 5, 2019

jasonish Sep 4, 2019

victorjulien Sep 5, 2019

jasonish Sep 4, 2019

victorjulien Sep 5, 2019

jasonish Sep 4, 2019

victorjulien Sep 5, 2019

jasonish Sep 4, 2019

victorjulien Sep 5, 2019

jasonish Sep 4, 2019

		Example adding 'google.com' to set 'myset'::

		dataset-add myset string Z29vZ2xlLmNvbQ==

Feature/datasets/v43 #4166

Feature/datasets/v43 #4166

Conversation

victorjulien commented Sep 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment