New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/datasets/v43 #4166
Feature/datasets/v43 #4166
Conversation
Thread safe hash table implementation based on the Flow hash, IP Pair hash and others. Hash is array of buckets with per bucket locking. Each bucket has a list of elements which also individually use locking.
Datasets are sets/lists of data that can be accessed or added from the rule language. This patch implements 3 data types: 1. string (or buffer) 2. md5 3. sha256 The patch also implements 2 new rule keywords: 1. dataset 2. datarep The dataset keyword allows matching against a list of values to see if it exists or not. It can also add the value to the set. The set can optionally be stored to disk on exit. The datarep support matching/lookups only. With each item in the set a reputation value is stored and this value can be matched against. The reputation value is unsigned 16 bit, so values can be between 0 and 65535. Datasets can be registered in 2 ways: 1. through the yaml 2. through the rules The goal of this rules based approach is that rule writers can start using this without the need for config changes. A dataset is implemented using a thash hash table. Each dataset is its own separate thash.
type <type> | ||
the data type: string, md5, sha256 | ||
save <file name> | ||
file name for saving the in memory data when Suricata exits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the in
is probably in reverse order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should have written 'in-memory'.
load <file name> | ||
file name for load the data when Suricata starts up | ||
state | ||
sets both 'save' and 'load' to the same value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would you want to use a different combination of the save, load and state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is an initial load
file something we could see a rule distributor releasing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can either use load, load and save, or state. Keyword parsing will block the rest.
Just a load would indeed be what I expect to be used by a rule vendor.
State would be used in case you use either rules to fill a set (dataset + set operator) or if you manage it over unix socket.
Save is something you can use to just extract data. E.g. get all unique UAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the default directories should differ for a set vs, state. Think of the case where a ruleset tarball contains a rule, and an associated dataset, I think the best place to put that dataset is in the rule directory itself, or in a sub-directory of the rule directory named datasets
. In fact, that might be a good convention for rule publishers to follow as well.
The rule would then contain dataset: load datasets/myset.lst
. The rule parser would then look for this relative to the directory the rule is being loaded from.
Its a little less clear what to do with save, and state, and its likely those won't be used by rule publishers.
Example adding 'google.com' to set 'myset':: | ||
|
||
dataset-add myset string Z29vZ2xlLmNvbQ== | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why as base64? Can we not do that conversion on the Suricata side of the socket?
Also for hex notation, is a leading 0x required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are working with binary data, so base64 seemed appropriate. It's in files too, not just the socket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No leading 0x is required or supported. It's the output that md5sum
and similar use.
- dns-sha256-seen: | ||
type: sha256 | ||
state: dns-sha256-seen.lst | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should use lists here. Instead:
datasets:
ua-seen:
type: string
state: ua-seen.lst
dns-sha256-seen:
type: sha256
state: dns-sha256-seen.lst
If the list does make more sense, then it should look like:
datasets:
- name: ua-seen
type: string
state: ua-seen.lst
- name: dns-sha256-seen
type: sha256
state: dns-sha256-seen.lst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
const char *data_dir = ConfigGetDataDirectory(); | ||
if ((ret = stat(data_dir, &st)) != 0) { | ||
SCLogNotice("data-dir '%s': %s", data_dir, strerror(errno)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably an error here? Perhaps even failure, as something I explicitly configured is not going to be active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConfigGetDataDirectory will return a default if not configured. Error will be handled later when the file is actually opened.
list_pos++; | ||
} | ||
} | ||
SCLogNotice("datasets done: %p", datasets); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing this and others.
set->load, strerror(errno)); | ||
return -1; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the filename comes from state, I think its very likely the file may not exist yet, and should probably be created. I would expect an error on load though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good catch. Fixing this and adding a SV test.
@@ -2532,6 +2532,7 @@ else | |||
EXPAND_VARIABLE(sysconfdir, e_sysconfrulesdir, "/suricata/rules") | |||
EXPAND_VARIABLE(localstatedir, e_localstatedir, "/run/suricata") | |||
EXPAND_VARIABLE(datadir, e_datarulesdir, "/suricata/rules") | |||
EXPAND_VARIABLE(localstatedir, e_datadir, "/lib/suricata/data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This directory appears to have 2 usages. One as a place to hold datasets that may be distributed as part of a ruleset, but the other is more like a cache, for example the state files.
I like to separate the 2, so I can rm -rf
the cache type files to reset my state, but not lose files that may be required to startup.
It would also be nice to set some guidelines for people who wish to distribute rules that make use of datasets.
Rebased version of #4047. For merge as experimental feature.
PRScript output (if applicable):