Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/datasets/v43 #4166

Merged
merged 7 commits into from Sep 4, 2019
Merged

Conversation

victorjulien
Copy link
Member

Rebased version of #4047. For merge as experimental feature.

PRScript output (if applicable):

Thread safe hash table implementation based on the Flow hash, IP Pair
hash and others.

Hash is array of buckets with per bucket locking. Each bucket has a
list of elements which also individually use locking.
Datasets are sets/lists of data that can be accessed or added from
the rule language.

This patch implements 3 data types:

1. string (or buffer)
2. md5
3. sha256

The patch also implements 2 new rule keywords:

1. dataset
2. datarep

The dataset keyword allows matching against a list of values to see if
it exists or not. It can also add the value to the set. The set can
optionally be stored to disk on exit.

The datarep support matching/lookups only. With each item in the set a
reputation value is stored and this value can be matched against. The
reputation value is unsigned 16 bit, so values can be between 0 and 65535.

Datasets can be registered in 2 ways:

1. through the yaml
2. through the rules

The goal of this rules based approach is that rule writers can start using
this without the need for config changes.

A dataset is implemented using a thash hash table. Each dataset is its own
separate thash.
type <type>
the data type: string, md5, sha256
save <file name>
file name for saving the in memory data when Suricata exits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the in is probably in reverse order

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should have written 'in-memory'.

load <file name>
file name for load the data when Suricata starts up
state
sets both 'save' and 'load' to the same value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would you want to use a different combination of the save, load and state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an initial load file something we could see a rule distributor releasing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can either use load, load and save, or state. Keyword parsing will block the rest.

Just a load would indeed be what I expect to be used by a rule vendor.

State would be used in case you use either rules to fill a set (dataset + set operator) or if you manage it over unix socket.

Save is something you can use to just extract data. E.g. get all unique UAs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the default directories should differ for a set vs, state. Think of the case where a ruleset tarball contains a rule, and an associated dataset, I think the best place to put that dataset is in the rule directory itself, or in a sub-directory of the rule directory named datasets. In fact, that might be a good convention for rule publishers to follow as well.

The rule would then contain dataset: load datasets/myset.lst. The rule parser would then look for this relative to the directory the rule is being loaded from.

Its a little less clear what to do with save, and state, and its likely those won't be used by rule publishers.

Example adding 'google.com' to set 'myset'::

dataset-add myset string Z29vZ2xlLmNvbQ==

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why as base64? Can we not do that conversion on the Suricata side of the socket?

Also for hex notation, is a leading 0x required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are working with binary data, so base64 seemed appropriate. It's in files too, not just the socket.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No leading 0x is required or supported. It's the output that md5sum and similar use.

- dns-sha256-seen:
type: sha256
state: dns-sha256-seen.lst

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should use lists here. Instead:

datasets:
  ua-seen:
    type: string
    state: ua-seen.lst
  dns-sha256-seen:
    type: sha256
    state: dns-sha256-seen.lst

If the list does make more sense, then it should look like:

datasets:
  - name: ua-seen
    type: string
    state: ua-seen.lst
  - name: dns-sha256-seen
    type: sha256
    state: dns-sha256-seen.lst

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


const char *data_dir = ConfigGetDataDirectory();
if ((ret = stat(data_dir, &st)) != 0) {
SCLogNotice("data-dir '%s': %s", data_dir, strerror(errno));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably an error here? Perhaps even failure, as something I explicitly configured is not going to be active.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConfigGetDataDirectory will return a default if not configured. Error will be handled later when the file is actually opened.

list_pos++;
}
}
SCLogNotice("datasets done: %p", datasets);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this and others.

set->load, strerror(errno));
return -1;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the filename comes from state, I think its very likely the file may not exist yet, and should probably be created. I would expect an error on load though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good catch. Fixing this and adding a SV test.

@victorjulien victorjulien merged commit 0107b9a into OISF:master Sep 4, 2019
@@ -2532,6 +2532,7 @@ else
EXPAND_VARIABLE(sysconfdir, e_sysconfrulesdir, "/suricata/rules")
EXPAND_VARIABLE(localstatedir, e_localstatedir, "/run/suricata")
EXPAND_VARIABLE(datadir, e_datarulesdir, "/suricata/rules")
EXPAND_VARIABLE(localstatedir, e_datadir, "/lib/suricata/data")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory appears to have 2 usages. One as a place to hold datasets that may be distributed as part of a ruleset, but the other is more like a cache, for example the state files.

I like to separate the 2, so I can rm -rf the cache type files to reset my state, but not lose files that may be required to startup.

It would also be nice to set some guidelines for people who wish to distribute rules that make use of datasets.

@victorjulien victorjulien mentioned this pull request Sep 6, 2019
@victorjulien victorjulien deleted the feature/datasets/v43 branch October 21, 2019 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants