Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending data dictionaries? #95

Open
nicolasreich opened this issue Dec 1, 2020 · 3 comments
Open

Extending data dictionaries? #95

nicolasreich opened this issue Dec 1, 2020 · 3 comments

Comments

@nicolasreich
Copy link
Contributor

There is an extension mechanism for entities, in order not to duplicate field definitions. It would be good to have such a mechanism for data dictionaries as well. For example, all Zeek network protocol events have fields for source and destination IP and port, which are duplicated across all the data dictionaries; instead, they all could extend a generic dictionary which defines these common fields. What do you think? Is that already part of your plans?

@hxnoyd
Copy link
Collaborator

hxnoyd commented Jan 12, 2021

Hi @nicolasreich. First of all, sorry for the late reply.

So far we have developed data dictionaries as independent document, as close as possible to the raw events produced by the sensor. The main goal is that you will always be able to drill down (i.e. from the data model) to the source of truth of an event and its fields. One of the tradeoffs is, as you suggest, duplicate information, that becomes apparent when you consume multiple events in the same sensor.

We are, however, planning to improve Data Dictionaries, in order to deal with situations were event fields can have different definitions depending on the event type, or in situations where a field contains a nested JSON,list,etc, that we could use to extend the fieldset of the event.

Regardless, I would interested in further exploring your use case.

@nicolasreich
Copy link
Contributor Author

Hi @hxnoyd. No worries, it was the holidays for everyone.

The rationale for this question was Suricata Eve JSON logs, where you have common fields, then nested fields for specific data. So for any alert, you get common fields, like source and destination IP addresses, as well as an alert section, and a different section depending on the protocol that triggered the alert.

So for a alert triggered by a DNS request, you would get something like:

src_ip: ...,
dest_ip: ...,
...
other common fields
...
alert: { ... alert fields ... },
dns: { ... dns fields ... }

While for an alert triggered by an HTTP request:

src_ip: ...,
dest_ip: ...,
...
other common fields
...
alert: { ... alert fields ... },
http: { ... http fields ... }

So the common fields are present in every event; the alert object is present in every alert; and then, depending on the type of the underlying traffic, there might be other objects.

It's obviously possible to have a data dictionary for each alert type, each containing the common fields and the alert fields; but it means a lot of duplication, causing a lot of potential mistakes, and what seems like unnecessary verbiage.

I think it would make sense to be able to extend a data dictionary, much like it's possible for entities. The rendered markdown version of the Data Dictionary would still be an independent document containing all the data.

@hxnoyd
Copy link
Collaborator

hxnoyd commented Feb 26, 2021

Hi @nicolasreich.

Thanks for the detailed explanation, it is now more clear what you mean by 'extending', in a nutshell: deconstruct data dictionaries depending on the field prevalence, to avoid duplicates, and keep the data dictionary YAML as clean as possible.

I see the benefit of such approach for events in the same log source (keep it simple/reduce duplicate), but that would mean an increase in the number of data dictionaries, since we would need to create the 'common fields' data dictionaries (i.e. src_ip, dest_ip, etc). On the one hand we would have a schema with low duplicate fields and, on the other hand, we would have more YAML data dictionaries to maintain.

The field name duplication have been raised multiple times in the past, but we always opted by keeping the data dictionaries as close as possible to the original events, so that the community could customize them as needed. The main reason for this is to keep the data dictionary atomicity, an absolutely independent object, or the source of truth in a single document if you like. By doing so we enable the community to model the data dictionaries as they like, to their own needs (i.e. logstash pipelines).

Regardless, I think your suggestion is aligned with our vision for the improvement of data dictionaries, possibly with the creation of a separate dictionary that would provide a first layer of abstraction for data dictionaries, where the community would be able to better map events with entities, and/or the detection data model. This would allow us to keep the source of truth, at the expense of maintaining another dictionary with modeled/standardized events.

Unfortunately the last few months have been insanely busy, and we haven't had the time to work on a PoC for this... but it is on the roadmap :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants