-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add windows event data filter #24360
Conversation
Regression DetectorRegression Detector ResultsRun ID: 122bbe1c-7789-4343-95a9-39c9d88ef459 Performance changes are noted in the perf column of each table:
No significant changes in experiment optimization goalsConfidence level: 90.00% There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.
|
perf | experiment | goal | Δ mean % | Δ mean % CI |
---|---|---|---|---|
❌ | file_to_blackhole | % cpu utilization | +12.20 | [+6.07, +18.34] |
Fine details of change detection per experiment
perf | experiment | goal | Δ mean % | Δ mean % CI |
---|---|---|---|---|
❌ | file_to_blackhole | % cpu utilization | +12.20 | [+6.07, +18.34] |
➖ | tcp_syslog_to_blackhole | ingress throughput | +0.63 | [-20.21, +21.48] |
➖ | process_agent_standard_check | memory utilization | +0.58 | [+0.52, +0.64] |
➖ | pycheck_1000_100byte_tags | % cpu utilization | +0.37 | [-4.22, +4.95] |
➖ | process_agent_real_time_mode | memory utilization | +0.06 | [+0.02, +0.11] |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.03 | [-0.02, +0.07] |
➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.19, +0.21] |
➖ | trace_agent_json | ingress throughput | +0.01 | [-0.02, +0.04] |
➖ | trace_agent_msgpack | ingress throughput | +0.01 | [-0.00, +0.02] |
➖ | otel_to_otel_logs | ingress throughput | -0.00 | [-0.35, +0.35] |
➖ | process_agent_standard_check_with_stats | memory utilization | -0.29 | [-0.34, -0.23] |
➖ | idle | memory utilization | -0.30 | [-0.34, -0.27] |
➖ | basic_py_check | % cpu utilization | -1.01 | [-3.47, +1.45] |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -2.16 | [-4.99, +0.67] |
Explanation
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
7e16973
to
65ad39c
Compare
Go Package Import DifferencesBaseline: d060056
|
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv create-vm --pipeline-id=32939685 --os-family=ubuntu |
65ad39c
to
3c54de7
Compare
3c54de7
to
a043a9f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the BaRX owned code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for agent-shared-components
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments which are not that important. The only suggestion I think could be useful is adding ID ranges for sequences,
@@ -0,0 +1,373 @@ | |||
schema_version: 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you can use ID ranges for some sequences. It looks tedious.
comp/checks/windowseventlog/windowseventlogimpl/check/ddlog_submitter.go
Show resolved
Hide resolved
@@ -226,8 +288,7 @@ func (c *Check) ddLogSubmitter(logsAgent logsAgent.Component, doneCh <-chan stru | |||
inCh: inCh, | |||
bookmarkSaver: c.bookmarkSaver, | |||
logSource: sources.NewLogSource("dd_security_events", &logsConfig.LogsConfig{ | |||
Source: logsSource, | |||
Service: "Windows", | |||
Source: logsSource, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is service now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is unset.
Our docs currently suggest configuring service:Windows
in the logs
config, but this never made much sense to me.
service
is also a broader scoped tag, and I don't want to override the service tag that might be configured elsewhere. There's a common check level configuration option but we can't use it right now. See https://datadoghq.atlassian.net/browse/WINA-761
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, what are service for some of the checks, check's name. This is of course a "foundational" check. Maybe messages emmited for cloud Siem could be marked as service:clouesiem? What the message consumers (our cloud siem team) using logs which normally marked by what service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now source:windows.events
is used to identify windows event logs
comp/checks/windowseventlog/windowseventlogimpl/check/eventdata_getter.go
Show resolved
Hide resolved
for _, id := range f.eventIDs { | ||
if eventID == id { | ||
return true | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this part fast? I wonder of using something like this https://pkg.go.dev/github.com/Workiva/go-datastructures/bitarray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be a difference in your benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is what the benchmark is measuring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
400 entries array would be what 1600 or 3200 bytes in memory. which means not tiny number of cache lines. if you have only bit, you reduce size by 32 or 64. Probably not a big deal or saving but I am curious.
sizeMatchSet := []int{10, 100, 1000, 10000} | ||
for _, n := range sizeMatchSet { | ||
b.Run(fmt.Sprintf("MatchEventID_%d", n), func(b *testing.B) { | ||
// build a filter with n elements | ||
f := &eventIDFilter{eventIDs: make([]int, n)} | ||
for i := 0; i < n; i++ { | ||
f.eventIDs[i] = i | ||
} | ||
b.ResetTimer() | ||
for n := 0; n < b.N; n++ { | ||
// set eventID to the size to imitate worst case scenario | ||
f.isAllowedEventID(n) | ||
} | ||
}) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably most aplicable
BenchmarkIsAllowedEventID/MatchEventID_100
BenchmarkIsAllowedEventID/MatchEventID_100-16 18950608 57.03 ns/op
BenchmarkIsAllowedEventID/MatchEventID_1000
BenchmarkIsAllowedEventID/MatchEventID_1000-16 2913006 389.3 ns/op
The numbers are not bad, but the question is if we can do better if and only if on some machine where we enable it and there is a high stream of such events continuously. And what if overall CPU overhead on high rate. Obviously, just enabling the feature alone will probably add higher overhead for system AND agent that filtering, but still would be interesting to know (and perhaps even document something).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the primary bottleneck for all of the above is still EvtFormatMessage
. See https://datadoghq.atlassian.net/browse/WINA-33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numbers are not bad, but the question is if we can do better if and only if on some machine where we enable it and there is a high stream of such events continuously
I'm not sure what you mean. It's taking 389 nanoseconds to perform a (failing, worst case) match. That means we could theoretically match ~3,000,000 events per second.
Without these changes, EvtFormatMessage
bottlenecks us to ~3300 events/s, but even without EvtFormatMessage
the best I've seen is ~66,000 events per second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I understand. I did mean though a different overhead, invisible one. When we subscribe to high-rate source it alone will add a significant overhead of moving data from kernel or file (not sure) into Agent process implicitly. Sure, with humber posted for filtering it is visible and explicit overhead (part of that anyway) but by enabling the feature, all of a sudden we are making Agent to drink over a fire hydrant. We have no control over it though, but customer may not realize that it was their "fault" not hours.
To me overhead you a measure and throughput is only a part of the picture, another way. If customers start seeing Agent CPU spike, that what is that implicit overhead of drinking over firehose. Not sure what we can do about it beyond documentation. Or how bad it could be in some cases. In theory we can minimizing (maybe) this overhead by creating compound filtering (20-30 filters?) or even venture to ETW theretory. It is pure speculation at this point but it worth to be at least aware of that I think.
for _, id := range f.eventIDs { | ||
if eventID == id { | ||
return true | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we sort event ID array then there is no need to scan whole array. Especially for ID which are not in the range (you can quickly check for min/max).
/merge |
🚂 MergeQueue Pull request added to the queue. This build is going to start soon! (estimated merge in less than 25m) Use |
/merge |
🚂 MergeQueue Pull request added to the queue. This build is next! (estimated merge in less than 51m) Use |
This reverts commit 896108a.
* add event id filter with hardcoded config * add eventdata filter on eventids * read filters from file * read confd_path from config comp * fix integration test invocation * add profiles * codeowners * add missing close * avoid formatting the message a second time
What does this PR do?
Add the two example security event profiles that define a filter for Security Event Log based on Event IDs. Adds the necessary filtering logic to the Windows Event Log check.
Motivation
https://datadoghq.atlassian.net/browse/WINA-721
https://datadoghq.atlassian.net/browse/WINA-702
Additional Notes
EventID filtering is only used by the
dd_security_events
configuration. It is not available to the regular check configuration.The EventID is a simple compare in a list, and it is plenty fast enough. Even with a list of 10,000 EventIDs the filter only takes .002ms. Our
high
profile has ~400 EventIDs listed.Possible Drawbacks / Trade-offs
Describe how to test/QA your changes
Setup:
enabled logs in
datadog.yaml
configure the
win32_event_log
checkVerify:
Security logs are visible in Logs search