add windows event data filter #24360

clarkb7 · 2024-04-03T19:05:43Z

What does this PR do?

Add the two example security event profiles that define a filter for Security Event Log based on Event IDs. Adds the necessary filtering logic to the Windows Event Log check.

Motivation

https://datadoghq.atlassian.net/browse/WINA-721
https://datadoghq.atlassian.net/browse/WINA-702

Additional Notes

EventID filtering is only used by the dd_security_events configuration. It is not available to the regular check configuration.

The EventID is a simple compare in a list, and it is plenty fast enough. Even with a list of 10,000 EventIDs the filter only takes .002ms. Our high profile has ~400 EventIDs listed.

goos: windows                                                                                                                                                                   
goarch: amd64                                                                                                                                                                    
pkg: http://github.com/DataDog/datadogagent/comp/checks/windowseventlog/windowseventlogimpl/check/eventdatafilter                                                                       
cpu: 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz                                                                                                                              
BenchmarkIsAllowedEventID                                                                                                                                                        
BenchmarkIsAllowedEventID/MatchEventID_10                                                                                                                                        
BenchmarkIsAllowedEventID/MatchEventID_10-16            208238829                4.907 ns/op                                                                                     
BenchmarkIsAllowedEventID/MatchEventID_100                                                                                                                                       
BenchmarkIsAllowedEventID/MatchEventID_100-16           18950608                57.03 ns/op                                                                                      
BenchmarkIsAllowedEventID/MatchEventID_1000                                                                                                                                      
BenchmarkIsAllowedEventID/MatchEventID_1000-16           2913006               389.3 ns/op                                                                                       
BenchmarkIsAllowedEventID/MatchEventID_10000                                                                                                                                     
BenchmarkIsAllowedEventID/MatchEventID_10000-16           881024              2375 ns/op

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Setup:
enabled logs in datadog.yaml

logs_enabled: true

configure the win32_event_log check

init_config:
  legacy_mode: false
instances:
  - dd_security_events: high

Verify:
Security logs are visible in Logs search

pr-commenter · 2024-04-03T20:41:01Z

Regression Detector

Regression Detector Results

Run ID: 122bbe1c-7789-4343-95a9-39c9d88ef459
Baseline: b7d3f32
Comparison: eddc7ef

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI
❌	file_to_blackhole	% cpu utilization	+12.20	[+6.07, +18.34]

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI
❌	file_to_blackhole	% cpu utilization	+12.20	[+6.07, +18.34]
➖	tcp_syslog_to_blackhole	ingress throughput	+0.63	[-20.21, +21.48]
➖	process_agent_standard_check	memory utilization	+0.58	[+0.52, +0.64]
➖	pycheck_1000_100byte_tags	% cpu utilization	+0.37	[-4.22, +4.95]
➖	process_agent_real_time_mode	memory utilization	+0.06	[+0.02, +0.11]
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.03	[-0.02, +0.07]
➖	uds_dogstatsd_to_api	ingress throughput	+0.01	[-0.19, +0.21]
➖	trace_agent_json	ingress throughput	+0.01	[-0.02, +0.04]
➖	trace_agent_msgpack	ingress throughput	+0.01	[-0.00, +0.02]
➖	otel_to_otel_logs	ingress throughput	-0.00	[-0.35, +0.35]
➖	process_agent_standard_check_with_stats	memory utilization	-0.29	[-0.34, -0.23]
➖	idle	memory utilization	-0.30	[-0.34, -0.27]
➖	basic_py_check	% cpu utilization	-1.01	[-3.47, +1.45]
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-2.16	[-4.99, +0.67]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

cit-pr-commenter · 2024-04-24T05:00:43Z

Go Package Import Differences

Baseline: d060056
Comparison: 9eed87d

binary

os

arch

change

agent

windows

amd64

+2, -0

+github.com/DataDog/datadog-agent/comp/checks/windowseventlog/windowseventlogimpl/check/eventdatafilter
+github.com/Masterminds/semver/v3

agent

windows

386

+2, -0

+github.com/DataDog/datadog-agent/comp/checks/windowseventlog/windowseventlogimpl/check/eventdatafilter
+github.com/Masterminds/semver/v3

pr-commenter · 2024-04-24T05:50:36Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=32939685 --os-family=ubuntu

spencergilbert

LGTM for the BaRX owned code.

dustmop

LGTM for agent-shared-components

iglendd

Left a few comments which are not that important. The only suggestion I think could be useful is adding ID ranges for sequences,

iglendd · 2024-04-26T17:26:01Z

cmd/agent/dist/conf.d/win32_event_log.d/profiles/dd_security_events_high.yaml.example

@@ -0,0 +1,373 @@
+schema_version: 1.0


I wonder if you can use ID ranges for some sequences. It looks tedious.

comp/checks/windowseventlog/windowseventlogimpl/check/check.go

comp/checks/windowseventlog/windowseventlogimpl/check/ddlog_submitter.go

iglendd · 2024-04-26T17:46:35Z

comp/checks/windowseventlog/windowseventlogimpl/check/subscription.go

@@ -226,8 +288,7 @@ func (c *Check) ddLogSubmitter(logsAgent logsAgent.Component, doneCh <-chan stru
 		inCh:          inCh,
 		bookmarkSaver: c.bookmarkSaver,
 		logSource: sources.NewLogSource("dd_security_events", &logsConfig.LogsConfig{
-			Source:  logsSource,
-			Service: "Windows",
+			Source: logsSource,


what is service now?

it is unset.
Our docs currently suggest configuring service:Windows in the logs config, but this never made much sense to me.

service is also a broader scoped tag, and I don't want to override the service tag that might be configured elsewhere. There's a common check level configuration option but we can't use it right now. See https://datadoghq.atlassian.net/browse/WINA-761

Interesting, what are service for some of the checks, check's name. This is of course a "foundational" check. Maybe messages emmited for cloud Siem could be marked as service:clouesiem? What the message consumers (our cloud siem team) using logs which normally marked by what service?

Right now source:windows.events is used to identify windows event logs

comp/checks/windowseventlog/windowseventlogimpl/check/subscription.go

comp/checks/windowseventlog/windowseventlogimpl/check/eventdata_getter.go

iglendd · 2024-04-26T18:14:55Z

comp/checks/windowseventlog/windowseventlogimpl/check/eventdatafilter/filter.go

+	for _, id := range f.eventIDs {
+		if eventID == id {
+			return true
+		}
+	}


Is this part fast? I wonder of using something like this https://pkg.go.dev/github.com/Workiva/go-datastructures/bitarray

What would be a difference in your benchmark

This part is what the benchmark is measuring

400 entries array would be what 1600 or 3200 bytes in memory. which means not tiny number of cache lines. if you have only bit, you reduce size by 32 or 64. Probably not a big deal or saving but I am curious.

iglendd · 2024-04-26T18:28:03Z

comp/checks/windowseventlog/windowseventlogimpl/check/eventdatafilter/filter_test.go

+	sizeMatchSet := []int{10, 100, 1000, 10000}
+	for _, n := range sizeMatchSet {
+		b.Run(fmt.Sprintf("MatchEventID_%d", n), func(b *testing.B) {
+			// build a filter with n elements
+			f := &eventIDFilter{eventIDs: make([]int, n)}
+			for i := 0; i < n; i++ {
+				f.eventIDs[i] = i
+			}
+			b.ResetTimer()
+			for n := 0; n < b.N; n++ {
+				// set eventID to the size to imitate worst case scenario
+				f.isAllowedEventID(n)
+			}
+		})
+	}
+}


This is probably most aplicable
BenchmarkIsAllowedEventID/MatchEventID_100
BenchmarkIsAllowedEventID/MatchEventID_100-16 18950608 57.03 ns/op
BenchmarkIsAllowedEventID/MatchEventID_1000
BenchmarkIsAllowedEventID/MatchEventID_1000-16 2913006 389.3 ns/op

The numbers are not bad, but the question is if we can do better if and only if on some machine where we enable it and there is a high stream of such events continuously. And what if overall CPU overhead on high rate. Obviously, just enabling the feature alone will probably add higher overhead for system AND agent that filtering, but still would be interesting to know (and perhaps even document something).

the primary bottleneck for all of the above is still EvtFormatMessage. See https://datadoghq.atlassian.net/browse/WINA-33

The numbers are not bad, but the question is if we can do better if and only if on some machine where we enable it and there is a high stream of such events continuously

I'm not sure what you mean. It's taking 389 nanoseconds to perform a (failing, worst case) match. That means we could theoretically match ~3,000,000 events per second.
Without these changes, EvtFormatMessage bottlenecks us to ~3300 events/s, but even without EvtFormatMessage the best I've seen is ~66,000 events per second.

Yeah, I understand. I did mean though a different overhead, invisible one. When we subscribe to high-rate source it alone will add a significant overhead of moving data from kernel or file (not sure) into Agent process implicitly. Sure, with humber posted for filtering it is visible and explicit overhead (part of that anyway) but by enabling the feature, all of a sudden we are making Agent to drink over a fire hydrant. We have no control over it though, but customer may not realize that it was their "fault" not hours.

To me overhead you a measure and throughput is only a part of the picture, another way. If customers start seeing Agent CPU spike, that what is that implicit overhead of drinking over firehose. Not sure what we can do about it beyond documentation. Or how bad it could be in some cases. In theory we can minimizing (maybe) this overhead by creating compound filtering (20-30 filters?) or even venture to ETW theretory. It is pure speculation at this point but it worth to be at least aware of that I think.

iglendd · 2024-04-26T20:02:30Z

comp/checks/windowseventlog/windowseventlogimpl/check/eventdatafilter/filter.go

+	for _, id := range f.eventIDs {
+		if eventID == id {
+			return true
+		}
+	}


If we sort event ID array then there is no need to scan whole array. Especially for ID which are not in the range (you can quickly check for min/max).

clarkb7 · 2024-04-27T07:27:45Z

/merge

dd-devflow · 2024-04-27T07:27:51Z

🚂 MergeQueue

Pull request added to the queue.

This build is going to start soon! (estimated merge in less than 25m)

Use /merge -c to cancel this operation!

dd-devflow · 2024-04-27T07:48:22Z

❌ MergeQueue

PR can't be merged according to github policy

If you need support, contact us on Slack #devflow with those details!

clarkb7 · 2024-04-29T15:59:37Z

/merge

dd-devflow · 2024-04-29T15:59:43Z

🚂 MergeQueue

Pull request added to the queue.

This build is next! (estimated merge in less than 51m)

Use /merge -c to cancel this operation!

This reverts commit 896108a.

* add event id filter with hardcoded config * add eventdata filter on eventids * read filters from file * read confd_path from config comp * fix integration test invocation * add profiles * codeowners * add missing close * avoid formatting the message a second time

clarkb7 added changelog/no-changelog team/windows-agent qa/done Skip QA week as QA was done before merge and regressions are covered by tests labels Apr 3, 2024

clarkb7 changed the base branch from branden.clark/eventlog-check-logs to main April 3, 2024 19:06

clarkb7 force-pushed the branden.clark/eventlog-eventdata-filter branch from 7e16973 to 65ad39c Compare April 24, 2024 04:54

add event id filter with hardcoded config

329770a

clarkb7 force-pushed the branden.clark/eventlog-eventdata-filter branch from 65ad39c to 3c54de7 Compare April 24, 2024 18:54

clarkb7 added 6 commits April 24, 2024 16:50

add eventdata filter on eventids

d8b92cf

read filters from file

42f3a43

read confd_path from config comp

d705e1c

fix integration test invocation

60b9bf8

add profiles

3fa2504

codeowners

a043a9f

clarkb7 force-pushed the branden.clark/eventlog-eventdata-filter branch from 3c54de7 to a043a9f Compare April 24, 2024 21:07

clarkb7 added 2 commits April 24, 2024 22:30

add missing close

55f927f

avoid formatting the message a second time

eddc7ef

clarkb7 marked this pull request as ready for review April 25, 2024 13:54

clarkb7 requested review from a team as code owners April 25, 2024 13:54

spencergilbert approved these changes Apr 25, 2024

View reviewed changes

Merge branch 'main' into branden.clark/eventlog-eventdata-filter

9eed87d

dustmop approved these changes Apr 26, 2024

View reviewed changes

iglendd approved these changes Apr 26, 2024

View reviewed changes

iglendd reviewed Apr 26, 2024

View reviewed changes

dd-mergequeue bot merged commit 896108a into main Apr 29, 2024
231 checks passed

dd-mergequeue bot deleted the branden.clark/eventlog-eventdata-filter branch April 29, 2024 16:28

github-actions bot added this to the 7.54.0 milestone Apr 29, 2024

julien-lebot added a commit that referenced this pull request Apr 30, 2024

Revert "add windows event data filter (#24360)"

bcb287e

This reverts commit 896108a.

This was referenced Apr 30, 2024

Revert "add windows event data filter" #25256

Closed

Clean up configuration on install failure #25257

Merged

clarkb7 mentioned this pull request Apr 30, 2024

Remove install-fail kitchen test #25273

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add windows event data filter #24360

add windows event data filter #24360

clarkb7 commented Apr 3, 2024 •

edited

Loading

pr-commenter bot commented Apr 3, 2024 •

edited

Loading

Experiments ignored for regressions

Fine details of change detection per experiment

Explanation

cit-pr-commenter bot commented Apr 24, 2024 •

edited

Loading

pr-commenter bot commented Apr 24, 2024 •

edited

Loading

spencergilbert left a comment

dustmop left a comment

iglendd left a comment

iglendd Apr 26, 2024

iglendd Apr 26, 2024

clarkb7 Apr 26, 2024 •

edited

Loading

iglendd Apr 26, 2024

clarkb7 Apr 26, 2024

iglendd Apr 26, 2024

iglendd Apr 26, 2024

clarkb7 Apr 26, 2024

iglendd Apr 26, 2024

iglendd Apr 26, 2024

clarkb7 Apr 26, 2024

clarkb7 Apr 26, 2024 •

edited

Loading

iglendd Apr 26, 2024

iglendd Apr 26, 2024

clarkb7 commented Apr 27, 2024

dd-devflow bot commented Apr 27, 2024

dd-devflow bot commented Apr 27, 2024

clarkb7 commented Apr 29, 2024

dd-devflow bot commented Apr 29, 2024

add windows event data filter #24360

add windows event data filter #24360

Conversation

clarkb7 commented Apr 3, 2024 • edited Loading

What does this PR do?

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

pr-commenter bot commented Apr 3, 2024 • edited Loading

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Experiments ignored for regressions

Fine details of change detection per experiment

Explanation

cit-pr-commenter bot commented Apr 24, 2024 • edited Loading

Go Package Import Differences

pr-commenter bot commented Apr 24, 2024 • edited Loading

Test changes on VM

spencergilbert left a comment

Choose a reason for hiding this comment

dustmop left a comment

Choose a reason for hiding this comment

iglendd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkb7 Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkb7 Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clarkb7 commented Apr 27, 2024

dd-devflow bot commented Apr 27, 2024

dd-devflow bot commented Apr 27, 2024

clarkb7 commented Apr 29, 2024

dd-devflow bot commented Apr 29, 2024

clarkb7 commented Apr 3, 2024 •

edited

Loading

pr-commenter bot commented Apr 3, 2024 •

edited

Loading

cit-pr-commenter bot commented Apr 24, 2024 •

edited

Loading

pr-commenter bot commented Apr 24, 2024 •

edited

Loading

clarkb7 Apr 26, 2024 •

edited

Loading

clarkb7 Apr 26, 2024 •

edited

Loading