Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add windows event data filter #24360

Merged
merged 10 commits into from
Apr 29, 2024
Merged

Conversation

clarkb7
Copy link
Contributor

@clarkb7 clarkb7 commented Apr 3, 2024

What does this PR do?

Add the two example security event profiles that define a filter for Security Event Log based on Event IDs. Adds the necessary filtering logic to the Windows Event Log check.

Motivation

https://datadoghq.atlassian.net/browse/WINA-721
https://datadoghq.atlassian.net/browse/WINA-702

Additional Notes

EventID filtering is only used by the dd_security_events configuration. It is not available to the regular check configuration.

The EventID is a simple compare in a list, and it is plenty fast enough. Even with a list of 10,000 EventIDs the filter only takes .002ms. Our high profile has ~400 EventIDs listed.

goos: windows                                                                                                                                                                   
goarch: amd64                                                                                                                                                                    
pkg: http://github.com/DataDog/datadogagent/comp/checks/windowseventlog/windowseventlogimpl/check/eventdatafilter                                                                       
cpu: 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz                                                                                                                              
BenchmarkIsAllowedEventID                                                                                                                                                        
BenchmarkIsAllowedEventID/MatchEventID_10                                                                                                                                        
BenchmarkIsAllowedEventID/MatchEventID_10-16            208238829                4.907 ns/op                                                                                     
BenchmarkIsAllowedEventID/MatchEventID_100                                                                                                                                       
BenchmarkIsAllowedEventID/MatchEventID_100-16           18950608                57.03 ns/op                                                                                      
BenchmarkIsAllowedEventID/MatchEventID_1000                                                                                                                                      
BenchmarkIsAllowedEventID/MatchEventID_1000-16           2913006               389.3 ns/op                                                                                       
BenchmarkIsAllowedEventID/MatchEventID_10000                                                                                                                                     
BenchmarkIsAllowedEventID/MatchEventID_10000-16           881024              2375 ns/op    

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Setup:
enabled logs in datadog.yaml

logs_enabled: true

configure the win32_event_log check

init_config:
  legacy_mode: false
instances:
  - dd_security_events: high

Verify:
Security logs are visible in Logs search

@clarkb7 clarkb7 added changelog/no-changelog team/windows-agent qa/done Skip QA week as QA was done before merge and regressions are covered by tests labels Apr 3, 2024
@clarkb7 clarkb7 changed the base branch from branden.clark/eventlog-check-logs to main April 3, 2024 19:06
@pr-commenter
Copy link

pr-commenter bot commented Apr 3, 2024

Regression Detector

Regression Detector Results

Run ID: 122bbe1c-7789-4343-95a9-39c9d88ef459
Baseline: b7d3f32
Comparison: eddc7ef

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI
file_to_blackhole % cpu utilization +12.20 [+6.07, +18.34]

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI
file_to_blackhole % cpu utilization +12.20 [+6.07, +18.34]
tcp_syslog_to_blackhole ingress throughput +0.63 [-20.21, +21.48]
process_agent_standard_check memory utilization +0.58 [+0.52, +0.64]
pycheck_1000_100byte_tags % cpu utilization +0.37 [-4.22, +4.95]
process_agent_real_time_mode memory utilization +0.06 [+0.02, +0.11]
tcp_dd_logs_filter_exclude ingress throughput +0.03 [-0.02, +0.07]
uds_dogstatsd_to_api ingress throughput +0.01 [-0.19, +0.21]
trace_agent_json ingress throughput +0.01 [-0.02, +0.04]
trace_agent_msgpack ingress throughput +0.01 [-0.00, +0.02]
otel_to_otel_logs ingress throughput -0.00 [-0.35, +0.35]
process_agent_standard_check_with_stats memory utilization -0.29 [-0.34, -0.23]
idle memory utilization -0.30 [-0.34, -0.27]
basic_py_check % cpu utilization -1.01 [-3.47, +1.45]
uds_dogstatsd_to_api_cpu % cpu utilization -2.16 [-4.99, +0.67]

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

@clarkb7 clarkb7 force-pushed the branden.clark/eventlog-eventdata-filter branch from 7e16973 to 65ad39c Compare April 24, 2024 04:54
Copy link

cit-pr-commenter bot commented Apr 24, 2024

Go Package Import Differences

Baseline: d060056
Comparison: 9eed87d

binaryosarchchange
agentwindowsamd64
+2, -0
+github.com/DataDog/datadog-agent/comp/checks/windowseventlog/windowseventlogimpl/check/eventdatafilter
+github.com/Masterminds/semver/v3
agentwindows386
+2, -0
+github.com/DataDog/datadog-agent/comp/checks/windowseventlog/windowseventlogimpl/check/eventdatafilter
+github.com/Masterminds/semver/v3

@pr-commenter
Copy link

pr-commenter bot commented Apr 24, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=32939685 --os-family=ubuntu

@clarkb7 clarkb7 force-pushed the branden.clark/eventlog-eventdata-filter branch from 65ad39c to 3c54de7 Compare April 24, 2024 18:54
@clarkb7 clarkb7 force-pushed the branden.clark/eventlog-eventdata-filter branch from 3c54de7 to a043a9f Compare April 24, 2024 21:07
@clarkb7 clarkb7 marked this pull request as ready for review April 25, 2024 13:54
@clarkb7 clarkb7 requested review from a team as code owners April 25, 2024 13:54
Copy link
Contributor

@spencergilbert spencergilbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the BaRX owned code.

Copy link
Contributor

@dustmop dustmop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for agent-shared-components

Copy link
Contributor

@iglendd iglendd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments which are not that important. The only suggestion I think could be useful is adding ID ranges for sequences,

@@ -0,0 +1,373 @@
schema_version: 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you can use ID ranges for some sequences. It looks tedious.

@@ -226,8 +288,7 @@ func (c *Check) ddLogSubmitter(logsAgent logsAgent.Component, doneCh <-chan stru
inCh: inCh,
bookmarkSaver: c.bookmarkSaver,
logSource: sources.NewLogSource("dd_security_events", &logsConfig.LogsConfig{
Source: logsSource,
Service: "Windows",
Source: logsSource,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is service now?

Copy link
Contributor Author

@clarkb7 clarkb7 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is unset.
Our docs currently suggest configuring service:Windows in the logs config, but this never made much sense to me.

service is also a broader scoped tag, and I don't want to override the service tag that might be configured elsewhere. There's a common check level configuration option but we can't use it right now. See https://datadoghq.atlassian.net/browse/WINA-761

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, what are service for some of the checks, check's name. This is of course a "foundational" check. Maybe messages emmited for cloud Siem could be marked as service:clouesiem? What the message consumers (our cloud siem team) using logs which normally marked by what service?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now source:windows.events is used to identify windows event logs

Comment on lines +60 to +64
for _, id := range f.eventIDs {
if eventID == id {
return true
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this part fast? I wonder of using something like this https://pkg.go.dev/github.com/Workiva/go-datastructures/bitarray

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be a difference in your benchmark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is what the benchmark is measuring

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

400 entries array would be what 1600 or 3200 bytes in memory. which means not tiny number of cache lines. if you have only bit, you reduce size by 32 or 64. Probably not a big deal or saving but I am curious.

Comment on lines +39 to +54
sizeMatchSet := []int{10, 100, 1000, 10000}
for _, n := range sizeMatchSet {
b.Run(fmt.Sprintf("MatchEventID_%d", n), func(b *testing.B) {
// build a filter with n elements
f := &eventIDFilter{eventIDs: make([]int, n)}
for i := 0; i < n; i++ {
f.eventIDs[i] = i
}
b.ResetTimer()
for n := 0; n < b.N; n++ {
// set eventID to the size to imitate worst case scenario
f.isAllowedEventID(n)
}
})
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably most aplicable
BenchmarkIsAllowedEventID/MatchEventID_100
BenchmarkIsAllowedEventID/MatchEventID_100-16 18950608 57.03 ns/op
BenchmarkIsAllowedEventID/MatchEventID_1000
BenchmarkIsAllowedEventID/MatchEventID_1000-16 2913006 389.3 ns/op

The numbers are not bad, but the question is if we can do better if and only if on some machine where we enable it and there is a high stream of such events continuously. And what if overall CPU overhead on high rate. Obviously, just enabling the feature alone will probably add higher overhead for system AND agent that filtering, but still would be interesting to know (and perhaps even document something).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the primary bottleneck for all of the above is still EvtFormatMessage. See https://datadoghq.atlassian.net/browse/WINA-33

Copy link
Contributor Author

@clarkb7 clarkb7 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers are not bad, but the question is if we can do better if and only if on some machine where we enable it and there is a high stream of such events continuously

I'm not sure what you mean. It's taking 389 nanoseconds to perform a (failing, worst case) match. That means we could theoretically match ~3,000,000 events per second.
Without these changes, EvtFormatMessage bottlenecks us to ~3300 events/s, but even without EvtFormatMessage the best I've seen is ~66,000 events per second.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I understand. I did mean though a different overhead, invisible one. When we subscribe to high-rate source it alone will add a significant overhead of moving data from kernel or file (not sure) into Agent process implicitly. Sure, with humber posted for filtering it is visible and explicit overhead (part of that anyway) but by enabling the feature, all of a sudden we are making Agent to drink over a fire hydrant. We have no control over it though, but customer may not realize that it was their "fault" not hours.

To me overhead you a measure and throughput is only a part of the picture, another way. If customers start seeing Agent CPU spike, that what is that implicit overhead of drinking over firehose. Not sure what we can do about it beyond documentation. Or how bad it could be in some cases. In theory we can minimizing (maybe) this overhead by creating compound filtering (20-30 filters?) or even venture to ETW theretory. It is pure speculation at this point but it worth to be at least aware of that I think.

Comment on lines +60 to +64
for _, id := range f.eventIDs {
if eventID == id {
return true
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we sort event ID array then there is no need to scan whole array. Especially for ID which are not in the range (you can quickly check for min/max).

@clarkb7
Copy link
Contributor Author

clarkb7 commented Apr 27, 2024

/merge

@dd-devflow
Copy link

dd-devflow bot commented Apr 27, 2024

🚂 MergeQueue

Pull request added to the queue.

This build is going to start soon! (estimated merge in less than 25m)

Use /merge -c to cancel this operation!

@dd-devflow
Copy link

dd-devflow bot commented Apr 27, 2024

❌ MergeQueue

PR can't be merged according to github policy

If you need support, contact us on Slack #devflow with those details!

@clarkb7
Copy link
Contributor Author

clarkb7 commented Apr 29, 2024

/merge

@dd-devflow
Copy link

dd-devflow bot commented Apr 29, 2024

🚂 MergeQueue

Pull request added to the queue.

This build is next! (estimated merge in less than 51m)

Use /merge -c to cancel this operation!

@dd-mergequeue dd-mergequeue bot merged commit 896108a into main Apr 29, 2024
231 checks passed
@dd-mergequeue dd-mergequeue bot deleted the branden.clark/eventlog-eventdata-filter branch April 29, 2024 16:28
@github-actions github-actions bot added this to the 7.54.0 milestone Apr 29, 2024
julien-lebot added a commit that referenced this pull request Apr 30, 2024
alexgallotta pushed a commit that referenced this pull request May 9, 2024
* add event id filter with hardcoded config

* add eventdata filter on eventids

* read filters from file

* read confd_path from config comp

* fix integration test invocation

* add profiles

* codeowners

* add missing close

* avoid formatting the message a second time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog qa/done Skip QA week as QA was done before merge and regressions are covered by tests team/windows-agent
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants