Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize state store usage #538

Merged
merged 4 commits into from
May 9, 2023
Merged

Optimize state store usage #538

merged 4 commits into from
May 9, 2023

Conversation

nscuro
Copy link
Member

@nscuro nscuro commented May 7, 2023

This PR introduces some improvements to how Kafka Streams state stores are used in the vulnerability-analyzer.

Automatic cleanup of completed-scans-table state store

The completed-scans-table store is where we aggregate results from all applicable scanners, per ScanKey. The intention is to only forward a result events to dtrack.vuln-analysis.result once all scanners completed their work. The store was prone to accumulating lots of data, for the following reasons:

  1. Scanner results can be quite big (depending on how many vulnerabilities were identified). The data is compressed in Kafka topics, but it's uncompressed in the local state store.
  2. KTables do not support manual deletions or TTL policies, so entries never expire and stay around forever
  3. The default segment.bytes for changelog topics is 256 MiB. Kafka can only delete records of "inactive" segments, meaning that partitions in the changelog topic must accumulate >256 MiB of data first before compaction can kick in.

All of the above are amplified by the fact that ScanKeys are unbounded, as there is no finite set of keys.

We worked around (2) by introducing a processor that emits Tombstone records for ScanKeys for which no update to the completed-scans-table was observed for over an hour. This approach is recommended by Confluent.

While this does keep the table from growing indefinitely, we could still run into situations where lots of unique keys accumulate over the course of 1h. This is unfortunate because we don't need the aggregated scanner results anymore, once they are "complete" and we forwarded them.

This PR adds a custom Processor that replaces the previously used KTable. It includes logic to delete completed aggregates from the store, but it also still performs TTL enforcement to regularly clean up old records. This change is backward-compatible, because it reuses the existing state store.

Note
The topic hyades-vulnerability-analyzer-completed-scans-table-last-update-store-changelog is no longer necessary and should be removed after deploying the updated vulnerability-analyzer service.

Configurable state stores

It is now possible to configure whether in-memory or persistent state stores shall be used. Further, certain customizations to RocksDB are configurable (e.g. the compaction style and compression type). These settings are currently applied globally to all state stores. In the future we can scope them to individual state stores, too.

To switch all state stores to RocksDB, and enable compression, the following properties may be used

state-store.type=rocks_db
state-store.rocks-db.compression-type=lz4_compression

Smaller changelog topics

When using state stores, Kafka Streams will, per default, create changelog topics with segment.bytes set to 256 MiB, which is a lot (too much) data in many cases. It artificially increases restoration times.

We now have a centralized default configuration for changelog topics, that applies a few stricter limits:

public static Map<String, String> defaultChangelogTopicConfig() {
return Map.of(
TopicConfig.CLEANUP_POLICY_CONFIG, TopicConfig.CLEANUP_POLICY_COMPACT,
TopicConfig.SEGMENT_BYTES_CONFIG, String.valueOf(64 * 1024 * 1024), // 64 MiB
TopicConfig.MAX_COMPACTION_LAG_MS_CONFIG, "0" // Perform compaction ASAP
);

Log segments are limited to 64 MiB, and records are signaled to be eligible for compaction immediately after they've been written. This should help with keeping topics smaller, and having compaction kick in sooner.

This config has also been added to TOPICS.md.

For some topics it may even be practical to reduce the segment.bytes even further, but I am not quite sure yet what the actual performance impact on the broker side is yet.

nscuro added 2 commits May 7, 2023 13:18
Instead of waiting for them to expire. There is no need to keep aggregated scanner results around for longer than necessary.

`KTable`s constructed via aggregation do not support deletion or expiration, so we have to use a custom processor to do it.

Signed-off-by: nscuro <nscuro@protonmail.com>
It is now possible to switch between `in_memory` and `rocks_db`. A small selection of RocksDB properties are also configurable.

Users who wish to sacrifice disk space for a lower memory footprint can choose to enable RocksDB. Additionally, compression may be enabled to sacrifice more CPU utilization for less disk space.

Also provide a more restrictive baseline config for changelog topics. The default segment size of 256 MiB can cause stores and changelog topics to grow very large before compaction kicks in.

Signed-off-by: nscuro <nscuro@protonmail.com>
@nscuro nscuro added enhancement New feature or request domain/vuln-analysis labels May 7, 2023
nscuro added 2 commits May 7, 2023 18:25
Signed-off-by: nscuro <nscuro@protonmail.com>
Signed-off-by: nscuro <nscuro@protonmail.com>
@nscuro nscuro marked this pull request as ready for review May 7, 2023 17:05
@nscuro
Copy link
Member Author

nscuro commented May 7, 2023

But is failing, but the change to the proto definition is in fact backward-compatible: https://protobuf.dev/programming-guides/proto3/#deleting

@sonarcloud
Copy link

sonarcloud bot commented May 7, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

90.9% 90.9% Coverage
5.0% 5.0% Duplication

Copy link
Collaborator

@sahibamittal sahibamittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@VinodAnandan VinodAnandan merged commit e11c433 into main May 9, 2023
10 of 12 checks passed
@VinodAnandan VinodAnandan deleted the optimize-state-store-usage branch May 9, 2023 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants