Optimize state store usage #538

nscuro · 2023-05-07T12:19:40Z

This PR introduces some improvements to how Kafka Streams state stores are used in the vulnerability-analyzer.

Automatic cleanup of `completed-scans-table` state store

The completed-scans-table store is where we aggregate results from all applicable scanners, per ScanKey. The intention is to only forward a result events to dtrack.vuln-analysis.result once all scanners completed their work. The store was prone to accumulating lots of data, for the following reasons:

Scanner results can be quite big (depending on how many vulnerabilities were identified). The data is compressed in Kafka topics, but it's uncompressed in the local state store.
KTables do not support manual deletions or TTL policies, so entries never expire and stay around forever
The default segment.bytes for changelog topics is 256 MiB. Kafka can only delete records of "inactive" segments, meaning that partitions in the changelog topic must accumulate >256 MiB of data first before compaction can kick in.

All of the above are amplified by the fact that ScanKeys are unbounded, as there is no finite set of keys.

We worked around (2) by introducing a processor that emits Tombstone records for ScanKeys for which no update to the completed-scans-table was observed for over an hour. This approach is recommended by Confluent.

While this does keep the table from growing indefinitely, we could still run into situations where lots of unique keys accumulate over the course of 1h. This is unfortunate because we don't need the aggregated scanner results anymore, once they are "complete" and we forwarded them.

This PR adds a custom Processor that replaces the previously used KTable. It includes logic to delete completed aggregates from the store, but it also still performs TTL enforcement to regularly clean up old records. This change is backward-compatible, because it reuses the existing state store.

Note
The topic hyades-vulnerability-analyzer-completed-scans-table-last-update-store-changelog is no longer necessary and should be removed after deploying the updated vulnerability-analyzer service.

Configurable state stores

It is now possible to configure whether in-memory or persistent state stores shall be used. Further, certain customizations to RocksDB are configurable (e.g. the compaction style and compression type). These settings are currently applied globally to all state stores. In the future we can scope them to individual state stores, too.

To switch all state stores to RocksDB, and enable compression, the following properties may be used

state-store.type=rocks_db
state-store.rocks-db.compression-type=lz4_compression

Smaller changelog topics

When using state stores, Kafka Streams will, per default, create changelog topics with segment.bytes set to 256 MiB, which is a lot (too much) data in many cases. It artificially increases restoration times.

We now have a centralized default configuration for changelog topics, that applies a few stricter limits:

hyades/vulnerability-analyzer/src/main/java/org/hyades/util/StateStoreUtil.java

Lines 54 to 59 in 33bee86

    
           public static Map<String, String> defaultChangelogTopicConfig() { 
        
               return Map.of( 
        
                       TopicConfig.CLEANUP_POLICY_CONFIG, TopicConfig.CLEANUP_POLICY_COMPACT, 
        
                       TopicConfig.SEGMENT_BYTES_CONFIG, String.valueOf(64 * 1024 * 1024), // 64 MiB 
        
                       TopicConfig.MAX_COMPACTION_LAG_MS_CONFIG, "0" // Perform compaction ASAP 
        
               );

Log segments are limited to 64 MiB, and records are signaled to be eligible for compaction immediately after they've been written. This should help with keeping topics smaller, and having compaction kick in sooner.

This config has also been added to TOPICS.md.

For some topics it may even be practical to reduce the segment.bytes even further, but I am not quite sure yet what the actual performance impact on the broker side is yet.

Instead of waiting for them to expire. There is no need to keep aggregated scanner results around for longer than necessary. `KTable`s constructed via aggregation do not support deletion or expiration, so we have to use a custom processor to do it. Signed-off-by: nscuro <nscuro@protonmail.com>

It is now possible to switch between `in_memory` and `rocks_db`. A small selection of RocksDB properties are also configurable. Users who wish to sacrifice disk space for a lower memory footprint can choose to enable RocksDB. Additionally, compression may be enabled to sacrifice more CPU utilization for less disk space. Also provide a more restrictive baseline config for changelog topics. The default segment size of 256 MiB can cause stores and changelog topics to grow very large before compaction kicks in. Signed-off-by: nscuro <nscuro@protonmail.com>

Signed-off-by: nscuro <nscuro@protonmail.com>

nscuro · 2023-05-07T17:10:46Z

But is failing, but the change to the proto definition is in fact backward-compatible: https://protobuf.dev/programming-guides/proto3/#deleting

sonarcloud · 2023-05-07T17:17:25Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

90.9% Coverage
5.0% Duplication

sahibamittal

LGTM

nscuro added 2 commits May 7, 2023 13:18

nscuro added enhancement New feature or request domain/vuln-analysis labels May 7, 2023

nscuro added 2 commits May 7, 2023 18:25

Cleanup and add test for ScannerResultAggregator

33bee86

Signed-off-by: nscuro <nscuro@protonmail.com>

Fix StateStoreUtil for non-Quarkus unit tests

86472b2

Signed-off-by: nscuro <nscuro@protonmail.com>

nscuro marked this pull request as ready for review May 7, 2023 17:05

sahibamittal approved these changes May 8, 2023

View reviewed changes

VinodAnandan merged commit e11c433 into main May 9, 2023
10 of 12 checks passed

VinodAnandan deleted the optimize-state-store-usage branch May 9, 2023 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize state store usage #538

Optimize state store usage #538

nscuro commented May 7, 2023 •

edited

nscuro commented May 7, 2023

sonarcloud bot commented May 7, 2023

sahibamittal left a comment

	public static Map<String, String> defaultChangelogTopicConfig() {
	return Map.of(
	TopicConfig.CLEANUP_POLICY_CONFIG, TopicConfig.CLEANUP_POLICY_COMPACT,
	TopicConfig.SEGMENT_BYTES_CONFIG, String.valueOf(64 * 1024 * 1024), // 64 MiB
	TopicConfig.MAX_COMPACTION_LAG_MS_CONFIG, "0" // Perform compaction ASAP
	);

Optimize state store usage #538

Optimize state store usage #538

Conversation

nscuro commented May 7, 2023 • edited

Automatic cleanup of completed-scans-table state store

Configurable state stores

Smaller changelog topics

nscuro commented May 7, 2023

sonarcloud bot commented May 7, 2023

sahibamittal left a comment

Choose a reason for hiding this comment

nscuro commented May 7, 2023 •

edited

Automatic cleanup of `completed-scans-table` state store