Skip to content

Conversation

@monrax
Copy link
Collaborator

@monrax monrax commented Sep 3, 2025

This closes #31

All Graylog Datanode env vars should have the GRAYLOG_DATANODE_ prefix, as indicated in graylog-docker (except for JAVA_OPTS which is just JAVA_OPTS).

As pointed out by @williamtrelawny , all secret fields must be base64 encoded as described in the Kubernetes API reference docs. This PR fixes this, In addition to separating DataNode-only vars into its own Secret resource.

Finally, a minor validation is added: when any one or two of s3ClientDefaultAccessKey, s3ClientDefaultSecretKey, and s3ClientDefaultEndpoint are specified, but not the three of them, the templates will fail to render as to prevent a CrashLoopBackoff from the datanode pods as all three values are required for a correct start.

Notes for Reviewers

  • The commit history must be preserved - please use the rebase-merge or standard merge option instead of squash-merge
  • Sync up with the author before merging

@monrax
Copy link
Collaborator Author

monrax commented Sep 5, 2025

Validation steps:

  1. Checkout this branch:
git checkout fix/datanode-s3
git pull
  1. Install chart:
helm install graylog ./graylog -n graylog --set provider="aws-managed-sc"
  1. Enable S3 features:
helm upgrade graylog ./graylog -n graylog --reuse-values \
--set datanode.config.s3ClientDefaultAccessKey="<your access key>" \
--set datanode.config.s3ClientDefaultSecretKey="<your secret key>" \
--set datanode.config.s3ClientDefaultRegion="<your region>" \
--set datanode.config.s3ClientDefaultEndpoint="<your endpoint>"
  1. Verify that all pods reach the RUNNING status, and that there are no unexpected logs.
  2. Verify that S3-backed Warm Tier, and Data Lake features work as expected.

@williamtrelawny
Copy link
Collaborator

williamtrelawny commented Sep 5, 2025

We should also have a check to verify datanode.config.nodeSearchCacheSize is >= datanode.custom.persistence.data.size else the datanode container fails saying there's not enough disk space for the former. Perhaps even (>= +8Gi) or something if possible to add in a little buffer to the disk so that the original 8Gi default for log storage is maintained?

When setting all the appropriate values, this does work though!

@monrax
Copy link
Collaborator Author

monrax commented Sep 5, 2025

We should also have a check to verify datanode.config.nodeSearchCacheSize is >= datanode.custom.persistence.data.size else the datanode container fails saying there's not enough disk space for the former. Perhaps even (>= +8Gi) or something if possible to add in a little buffer to the disk so that the original 8Gi default for log storage is maintained?

Could you open a new issue for this, please?

A first iteration of a patch for this issue could look like this:

  {{- $cacheSize := .Values.datanode.config.nodeSearchCacheSize | regexFind "^[0-9]+\\.?[0-9]*" | float64 }}
  {{- $diskSize := .Values.datanode.custom.persistence.data.size | default "8Gi" | regexFind "^[0-9]+\\.?[0-9]*" | float64 }}
  {{- if ge $cacheSize $diskSize }}
    {{ printf "Node Search Cache Size (%s) must be greater than the configured datanode.custom.persistence.data.size (%s)" $cacheSize $diskSize | fail }}
  {{- end }}

However, since the size check requires parsing the actual numbers to perform the comparison, and the units could vary between nodeSearchCacheSize and persistence.data.size, as well as the formats (e.g. "Gi" vs "gb"), this would be a bit more involved than datanode.config.nodeSearchCacheSize >= datanode.custom.persistence.data.size. This kind of fail-fast (re)validation at render time as a complement to validation at runtime inside the container is an interesting pattern that we should explore on its own issue.

@williamtrelawny williamtrelawny self-requested a review September 5, 2025 17:35
Copy link
Collaborator

@williamtrelawny williamtrelawny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀 Awesome work!

@williamtrelawny williamtrelawny merged commit 3294853 into main Sep 5, 2025
1 check passed
@monrax monrax deleted the fix/datanode-s3 branch December 5, 2025 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Warm Tier - datanode.config.s3ClientDefaultAccessKey value throws error

3 participants