Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fluentd liveness probe #343

Merged
merged 11 commits into from
Jan 6, 2020
5 changes: 5 additions & 0 deletions deploy/helm/sumologic/conf/common.conf
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@
<source>
@type prometheus_output_monitor
</source>
<source>
@type http
port 9880
bind 0.0.0.0
</source>
{{- if .Values.sumologic.fluentdLogLevel }}
<system>
log_level {{ .Values.sumologic.fluentdLogLevel }}
Expand Down
19 changes: 8 additions & 11 deletions deploy/helm/sumologic/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,19 +47,16 @@ spec:
containerPort: 24321
protocol: TCP
livenessProbe:
exec:
command:
- "/bin/sh"
- "-c"
- "[ $(pgrep ruby | wc -l) -gt 0 ]"
httpGet:
path: /fluentd.pod.healthcheck?json=%7B%22log%22%3A+%22health+check%22%7D
port: 9880
initialDelaySeconds: 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems very generous, do we need to wait 5 minutes before starting the liveness probe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we do have wait of 5 minutes, so I kept that. It is recommended to be generous in that otherwise the pod might end up in infinite loop of being restarted during startup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5min does sound a little high, but the readiness probe will fail first and the pod will stop accepting data, and if we need more replicas in between that time then with autoscaler enabled I think we'll be okay. So I'm okay with leaving it at 5min

periodSeconds: 20
periodSeconds: 30
timeoutSeconds: 3
readinessProbe:
rvmiller89 marked this conversation as resolved.
Show resolved Hide resolved
exec:
command:
- "/bin/sh"
- "-c"
- "[ $(pgrep ruby | wc -l) -gt 0 ]"
httpGet:
path: /fluentd.pod.healthcheck?json=%7B%22log%22%3A+%22health+check%22%7D
port: 9880
initialDelaySeconds: 30
periodSeconds: 5
volumeMounts:
Expand Down
19 changes: 8 additions & 11 deletions deploy/helm/sumologic/templates/events-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,16 @@ spec:
- name: pos-files
mountPath: /mnt/pos/
livenessProbe:
exec:
command:
- "/bin/sh"
- "-c"
- "[ $(pgrep ruby | wc -l) -gt 0 ]"
httpGet:
path: /fluentd.pod.healthcheck?json=%7B%22log%22%3A+%22health+check%22%7D
port: 9880
initialDelaySeconds: 300
periodSeconds: 20
periodSeconds: 30
timeoutSeconds: 3
readinessProbe:
exec:
command:
- "/bin/sh"
- "-c"
- "[ $(pgrep ruby | wc -l) -gt 0 ]"
httpGet:
path: /fluentd.pod.healthcheck?json=%7B%22log%22%3A+%22health+check%22%7D
port: 9880
initialDelaySeconds: 30
periodSeconds: 5
env:
Expand Down
43 changes: 21 additions & 22 deletions deploy/kubernetes/fluentd-sumologic.yaml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ data:
<source>
@type prometheus_output_monitor
</source>
<source>
@type http
port 9880
bind 0.0.0.0
</source>

metrics.conf: |-
<source>
Expand Down Expand Up @@ -469,19 +474,16 @@ spec:
containerPort: 24321
protocol: TCP
livenessProbe:
exec:
command:
- "/bin/sh"
- "-c"
- "[ $(pgrep ruby | wc -l) -gt 0 ]"
httpGet:
path: /fluentd.pod.healthcheck?json=%7B%22log%22%3A+%22health+check%22%7D
port: 9880
initialDelaySeconds: 300
periodSeconds: 20
periodSeconds: 30
timeoutSeconds: 3
readinessProbe:
exec:
command:
- "/bin/sh"
- "-c"
- "[ $(pgrep ruby | wc -l) -gt 0 ]"
httpGet:
path: /fluentd.pod.healthcheck?json=%7B%22log%22%3A+%22health+check%22%7D
port: 9880
initialDelaySeconds: 30
periodSeconds: 5
volumeMounts:
Expand Down Expand Up @@ -625,19 +627,16 @@ spec:
- name: pos-files
mountPath: /mnt/pos/
livenessProbe:
exec:
command:
- "/bin/sh"
- "-c"
- "[ $(pgrep ruby | wc -l) -gt 0 ]"
httpGet:
path: /fluentd.pod.healthcheck?json=%7B%22log%22%3A+%22health+check%22%7D
port: 9880
initialDelaySeconds: 300
periodSeconds: 20
periodSeconds: 30
timeoutSeconds: 3
readinessProbe:
exec:
command:
- "/bin/sh"
- "-c"
- "[ $(pgrep ruby | wc -l) -gt 0 ]"
httpGet:
path: /fluentd.pod.healthcheck?json=%7B%22log%22%3A+%22health+check%22%7D
port: 9880
initialDelaySeconds: 30
periodSeconds: 5
env:
Expand Down