Skip to content
This repository has been archived by the owner on May 16, 2023. It is now read-only.

fix cluster outage, add masterService template #41

Closed
wants to merge 5 commits into from

Conversation

kimxogus
Copy link
Contributor

@kimxogus kimxogus commented Jan 21, 2019

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@kimxogus
Copy link
Contributor Author

signed cla

@@ -37,7 +37,7 @@ extraEnvs:
# A list of secrets and their paths to mount inside the pod
# This is useful for mounting certificates for security and for mounting
# the X-Pack license
secretMounts:
secretMounts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for this to pass validations, shouln't it be set to []?

privileged: true
image: "{{ .Values.image }}:{{ .Values.imageTag }}"
command:
- /bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some comment here to state the purpose of the initContainer?

HOST="${!HOSTVAR}"

if [ ! -f /usr/share/elasticsearch/config/elasticsearch.yml ]; then
echo "" > /usr/share/elasticsearch/config/elasticsearch.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nitpicky, but "touch" would be more elegant.

@@ -78,6 +78,8 @@ spec:
secret:
secretName: {{ .name }}
{{- end }}
- name: config
emptyDir: {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we store the configuration here instead of regenerating it at each start?

@Crazybus
Copy link
Contributor

@kimxogus

The issue you are linking to is referring to a different helm chart. The readinessProbes and service setup of this chart are quite different and have been designed and tested to not cause downtime during rolling upgrades and restarts.

Were you able to reproduce the same problem with this helm chart? There is a rolling upgrade script that I used when initially developing the chart to make sure it remained available. It isn't currently running as part of the automated testing (and is currently broken by the path changes to the proxy api) but this seems like a good time for it to be re-enabled to ensure this behaviour remains.

I have just done some testing locally and wasn't able to cause any failed search requests during a rolling upgrade, killing the master pod, or doing a kill 1 in the master pod.

Here is how I'm testing it

# Deploy the default example
cd elasticsearch/examples/default
make

# Start kubectl proxy
kubectl proxy

# Run constant search requests against the service
watch -n 0.5 'curl -s --fail http://localhost:8001/api/v1/namespaces/default/services/elasticsearch-master:9200/proxy/_search'

If you are able to cause a failure could you please open an issue first with details on how you reproduced the problem?

@@ -0,0 +1,29 @@
{{ if eq .Values.roles.master "true" }}
{{- range $i := until (int .Values.replicas) }}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this?
The headless service is used for service discovery and includes all members in the cluster even the unready ones
https://github.com/elastic/helm-charts/blob/master/elasticsearch/templates/service.yaml#L31

@kimxogus
Copy link
Contributor Author

@Crazybus I wrote this PR because I had cluster outage during rolling upgrade when I dived into elastic/elasticsearch#36822, and I thought this is not fixed yet because I couldn't find changes related to connection timeout problem(https://discuss.elastic.co/t/timed-out-waiting-for-all-nodes-to-process-published-state-and-cluster-unavailability/138590/2 and elastic/elasticsearch#36822 (comment)). but it seems already fixed here without announce service.

This PR doesn't seem necessary any more. Thanks.

@kimxogus kimxogus closed this Jan 29, 2019
@desaintmartin
Copy link
Contributor

desaintmartin commented Jan 29, 2019

I am curious, where has it been fixed? I don't see any related change into the chart itself, is it from a new version of ES?
I've seen PRs in latest version of kube about solving such problems when deleting a service, maybe it also solves our problem?

@kimxogus
Copy link
Contributor Author

I'm curious too. I also tested with es 6.5.3 which I used when I had cluster outage, but curl 127.0.0.1:9200 responses immediately too which means cluster outage no longer exists.
I upgraded k8s from 1.10 to 1.11 recently, so I think some changes in k8s affected this issue.

@Crazybus
Copy link
Contributor

I am curious, where has it been fixed? I don't see any related change into the chart itself, is it from a new version of ES?

I'm curious too. I also tested with es 6.5.3 which I used when I had cluster outage

Did you reproduce the outage using this chart (elastic/helm-charts)? Or was it with the helm/charts/elasticsearch version which you linked to in the original comment? This chart is not based on the helm/charts version so it wouldn't make sense for it to share the same issues. Nothing has been changed from 6.5.3-alpha1 to the current version that could have affected anything related to this.

@desaintmartin
Copy link
Contributor

What 1.11 version do you have?

@kimxogus
Copy link
Contributor Author

@Crazybus I reproduced the outage using both charts (elastic/elasticsearch#36822 (comment))

@desaintmartin 1.11.7 and canal with calico 3.2.3 and flannel 0.9.0 as cni.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants