Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default.path.data included as a data path when path.data configured as yaml array #23981

Closed
tylerfontaine opened this issue Apr 7, 2017 · 18 comments
Assignees
Labels

Comments

@tylerfontaine
Copy link

tylerfontaine commented Apr 7, 2017

Elasticsearch version: 5.3.0

OS version: Tried in Centos 7 and OS X, though I suspect it doesn't matter.

Description of the problem including expected versus actual behavior:
When using multiple data paths, or even specifying just a single data path as an array in yaml, elasticearch includes whatever is passed to it as default.path.data, such as when installed via RPM or DEB packages, from the systemd unit OR the init.d script, since you set this when you execute the es binary:

-Edefault.path.data=${DATA_DIR}

Steps to reproduce:
Repro:

Set this in the elasticsearch.yml

path.data:
- "/some/datapath1"

run this:

bin/elasticsearch -Edefault.path.data="/some/datapath2"

And elasticsearch will start configured with both data paths.

Provide logs (if relevant):

[2017-04-07T13:30:49,987][INFO ][o.e.e.NodeEnvironment    ] [jCWIeyS] using [2] data paths, mounts [[/ (/dev/disk1)]], net usable_space [198.8gb], net total_space [464.7gb], spins? [unknown], types [hfs]

This also happens when you use this array format:

path.data: ["/some/datapath1"]

@dakrone dakrone self-assigned this Apr 7, 2017
@dakrone dakrone added the >bug label Apr 7, 2017
@jasontedor jasontedor assigned jasontedor and unassigned dakrone Apr 7, 2017
@jasontedor
Copy link
Member

jasontedor commented Apr 7, 2017

I know what is happening here and I see a fix, I will open a PR soon. This is an incredibly unfortunate consequence of the fact that when a setting is specified as an array, we break it down into new keys path.data.0, path.data.1, ..., path.data.n, yet the default setting is applied as path.data so none of the keys from the array setting override as is done with other default settings.

@jasontedor
Copy link
Member

This is a horrifically bad bug, it appears to have been introduced in 5.3.0, and it means that shards can end up in /var/lib/elasticsearch even if a user did not intend for that to be the case because they configured path.data as an array but the packaging specifies default.path.data. If /var/lib/elasticsearch is sitting on the root partition because why not since the user wants the data elsewhere, the user can easily end up with:

  • shards not where they wanted them
  • the root partition can fill up

It gets worse. Let's say that we fix this bug in 5.3.1 and a user migrates from 5.3.0 not aware of the fiasco that is ensuing here. They start up Elasticsearch on 5.3.1, and now their shards are missing! If they're not careful, they are in data loss territory.

@nik9000
Copy link
Member

nik9000 commented Apr 7, 2017

😭

@jasontedor jasontedor added critical v5.3.0 :Core/Infra/Settings Settings infrastructure and APIs labels Apr 7, 2017
@nik9000
Copy link
Member

nik9000 commented Apr 8, 2017

Let's say that we fix this bug in 5.3.1 and a user migrates from 5.3.0 not aware of the fiasco that is ensuing here. They start up Elasticsearch on 5.3.1, and now their shards are missing! If they're not careful, they are in data loss territory.

If we just fix the config issue and someone upgrades one node at a time they shouldn't lose any data so long as they have replicas. The entire shard will have to recover from the primary, but that is a fairly normal thing. They'd have to manually clean up the shards in accidental data directory. Which isn't horrible, but it isn't great at all.

We could do more to try and copy the data or make tool or something.

Minimally anyone on 5.3 will want to be able to know whether they are effected by this issue before they upgrade so they know what they have to do as part of the upgrade.

@jasontedor
Copy link
Member

If we just fix the config issue and someone upgrades one node at a time they shouldn't lose any data so long as they have replicas.

That's the problem, we do not have complete control over this.

They'd have to manually clean up the shards in accidental data directory. Which isn't horrible, but it isn't great at all.

Yes, if they are aware of the problem and do not panic. Or they do not do something like practice immutable infrastructure.

We could do more to try and copy the data or make tool or something.

Yes, but it's quite delicate.

Minimally anyone on 5.3 will want to be able to know whether they are effected by this issue before they upgrade so they know what they have to do as part of the upgrade.

Indeed, and therein lies the rub.

@nik9000
Copy link
Member

nik9000 commented Apr 8, 2017

Yeah, I just wanted to start talking about this. I certainly don't claim to have answers. I guess I'm fairly sure that the person doing the upgrade is going to want to know if they are effected by this. Lots of people won't and 5.3.0 will just be a normal version for them. For the effected people, even if we handle the whole thing automatically, they are going to need to know.

@jasontedor
Copy link
Member

jasontedor commented Apr 8, 2017

Yeah, I just wanted to start talking about this.

Indeed, we need more people thinking about this. I discussed some of the complexities with @dakrone earlier in the evening via another channel, and I've been thinking about it all night. Sorry that I haven't taken the time to write my thoughts down yet.

Another complexity that I thought of is that the fix for this has to live for the rest of the 5.x series lifecycle because someone can upgrade from 5.3.0 to any future release in the 5.x series.

And yet another complexity that I thought is a situation where someone has their default.path.data included as one of the values in path.data (assuming multiple path.data).

For the effected people, even if we handle the whole thing automatically, they are going to need to know.

I completely agree.

@tylerfontaine
Copy link
Author

tylerfontaine commented Apr 8, 2017 via email

@jasontedor
Copy link
Member

jasontedor commented Apr 8, 2017

No, this is absolutely not allowed. You can only rolling restart upgrade from the latest 5.x that is released when 6.0.0 goes GA.

@nik9000
Copy link
Member

nik9000 commented Apr 8, 2017

Do we have set limitations yet on rolling upgrades to 6.0? Would 5.3.0 ->
6.0 be allowed? If so, the fix has to carry over there as well.

We'll make sure there is a 5.4 and so that doesn't have to happen.

And yet another complexity that I thought is a situation where someone has their default.path.data included as one of the values in path.data (assuming multiple path.data).

So any detection we have will produce a false positive for them, right?

@tylerfontaine
Copy link
Author

In fact, full cluster shutdown major version upgrades are even MORE dangerous in this instance. Because all of the shard disappear at the same time.

@jasontedor
Copy link
Member

So any detection we have will produce a false positive for them, right?

I think it has to be smart enough to not produce a false positive for them.

@tylerfontaine
Copy link
Author

Just because a 5.4 exists, doesn't mean people will be there when they upgrade.

@jasontedor
Copy link
Member

jasontedor commented Apr 8, 2017

In fact, full cluster shutdown major version upgrades are even MORE dangerous in this instance. Because all of the shard disappear at the same time.

Indeed, that means that we need a fix in 6.x as well, someone could full cluster restart upgrade from 5.3.0.

@nik9000
Copy link
Member

nik9000 commented Apr 8, 2017

Just because a 5.4 exists, doesn't mean people will be there when they upgrade.

Right, sorry. They won't be able to do a rolling upgrade. Just a full cluster restart upgrade.

@s1monw
Copy link
Contributor

s1monw commented Apr 10, 2017

I think we should look into removing the default option here. It's too much magic IMO and we can give people a good error message in that case. We can even advice how to copy over stuff to the actual data directories OR do it ourself. We had code for this in 2.x that we can reuse when we moved away from rolling our own software raid.

@jasontedor
Copy link
Member

I think we should look into removing the default option here.

Today this is used to support the packaging. I think that we can remove this if we instead ship the elasticsearch.yml with the distribution packaging to include explicit values path.data, path.logs, and path.conf. Sadly, I think that as a result of the situation here, we can not do this until 7.0.0.

@s1monw
Copy link
Contributor

s1monw commented Apr 11, 2017

Sadly, I think that as a result of the situation here, we can not do this until 7.0.0

that is a critical issue, I think we have to find a way to either fix it reliably or we need to drop stuff and I am not hesitating to even do this in a minor if it means dataloss or anything along those lines.

nik9000 added a commit that referenced this issue Apr 17, 2017
Adds a packaging test that would have detected #23981. The test configures path.data and validates that nothing ends up in default.path.data.
nik9000 added a commit that referenced this issue Apr 17, 2017
Adds a packaging test that would have detected #23981. The test configures path.data and validates that nothing ends up in default.path.data.
nik9000 added a commit that referenced this issue Apr 17, 2017
Adds a packaging test that would have detected #23981. The test configures path.data and validates that nothing ends up in default.path.data.
nik9000 added a commit that referenced this issue Apr 17, 2017
Adds a packaging test that would have detected #23981. The test configures path.data and validates that nothing ends up in default.path.data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants