[STORE] Move to one datapath per shard #9498

s1monw · 2015-01-30T10:16:17Z

Today Elasticsearch has the ability to utilize multiple datapaths
which is useful to spread load across several disks throughput and space wise.
Yet, the way it is utilized is problematic since it's similar to a stripe raid where
each shard gets a dedicated directory on each of the data paths / disks. This method has several
problems:

if one data_path is lost all shards on the node are lost since files are spread across disk
segment files that are logically belong together are spread across disks where performance implications are unknown.
some files historically where not write once such that updates to these files needed to always go to the same directory
corruptions due to a lost disk might be detected very late if those files are already loaded into memory
listing contents of a shard is not trivial since multiple FS directries are involved
code is more complicated since we need to wrap lucenes directory abstraction to make this work.
if a user misconfigures a node all shard data is lost ie. due to reordering of the paths

This issues aims to move away from this shard striping towards a datapath per shard such that all
files for a Elasticsearch Shard (lucene index) are on the same directory and no special code is needed.

This requires some changes how we balance disk utilization, today we pick the directories when a file
needs to be written, now we need to select the directory ahead of time. While this seems to be trivial
since we can just pick the least utilized disk we need to take the expected size of the shard into account
and see if we can allocate this shard on a certain node at all. This can have implication on how we do this
since data-paths space is accumulated for this decision.

Rather less trivial is the migration path for this, we need to move from multiple directories to a single directory
likely in the background such that the upgrade process for users is smooth. There are several options:

upgrade on shard allocation, copy all the data into the new dedicated directory when the shard is allocated. This makes the selection of the dedicated directory simpler since we do one shard a time but might be a pretty slow process for the upgrade.
upgrade over time, select a dedicated directory on shard allocation and redirect all writes to this directory. Merges or recoveries will then automatically use the dedicated directory and an upgrade can be triggered via a merge or the upgrade API. The biggest issue is that we need to keep the distributor directory around which is way simpler now but it would be nice to drop it. Also picking the dedicated shard dir might be harder since we don't really know how disks will fill up over time.
don't upgrade and only move new indices to the per-shard path
use and upgrade tool that must run pre-node startup

However, I think the code changes can we done in two steps:

remove the old strip raid behavior for new indices
upgrade old indices

since the latter is much harder.

kimchy · 2015-02-05T14:56:26Z

+1 on this change, my suggestion to upgrade is to upgrade on node startup automatically. We could scan the data directory, and rebalance shards data paths during startup, and only then make the node available. If we can use MOVE of files it would be great, and if MOVE fails, we can resort to copying it over so it will be more lightweight.

s1monw · 2015-03-02T11:22:21Z

moved out to 2.0 for now

joealex · 2015-03-18T19:02:55Z

This is a very common scenario when you have very large clusters and multiple disks on each node. Some disk is bound to fail every few days. We had a bunch of disk failures across some days and took some time to figure out the cause was due to shards striped across multiple disks. When it runs it is perfect since the load is balanced and the multiple disk spins are utilized. We had to shutdown ES on the issue node, remove the disk from ES config and bring ES back up after sometime. This forced the Node refresh of data from other Nodes.

With the all shard data on one disk, balancing/utilizing all disks may have to be given a good look, ie if this will cause some disks to be over-utilized due to that index living there getting more data. Also about reading performance since now all from one single disk.

nik9000 · 2015-03-18T20:12:24Z

+1 on doing this.

likely in the background such that the upgrade process for users is smooth

I'm not sure how this can be nicely automated..... I think the optimal process involves slowly moving away from this by moving shards off of a node (a few nodes?), taking it out of the cluster, reformatting its multiple disks to proper raid, adding it back to the cluster, and letting the cluster heal. I suppose in many respects its similar to replacing the disks entirely. We did that about a year ago with great success, btw.

josephglanville · 2015-04-06T22:12:03Z

Overall +1 to this change for increasing durability of shards.

However once this change goes though one will need to either resort to RAID of some description of more shards to make up for the loss of spindles.

In light of this change having multiple shards allocated to the same node would actually be beneficial in terms of IO throughput but could anyone comment on any drawbacks of this? It seems to be the most elegant solution to restore the lost performance without resorting to external systems.

If the increase in shards is indeed feasible it would also be nice as a latter optimisation to attempt to allocate shards of the same index to different disks.

This commit moves away from using stripe RAID-0 simumlation across multiple data paths towards using a single path per shard. Multiple data paths are still supported but shards and it's data is not striped across multiple paths / disks. This will for instance prevent to loose all shards if a single disk is corrupted. Indices that are using this features already will automatically upgraded to a single datapath based on a simple diskspace based heuristic. In general there must be enough diskspace to move a single shard at any time otherwise the upgrade will fail. Closes elastic#9498

this commit removes the obsolete settings for distributors and updates the documentation on multiple data.path. It also adds an explain to the migration guide. Relates to elastic#9498 Closes elastic#10770

s1monw added >enhancement v2.0.0-beta1 resiliency v1.5.0 :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. labels Jan 30, 2015

s1monw removed the v1.5.0 label Mar 2, 2015

s1monw added the blocker label Mar 2, 2015

joealex mentioned this issue Mar 18, 2015

Roadmap for 2.0 #9970

Closed

14 tasks

s1monw mentioned this issue Apr 7, 2015

Move to one data.path per shard #10461

Merged

s1monw closed this as completed in #10461 Apr 20, 2015

rmuir added the release highlight label Apr 20, 2015

jpountz mentioned this issue Dec 28, 2016

[feature request]and one more dynamic index setting parameter: traslog path #22359

Closed

antonbormotov mentioned this issue Oct 20, 2017

Allow to stripe the data location over multiple locations #1356

Closed

clintongormley added the :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. label Feb 13, 2018

clintongormley removed the :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. label Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STORE] Move to one datapath per shard #9498

[STORE] Move to one datapath per shard #9498

s1monw commented Jan 30, 2015

kimchy commented Feb 5, 2015

s1monw commented Mar 2, 2015

joealex commented Mar 18, 2015

nik9000 commented Mar 18, 2015

josephglanville commented Apr 6, 2015

[STORE] Move to one datapath per shard #9498

[STORE] Move to one datapath per shard #9498

Comments

s1monw commented Jan 30, 2015

kimchy commented Feb 5, 2015

s1monw commented Mar 2, 2015

joealex commented Mar 18, 2015

nik9000 commented Mar 18, 2015

josephglanville commented Apr 6, 2015