Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Balance new shard allocations more evenly on multiple path.data #11185

Closed
wants to merge 5 commits into from

Conversation

mikemccand
Copy link
Contributor

This change adds a simplistic heuristic to try to balance new shard allocations across multiple data paths on one node.

It very roughly predicts (guesses!) how much disk space a shard will eventually use, as the max of the current avg. size of shards across the cluster, and 5% of current free space across all path.data on the current node, and then reserves space by counting how many shards are now assigned to each path.data.

Picking the best path.data for a new shard is using the same "most free space" logic, except it now deducts the reserved space.

I tested this on an EC2 instance with 2 SSDs with nearly the same amount of free space and confirmed we now put 2 shards on one SSD and 3 shards on the other, vs all 5 shards on a single path with master today, but I'm not sure how to make a standalone unit test ... maybe I can use a MockFS to fake up N path.datas with different free space?

This is just a heuristic, and it easily has adversarial cases that will fill up one path.data while other path.data on the same node still have plenty of space, and unfortunately ES can't recover from that today. E.g., DiskThresholdDecider won't even detect any problem (since it sums up total free space across all path.data) ... I think we should separately think about fixing that, but at least this change improves the current situation.

Closes #11122

@mikemccand mikemccand added >bug v2.0.0-beta1 blocker :Core/Infra/Core Core issues without another label labels May 15, 2015
@mikemccand mikemccand self-assigned this May 15, 2015
for(Tuple<IndexShard,Injector> shardInjectorTuple : shards.values()) {
IndexShard shard = shardInjectorTuple.v1();
// Remove indices/<index>/<shardID> subdirs from the statePath to get back to the path.data:
Path nodeDataPath = shard.shardPath().getShardStatePath().getParent().getParent().getParent();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems error prone if the structure could change in the future? I dont know a better way though..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move it to a new NodePath method, that way the directory structure details remain hidden in NodeEnv/NodePath abstraction...

@rjernst
Copy link
Member

rjernst commented May 15, 2015

This looks ok to me, although I do hope we keep tweaking this heuristic. I've tried to think of a better one but can't for now..but I wish we could do something based on weighted shard count instead of the very hard to think about mixed/shifting logic of average/5%.

ShardPath path = ShardPath.loadShardPath(logger, nodeEnv, shardId, indexSettings);
if (path == null) {
path = ShardPath.selectNewPathForShard(nodeEnv, shardId, indexSettings);

long totFreeSpace = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any change we can move this heuristic into a utils class or into ShardPath as a static method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little tricky because I'm iterating over IndexService's shards map ... but maybe I could break that out into a loop that first produces a map of shard to data path, and then pass that map to a new static method in ShardPath. I'll try.

@s1monw
Copy link
Contributor

s1monw commented May 15, 2015

mike this looks good to me - can we somehow have a test for this?

@mikemccand
Copy link
Contributor Author

can we somehow have a test for this?

I agree, but it's tricky: I think I need a new MockFS that fakes how much free space there is on each path. I can try ...

@mikemccand
Copy link
Contributor Author

This looks ok to me, although I do hope we keep tweaking this heuristic. I've tried to think of a better one but can't for now..but I wish we could do something based on weighted shard count instead of the very hard to think about mixed/shifting logic of average/5%.

The challenge with only using current shard count is then we don't take into account how much space the already allocated shards are already using? E.g maybe one path has only a few shards, and is nearly full, but another path has quite a few shards and has lots of free space.

No matter the heuristic here, there will be easy adversarial cases against it, so in the end this will be at best a "starting guess": we can't predict the future.

To fix this correctly we really need shard allocation to separately see / allocate across each path.data on a node, so we can move a shard off a path.data that is filling up even if other path.data on that same node have plenty of space.

@rmuir
Copy link
Contributor

rmuir commented May 17, 2015

I agree, but it's tricky: I think I need a new MockFS that fakes how much free space there is on each path. I can try ...

this requires QUITE a few steps, and please note that ES's file management (especially tests) is simply not ready to juggle multiple nio filesystems at once (various copy/move routines are unsafe there).

Separately, an out-of-disk space mockfs would be great. But please don't add a complicated test, and please don't use multiple nio.2 filesystems when ES isn't ready for that yet.

Test what is reasonable to test and then if we want to do better, some cleanups are needed. I have been working on these things but it is quite difficult without straight up banning methods, because more tests are added all the time.

@mikemccand
Copy link
Contributor Author

this requires QUITE a few steps, and please note that ES's file management (especially tests) is simply not ready to juggle multiple nio filesystems at once (various copy/move routines are unsafe there).

OK this sounds hard :) I'll try to make a more direct test that just tests the heuristic logic w/o needing MockFS ...

@mikemccand
Copy link
Contributor Author

I pushed a new commit addressing feedback (thanks!).

However, I gave up making a direct unit test for the ShardPath.selectNewPath ... I tried to simplify the arguments passed to it (e.g. replacing Iterator with Iterator and extracting the shard's paths "up above") so that it was more easily tested, but it just became too forced ...

I did retest on EC2 and confirmed the 5 shards are split to 2 and 3 shards on each SSD.

I'll open a follow-on issue to allow for shards to relocated to different path.data within a single node.

final Path dataPath;
final Path statePath = minUsed;

if (NodeEnvironment.hasCustomDataPath(indexSettings)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this above the calculation. If a custom data path is used, we don't need to do all the space calculations so they can be skipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh good point, I'll fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm confused: when a custom data path is used, we still pass minUsed as ShardPath.shardStatePath ... is this not used when there is a custom data path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have custom data path it can only be one path so there is nothing to select?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have custom data path it can only be one path so there is nothing to select?

This is what I expected too, but go look at ShardPath.java right now in master: it still passes the loadedPath to the new ShardPath (as shardStatePath) even in the custom data path case. This confuses me :)

I think it means we store shard state (ShardStateMetaData) on the local node's path.data even if custom data path is set?

@mikemccand
Copy link
Contributor Author

One problem with this change is it's making a "blind guess" about how big a shard will be, yet if it's a relocation of an existing shard, the source node clearly "knows" how large it is, so it's crazy not to use this.

@dakrone is working on getting this information published via ClusterInfo or index meta data or something (thanks!), then I'll fix this change to use that, so we can make a more informed "guess".

Even so, this is just the current size of the shard, and we still separately need to fix #11271 so shards on one path.data that's filling up will still be relocated even if other path.data on the same node have plenty of space.

/**
* Just records average size of shards in the cluster as a heuristic to help when allocating new shards to data paths.
*/
class ShardSizeListener implements ClusterInfoService.Listener {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this causes a memory leak since IndexService is created per index and if it's closed you need to remove this. you should add this listener in IndicesService

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think clusterInfoService is only started on the master

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now we should just iterator over all teh shards on this node to get an idea....?

@s1monw
Copy link
Contributor

s1monw commented Jun 12, 2015

I left some comments here again, I think we should move forward here without the infrastructure to make it perfect. It's a step in the right direction?

…th, and always store state on NodePatshs[0] in that case
…de based on 'refreshed every 10 sec by default' store stats, to get a 'guess' at this new shard's expected size
@mikemccand
Copy link
Contributor Author

OK I folded in feedback here.

I avoid this logic when a custom data path is set, and force the statePath to be NodePaths[0] in that case.

And I removed ClusterInfo/Service and instead take avg. of all shards already on the node.

dakrone added a commit to dakrone/elasticsearch that referenced this pull request Jun 12, 2015
This commit makes the `ClusterInfo` object available to all nodes via a
custom ClusterState object. The cluster info contains the following
information:

1. Disk usage information about each node in the cluster.

2. Shard sizes for all primary shards in the cluster.

3. Relative classifications for the size of each index.

Each index is classified from SMALLEST to LARGEST based on its size,
with SMALL, MEDIUM, and LARGE falling equidistantly between smallest and
largest.

This information can be used in the future for determining the best data
path for a particular shard, enhancing the `DiskThresholdDecider`, or
making routing decidings for a shard based on its relative size in the
cluster. It can be accessed by using something like:

```java
ClusterInfo info = clusterService.state().custom(ClusterInfo.TYPE);
Map<String, DiskUsage> nodeDiskUsage = info.getNodeDiskUsages();
Map<ShardId, Long> shardSizes = info.getShardSizes();
IndexClassification classification = info.getIndexClassification();
```

The `ClusterInfo` object is also available to inspect from the cluster
state HTTP endpoint, which returns a response like:

```json
{
  ... other cluster state ...
  "cluster_info" : {
    "node_disk_usage" : {
      "nFlT3nFzQRyvixcYv6Yfuw" : {
        "free_bytes" : 37556461568,
        "used_bytes" : 23225233408
      }
    },
    "shard_size" : {
      "[test][4]" : 156,
      "[test][2]" : 156,
      "[test][3]" : 3007,
      "[foo][0]" : 127,
      "[wiki][0]" : 12728802,
      "[test2][4]" : 156,
      "[test2][3]" : 2904,
      "[wiki2][1]" : 23338909,
      "[test2][2]" : 156,
      "[test2][1]" : 156,
      "[test2][0]" : 156,
      "[test][0]" : 156,
      "[test][1]" : 156
    },
    "classifications" : {
      "test2" : "SMALLEST",
      "test" : "SMALLEST",
      "wiki2" : "LARGEST",
      "foo" : "SMALLEST",
      "wiki" : "MEDIUM"
    }
  }
}
```

Relates to work on elastic#11185
@@ -270,6 +273,23 @@ public String indexUUID() {
return indexSettings.get(IndexMetaData.SETTING_UUID, IndexMetaData.INDEX_UUID_NA_VALUE);
}

// NOTE: O(numShards) cost, but numShards should be smallish?
private long getAvgShardSizeInBytes() throws IOException {
Iterator<IndexShard> it = this.iterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this for (IndexShard shard : this) {... not a big deal though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah I'll fix that and push, thanks @s1monw

@s1monw
Copy link
Contributor

s1monw commented Jun 16, 2015

left a tiny comment otherwise LGTM

@s1monw
Copy link
Contributor

s1monw commented Jun 17, 2015

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Shard allocation should work harder to balance new shards across multiple path.data paths
5 participants