Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force merge behaviour unreliable #102594

Open
EgbertW opened this issue Nov 24, 2023 · 4 comments
Open

Force merge behaviour unreliable #102594

EgbertW opened this issue Nov 24, 2023 · 4 comments
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >docs General docs changes Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team

Comments

@EgbertW
Copy link

EgbertW commented Nov 24, 2023

Elasticsearch Version

8.10.3

Installed Plugins

No response

Java Version

bundled

OS Version

Linux 5.15.107+ #1 SMP x86_64 x86_64 x86_64 GNU/Linux

Problem Description

I monitor an index that relies on force merging to 1 segment for performance considerations. The process is:

  1. Create index, replicas set to 0
  2. Bulk insert millions of documents into it
  3. Refresh index
  4. Force merge with max_num_segments set to 1, wait_for_completion=false
  5. Use task API to read task until task is done
  6. Use stats API to get the number of primary segments
  7. If primary segments is more than the number of shards, go back to step 4 and repeat
  8. Set number_of_replicas to 5, wait for replication to complete
  9. Start using index for querying

Multiple concurrent processes like this can run on the same cluster. We've noticed on multiple occassions that step 3&4 is not enough: the task completes successfully but still more than 1 segments exist for 1 or more shards. So we added step 6 to retry if that happens.

A new issue surfaced recently: even step 6 was not good enough. The stats API reported a number of primary segments equal to the amount of shards for several minutes and replication was initiated. During the allocation of additional replicas, the number of segments jumped back up to 44. Essentially, the force merge on one shard was reversed.

This mostly seems to happen when multiple indices are force merged at the same time. We noticed log messages like:

now throttling indexing: numMergesInFlight=10, maxNumMerges=9
stop throttling indexing: numMergesInFlight=8, maxNumMerges=9

This message appears to be a red flag - it looks like not only indexing is throttled but also force merging is partially cancelled.

This all feels like a bug somewhere in the force merge administration - the task is not monitored properly and not guaranteed to complete, and even if if does appear to complete correctly, there looks to be a change that the force merge is reversed.

Steps to Reproduce

  • Have a Elasticsearch cluster, set force_merge threadpool size to X=8
  • Create 2 separate indices with 0 replicas, with S=6 shards each (where 2xS is more than X=8)
  • Bulk insert millions of documents into both of them, concurrently (enough to be sure the force merge takes a significant amount of time)
  • As soon as all documents are indexed into an index, trigger a refresh on it
  • Directly after the refresh, initate a force merge with max_num_segments set to 1 (do this separately for both indices)
  • Use the task API to monitor the resulting tasks for completion
  • Initiate replicaton on the indices as soon as the tasks are done
  • Monitor the amount of primary shards throughout the entire process

Rinse & repeat a couple of times if necessary. Observe that the number of primary segments is not guaranteed to be 12 (6 on each index, 1 on each shard) after this.

The only lead we currently have is this: https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-forcemerge.html

that states:

Shards that are relocating during a forcemerge will not be merged.

Theoretically, the replication of one index could cause a rebalancing for primary shards on the other, effectively nullifying the effect of the force merge that is still ongoing.

It would be great to have some reliable grip on the process and being sure that it completed correctly.

Logs (if relevant)

now throttling indexing: numMergesInFlight=10, maxNumMerges=9
stop throttling indexing: numMergesInFlight=8, maxNumMerges=9

@EgbertW EgbertW added >bug needs:triage Requires assignment of a team area label labels Nov 24, 2023
@mayya-sharipova mayya-sharipova added :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. and removed needs:triage Requires assignment of a team area label labels Nov 24, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Nov 24, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner DaveCTurner added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Distributed/Task Management Issues for anything around the Tasks API - both persistent and node level. labels Nov 25, 2023
@tlrx
Copy link
Member

tlrx commented Nov 29, 2023

This all feels like a bug somewhere in the force merge administration - the task is not monitored properly and not guaranteed to complete, and even if if does appear to complete correctly, there looks to be a change that the force merge is reversed.

There is no such thing in Elasticsearch that reverts a completed force merge, but there are various operations that may produce new segments or retain existing segments while the force merge is running.

Among the ones I can think of:

  • indexing is not really stopped
  • refreshes produces new segments (think of the default refresh interval of 1 sec, or external client refreshing indices)
  • shard relocations may retain segments until the relocation is done
  • snapshots retain segments until snapshot is completed
  • indexing memory controller activate indexing throttling and flush largest shards (it produces more segments)
  • point-in-time and scroll search queries retain segments

Your use case probably needs to be adapted to take into consideration all these operations (and there are maybe more, sorry if I forgot some of them).

A good start would be to create the index with refresh interval of -1 and index through an alias so that the alias can be dropped once bulk indexing is done (or even better, use Security and specific permission to ensure your process is the only possible indexing client). Then flush the index and add a write block to the index (see Add Block API), this would guarantee not further indexing. The index can then be relocated to a node with less active indexing so that indexing throttling won't be activated and shard relocation due to disk space less likely to happen. Then force merge to 1 segment and use the Task API to see the results.

I'm keeping this issue opened since I think we can at least add a note in documentation about the various reason that could cause a shard to have more segments than expected after a force merge.

@tlrx tlrx added >docs General docs changes and removed >bug labels Nov 29, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Docs Meta label for docs team label Nov 29, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@EgbertW
Copy link
Author

EgbertW commented Jan 9, 2024

@tlrx Thanks for your response. It took me some time to find some time to respond.

I understand there is no such thing as a reversal. Then there must be some misalignment, not sure what's causing it. In the indexing task, all documents are inserted into Elasticsearch, the force merge API is called and then periodically (every minute or so) the <indexname>/_segments endpoint is called to get the number of segments.

The behaviour I noticed was that at some point num_committed_segments in the reply of this was 1, and then a minute later it suddenly reverted back to its old value of 45. I definitely can't rule out that the force merge was aborted by any of the causes you list. Shard relocations seem the must likely culprit. However, the possibility of a force merge not succeeding completely are taken into account in the code - that is why the _segments endpoint is called after the task is finished. However, it turns out that even if _segments reports that there is one segment, there's still a possiblity of it not being accurate. This feels like a bug to me.

Prefably I'd see a forge merge task that allows me to state that no matter what happens and no matter how long it takes, I want the task to continue until there really is just one segment in every shard left. If this is unfeasible, it would at least be nice to have some reliable way to find out afterwards if it succeeded.

The indexing process already has refresh disabled. We're only calling _refresh endpoint after we got confirmation that all documents have been indexed by the _bulk endpoint.

Right now the solution we've going with is:

  • Do the index
  • Try to avoid all the conditions that may cause the forge merge to fail as much as possible
  • This includes postponing any shard allocations triggered by our process
  • Call forge merge and wait for the task to finish
  • Check _segments to see if it works

Because _segments has proven to be not completely accurate, we're now considering waiting for a couple of minutes and then calling _segments again to see if the number of segments is still 1 but this feels quite quirky.

So I don't think it would be solved by just updating the documentation, to be honest, there's something more going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >docs General docs changes Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team
Projects
None yet
Development

No branches or pull requests

5 participants