New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovery files left behind when replica building fails #7315
Comments
I also run into the same problem. |
@suitingtseng which version of ES are you using? @avleen sorry for the late response. Is this issue still a problem? Which version are you currently on? I'm wondering if this is another manifestation of #7386 (comment) |
I am using 1.3.1. |
I found this happening on 1.3.1 also, with the compressed recovery bug. I On Sun, Sep 14, 2014 at 10:13 PM, suitingtseng notifications@github.com
|
@suitingtseng thx. Did you do a full cluster restart since being on 1.1.0 as #7386 (comment) ? Are there any other error in your logs? If you cluster is all green now, you can safely delete the recovery.* files, though we should figure what they don't go on their own... |
Just ran into what I believe is this situation as well. After 24 hours, some indices are moved to different nodes. On 2 of the "slow" nodes, 2 shards got into a state where they are growing uncontrollably (should bea round 30GB but are 1.3 TB). Inside the shard directory it's littered with recovery.* files. Running ES 1.3.0 and new recovery files keep being created. The log on the "slow" node is complaining about "File corruption occured on recovery but checksums are ok" To expand a bit further, it seems they are caught in an endless recovery cycle and just creating new recovery files over and over unable to repair |
This commit rewrites the state controls in the RecoveryTarget family classes to make it easier to guarantee that: - recovery resources are only cleared once there are no ongoing requests - recovery is automatically canceled when the target shard is closed/removed - canceled recoveries do not leave temp files behind when canceled. Highlights of the change: 1) All temporary files are cleared upon failure/cancel (see #7315 ) 2) All newly created files are always temporary 3) Doesn't list local files on the cluster state update thread (which throw unwanted exception) 4) Recoveries are canceled by a listener to IndicesLifecycle.beforeIndexShardClosed, so we don't need to explicitly call it. 5) Simplifies RecoveryListener to only notify when a recovery is done or failed. Removed subtleties like ignore and retry (they are dealt with internally) Closes #8092 , Closes #7315
Any one solved this issue or is there any walk around? |
I manually rm'd them here, and it was OK. On Fri Jan 23 2015 at 6:36:59 PM yangou notifications@github.com wrote:
|
@yangou the issue is solved but the fix will be released with 1.5.0 (no ETA yet). You can safely remove the recovery.* files of all old recoveries. To check what are the currently active recoveries you can call |
We have a situation where several indices need replicas either relocated or rebuilt (we're not sure exactly which of the two caused this situation, but I think it was the initial replica build, rather than relocation).
In one situation, the nodes which we were trying to send the shards to went over their high disk threshold, and the recovery was aborted.
In another, we tickled the recently found bug on recovery and compression.
In both cases (afaict), the shard directory on disk was littered with files named
recovery.*
. Sometimes terabytes of files.Even when the replica build cancelled, moved on to another host, etc, those files aren't being cleaned up.
The text was updated successfully, but these errors were encountered: