Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery files left behind when replica building fails #7315

Closed
avleen opened this issue Aug 18, 2014 · 9 comments
Closed

Recovery files left behind when replica building fails #7315

avleen opened this issue Aug 18, 2014 · 9 comments

Comments

@avleen
Copy link

avleen commented Aug 18, 2014

We have a situation where several indices need replicas either relocated or rebuilt (we're not sure exactly which of the two caused this situation, but I think it was the initial replica build, rather than relocation).

In one situation, the nodes which we were trying to send the shards to went over their high disk threshold, and the recovery was aborted.
In another, we tickled the recently found bug on recovery and compression.

In both cases (afaict), the shard directory on disk was littered with files named recovery.*. Sometimes terabytes of files.
Even when the replica build cancelled, moved on to another host, etc, those files aren't being cleaned up.

@suitingtseng
Copy link

I also run into the same problem.
Does any one have the solution?
Can I just delete the recovery.* files?

@bleskes
Copy link
Contributor

bleskes commented Sep 14, 2014

@suitingtseng which version of ES are you using?

@avleen sorry for the late response. Is this issue still a problem? Which version are you currently on?

I'm wondering if this is another manifestation of #7386 (comment)

@suitingtseng
Copy link

I am using 1.3.1.
Thanks for your reply.

@avleen
Copy link
Author

avleen commented Sep 15, 2014

I found this happening on 1.3.1 also, with the compressed recovery bug. I
haven't had recovery failures since then so I have no more data to gauge
this with :(

On Sun, Sep 14, 2014 at 10:13 PM, suitingtseng notifications@github.com
wrote:

I am using 1.3.1.
Thanks for your reply.

Reply to this email directly or view it on GitHub
#7315 (comment)
.

@bleskes
Copy link
Contributor

bleskes commented Sep 15, 2014

@suitingtseng thx. Did you do a full cluster restart since being on 1.1.0 as #7386 (comment) ? Are there any other error in your logs?

If you cluster is all green now, you can safely delete the recovery.* files, though we should figure what they don't go on their own...

@jlintz
Copy link

jlintz commented Sep 22, 2014

Just ran into what I believe is this situation as well. After 24 hours, some indices are moved to different nodes. On 2 of the "slow" nodes, 2 shards got into a state where they are growing uncontrollably (should bea round 30GB but are 1.3 TB). Inside the shard directory it's littered with recovery.* files. Running ES 1.3.0 and new recovery files keep being created. The log on the "slow" node is complaining about "File corruption occured on recovery but checksums are ok"

To expand a bit further, it seems they are caught in an endless recovery cycle and just creating new recovery files over and over unable to repair

bleskes added a commit that referenced this issue Oct 23, 2014
This commit rewrites the state controls in the RecoveryTarget family classes to make it easier to guarantee that:
- recovery resources are only cleared once there are no ongoing requests
- recovery is automatically canceled when the target shard is closed/removed
- canceled recoveries do not leave temp files behind when canceled.

Highlights of the change:
1) All temporary files are cleared upon failure/cancel (see #7315 )
2) All newly created files are always temporary
3) Doesn't list local files on the cluster state update thread (which throw unwanted exception)
4) Recoveries are canceled by a listener to IndicesLifecycle.beforeIndexShardClosed, so we don't need to explicitly call it.
5) Simplifies RecoveryListener to only notify when a recovery is done or failed. Removed subtleties like ignore and retry (they are dealt with internally)

Closes #8092 , Closes #7315
@yangou
Copy link

yangou commented Jan 23, 2015

Any one solved this issue or is there any walk around?
Simple question: Can I just manual rm those recovery file? They are eating up most of my disk spaces.
I'm having tons of recovery files to and I'm upgrading from 1.1 to 1.4.2.

@avleen
Copy link
Author

avleen commented Jan 24, 2015

I manually rm'd them here, and it was OK.

On Fri Jan 23 2015 at 6:36:59 PM yangou notifications@github.com wrote:

Any one solved this issue or is there any walk around?
Simple question: Can I just manual rm those recovery file? They are eating
up most of my disk spaces.
I'm having tons of recovery files to and I'm upgrading from 1.1 to 1.4.2.

Reply to this email directly or view it on GitHub
#7315 (comment)
.

@bleskes
Copy link
Contributor

bleskes commented Jan 24, 2015

@yangou the issue is solved but the fix will be released with 1.5.0 (no ETA yet). You can safely remove the recovery.* files of all old recoveries. To check what are the currently active recoveries you can call GET _cat/recovery?active_only=true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants