Recovery files left behind when replica building fails #7315

avleen · 2014-08-18T15:06:10Z

We have a situation where several indices need replicas either relocated or rebuilt (we're not sure exactly which of the two caused this situation, but I think it was the initial replica build, rather than relocation).

In one situation, the nodes which we were trying to send the shards to went over their high disk threshold, and the recovery was aborted.
In another, we tickled the recently found bug on recovery and compression.

In both cases (afaict), the shard directory on disk was littered with files named recovery.*. Sometimes terabytes of files.
Even when the replica build cancelled, moved on to another host, etc, those files aren't being cleaned up.

The text was updated successfully, but these errors were encountered:

suitingtseng · 2014-09-13T08:40:21Z

I also run into the same problem.
Does any one have the solution?
Can I just delete the recovery.* files?

bleskes · 2014-09-14T20:55:56Z

@suitingtseng which version of ES are you using?

@avleen sorry for the late response. Is this issue still a problem? Which version are you currently on?

I'm wondering if this is another manifestation of #7386 (comment)

suitingtseng · 2014-09-15T02:13:15Z

I am using 1.3.1.
Thanks for your reply.

avleen · 2014-09-15T04:18:08Z

I found this happening on 1.3.1 also, with the compressed recovery bug. I
haven't had recovery failures since then so I have no more data to gauge
this with :(

On Sun, Sep 14, 2014 at 10:13 PM, suitingtseng notifications@github.com
wrote:

I am using 1.3.1.
Thanks for your reply.

Reply to this email directly or view it on GitHub
#7315 (comment)
.

bleskes · 2014-09-15T11:15:13Z

@suitingtseng thx. Did you do a full cluster restart since being on 1.1.0 as #7386 (comment) ? Are there any other error in your logs?

If you cluster is all green now, you can safely delete the recovery.* files, though we should figure what they don't go on their own...

jlintz · 2014-09-22T15:53:14Z

Just ran into what I believe is this situation as well. After 24 hours, some indices are moved to different nodes. On 2 of the "slow" nodes, 2 shards got into a state where they are growing uncontrollably (should bea round 30GB but are 1.3 TB). Inside the shard directory it's littered with recovery.* files. Running ES 1.3.0 and new recovery files keep being created. The log on the "slow" node is complaining about "File corruption occured on recovery but checksums are ok"

To expand a bit further, it seems they are caught in an endless recovery cycle and just creating new recovery files over and over unable to repair

This commit rewrites the state controls in the RecoveryTarget family classes to make it easier to guarantee that: - recovery resources are only cleared once there are no ongoing requests - recovery is automatically canceled when the target shard is closed/removed - canceled recoveries do not leave temp files behind when canceled. Highlights of the change: 1) All temporary files are cleared upon failure/cancel (see #7315 ) 2) All newly created files are always temporary 3) Doesn't list local files on the cluster state update thread (which throw unwanted exception) 4) Recoveries are canceled by a listener to IndicesLifecycle.beforeIndexShardClosed, so we don't need to explicitly call it. 5) Simplifies RecoveryListener to only notify when a recovery is done or failed. Removed subtleties like ignore and retry (they are dealt with internally) Closes #8092 , Closes #7315

yangou · 2015-01-23T23:36:15Z

Any one solved this issue or is there any walk around?
Simple question: Can I just manual rm those recovery file? They are eating up most of my disk spaces.
I'm having tons of recovery files to and I'm upgrading from 1.1 to 1.4.2.

avleen · 2015-01-24T06:45:30Z

I manually rm'd them here, and it was OK.

On Fri Jan 23 2015 at 6:36:59 PM yangou notifications@github.com wrote:

Any one solved this issue or is there any walk around?
Simple question: Can I just manual rm those recovery file? They are eating
up most of my disk spaces.
I'm having tons of recovery files to and I'm upgrading from 1.1 to 1.4.2.

Reply to this email directly or view it on GitHub
#7315 (comment)
.

bleskes · 2015-01-24T08:50:42Z

@yangou the issue is solved but the fix will be released with 1.5.0 (no ETA yet). You can safely remove the recovery.* files of all old recoveries. To check what are the currently active recoveries you can call GET _cat/recovery?active_only=true

clintongormley assigned bleskes Aug 18, 2014

bleskes assigned imotov and unassigned bleskes Aug 19, 2014

bleskes mentioned this issue Oct 15, 2014

Refactor RecoveryTarget state management #8092

Closed

bleskes closed this as completed in 24bc8d3 Oct 23, 2014

bleskes added v1.5.0 v2.0.0-beta1 >bug labels Oct 23, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery files left behind when replica building fails #7315

Recovery files left behind when replica building fails #7315

avleen commented Aug 18, 2014

suitingtseng commented Sep 13, 2014

bleskes commented Sep 14, 2014

suitingtseng commented Sep 15, 2014

avleen commented Sep 15, 2014

bleskes commented Sep 15, 2014

jlintz commented Sep 22, 2014

yangou commented Jan 23, 2015

avleen commented Jan 24, 2015

bleskes commented Jan 24, 2015

Recovery files left behind when replica building fails #7315

Recovery files left behind when replica building fails #7315

Comments

avleen commented Aug 18, 2014

suitingtseng commented Sep 13, 2014

bleskes commented Sep 14, 2014

suitingtseng commented Sep 15, 2014

avleen commented Sep 15, 2014

bleskes commented Sep 15, 2014

jlintz commented Sep 22, 2014

yangou commented Jan 23, 2015

avleen commented Jan 24, 2015

bleskes commented Jan 24, 2015