OOM causes data loss on 0.20.6 #3135

MagmaRules · 2013-06-04T21:57:03Z

Hi there,

Last week we had a major crash in our production cluster. Due to some faulty configuration the machines when out of memory and the whole cluster crashed.

After a reconfiguration and a restart we discovered that shard 8 wasn't recovering. We checked the data in disk and found out that the shard had half of the expected size.
We ran the command: "java -cp lucene-core-3.6.1.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /chroot/pr/elasticsearch/data/PR/nodes/0/indices/spc/8/index/ -fix" but no data was recovered. Both the master and the replica of shard 8 had the exact same size.

We are running ES 0.20.6. Our index has 20 shards, each one with a replica. We have 6 machines and our cluster size is nearing the 250GB (500GB with the replicas). Our configuration is this: http://pastie.org/8006818 .

I published our logs in dropbox: https://dl.dropboxusercontent.com/u/53354942/PR-logs.zip . Let me know if you need more logs.

MagmaRules · 2013-06-06T13:03:07Z

Yesterday we had another crash, same reason. This time there was no data loss.

clintongormley · 2013-06-18T10:56:18Z

@imotov @s1monw any ideas on this one?

MagmaRules · 2013-06-18T11:07:17Z

The cluster crashed again and we lost a full shard. The shard disappeared. It is not even in disk. I will make the logs available ASAP.

s1monw · 2013-06-18T12:34:00Z

Are you always getting OOM when your cluster crashes? I wonder if the shard disappears because it can't be recovered since you hit OOM during recovery? Is it possible that your nodes are pretty much at the limits memory wise? I'd be interested if you hit a too many open files exception at some point?

MagmaRules · 2013-06-18T13:34:27Z

"Are you always getting OOM when your cluster crashes?"
Yes.

"I'd be interested if you hit a too many open files exception at some point?"
We are not seeing too many open files in the logs. I know about this issue: #2812 . Unfortunately it doesn't seem the same. Although it seems similar.

"Is it possible that your nodes are pretty much at the limits memory wise"
Yes its possible. I'm working on that. The problem is just the data loss.

MagmaRules · 2013-06-18T14:57:49Z

Just to add some more info. This time the shard disappeared from the filesystem. The only way we were able to recover was by copying an empty shard from a dev machine.

spinscale · 2014-02-21T16:08:29Z

do you still hit this with a current elasticsearch version or might it make sense to close this one?

MagmaRules · 2014-02-21T16:14:57Z

I guess =). I haven't tried to replicate it lately.
The only environment i had these problems was in production and a daily reboot was implemented to prevent the crashes.
We did upgrade to 0.90 but the reboot was not removed so I can't say for sure the problem was fixed. On the other hand I can't say it wasn't =).

clintongormley · 2014-08-08T12:15:29Z

Please reopen if this is still an issue.

clintongormley closed this as completed Aug 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM causes data loss on 0.20.6 #3135

OOM causes data loss on 0.20.6 #3135

MagmaRules commented Jun 4, 2013

MagmaRules commented Jun 6, 2013

clintongormley commented Jun 18, 2013

MagmaRules commented Jun 18, 2013

s1monw commented Jun 18, 2013

MagmaRules commented Jun 18, 2013

MagmaRules commented Jun 18, 2013

spinscale commented Feb 21, 2014

MagmaRules commented Feb 21, 2014

clintongormley commented Aug 8, 2014

OOM causes data loss on 0.20.6 #3135

OOM causes data loss on 0.20.6 #3135

Comments

MagmaRules commented Jun 4, 2013

MagmaRules commented Jun 6, 2013

clintongormley commented Jun 18, 2013

MagmaRules commented Jun 18, 2013

s1monw commented Jun 18, 2013

MagmaRules commented Jun 18, 2013

MagmaRules commented Jun 18, 2013

spinscale commented Feb 21, 2014

MagmaRules commented Feb 21, 2014

clintongormley commented Aug 8, 2014