New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM causes data loss on 0.20.6 #3135
Comments
Yesterday we had another crash, same reason. This time there was no data loss. |
The cluster crashed again and we lost a full shard. The shard disappeared. It is not even in disk. I will make the logs available ASAP. |
Are you always getting OOM when your cluster crashes? I wonder if the shard disappears because it can't be recovered since you hit OOM during recovery? Is it possible that your nodes are pretty much at the limits memory wise? I'd be interested if you hit a too many open files exception at some point? |
"Are you always getting OOM when your cluster crashes?" "I'd be interested if you hit a too many open files exception at some point?" "Is it possible that your nodes are pretty much at the limits memory wise" |
Just to add some more info. This time the shard disappeared from the filesystem. The only way we were able to recover was by copying an empty shard from a dev machine. |
do you still hit this with a current elasticsearch version or might it make sense to close this one? |
I guess =). I haven't tried to replicate it lately. |
Please reopen if this is still an issue. |
Hi there,
Last week we had a major crash in our production cluster. Due to some faulty configuration the machines when out of memory and the whole cluster crashed.
After a reconfiguration and a restart we discovered that shard 8 wasn't recovering. We checked the data in disk and found out that the shard had half of the expected size.
We ran the command: "java -cp lucene-core-3.6.1.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /chroot/pr/elasticsearch/data/PR/nodes/0/indices/spc/8/index/ -fix" but no data was recovered. Both the master and the replica of shard 8 had the exact same size.
We are running ES 0.20.6. Our index has 20 shards, each one with a replica. We have 6 machines and our cluster size is nearing the 250GB (500GB with the replicas). Our configuration is this: http://pastie.org/8006818 .
I published our logs in dropbox: https://dl.dropboxusercontent.com/u/53354942/PR-logs.zip . Let me know if you need more logs.
The text was updated successfully, but these errors were encountered: