Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM causes data loss on 0.20.6 #3135

Closed
MagmaRules opened this issue Jun 4, 2013 · 9 comments
Closed

OOM causes data loss on 0.20.6 #3135

MagmaRules opened this issue Jun 4, 2013 · 9 comments

Comments

@MagmaRules
Copy link

Hi there,

Last week we had a major crash in our production cluster. Due to some faulty configuration the machines when out of memory and the whole cluster crashed.

After a reconfiguration and a restart we discovered that shard 8 wasn't recovering. We checked the data in disk and found out that the shard had half of the expected size.
We ran the command: "java -cp lucene-core-3.6.1.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /chroot/pr/elasticsearch/data/PR/nodes/0/indices/spc/8/index/ -fix" but no data was recovered. Both the master and the replica of shard 8 had the exact same size.

We are running ES 0.20.6. Our index has 20 shards, each one with a replica. We have 6 machines and our cluster size is nearing the 250GB (500GB with the replicas). Our configuration is this: http://pastie.org/8006818 .

I published our logs in dropbox: https://dl.dropboxusercontent.com/u/53354942/PR-logs.zip . Let me know if you need more logs.

@MagmaRules
Copy link
Author

Yesterday we had another crash, same reason. This time there was no data loss.

@clintongormley
Copy link

@imotov @s1monw any ideas on this one?

@MagmaRules
Copy link
Author

The cluster crashed again and we lost a full shard. The shard disappeared. It is not even in disk. I will make the logs available ASAP.

@s1monw
Copy link
Contributor

s1monw commented Jun 18, 2013

Are you always getting OOM when your cluster crashes? I wonder if the shard disappears because it can't be recovered since you hit OOM during recovery? Is it possible that your nodes are pretty much at the limits memory wise? I'd be interested if you hit a too many open files exception at some point?

@MagmaRules
Copy link
Author

"Are you always getting OOM when your cluster crashes?"
Yes.

"I'd be interested if you hit a too many open files exception at some point?"
We are not seeing too many open files in the logs. I know about this issue: #2812 . Unfortunately it doesn't seem the same. Although it seems similar.

"Is it possible that your nodes are pretty much at the limits memory wise"
Yes its possible. I'm working on that. The problem is just the data loss.

@MagmaRules
Copy link
Author

Just to add some more info. This time the shard disappeared from the filesystem. The only way we were able to recover was by copying an empty shard from a dev machine.

@spinscale
Copy link
Contributor

do you still hit this with a current elasticsearch version or might it make sense to close this one?

@MagmaRules
Copy link
Author

I guess =). I haven't tried to replicate it lately.
The only environment i had these problems was in production and a daily reboot was implemented to prevent the crashes.
We did upgrade to 0.90 but the reboot was not removed so I can't say for sure the problem was fixed. On the other hand I can't say it wasn't =).

@clintongormley
Copy link

Please reopen if this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants