-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling Eureka Horizontally #1273
Comments
Another thing we've noticed, is that anytime we attempt to scale beyond 3 nodes, at least one of the nodes always reports that there are unavailable replicas. Over time, the node that reports the unavailable replicas changes. Also, we randomly get this error: We'll see this several times in a row and all of a sudden a significant number of registered instances will be evicted from the Eureka server. |
Sean, thanks for the detailed write up. I don't see anything out of ordinary in your config, it should be ok, you might want to fix
I can offer a personal opinion only, I don't think there's a whole lot of benefit scaling horizontally beyond 2-3 nodes for redundancy alone, heck running 1 would not be crazy. Services should be designed in a way a temporary Name Service (Eureka) unavailability is not an issue, worst case if none of the servers are up, new deployments will be slowed down and the existing instances will have a potentially stale state (unless this is naturally very volatile, couple minutes of delay is not a problem); and this is only until you restart / redeploy the name service nodes after which it all goes back to normal. This is not a service you need many nines of availability for, there's a little reason to go above 99.9% for a name service but opinions vary. An actual benefit of scaling horizontally may be amplifying the NIC throughput in case you have a lot of readers and are close to saturating the NICs on your Eureka nodes. The number of horizontal nodes here is proportional of the total throughput you want to handle, but I would not go crazy (maybe 5 tops), as this is a full mesh replication.
Batching is probably the only important one. Try playing with these settings.
I'm unable to share the exact numbers but can say that we have significantly more instances than a couple thousands, but trying to keep the NIC saturation under 25% on the Eureka nodes. We do not run a massive number of Eureka nodes, just a couple (horizontally), most of the traffic is reads, the replication is negligible in comparison, so that's the main optimization opportunity. Based on your write up, it seems like you may be having either networking issues or sub-optimal tomcat configuration, I suggest start with checking that, this is a bunch of blocking IO so one thing to check is whether the pools are getting full and hence requests are getting dropped due to that. |
@troshko111 For now, we have decided to stick with three eureka servers. Regarding the consistency issue, I did notice that consistency seemed to get better when I changed Thank you again. I think we can close this one out. |
Hi,
We recently had some issues with our 3 node Eureka Server cluster. We had some major scaling events which brought the number of registered instances with Eureka to about 2400. Upon reaching this number, file handles and CPU spiked. Eureka was unable to recover from this and our only recourse was to scale back our services and kill off the 2 Eureka Servers that were in distress (the 3rd was not in distress). We were running an older version of Eureka (1.6.2) via spring cloud. We have since upgraded Eureka to version 1.9.13. After our upgrade to 1.9.13 we noticed we could now register about 3500 instances before we would start seeing a significant number of socket timeouts and CPU would jitter quite a bit. We then found the following issue and supporting pull request:
In short, the guidance was to increase the peer node read timeout using the
eureka.server.peer-node-read-timeout-ms
property. We did this and doing so got rid of timeouts completely and settled CPU. Our target was to register 10,000 instances which we've now hit.Our next test was to figure out what impact scaling the Eureka server horizontally from 3 instances to 9 instances would have. When we scaled out with 10,000 registered instances, each eureka server seemed to be giving a different picture of the world and we'd never see the same number of instances registered between our eureka server. Most of the time each server was off by the thousands. I tried scaling back down to 6 instances but that did not work either. It wasn't until I scaled back to three did things start to settle out. So here are my questions:
eureka.server.max-elements-in-status-replication-pool
property, what exactly defines an element? Does this relate to the number of registered instances?To possibly help with any of my questions above, here is what our configurations look like on our Eureka servers (we use DNS resolution to find the eureka servers and I've omitted those properties from the config below). I realize some of these properties are the default but we define them anyways in our
application.properties
file.The text was updated successfully, but these errors were encountered: