Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard + Turbine for multiple services on the same server #117

Closed
pparth opened this issue Feb 28, 2013 · 21 comments
Closed

Dashboard + Turbine for multiple services on the same server #117

pparth opened this issue Feb 28, 2013 · 21 comments

Comments

@pparth
Copy link

pparth commented Feb 28, 2013

Hello again Ben,

I used the Dashboard to monitor a single service successfully. I tried to use Turbine in order to aggregate Hystrix metrics from multiple services residing on the same server, but this seems to be impossible.
So, let me clear the things a little bit:
The architecture of the relevant Netflix implementation, which is reflected in the Dashboard images presented all over the Dashboard and Turbine Wiki, essentially implies that each service resides on its own server. In order to aggregate Hystrix metrics from all services (each defining its own set of Hystrix commands), all services should generate the metrics stream on the same port (e.g. 8080) which is configured in Turbine. The question is: can Turbine aggregate different sets of Hystrix commands metrics this way?
So, if i have 5 different services, each with different sets of Hystrix commands, located in 5 different servers, then can i have a Dashboard where i could see the union of all Hystrix commands, using Turbine to define a Default cluster of these instances?
What happens when the same Hystrix Command is defined in different hosts?

@benjchristensen
Copy link
Contributor

Hi @pparth

Turbine supports monitoring multiple clusters of servers at the same time. You can see some information about this here: https://github.com/Netflix/Turbine/wiki/Configuration

(@opuneet Can each cluster be given a different port, or is that global?)

A Hystrix/Turbine dashboard represents the metrics for a "cluster" as defined in Turbine and Netflix generally defines that to be a cluster of servers with a single application on it.

For example, most of the screenshots you see are from the Netflix API, a single application deployed across hundreds of servers that uses 100+ backend services served by dozens of backend applications each with their own cluster of servers. The Hystrix dashboard shows the HystrixCommand metrics of the Netflix API interacting with all of those services. Each of the other backend clusters of servers have their own Hystrix dashboard and Turbine monitor.

However, what you define as a "cluster" is completely up to you in Turbine. We just happen to do it by application, as each application is a logical place to monitor and separate metrics for us.

If you want all servers from different applications to all merge together and be presented on a single dashboard you can configure Turbine to do that for you. It's just what instances you tell Turbine to monitor.

If the same HystrixCommand is used in different applications, the metrics for them will be merged together by Turbine if you have a single 'cluster' configured in Turbine that pulls from multiple applications. This may make sense for your use case - it doesn't for us. For example, different applications have different configurations of HystrixCommands (timeouts, concurrency limits, etc) - even if the HystrixCommand itself is the same used by 2 applications, the config and metrics are different.

As for multiple JVM processes on the same machine - obviously they can't all stream over the same port, so you'll need to configure different ports for them and then configure Turbine to monitor the different ports. Perhaps each port represents a "cluster" for you in Turbine.

In short ... Turbine just aggregates whatever metrics it gets from the streams you point it at. It doesn't matter if they are different applications/services/machines etc, it's just data for Turbine. It is you who defines the logical boundaries by what you call a "cluster" in Turbine.

@opuneet
Copy link
Contributor

opuneet commented Feb 28, 2013

Turbine can work with multiple clusters and each cluster can specify it's own configuration when connecting to a host within that cluster.

But I think that the "logical" cluster trick may not work here. Turbine maintains state of all these instances in order to maintain persistent connections to them and it does rely on the "hostname" and if the host name is the same then it won't instantiate a new connection to that same server (on a different port). This was done intentionally so as to ensure that Turbine does not open multiple connections to our servers in production and hence be less intrusive to our prod servers.

@opuneet
Copy link
Contributor

opuneet commented Feb 28, 2013

Hi @pparth

There may be something else that we could do here, but we'd have to write code and plug in a new aggregator. All the aggregators use a global singleton called MonitorConsole to track these connections and there is one connection per host / instance.

The new aggregator implementation would inherit all the basic aggregation logic from the AggregateClusterMonitor class, but would maintain it's own MonitorConsole object hence connections from one Aggregator to individual hosts would not step on the other connections (to the same hosts maintained from within another aggregator).

Each aggregator instance would map to a logical cluster, and you would have 5 clusters mapping to your 5 services. Make sense?

But then the down side to this approach would be that you would see the metrics for a cluster alone since that is how one connects to the Turbine aggregated output stream, and hence you would not be able to see all the metrics from all 5 logical services in one Hystrix dash. But the Hystrix dash could still be used to connect to all 5 Turbine clusters separately.

@pparth
Copy link
Author

pparth commented Mar 1, 2013

Hello guys,

Ben, here at Odesk, we have essentially the same architecture: we have a single application called "O2 API" that consists of a number of services that may well be up to a few dozens in full deployment. These services call each other through Hystrix-wrapped connections.

We may end up, eventually, deploying each service to its own host, but i really think that the most logical approach for architects to take in the early stages of the implementation, is to deploy all the services of the application on the same host and have a cluster of 3+ hosts with a load balancer in front.

Problem is that Turbine does not seem to go well with this configuration. I already tried the logical trick but it has major problems:

  • First of all, as @opuneet stated, it does not really work. Turbine won't instantiate a new connection to that same server (on a different port) and i got results only from the first in line service.
  • Even with the new aggregator that fixes this problem, we would not be able to see all the metrics "of the API" on the same dashboard. This is a must for SiteOps as you very well know.
  • It may well be only 5 services for now, but this number quickly will rise to a few dozens. This is not maintainable.

So, i think that the major problem is that the Dashboard, alone or with Turbine, does not support the consolidation of Hystrix commands coming from 2+ services listening on different ports on the same host. This requirement is much more critical than the aggregation of data from different hosts, which is something to be checked on the next stage. I can't tell if this requirement is to be supported by the Dashboard itself or with the help of Turbine. But it's definitely a must for any small to medium sized deployment.

@pparth
Copy link
Author

pparth commented Mar 7, 2013

Hello guys, @benjchristensen and @opuneet

Are you going to consider the aforementioned requirement for the Dashboard to consolidate commands from multiple services listening on different ports on the same host? Or do you think this is out of scope and we have to stick with the single service per host architecture? Is there a possible roadmap if you decide otherwise?

Thank you for your time!

@opuneet
Copy link
Contributor

opuneet commented Mar 7, 2013

Hey @pparth, sorry for not getting back to you earlier.
@benjchristensen and I spoke about this and I can re-work Turbine to be able to truly support logical clusters where it would be able to connect to the same host on a different port for a different logical service. However the term "cluster" is pretty ingrained into Turbine's design and Turbine's entire operation is isolated at the cluster level. It does not aggregate data across cluster, since cluster is the grouping level for Turbine. This means that there will still be a single unique dash for each logical cluster you define.

Even if I make changes you one to be able to get multiple the aggregated Turbine streams for multiple clusters in the same connection (a single connection is needed by the Hystrix dash) you could still have the same Hystrix commands defined in different clusters and hence multiple data events for these commands will stomp on each other in the Hystrix dash.

There may be a way to solve this by indicating the cluster name along with each data event that is streamed out from Turbine, but this would also then involve major changes to the Hystrix dash to indicate what Hystrix command is for which cluster. We didn't have this kind of a use case when we were designing these 2 components, hence this may take some more thought.

I've opened an issue for Turbine here Netflix/Turbine#9 to track the logical cluster problem.
Let me know if you want this to be fixed soon given that I still can't fix the second problem for you in the near future.

Thanks,
-Puneet

@pparth
Copy link
Author

pparth commented Mar 8, 2013

Hello @opuneet ,
I understand the problem you face. This is the reason i kept mentioning in my previous posts, that i'm not really sure if the logical cluster problem is something that should be solved in the context of Dashboard or Turbine. As i see it, a logical cluster does not do any aggregations and therefore should be offered by the Dashboard itself with no need to install yet another service as Turbine. Turbine should be installed only when someone needs to expand to more than one "Turbine" clusters of "Dashboard", "logical" clusters. This architecture may need major refactorings but it seems to be more elegant on supporting different aggregation features on the long run. To this way, seems to point the additional requirement that needs to be solved when working in the context of a logical cluster: when 2 or more instances of the same Hystrix command are running in the same logical cluster, they should also be presented as different instances on the Dashboard! Don't know how your proposed solution using Turbine will behave in this situation. My guess is that it will either try to aggregate or die trying!

Just my thoughts of course, you know better!

@opuneet
Copy link
Contributor

opuneet commented Mar 8, 2013

Hi @pparth

I think I understand what you are saying, but I want to clarify the meaning of some terms used here just to ensure that we are all on the same page w.r.t your use case / problem definition. I'm restating a few facts here so that there is no confusion on this thread.

Apologies in advance for the long winded email, but I think it's necessary since this thread is getting pretty confusing
If you feel that I've mis-understood the facts here then please correct me, but please be mindful of the terms being used here.

How Turbine works and what do we mean by 'cluster'

Cluster is a Turbine concept and not a Hystrix dash concept. Turbine defines 'cluster' as a group of logical hosts, and runs an individual aggregator for each defined cluster. All data emanating from hosts for the same cluster is aggregated together by the agg which then produces a single aggregate output stream. Hence all Hystrix cmd metrics that have the same aggregation key ( basically name) within the same cluster all get aggregated together into a single metric. This is the value add of using Turbine - you have multiple hosts such as an AWS ASG, and you need to aggregate the same data from all your ec2 instances within the same ASG to get an ASG level view of your Hystrix metrics. Your cluster here is an ASG, but you could also use a group of ASGs or some arbitrary list of instances within the same cluster. Hence the term logical cluster

If 2 or more Hystrix cmd instances with the same name should not be aggregated together, then they cannot be in the same cluster, since that is essentially how Turbine works. Cluster is the scoping mechanism for aggregation.

What is the limitation in Turbine here w.r.t your use case.

Turbine makes a brittle assumption that each host / instance belongs to exactly one cluster, hence cannot connect to the same host on 5 different ports which represent 5 different services. This is a limitation and can be fixed in the near future, but Turbine will still assume that each service on a different port is logically separate and hence should be part of a different cluster. In any case, looks like you don't want the same Hystrix cmd metrics from different services to be aggregated together and hence the different services here map to different Turbine clusters.

So using cluster to map to your services gives you a natural isolation boundary for your metrics

How does Hystrix dash work

The Hystrix dash needs a connection url and simply gets data / metrics from this connection and displays that data using it's javascript code which essentially reacts to the HTML 5 spec compliant events coming over the connection stream (possibly to Turbine or even an individual host).

It treats each event as a self describing Hystrix command metrics instance.
e.g { name : "my-command" , "success" : 50 , "timeout" : 2, "fallback" : 1}
and has no concept of a cluster here.

So the overall summary here is ...

  • Hystrix dash shows all metrics for a single service / app using a single connection to some remote data source. It knows nothing about a cluster.
  • Turbine uses a cluster to give you a more global view in case you have more than one instance for the same service / app.

You use case.

  • You are running 5 separate services on the same physical host.
  • You need a single comprehensive view of Hystrix data from all services using the same dash i.e you want to look at one page to view all 5 services.
  • You also want isolation between the same Hystrix metrics from different services.
  • You could have more than one physical host in the future for scaling your 5 services horizontally and hence you could use Turbine in the future to get an aggregated view of the same metrics from different hosts, but you still want isolation boundaries between different services and you still the union (NOT sum) of all metrics from all 5 services in the same dash.

Do you really need Turbine here?

Well if you have one physical host with 5 different services, then no you do not. But if you scale your 5 services horizontally in the future you will need something to combine metrics from different hosts together, regardless of the isolation boundaries. This is what Turbine was designed for.

What could we do to achieve your use case

Looks like no matter what we do for Turbine, we still need to do some work on the Hystrix dash to achieve the union with isolation boundaries feature that you want. The Hystrix dash will have to be aware of these isolation boundaries so that metrics do not stomp on each other.

I'm not going to make implementation suggestions this early and I think that doing this is non-trivial in the near future. @benjchristensen and I can discuss this and get back to you.

@pparth
Copy link
Author

pparth commented Mar 9, 2013

Hello @opuneet and thank you for your effort!

Your post was very helpful. Your description about the way Turbine works was really enlightening.
Furthermore, you stated my use case exactly as it is, just let me enhance the isolation boundary issue a little bit:
When a large number of the same Hystrix command types are contained in a large number of services (e.g. out of 10 services, the 8 uses the same 3-4 Hystrix command types), it can be really confusing for someone who looks at the Dashboard to understand where each command belongs to.
I can see 3 possible solutions to this problem:

1). If you don't change the signature of the data stream, then the only way for the Dashboard to isolate the commands is to add an arbitrary prefix or suffix to the command name, which essentially is not helpful at all.

2). Again, without changing the data stream signature, a useful option would be for us to somehow augment the command name with the service name where is hosted. So, i could name the command "servicename_commandName" and then you could have the isolation boundary you request. A Dashboard user could then just sort by name and have a decent view of the commands. Its not perfect, but it will do. So, if you think that this solution will help you in the short run, and help you provide a version of an implementation quicker, please let me know. I am pretty sure that we can do it.

3). Best solution on the long run, would be to change the data stream signature in order to contain the service (host app) name. The Dashboard should be aware of this new piece of data. Firstly, it can define the isolation boundary requested and secondly it can provide the user with an opportunity to have a detail view per service. This is very helpful when a large number of services are deployed and the 2) solution is getting out of hand. Now, how can the Hystrix world know about the service name? I think that a nice, Netflix-integrated solution would be to automatically get the name from the ApplicationId attribute of the DeploymentContext defined on the static ConfigurationManager of Archaius. This way, there is no need for other, custom implementations.
In fact, it would be useful if the whole DeploymentContext object is added to the stream. The Stack attribute could be an other helpful piece of data for the Dashboard. It could be a comma-separated list of the stacks a service belongs to. Using this list, a user could have a vertical view of the services belonging to the same stack, and have a better understanding where in the stack tree the problem exists.
P.S The other DeploymentContext attributes could be very helpful for Turbine to automatically define dimensions of clusters and for the Dashboard to present different views per cluster on the same page...but this is another discussion.

Sorry, if i went too far. I'm just really excited about the use of these technologies. Hope i've been helpful.

@opuneet
Copy link
Contributor

opuneet commented Mar 14, 2013

Hey @pparth

Yes, I see that 3. makes sense. We'll discuss and see if and when we can make these changes. Meanwhile 2. can work for you and I've just released changes to Turbine to enable connecting to the same host for a different service.

Please note that I've tested using my unit tests but haven't really had a chance to test this using a real server since we don't usually run services at Netflix using that mechanism.

@pparth
Copy link
Author

pparth commented Mar 14, 2013

Nice!

I'll do my code changes in order to augment my command names with the service name, while waiting for the release to reach Maven Central.

So to clear this out, a working configuration for, say, 2 services will be as follows?

<turbine>
    <aggregator>
        <clusterConfig>connectionsCS,connectionsAS</clusterConfig>
    </aggregator>
    <instanceUrlSuffix>
        <connectionsCS>:8052/hystrix.stream</connectionsCS>
        <connectionsAS>:8050/hystrix.stream</connectionsAS>
    </instanceUrlSuffix>
    <ConfigPropertyBasedDiscovery>
        <connectionsCS>
            <instances>localhost</instances>
        </connectionsCS>
        <connectionsAS>
            <instances>localhost</instances>
        </connectionsAS>
    </ConfigPropertyBasedDiscovery>
</turbine>

@opuneet
Copy link
Contributor

opuneet commented Mar 14, 2013

Yes I think that this will work.

@pparth
Copy link
Author

pparth commented Mar 15, 2013

Ok, i tested the new version and seems to work ok. So, what do we have now? A cluster view of Turbine that is exactly the same with the relevant Dashboard view when connecting to the same service port == cluster port. So, in the aforementioned example, the Dashboard direct connection to http://localhost:8052/hystrix.stream service, is the same to the Dashboard, through-Turbine, connection to http://turbine-hostname:port/turbine.stream?cluster=connectionsCS.

I'm looking forward to the next steps...

@pparth
Copy link
Author

pparth commented Apr 16, 2013

Hello,
Do we have anything new on that?

@benjchristensen
Copy link
Contributor

Is this still worth keeping open?

@opuneet
Copy link
Contributor

opuneet commented Jul 24, 2013

Hi @pparth

JI'm going to close this issue coz there's been no activity on this for a long time. Please reopen if necessary.

Note that there was a related issue here with Turbine Netflix/Turbine#9
which I'm going to close out as well.

To summarize this really long thread, we basically needed a way to confluence metrics from multiple turbine clusters and also represent multiple logical clusters on a single physical group of servers.

In the related Turbine issue, I've patched Turbine so that one can run multiple apps on the same physical h/w and then represent each distinct app within a distinct turbine cluster and hence get agg metrics for the entire cluster.

The Hystrix dash will be able to connect to each of these streams and give you the cluster level metrics for each app.

@opuneet opuneet closed this as completed Jul 24, 2013
@codependent
Copy link

@opuneet I am facing a similar problem and looking up the docs but I haven't found anything about this matter.

Our architecture consists of 2 apps, deployed in a cluster of 2 servers:

server1:9080/app1/hystrix.stream
server1:9080/app2/hystrix.stream
server2:9080/app1/hystrix.stream
server2:9080/app2/hystrix.stream

We would like to have a unified dashboard showing the info of this two services but can't manage to configure it. So far we have made a wild guess with something like this:

turbine.aggregator.clusterConfig=myCluster
turbine.instanceUrlSuffix.myCluster=:9080/app1/hystrix.stream
turbine.instanceUrlSuffix.myCluster=:9080/app2/hystrix.stream

turbine.ConfigPropertyBasedDiscovery.myCluster.instances=server1,server2

Thanks in advance

EDIT: With the config above, we only get updates from app2.

@seh13
Copy link

seh13 commented Mar 25, 2015

@codependent Did you ever find a solution to your issue with 2 servers? I'm having pretty much the same issue and can't seem to figure out what the problem is.

Thanks!

@codependent
Copy link

@seh13 Yes, it was pretty straightforward. I don't have access to the code right now, I'll post it here tomorrow.

@codependent
Copy link

Hi @seh13, these are the relevant parts of my cfg:

#################################
# Turbine 
#################################
turbine.aggregator.clusterConfig=myCluster
turbine.instanceUrlSuffix.myCluster=/hystrix.stream

#################################
# ConfigPropertyBasedInstanceDiscovery 
# (config.properties-based impl only)
#################################
turbine.ConfigPropertyBasedDiscovery.myCluster.instances=domain1:9080/app1,domain1:9080/app2

You can keep adding domains+app to turbine.ConfigPropertyBasedDiscovery.myCluster.instances

Hope it helps

@seh13
Copy link

seh13 commented Mar 26, 2015

@codependent Thanks a lot! I'll give that a try

Update: Worked perfectly, thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants