Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat]: Stream data from all tiers to a parent node (not just tier 0) #16849

Open
hugovalente-pm opened this issue Jan 25, 2024 · 9 comments
Open
Labels
feature request New features needs triage Issues which need to be manually labelled

Comments

@hugovalente-pm
Copy link
Contributor

hugovalente-pm commented Jan 25, 2024

Problem

If a user setups a standalone agent using the 3 tiers and then wants to stream the entire that from that agent to a parent agent only tier 0 data is streamed
This is an issue if the setup of Netdata Agents isn't carefully thought from the start, it is easy for this situation to occur.

Discussion started on this discord thread

Description

When data is to be streamed from a child agent to a parent agent not only tier 0 data should be streamed but the other tiers as well.

Importance

really want

Value proposition

  1. In cases where the setup/hierarchy of agents changes it will be possible to ensure the entire data from a child node is available on the parent node

Proposed implementation

No response

@hugovalente-pm hugovalente-pm added feature request New features needs triage Issues which need to be manually labelled labels Jan 25, 2024
@luisj1983
Copy link
Contributor

+1
In the future it will be a very common scenario that a netdata user will start off with the default standalone agent and then for various reasons, including increasing data retention, move to a streaming parent-child setup. We should allow all data to be preserved.

@ilyam8
Copy link
Member

ilyam8 commented Jan 26, 2024

But the new parent node will create tier 1 and 2 from tier0, what is the problem?

@luisj1983
Copy link
Contributor

But the new parent node will create tier 1 and 2 from tier0, what is the problem?

I was describing the original issue with a little more context to help in assessing the priority of the feature.
How you guys achieve it is up to you :)

@hugovalente-pm
Copy link
Contributor Author

@ilyam8 was having a chat with @ktsaou about this and due to the nature of how streaming works this indeed is not straightforward thing - in short we are building tier 1 and tier 2 as we are collecting data on tier 0

a suggestion that is probably much more efficient both in cost of operations and time to complete is moving the files from the child directly to the parent.
Netdata could provide documentation and an auxiliary script that could allow the user to specify where they would like to copy the files to and ssh them to the destination.

@luisj1983 @stelfrag what do you think?

@luisj1983
Copy link
Contributor

@hugovalente-pm

Thanks very much for picking this up!
I suppose it depends on whether the proposal is just something quick and dirty until something more substantial is delivered or not.

You're probably going to need to do some sort of 'offline export option' at some point anyway for the scenario where you have a parent with massive quantities of multi-tiered historical data and need to now replicate all those tiers off to another parent.

Now, as a quick and dirty workaround for me your option isn't a terrible one but from an IT Operations perspective it has a lot of challenges...

In the first place I can't just ssh from a child node to a parent node (or vice versa) without reconfiguring a bunch of security; which would raise red flags from a SecOps team. For some context, I'm currently spending time removing the need to allow any inbound ssh on the local network by using cloudflare tunnels; this would have me doing the opposite.

The second part is that remotely running a script like that on nodes at scale just isn't a nice thing to do. Sure, running stuff isn't hard but handling any failures etc takes work.

Another issue is that if, as an example, I want to use parents for data resilience- that is to say, have an exact replica of data and retention on the parent so that if a child (or the parent) gets nuked then I can reconstitute that data easily (and don't worry, I'm well aware of the separate need for backups). But in that scenario I've got to do all of the above and then if something breaks later on then I have to do it all over again. If an organisation has any sort of change control process then that's just a nightmare. There are plenty of orgs where the time from change request to green-light is measured in weeks; and in that time data could easily be lost by falling out of the retention range.

@luisj1983
Copy link
Contributor

Is there any plan around this? I still have a parent node sitting around useless because it is receiving streamed metrics but hasn't got the backfilled ones from the children.

@hugovalente-pm
Copy link
Contributor Author

I think this not one of the priorities for the @netdata/agent atm. the guide that you contributed with doesn't help on getting that parent into the same "state" as the children? https://learn.netdata.cloud/docs/netdata-agent/backup-and-restore-an-agent

@luisj1983
Copy link
Contributor

@hugovalente-pm
I'm not sure. I think the team was going to get back to me on which files to copy over.
Is it OK to just copy them into:

/var/cache/netdata/

How will the parent handle things like house-keeping for these files? Is that a concern?
Will the parent be able to handle the fact that it already has data for the same node for the same time period but in different files?

@hugovalente-pm
Copy link
Contributor Author

from my understanding it should work, that's why initially we pointed you in that direction.
probably better to get someone from the @netdata/agent to chime in on this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New features needs triage Issues which need to be manually labelled
Projects
None yet
Development

No branches or pull requests

3 participants