New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Display a warning in status when too many log tail #2265
Conversation
71550a3
to
c3fd1d2
Compare
@achntrl thanks for working on this! Someone from the team will review the code shortly, in the meantime, would you mind adding some screenshots of the agent status, to show how it displays? It helps the reviewer to picture how it would look like :) |
I found a small bug: If you have 2 folders with > 100 files, then you delete files and you go under 100 files then you add more than 100 files in the first folder and some files in the second folder, the counter for the second folder doesn't get updated It seems very specific but it's because of this loop It shouldn't be too hard to fix, I'll push a fix tomorrow (and hopefully a test!) |
This is great, adding the number of tailed files per wildcard tailer is more than I hoped we'd do, 👏 . |
pkg/logs/status/warnings.go
Outdated
messages = append( | ||
messages, | ||
fmt.Sprintf( | ||
"The default limit on the maximum number of files in use (%d) has been reached. Increase this limit (thanks to the attribute logs_config.open_files_limit in datadog.yaml) or decrease the number of tailed file.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still reach the limit if we override the files limit -> it's not always the default limit
that has been reached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the number displayed is the actual configured number, I will update the wording
pkg/logs/status/status.go
Outdated
if count == len(inputs) { | ||
return fmt.Sprintf("This path tails %d files", count) | ||
} | ||
return fmt.Sprintf("This path tries to tail %d files but only %d files are tailed", count, len(inputs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General nit: I'd avoid tries to
and but only
in agent statuses, as those tend to be blurry words.
What about something like %d files tailed out of %d files matching
that we could use for both cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path ..
-> %s ..." %{path}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this feature, the main logic is here but I'd like to raise any kind of warning from any kind of component, could you address this problem too please ?
pkg/logs/status/status.go
Outdated
@@ -111,3 +119,17 @@ func toDictionary(c *config.LogsConfig) map[string]interface{} { | |||
} | |||
return dictionary | |||
} | |||
|
|||
func generateOpenFilesLimitReachedWarning(c *config.LogsConfig, inputs []string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should not be here, the status should be agnostic of the type of warning.
pkg/logs/status/status.go
Outdated
Configuration: toDictionary(source.Config), | ||
Status: status, | ||
Inputs: inputs, | ||
WarningMessage: generateOpenFilesLimitReachedWarning(source.Config, inputs), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect something like:
WarningMessage: source.Warning.Render(),
pkg/logs/status/status.go
Outdated
if count == len(inputs) { | ||
return fmt.Sprintf("This path tails %d files", count) | ||
} | ||
return fmt.Sprintf("This path tries to tail %d files but only %d files are tailed", count, len(inputs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path ..
-> %s ..." %{path}
pkg/logs/input/file/file_provider.go
Outdated
@@ -34,6 +35,7 @@ func NewFile(path string, source *config.LogSource) *File { | |||
type Provider struct { | |||
filesLimit int | |||
shouldLogErrors bool | |||
reachedLimit bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not used
pkg/logs/input/file/file_provider.go
Outdated
@@ -71,9 +73,11 @@ func (p *Provider) FilesToTail(sources []*config.LogSource) []*File { | |||
|
|||
if len(filesToTail) == p.filesLimit { | |||
log.Warn("Reached the limit on the maximum number of files in use: ", p.filesLimit) | |||
status.Warnings.RaisedWarnings[status.OpenFilesLimitReachedType] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the interface of Warnings
was status.Warnings.Raise(key string, message string)
, we could add any kind of message
pkg/logs/input/file/file_provider.go
Outdated
return filesToTail | ||
} | ||
|
||
status.Warnings.RaisedWarnings[status.OpenFilesLimitReachedType] = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There it would be status.Warnings.Solve(key string)
pkg/logs/status/status.go
Outdated
Configuration map[string]interface{} `json:"configuration"` | ||
Status string `json:"status"` | ||
Inputs []string `json:"inputs"` | ||
WarningMessage string `json:"warning"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should have a Warning
struct here that is renderable
.
pkg/logs/status/warnings.go
Outdated
|
||
// Warning types | ||
const ( | ||
OpenFilesLimitReachedType = "open_files_limit_reached" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think Warnings
should be aware of the kind of warning.
@@ -4,6 +4,11 @@ NOTE: Changes made to this template should be reflected on the following templat | |||
*/}}========== | |||
Logs Agent | |||
========== | |||
{{- if .warnings }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add this block after line 9
because can't raise warnings if the agent is not running
@@ -4,6 +4,11 @@ NOTE: Changes made to this template should be reflected on the following templat | |||
*/}}========== | |||
Logs Agent | |||
========== | |||
{{- if .warnings }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want {{ if .warnings }}
905e242
to
8cf832b
Compare
pkg/logs/warning/warning.go
Outdated
|
||
// Remove marks a RaisedWarning as solved | ||
func Remove(key string) { | ||
delete(w.raised, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not safe, you should check if the entry exists first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the entry does not exists, delete is a noop see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
pkg/logs/warning/warning.go
Outdated
// This product includes software developed at Datadog (https://www.datadoghq.com/). | ||
// Copyright 2018 Datadog, Inc. | ||
|
||
package warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This package needs to be unit-tested, especially the operations on the map since multiple clients on different threads could call Raise
and Get
roughly at the same time.
For more details, you can check https://stackoverflow.com/questions/36167200/how-safe-are-golang-maps-for-concurrent-read-write-operations
This looks great and fully reusable, I left two comments but we're heading to the right direction. |
pkg/logs/warning/warning.go
Outdated
// This product includes software developed at Datadog (https://www.datadoghq.com/). | ||
// Copyright 2018 Datadog, Inc. | ||
|
||
package warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think we should move this package/file to ./pkg/logs/status
a1de789
to
ec760b4
Compare
pkg/logs/status/warning_test.go
Outdated
return "foo" | ||
} | ||
|
||
func TestRaise(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not test the inner map of the warnings
struct but just that Raise
followed by a Get
returns what I expect
pkg/logs/status/warning_test.go
Outdated
assert.Equal(t, []Warning{concreteWarning1, concreteWarning2}, GetWarnings()) | ||
} | ||
|
||
func TestRemove(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, I'd rather do something like Raise, Get, Remove, Get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code would be much simpler if we concentrated on having the possibility to display an arbitrary list of messages. It would be much more re-usable and we still can have the file provider use it to register a warning when we are tailing too many files.
pkg/logs/status/status.go
Outdated
Integrations []Integration `json:"integrations"` | ||
IsRunning bool `json:"is_running"` | ||
Integrations []Integration `json:"integrations"` | ||
WarningMessages []string `json:"warnings"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could have been simply generalized into Messages
.
If we need a severity we can always add that later.
pkg/logs/status/warning.go
Outdated
var w = newWarnings() | ||
|
||
// Warning is a generic interface that generate warning messages | ||
type Warning interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a string so I wonder if the abstraction is really necessary.
pkg/logs/config/source.go
Outdated
Config *LogsConfig | ||
Status *LogStatus | ||
inputs map[string]bool | ||
Overview string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Messages
would be more appropriate here as well as it doesn't convey additional semantics that aren't necessarily true here.
pkg/logs/status/status.go
Outdated
@@ -61,6 +61,7 @@ func Get() Status { | |||
} | |||
// Convert to json | |||
var integrations []Integration | |||
warningsDeduplicator := make(map[string]bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: make(map[string]bool)
-> make(map[string]struct{})
pkg/logs/status/status.go
Outdated
}) | ||
|
||
for _, warning := range source.Messages.GetWarnings() { | ||
warningsDeduplicator[warning] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warningsDeduplicator[warning] = true
-> warningsDeduplicator[warning] = struct{}{}
pkg/logs/status/status.go
Outdated
warningMessages := make([]string, len(warnings)) | ||
for i, w := range warnings { | ||
warningMessages[i] = w.Render() | ||
warnings := make([]string, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warnings := make([]string, 0)
-> var warnings []string
pkg/logs/status/status.go
Outdated
WarningMessages: warningMessages, | ||
IsRunning: true, | ||
Integrations: integrations, | ||
Messages: warnings, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit misleading, I am not sure the messages that we want to report at the very beginning of the logs status are always related to logs-sources.
We could for example report to the user that he is using a deprecated attribute and the he should use the new one.
pkg/logs/status/status.go
Outdated
@@ -61,6 +61,7 @@ func Get() Status { | |||
} | |||
// Convert to json | |||
var integrations []Integration | |||
warningsDeduplicator := make(map[string]bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a warning/message was not tight to a source, you'd need to do any deduplicate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the alternative would have been to compute the warning at the status create but this would require computer to identify LogSource for file an count the inputs
I found it messier than just deduplicating warning messages (which is quite a small operation)
When the user tails too many log files, some files are not tailed but the status doesn't report any problem (only in the logs of the agent)
c9ba6dd
to
2481ddc
Compare
|
||
// AddWarning create a warning | ||
func (m *Messages) AddWarning(key string, warning string) { | ||
m.AddMessage("warning_"+key, warning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you adding a prefix here ? If it's just for filtering out purpose, I'd rather avoid doing that because a string compare is not efficient.
You could create a structure:
type Message struct {
Value String
IsWarning bool
}
And use IsWarning for filtering out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to avoid collision between keys and to reuse the same methods so most warning methods are just calling the message one with the prefix warning_
pkg/logs/config/messages.go
Outdated
"sync" | ||
) | ||
|
||
// Messages holds a message message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "message" twice normal?
pkg/logs/config/source.go
Outdated
import "sync" | ||
import ( | ||
"sync" | ||
) | ||
|
||
// LogSource holds a reference to and integration name and a log configuration, and allows to track errors and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: to "an" integration
What does this PR do?
This PR displays a warning in the agent status when the user is trying to tail more files than the
logs_config.open_files_limit
is allowingMotivation
For now, this happens almost silently, the only way to find out that not all files are tailed is to read the logs of agent. This PR makes the problem more visible