-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add Couch Stats Resource Tracker (CSRT) #5491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
src/couch_stats/src/csrt_logger.erl
Outdated
end. | ||
|
||
-spec tracker(PidRef :: pid_ref()) -> ok. | ||
tracker({Pid, _Ref}=PidRef) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick.
It would be best to minimize the number of places where we assume a specific structure. Since leaking implementation details and relying on them makes code harder to update. How about introducing a helper in csrt_util
or csrt_sever
?
tracker(PidRef) ->
MonRef = csrt_server:monitor_pid_ref(PidRef),
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking maybe PidRef
is not the best name. Since it implies specific implementation. Maybe rename all instances of PidRef
to ResourceId
(this name make sense since we have csrt_server:get_resource(PidRef)
) and do not rely on the structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea with naming it PidRef
is intentionally to not be totally opaque and give a natural logic to what these ID values represent. For example, a coordinator process creates a new PidRef
by way of {self(), make_ref()}
and now we've got a logical handle to a particular process inducing work associated with a particular reference.
So I wanted this to be a unique ID value that makes sense as to what it represents because we have all kinds of different process hierarchies within CouchDB with different requirements. For example, the chttpd
processes that actually call csrt:create_coordinator_context
are reused processes so the need a new ref for every incoming http request and I wanted to be really sure we're not overlapping counting between requests.
So the idea is that a PidRef
is very much a particular process with a ref()
that represents a subset of the work that process performs, where it could be a full or partial subset. Then you can pull a PidRef
out of a JSON blob returned from querying the _active_resources
and go find all the associated workers or look directly at the pid and see what all it's doing.
This becomes even more important for extending beyond the RPC worker process and tracking induced work in other areas of the system. For example, passing along a PidRef
when making a request to couch_file
is essential for tracking work induced within couch_file
itself by the particular worker process, or compactor process, or indexer, etc.
I do agree on minimizing the defined structural definitions within the codebase, and to that effect I migrated most all actual instances of pattern matching on PidRef
into csrt_server
itself, but I missed this one and csrt:destroy_context
:
(! 6187)-> grep -r '{[^,]\+,[^}]\+} \?= \?PidRef' src/couch_stats/src/*.erl
src/couch_stats/src/csrt.erl:destroy_context({_, _} = PidRef) ->
src/couch_stats/src/csrt_logger.erl:tracker({Pid, _Ref} = PidRef) ->
src/couch_stats/src/csrt_server.erl:destroy_resource({_, _} = PidRef) ->
src/couch_stats/src/csrt_server.erl:update_counter({_Pid, _Ref} = PidRef, Field, Count) when Count >= 0 ->
src/couch_stats/src/csrt_server.erl:inc({_Pid, _Ref} = PidRef, Field, N) when is_integer(N) andalso N > 0 ->
src/couch_stats/src/csrt_server.erl:update_element({_Pid, _Ref} = PidRef, Update) ->
I'll drop the structural pattern matching on PidRef
out of csrt.erl
and csrt_logger.erl
, but I feel that the notion of a PidRef
makes sense logically and helps mentally model the flow of this data through the system, at least for me. On an amusing note, I had a similar thought process as you when naming this, and one of the reasons I went with PidRef
is we already have precedent for referencing the notion of "PidRef"'s here https://github.com/apache/couchdb/blob/couch-stats-resource-tracker-v3-rebase/src/mem3/src/mem3.erl#L457-L458 and in fact it's even utilized in the OTP codebase as well: https://github.com/search?q=repo%3Aerlang%2Fotp%20PidRef&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh.. csrt:destroy_context
was easy enough but csrt_logger:track
here actually destructures the PidRef
to grab the Pid
to monitor. I suppose I should add csrt_util:get_{pid,ref}(PidRef :: maybe_pidref())
.
%% are ignored. | ||
%% Switch this to `permanent` once CSRT is out of experimental stage. | ||
CSRTSup = | ||
{?CSRT_SUP, {?CSRT_SUP, start_link, []}, transient, infinity, supervisor, [?CSRT_SUP]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if transient does what we want here? If a transient child crashes it will still restart?
EDIT: Asked Russell in a meeting about this and the idea is that the transient supervisor would still crash but once it reaches its max restart limit it will shutdown. shutdown
is a normal stop, so then the transient child will stop but not propagate the crash further up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put some docs about it in the declaration there:
%% Use a dedicated CSRT supervisor with restart strategy set to transient
%% so that if a CSRT failure arrises that triggers the sup rate limiter
%% thresholds, that shutdown signal will bubble up here and be ignored,
%% as the use of transient specifies that `normal` and `shutdown` signals
%% are ignored.
%% Switch this to `permanent` once CSRT is out of experimental stage.
But over in the Erlang docs https://www.erlang.org/doc/system/sup_princ.html#child-specification for Child Specs we have for restart=transient
:
A transient child process is restarted only if it terminates abnormally, that is, with an exit reason other than normal, shutdown, or {shutdown,Term}.
We create a dedicated couch_stats_csrt_sup
supervisor marked as transient
for the CSRT servers so that those gen_server pids are restarted as you'd expect on error crashes, but in the event we hit the defined "Maximum Restart Intensity" ( https://www.erlang.org/doc/system/sup_princ.html#maximum-restart-intensity ) on the CSRT server pids, that triggers a shutdown on the couch_stats_csrt_sup
pid as declared in the link by:
If more than
MaxR
number of restarts occur in the lastMaxT
seconds, the supervisor terminates all the child processes and then itself. The termination reason for the supervisor itself in that case will beshutdown
.
And because the couch_stats_csrt_sup
supervisor is transient
, we tell the primary couch_stats_sup
to ignore any shutdown
failures arising from restart intensity overload.
The key idea here is that in the past we've seen threshold based overload/error scenarios where the problem only manifests once certain data based conditions are met, but once those conditions have been met it can result in a crash loop disabling the system, which can be difficult to unwind from and often requires a hotpatch to restore proper service. Using a transient
supervisor in this manner makes the core CouchDB database operations robust to isolated failures in experimental systems.
NOTE: I want to highlight that the use of the transient
supervisor approach here is not targeted at this particular PR, but rather, IMO, this is how we should be enabling all new experimental features until they've proven resilient in a wide variety of user workloads and deployment scenarios. IMO this transient
approach on a dedicated supervisor should be the default approach used when adding new functionality, like CSRT, until the config toggles are switched to on by default, at which point the use of transient
can be removed.
The core goal being to get bug reports of the form "so hey, I noticed this subsystem stopped generating data/logs/etc, here's a crash report from the logs", versus "after XYZ happened we started hitting a crash when querying $foo API operations, PLS FIX ASAP!!".
Couch Stats Resource Tracker (CSRT)
Couch Stats Resource Tracker (CSRT) is a framework for tracking the metrics
induced in
couch_stats
at the process level to understand what requests andoperations are utilizing the underlying system resources. The metrics in
couch_stats
are currently tracked at the node level allowing you to see thequantitative aspects of node resource consumption, but lack the qualitative
information to understand what induced those workloads. CSRT rectifies this by
way of process local real time metrics collection exposed in a queryable
interface and also by way of generating process lifetime level reports
documenting the total quantity and time of work induced by a given request.
This PR takes a different approach than
mango_stats
and other facilities thatlook to embed additional metrics that are not present in
couch_stats
; rather,the approach here is inverting this notion with the core idea that: if it's
worth tracking the consumption of a specific resource then that resource is
worth tracking by way of dedicated
couch_stats
metrics. So when the need foradditional data collection arises, we should add those as node level stats in
couch_stats
and then track those at the process level by way of CSRT.This is a rework of PR: #4812
This is a singular PR comprised of a few core components:
The distinction is drawn between these components to highlight that the core
system here is the underlying process level real time stats collection framework
that allows tracking, querying, and logging of resource usage induced by
requests to and operations within CouchDB. The current requests and operations
tracked along with the mechanisms for querying and logging are a solid start but
by no means complete and the idea is to build upon the core framework of 1) to
introspect resource usage such that a user of CouchDB can see what requests and
operations are comprising their system usage.
Understanding what requests utilize the given resources is incredibly important
for aggregate operations that require consuming far more resources than they
manifest back in the client response, as we currently lack vision into these
workload discrepancies. For instance, if you have a
_find
query that fallsback to the all docs index and is a query for which no rows will be found, that
singular http request will induce a full database scan of all documents, painful
enough on its own but magnified greatly when induced in parallel. These types of
requests usually manifest at the client level as timeouts, and within the
databse as heavy IO reader proceses that can thundering herd database shards.
Filtered changes feeds are similar, but even worse in that they funnel the full
database scan through the Javascript engine pipeline. At least with views the
workload induced on the system is roughly similar to what's returned to the
client, which provides at least some mechanism for the user to understand what's
going on; eg if you query 70 million rows from a view in a singular http request
that's something that you'll be able to see and realize it's obviously
problematic. CSRT resolves these issues by logging heavy duty requests as
reports allowing for post facto analysis of what the database was doing over a
given time period, and also a real time querying system for finding out what the
hot requests are right now.
1) the underlying real time stats collection framework
The previous PR details out a few different approaches that failed to scale
effectively during my testing. Instead, the approach in this PR builds upon the
signifcant improvents in Erlang made to ETS around atomic updates and
decentralized counters, combined with an approach that performs no concurrent
writes to the same key as all tracked processes directly update the central
stats table by way of
ets:update_counter
which performs atomic and isolatedupdates. This approach combined with
read_concurrency
andwrite_concurrency
allows for a highly concurrent data collection mechanism that still allows real
time querying. It's easy enough to track the process local stats and generate a
report at the end of the process lifetime, but given that workloads induced from
a request can potentially last for hours, it's critical to have a queryable real
time mechanism that allows for introspection of these long running tasks,
especially if we want to incorporate further information from long running
background jobs like indexing, replication, and compaction.
2) the mechanisms for tracking the resource operations
CSRT hooks into
couch_stats:increment_counter
from within the caller processand in the event the metric counter being incremented is one tracked by CSRT we
then update the ETS entry for the caller process for the given stat. This
greatly simplifies the approach of stats collections, avoids having a
centralized set of processes gathering and publishing stats, and avoids
concurrent writes to the same key given each process tracks itself.
As mentioned above, the key of this PR is the core framework for stats
collection and funneling of data between nodes with a limited subset of
operations being tracked by way of CSRT. Currently we track coordinator
processes as the http request flows into
chttpd
, then we track RPC workers byway of
rexi_server:init_p
. As the RPC worker processes send messages back byway of
rexi
they also embed a snapshot of the current workload delta since thelast rexi message sent, allowing for the coordinator process to accumulate and
reflect the full workload induced, which is then queryable. This mechanism
allows for
rexi:ping
to keep deltas flowing, so even when RPC workers are notsending data back to the client, they will funnel along their CSRT stats so we
can easily find the heavy processes. This PR intentionally keeps the focus to
the coordinator processes and
rexi_server
based RPC workers, but theexpectation is that this tracking system can be utilized for background job
workers and the other RPC subsystems like in
dreyfus
andmem3
.We specifically track RPC workers spawned by way of
rexi_server:init_p
, butwe're less specific about the underlying RPC operation called, and rather, we
track whether the RPC process induces metrics increments for one of the metrics
CSRT is interested in tracking. This goes back to the philosophy of CSRT, which
is to track the important things we already track in
couch_stats
, so ratherthan specifically counting the number of documents opened in an
_all_docs
requests, we instead track the stat
[couchdb, database_reads]
for all RPCworkers induced by
rexi_server:init_p
and then we can find out what the actualrequest that induced the workload was by way of the generated log report.
3) the querying and logging facilities
There's a rudimentary http based querying system that allows for counting,
grouping, and sorting on different fields of the real time values of the running
processes. There's also a process lifetime logging facility that allows for
customizable filtering of what process lifetimes to log. Both of these are
capable but rough and could use help.
The CSRT logger utilizes
ets:fun2ms
to create powerful and performantfiltering functions to determine whether to log the process lifetime report of a
process once it has concluded its operations. These filter match specs are
compiled and saved as a persistent term to be readily available to all processes
in the system, which allows tracking processes to monitor the lifetime of
coordinators or worker processes and upon their exit load the precompiled match
spec and locally determine whether or not to generate a log. This distributed
approach avoids centralized process trackers, a major overload failure mode of
previous experiments, while utilizing the significant performance gains of
persistent_term
and even avoids incurring heavy copying of values into thelocal caller process as the compiled match specs returned out of persistent term
are actually just refs to an internal representation!
Currently there are a handful of builtin matchers that are configurable by way
of the ini files, for example, filters on requests that perform more than X
docs read, or IOQ calls, or docs written. There's also a direct matcher for
dbname, or a more specific dbname IO matcher that matches when a request to the
provided dbname does more than X ops to any of the core IO metrics.
These are some baseline matchers that I thought would be useful, but it would be
great to get ideas from more folks on what they would find useful. These
matchers are easy to add as functions or dynamically by way of
remsh
, but Ihaven't come up with a great way to declare the more complicated config chains
in ini files. I'm hoping other folks might have some ideas on this front, for
example, the following matcher is easy to write and dynamically register to
perform filters, but I couldn't come up with a great approach to allow for
specifying these types of complex queries in an ini file:
The
csrt_logger
server will allow you to dynamically register match specs likethat to then generate process lifetime reports for processes that exit and whose
workloads matched those filter thresholds. This same matcher can also be run
directly against the ETS table to allow for sophisticated real time filtering by
way of match specs. Or you can even run the matcher against a list of
#rctx{}
values, which is basically what
csrt_logger:tracker
does.I'm quite excited for the dynamic logging and introspection capabilities
afforded by the match spec filtering mechanisms, but I am scratching my head a
bit coming up with a good approach for the combinatorics of different filter
pairings of the various fields. I've added a handful of builtin matchers in this
PR, and I think we can provide a useful set of "slow query" type basic filters
from the get go. Hopefully someone can come up with an expressive way of
chaining the matcher specs, but might be challenging given it's doing parse
transforms and what not. That said, it's entirely possible to write some fairly
verbose matchspec funs for various more complex matchers we want, I don't think
that's a problem for useful filters we expect to stick around.
TODO
Add some concluding thoughts, a few examples of output, and some requested
review dissussion points.