Skip to content

Commit

Permalink
Merge pull request #1067 from MetPX/issue824_morestates_2ndTry
Browse files Browse the repository at this point in the history
Issue824 morestates and issue966  and... other stuff
  • Loading branch information
petersilva committed May 27, 2024
2 parents 2df0acf + 2fa43c7 commit 08a468a
Show file tree
Hide file tree
Showing 9 changed files with 524 additions and 258 deletions.
80 changes: 43 additions & 37 deletions docs/source/Explanation/CommandLineGuide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -436,45 +436,45 @@ status

Sample OK status (sr is running)::

fractal% sr3 status
SSC-5CD2310S60% sr3 status

status:
Component/Config Processes Connection Lag Rates
State Run Retry msg data Queued LagMax LagAvg Last %rej pubsub messages RxData TxData
----- --- ----- --- ---- ------ ------ ------ ---- ---- ------ -------- ------ ------
cpost/veille_f34 run 1/1 0 100% 0% 0 0.00s 0.00s n/a 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
cpump/pelle_dd1_f04 run 1/1 0 100% 0% 0 0.00s 0.00s n/a 31.3% 0 Bytes/s 4 msgs/s 0 Bytes/s 0 Bytes/s
cpump/pelle_dd2_f05 run 1/1 0 100% 0% 0 0.00s 0.00s n/a 31.3% 0 Bytes/s 4 msgs/s 0 Bytes/s 0 Bytes/s
cpump/xvan_f14 run 1/1 0 100% 0% 0 0.00s 0.00s n/a 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
cpump/xvan_f15 run 1/1 0 100% 0% 0 0.00s 0.00s n/a 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
poll/f62 run 1/1 0 100% 0% 0 0.08s 0.04s 1.4s 0.0% 2.0 KiB/s 0 msgs/s 0 Bytes/s 0 Bytes/s
post/shim_f63 stop 0/0 - - - - - - -
post/test2_f61 stop 0/0 0 100% 0% 0 0.02s 0.01s 0.4s 0.0% 8.1 KiB/s 0 msgs/s 0 Bytes/s 0 Bytes/s
sarra/download_f20 run 3/3 0 100% 10% 0 13.17s 5.63s 1.8s 0.0% 5.4 KiB/s 4 msgs/s 1.7 KiB/s 0 Bytes/s
sender/tsource2send_f50 run 10/10 0 100% 9% 0 1.37s 1.08s 1.9s 0.0% 8.1 KiB/s 5 msgs/s 0 Bytes/s 1.7 KiB/s
shovel/pclean_f90 run 3/3 136 100% 0% 0 0.00s 0.00s 0.6s 0.0% 4.0 KiB/s 5 msgs/s 0 Bytes/s 0 Bytes/s
shovel/pclean_f92 run 3/3 0 100% 0% 0 0.00s 0.00s n/a 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
shovel/rabbitmqtt_f22 run 3/3 0 100% 0% 0 0.89s 0.67s 1.5s 0.0% 8.1 KiB/s 5 msgs/s 0 Bytes/s 0 Bytes/s
shovel/t_dd1_f00 run 3/3 0 100% 0% 124 23.15s 4.50s 0.1s 55.0% 3.9 KiB/s 9 msgs/s 0 Bytes/s 0 Bytes/s
shovel/t_dd2_f00 run 3/3 0 100% 0% 83 11.82s 3.50s 0.1s 49.2% 3.6 KiB/s 8 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/amqp_f30 run 3/3 0 100% 12% 0 18.79s 9.22s 0.1s 0.0% 3.3 KiB/s 4 msgs/s 1.9 KiB/s 0 Bytes/s
subscribe/cclean_f91 run 3/3 145 100% 0% 1 0.00s 0.00s 0.4s 0.0% 2.3 KiB/s 6 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/cdnld_f21 run 3/3 0 100% 17% 12 7.20s 2.81s 0.7s 0.0% 2.3 KiB/s 3 msgs/s 1.7 KiB/s 0 Bytes/s
subscribe/cfile_f44 run 3/3 0 100% 6% 1 3.32s 0.32s 0.4s 0.0% 2.3 KiB/s 6 msgs/s 1.7 KiB/s 0 Bytes/s
subscribe/cp_f61 run 3/3 0 100% 3% 0 6.42s 3.49s 1.6s 0.0% 4.2 KiB/s 6 msgs/s 635 Bytes/s 0 Bytes/s
subscribe/ftp_f70 run 3/3 0 100% 8% 0 1.18s 0.83s 0.2s 0.0% 1.8 KiB/s 3 msgs/s 1.8 KiB/s 0 Bytes/s
subscribe/q_f71 run 3/3 0 100% 0% 0 1.62s 0.57s 0.0s 0.0% 1.2 KiB/s 3 msgs/s 1.2 KiB/s 0 Bytes/s
subscribe/rabbitmqtt_f31 run 3/3 0 100% 11% 0 4.27s 1.95s 1.2s 0.0% 4.2 KiB/s 6 msgs/s 637 Bytes/s 0 Bytes/s
subscribe/u_sftp_f60 run 3/3 0 100% 1% 0 2.69s 2.23s 1.3s 0.7% 4.2 KiB/s 6 msgs/s 644 Bytes/s 0 Bytes/s
watch/f40 run 1/1 0 100% 0% 0 0.10s 0.05s 1.9s 0.0% 4.2 KiB/s 0 msgs/s 0 Bytes/s 0 Bytes/s
winnow/t00_f10 run 1/1 0 100% 0% 0 12.31s 4.33s 3.5s 50.0% 3.2 KiB/s 3 msgs/s 0 Bytes/s 0 Bytes/s
winnow/t01_f10 run 1/1 0 100% 0% 0 11.59s 3.76s 0.1s 50.5% 4.2 KiB/s 4 msgs/s 0 Bytes/s 0 Bytes/s
Total Running Configs: 25 ( Processes: 64 missing: 0 stray: 0 )
Memory: uss:2.4 GiB rss:3.3 GiB vms:6.2 GiB
CPU Time: User:39.62s System:4.42s
Pub/Sub Received: 103 msgs/s (80.6 KiB/s), Sent: 63 msgs/s (32.8 KiB/s) Queued: 221 Retry: 281, Mean lag: 2.32s
Data Received: 32 Files/s (11.9 KiB/s), Sent: 5 Files/s (1.7 KiB/s)
cpost/veille_f34 idle 1/1 0 100% 0% 0 0.00s 0.00s n/a 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
cpump/pelle_dd1_f04 stop 0/0 0 0% 0% 0 0.00s 0.00s 0 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
cpump/pelle_dd2_f05 stop 0/0 0 0% 0% 0 0.00s 0.00s 0 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
cpump/xvan_f14 hung 1/1 0 100% 0% 0 0.00s 0.00s n/a 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
cpump/xvan_f15 hung 1/1 0 100% 0% 0 0.00s 0.00s n/a 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
poll/f62 idle 1/1 0 100% 0% 0 0.00s 0.00s 221.3s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
post/shim_f63 stop 0/0 0 0% 0% 0 0.00s 0.00s 0 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
post/test2_f61 stop 0/0 0 100% 0% 0 0.00s 0.00s 224.6s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
sarra/download_f20 idle 3/3 0 100% 0% 0 0.00s 0.00s 324.0s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
sender/tsource2send_f50 idle 10/10 0 100% 0% 0 0.00s 0.00s 140.9s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
shovel/pclean_f90 idle 3/3 231 100% 0% 0 0.00s 0.00s 140.8s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
shovel/pclean_f92 idle 3/3 0 100% 0% 0 0.00s 0.00s 142.1s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
shovel/rabbitmqtt_f22 hung 3/3 0 100% 0% 0 0.00s 0.00s 139.9s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
shovel/t_dd1_f00 stop 0/0 0 0% 0% 0 0.00s 0.00s 0 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
shovel/t_dd2_f00 stop 0/0 0 0% 0% 0 0.00s 0.00s 0 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/amqp_f30 idle 3/3 0 100% 0% 0 0.00s 0.00s 323.9s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/cclean_f91 idle 3/3 247 100% 0% 1 0.00s 0.00s 217.1s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/cdnld_f21 idle 3/3 0 100% 0% 17 0.00s 0.00s 313.2s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/cfile_f44 idle 3/3 0 100% 0% 1 0.00s 0.00s 217.1s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/cp_f61 idle 3/3 0 100% 0% 0 0.00s 0.00s 139.0s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/ftp_f70 part 1/3 8 100% 3% 0 586.95s 438.50s 224.4s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/q_f71 lag 3/3 15 100% 39% 0 573.83s 457.23s 221.0s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/rabbitmqtt_f31 idle 3/3 0 100% 0% 0 0.00s 0.00s 138.1s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
subscribe/u_sftp_f60 idle 3/3 0 100% 0% 0 0.00s 0.00s 140.8s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
watch/f40 idle 1/1 0 100% 0% 0 0.00s 0.00s 141.3s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
winnow/t00_f10 idle 1/1 0 100% 0% 42 0.00s 0.00s 324.3s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
winnow/t01_f10 idle 1/1 0 100% 0% 0 0.00s 0.00s 325.4s 0.0% 0 Bytes/s 0 msgs/s 0 Bytes/s 0 Bytes/s
Total Running Configs: 21 ( Processes: 54 missing: 2 stray: 0 )
Memory: uss:2.2 GiB rss:3.0 GiB vms:6.6 GiB
CPU Time: User:562.61s System:47.39s
Pub/Sub Received: 0 msgs/s (0 Bytes/s), Sent: 0 msgs/s (0 Bytes/s) Queued: 97 Retry: 1019, Mean lag: 445.07s
Data Received: 0 Files/s (0 Bytes/s), Sent: 0 Files/s (0 Bytes/s)

Full sample::

Expand Down Expand Up @@ -542,10 +542,16 @@ The second row of output gives detailed headings within each category:
The configurations are listed on the left. For each configuraion, the state
will be:

* stop: no processes are running.
* run: all processes are running.
* part: some processes are running.
* disa: disabled, configured not to run.
* hung: processes appear hung, not writing anything to logs.
* idle: all processes running, but no data or message transfers for too long (idlethreshold.)
* lag: all processes running, but messages being processed are too old ( runStateThreshold_lag )
* part: some processes are running, others are missing.
* reje: all processes running, but too high percent of messages being rejected (runStateThreshold_reject )
* rtry: all processes running, but too large number of transfers failed and retrying (runStateThreshold_retry )
* run: all processes are running (and transferring, and not behind, and not slow... normal state.)
* slow: transfering less than minimum bytes/second ( runStateThreshold_slow )
* stop: no processes are running.

The next columns to the right give more information, detailing how many processes are Running, out of the number expected.
For example, 3/3 means 3 processes or instances found of the 3 expected to be found.
Expand Down
90 changes: 86 additions & 4 deletions docs/source/Reference/sr3_options.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -895,6 +895,7 @@ housekeeping <interval> (default: 300 seconds)
The **housekeeping** option sets how often to execute periodic processing as determined by
the list of on_housekeeping plugins. By default, it prints a log message every houskeeping interval.


include config
--------------

Expand Down Expand Up @@ -1483,6 +1484,7 @@ are uptodate. If the queue already exists, These flags can be
set to False, so no attempt to declare the queue is made, or it´s bindings.
These options are useful on brokers that do not permit users to declare their queues.


randomize <flag>
----------------

Expand Down Expand Up @@ -1595,11 +1597,91 @@ The **retry_ttl** (retry time to live) option indicates how long to keep trying
a file before it is aged out of a the queue. Default is two days. If a file has not
been transferred after two days of attempts, it is discarded.

sanity_log_dead <interval> (default: 1.5*housekeeping)
------------------------------------------------------

The **sanity_log_dead** option sets how long to consider too long before restarting
a component.
runStateThreshold_hung <interval> (default: 450)
------------------------------------------------

The runStateThreshold_hung (formerly: **sanity_log_dead**) option sets how long to consider too long before restarting
a component. when running *sr3 status*, the flow status will be shown as *hung*

This may indicate a problem with a poll plugin not releasing the cpu. or so sort of network hiccup.
A periodic run *sr3 sanity* (as a cron job) will restart hung jobs.


runStateThreshold_idle <interval> (default: 900)
------------------------------------------------

The runStateThreshold_idle option sets how long to consider too long before declaring no transfers are occurring.
the *sr3 status* command will show such flows as *idle*

In a flow that posts data, the last activity will be based on when it posted last.
In a flow that transfers data, the last activity will be based on the last data transfer.
In a flow does neither of the above, then the last activity is based on the last message
received.

This isn't a problem in itself, unless one is expecting a continuous flow. If a continuous flow
of a certain rate is expected, set the *runStateThreshold_slow* for the flow so that *sr3 status* flags
it as a problem.


runStateThreshold_lag <interval> (default: 30)
----------------------------------------------

The runStateThreshold_lag option sets how much delay in message processing to consider normal.
if the data AvgLag in the *sr3 status* command exceeds this, then the flow will be listed
as *lagging*.

When a flow is lagging, one should consider accellerating it:
* narrow the scope of the subscription (narrower *subtopics*)
* narrow the scope of the downloads (more *reject*'s in the accept/reject clauses)
* increase the download resources (more *instances*
* split the flow into multiple independent flows.

If the amount of lag being seen is reasonable to accept for the application, then
raising the runStateThreshold_lag for that flow could also be a reasonable remedy.


runStateThreshold_reject <count> (default: 80)
----------------------------------------------

The runStateThreshold_reject option sets how many messages to reject, as a percentage,
from a flow and consider normal. If the data *%rej* field in the *sr3 status* command
exceeds this, then the flow will be listed as *reje*.

To address, examine the subtopic in the configuration, and narrow them so that fewer messages
are transferred from the broker if possible. If that is not possible, then raise the threshold
for the flow affected.


runStateThreshold_retry <count> (default: 1000)
-----------------------------------------------

The runStateThreshold_retry option sets how big a queue of transfers and posts to retry
is considered normal. If the data *Retry* field in the *sr3 status* command
exceeds this, then the flow will be listed as *retr*.

To address, examine the logs to determine why so many transfers are failing. Address the cause.
If the cause cannot be addressed and this needs to be considered normal, then raise the threshold
for this configuration to match this expectation.


runStateThreshold_slow <byterate> (default: 0)
----------------------------------------------

The runStateThreshold_slow option sets how many messages bytes per second to expect this flow
to transfer normally.

if the data *RxData* and *TxData* fields combined in the *sr3 status* command is lower than this,
then the flow will be listed as *slow*.

To address this, consider whether the download patterns (accept/reject) are downloading
too much data. Downloading unused data will slow down transfer of data interest.

After considering the data in the flow, consider increasing instances in the configuration
or splitting up the load among several configurations, to improve parallelization.
If the speed observed is to be considered normal for that flow, then set the threshold
appropriately.


scheduled_interval,scheduled_hour,scheduled_minute
--------------------------------------------------
Expand Down
13 changes: 9 additions & 4 deletions docs/source/fr/Explication/GuideLigneDeCommande.rst
Original file line number Diff line number Diff line change
Expand Up @@ -544,10 +544,15 @@ La deuxième rangée donne des détails sur les en têtes de chacune des catégo
Les configurations sont répertoriées sur la gauche. Pour chaque configuration, l’état.
sera :

* stopped: aucun processus n’est en cours d’exécution.
* running: tout les processus sont en cours d’exécution.
* partial: certains processus sont en cours d’exécution.
* disabled: configuré pour ne pas s’exécuter.
* hung : les processus semblent bloqués et n'écrivent rien dans les journaux.
* idle : tous les processus en cours d'exécution, mais ne transfert pas depuis trop longtemps (runStateThreshold_idle.)
* lag : tous les processus en cours d'exécution, mais les messages en cours de traitement sont trop anciens ( runStateThreshold_lag )
* part: certains processus sont en cours d'exécution, d'autres manquent à l'appel.
* reje : tous les processus en cours d'exécution, mais un pourcentage trop élevé de messages rejetés (runStateThreshold_reject )
* rtry : tous les processus en cours d'exécution, mais un grand nombre de transferts échouent, causant d'autres tentatives (runStateThreshold_retry )
* run : tous les processus sont en cours d'exécution (et en transfert, et pas en retard, et pas lents... état normal.)
* slow : transfert de moins que le minimum d'octets/seconde ( runStateThreshold_slow )
* stop : aucun processus n'est en cours d'exécution.

Les colonnes à droite donnent plus d’informations, détaillant le nombre de processus en cours d’exécution à partir du nombre attendu.
Par exemple, 3/3 signifie 3 processus ou instances sont trouvés à partir des 3 attendus.
Expand Down
Loading

0 comments on commit 08a468a

Please sign in to comment.