Skip to content

Latest commit

 

History

History
175 lines (155 loc) · 35.7 KB

metrics.md

File metadata and controls

175 lines (155 loc) · 35.7 KB

Diego Metrics

A list of component-level metrics emitted by Diego. Contributors interested in adding new metrics should visit our contributor doc for a list of code conventions we follow.

Auctioneer

Metric Description Unit
AuctioneerFailedCellStateRequests Cumulative number of cells the auctioneer failed to query for state. Emitted during each auction. number
AuctioneerFetchStatesDuration Time the auctioneer took to fetch state from all the cells when running its auction. Emitted during each auction. ns
AuctioneerLRPAuctionsFailed Cumulative number of LRP instances that the auctioneer failed to place on Diego cells. Emitted during each auction. number
AuctioneerLRPAuctionsStarted Cumulative number of LRP instances that the auctioneer successfully placed on Diego cells. Emitted during each auction. number
AuctioneerTaskAuctionsFailed Cumulative number of Tasks that the auctioneer failed to place on Diego cells. Emitted during each auction. number
AuctioneerTaskAuctionsStarted Cumulative number of Tasks that the auctioneer successfully placed on Diego cells. Emitted during each auction. number
LockHeld Whether an auctioneeer holds the auctioneer lock (in locket): 1 means the lock is held, and 0 means the lock was lost. Emitted periodically by the active auctioneer. 0 or 1 (boolean)
RequestCount Cumulative number of requests the auctioneer has handled through its API. Emitted periodically. number
RequestLatency Time the auctioneer took to handle requests to its API endpoints. Emitted when the auctioneer handles requests. ns

BBS

Metric Description Unit
BBSMasterElected Emitted once when the BBS is elected as master. number (always 1)
ConvergenceLRPDuration Time the BBS took to run the entire LRP convergence pass. Emitted periodically. ns
ConvergenceLRPRuns Cumulative number of times BBS has run its LRP convergence pass. Emitted periodically. number
ConvergenceTaskDuration Time the BBS took to run the entire Task convergence pass. Emitted periodically. ns
ConvergenceTaskRuns Cumulative number of times the BBS has run its Task convergence pass. Emitted periodically. number
ConvergenceTasksKicked Cumulative number of times the BBS has updated a Task during its Task convergence pass. Emitted periodically. number
ConvergenceTasksPruned Cumulative number of times the BBS has deleted a malformed Task Definition during its Task convergence pass. Emitted periodically. number
CrashedActualLRPs Total number of LRP instances that have crashed. Emitted periodically. number
CrashingDesiredLRPs Total number of DesiredLRPs that have at least one crashed instance. Emitted periodically. number
DBOpenConnections Number of open connections to the SQL database. Emitted every 60 seconds. number
DBQueriesFailed Cumulative number of SQL queries that failed. Emitted every 60 seconds. number
DBQueriesInFlight Maximum number of concurrent in flight queries in the last 60 seconds. Emitted every 60 seconds. number
DBQueriesTotal Cumulative number of SQL queries executed, including BEGIN, COMMIT, and ROLLBACK statements. Emitted every 60 seconds. number
DBQueriesSucceeded Cumulative number of SQL queries that finished successfully. Emitted every 60 seconds. number
DBQueryDurationMax Maximum duration of all queries that have run in the last 60 seconds. Emitted every 60 seconds. ns
DBWaitDuration The total time blocked waiting for a new connection. Emitted every 60 seconds. ns
DBWaitCount The total number of connections waited for. Emitted every 60 seconds. number
Domain. <domain-name> Whether the <domain-name> domain is up-to-date, so that instances from that domain have been synchronized with DesiredLRPs for Diego to run. 1 means the domain is up-to-date, no data means it is not. Emitted periodically. always 1 when present
EncryptionDuration Time the BBS took to ensure all BBS records are encrypted with the current active encryption key. Emitted each time a BBS becomes the active master. ns
LRPsClaimed Total number of LRP instances that have been claimed by some cell. Emitted periodically. number
LRPsDesired Total number of LRP instances desired across all LRPs. Emitted periodically. number
LRPsExtra Total number of LRP instances that are no longer desired but still have a BBS record. Emitted periodically. number
LRPsMissing Total number of LRP instances that are desired but have no record in the BBS. Emitted periodically. number
LRPsRunning Total number of LRP instances that are running on cells. Emitted periodically. number
LRPsUnclaimed Total number of LRP instances that have not yet been claimed by a cell. Emitted periodically. number
LockHeld Whether a BBS holds the BBS lock (in locket): 1 means the lock is held, and 0 means the lock was lost. Emitted periodically by the active BBS server. 0 or 1 (boolean)
MigrationDuration Time the BBS took to run migrations against its persistence store. Emitted each time a BBS becomes the active master. ns
OpenFileDescriptors Current (non-cumulative) number of open file descriptors held by the BBS. Emitted periodically. number
PresentCells Total number of cells that are maintaining presence with Locket. Emitted periodically. number
RequestCount Cumulative number of requests the BBS has handled through its API. Emitted periodically. number
RequestLatency Maximum amount of time the BBS took to handle a request to one its API endpoints over a 60-second interval. Emitted every 60 seconds. ns
SuspectCells Total number of cells that are not maintaining their presences with Locket but for which the BBS has a record of at least one ActualLRP. Emitted periodically. number
SuspectClaimedActualLRPs Total number of Suspect LRP instances that have been claimed by some cell. Emitted periodically. number
SuspectRunningActualLRPs Total number of Suspect LRP instances that are running on cells. Emitted periodically. number
TasksCompleted Total number of Tasks that have completed. Emitted periodically. number
TasksPending Total number of Tasks that have not yet been placed on a cell. Emitted periodically. number
TasksResolving Total number of Tasks locked for deletion. Emitted periodically. number
TasksRunning Total number of Tasks running on cells. Emitted periodically. number
TasksSucceeded Cumulative number of tasks completed successfully. Note This metric has a cell-id tag that can be used to get the per cell metric. number
TasksFailed Cumulative number of tasks that failed. Note This metric has a cell-id tag that can be used to get the per cell metric. number
TasksStarted Cumulative number of tasks that has started so far. Note This metric has a cell-id tag that can be used to get the per cell metric. number

Locket

Metric Description Unit
ActiveLocks Total number of active locks. Emitted periodically. number
ActivePresences Total number of active presences. Emitted periodically. number
DBOpenConnections Number of open connections to the SQL database. Emitted every 60 seconds. number
DBQueriesFailed Cumulative number of SQL queries that failed. Emitted every 60 seconds. number
DBQueriesInFlight Maximum number of concurrent in flight queries in the last 60 seconds. Emitted every 60 seconds. number
DBQueriesTotal Cumulative number of SQL queries executed, including BEGIN, COMMIT, and ROLLBACK statements. Emitted every 60 seconds. number
DBQueriesSucceeded Cumulative number of SQL queries that finished successfully. Emitted every 60 seconds. number
DBQueryDurationMax Maximum duration of all queries that have run in the last 60 seconds. Emitted every 60 seconds. ns
LocksExpired Cumulative number of locks that have expired. Emitted every 60 seconds. number
PresenceExpired Cumulative number of presences that have expired. Emitted every 60 seconds. number
RequestsCancelled Cumulative number of requests of a particular type that have been cancelled by the client. Currently tracking Lock, Release, Fetch, and FetchAll requests. Emitted every 60 seconds. number
RequestsStarted Cumulative number of requests of a particular type that have been made. Currently tracking Lock, Release, Fetch, and FetchAll requests. Emitted every 60 seconds. number
RequestsSucceeded Cumulative number of requests of a particular type that have completed successfully. Currently tracking Lock, Release, Fetch, and FetchAll requests. Emitted every 60 seconds. number
RequestsFailed Cumulative number of requests of a particular type that have failed for any reason. Currently tracking Lock, Release, Fetch, and FetchAll requests. Emitted every 60 seconds. number
RequestsInFlight Number of requests of a particular type currently being handled by locket. Currently tracking Lock, Release, Fetch, and FetchAll requests. Emitted every 60 seconds. number
RequestLatencyMax Maximum request latency emitted by a request of a particular type in the last 60 seconds. Currently tracking Lock, Release, Fetch, and FetchAll requests. Emitted every 60 seconds. number

Rep

Metric Description Unit
AppInstanceExceededLogRateLimitCount Number of application instances that have exceeded the app log rate limit. Emitted once for each application instance that exceeds the log rate limit within the last 5 minute interval (metric only emitted if a app log rate limit has been set and an app instance has exceeded that limit). number
CapacityAllocatedDisk Amount of disk allocated to containers on this cell. Emitted periodically. mebibytes
CapacityAllocatedMemory Amount of memory allocated to containers on this cell. Emitted periodically. mebibytes
CapacityRemainingContainers Remaining number of containers this cell can host. Emitted periodically. number
CapacityRemainingDisk Remaining amount of disk available for this cell to allocate to containers. Emitted periodically. mebibytes
CapacityRemainingMemory Remaining amount of memory available for this cell to allocate to containers. Emitted periodically. mebibytes
CapacityTotalContainers Total number of containers this cell can host. Emitted periodically. number
CapacityTotalDisk Total amount of disk available for this cell to allocate to containers. Emitted periodically. mebibytes
CapacityTotalMemory Total amount of memory available for this cell to allocate to containers. Emitted periodically. mebibytes
CellUnhealthy Whether the cell has reached the healthcheck timeout against the garden backend. 1 signifies unhealthy. Emitted once. 1
ContainerCompletedCount Number of containers exited on this cell. Emitted after container exits. number
ContainerCount Number of containers hosted on the cell. Emitted periodically. number
ContainerExitedOnTimeoutCount Number of containers on this cell exited after graceful shutdown interval. Emitted after container exits. number
ContainerUsageDisk Amount of disk used by containers on this cell. Emitted periodically. mebibytes
ContainerUsageMemory Amount of memory used by containers on this cell. Emitted periodically. mebibytes
CredCreationFailedCount Count of failed instance identity credential creations. Emitted after every failed credential creation. number
CredCreationSucceededCount Count of successful instance identity credential creations. Emitted after every successful credential creation. number
CredCreationSucceededDuration Time the rep took to create instance identity credentials. Emitted after every successful credential creation. ns
C2CCredCreationFailedCount Count of failed C2C credential creations. Emitted after every failed credential creation. number
C2CCredCreationSucceededCount Count of successful C2C credential creations. Emitted after every successful credential creation. number
C2CCredCreationSucceededDuration Time the rep took to create C2C credentials. Emitted after every successful credential creation. ns
ContainerSetupSucceededDuration Time the rep took to setup a container with the Garden backend. Emitted after every successful container setup. ns
ContainerSetupFailedDuration Time the rep took to setup a container with the Garden backend. Emitted after every failed container setup. ns
GardenContainerCreationFailedDuration Time the rep's Garden backend took to create a container. Emitted after every failed container creation. ns
GardenContainerCreationSucceededDuration Time the rep's Garden backend took to create a container. Emitted after every successful container creation. ns
GardenContainerDestructionFailedDuration Time the rep's Garden backend took to destroy a container. Emitted after every failed container destruction. ns
GardenContainerDestructionSucceededDuration Time the rep's Garden backend took to destroy a container. Emitted after every successful container destruction. ns
GardenHealthCheckFailed Whether the cell has failed to pass its healthcheck against the garden backend. 0 signifies healthy, and 1 signifies unhealthy. Emitted periodically. 0 or 1 (boolean)
RepBulkSyncDuration Time the cell rep took to synchronize the ActualLRPs it has claimed with its actual garden containers. Emitted periodically by each rep. ns
RequestsStarted Cumulative number of requests of a particular type that have been made. Currently tracking CancelTask, ContainerMetrics, Perform, Reset, State, and StopLRPInstance requests. Emitted every 60 seconds. number
RequestsSucceeded Cumulative number of requests of a particular type that have completed successfully. Currently tracking CancelTask, ContainerMetrics, Perform, Reset, State, and StopLRPInstance requests. Emitted every 60 seconds. number
RequestsFailed Cumulative number of requests of a particular type that have failed for any reason. Currently tracking CancelTask, ContainerMetrics, Perform, Reset, State, and StopLRPInstance requests. Emitted every 60 seconds. number
RequestsInFlight Cumulative number of requests of a particular type that are in-flight by rep. Currently tracking CancelTask, ContainerMetrics, Perform, Reset, State, and StopLRPInstance requests. Emitted every 60 seconds. number
RequestLatencyMax Maximum request latency emitted by a request of a particular type in the last 60 seconds. Currently tracking CancelTask, ContainerMetrics, Perform, Reset, State, and StopLRPInstance requests. Emitted every 60 seconds. number
StalledGardenDuration Time the rep is waiting on its garden backend to become healthy during startup. Emitted only if garden not responsive when the rep starts up. ns
StartingContainerCount Number of containers currently in a Reserved, Initializing, or Created state. Emitted periodically. number
StrandedEvacuatingActualLRPs Evacuating ActualLPRs that timed out during the evacuation process. Emitted when evacuation doesn't complete successful. number
VolmanMountDuration Time volman took to mount a volume. Emitted by each rep when volumes are mounted. ns
VolmanMountDurationFor Time volman took to mount a volume with a specific volume driver. Emitted by each rep when volumes are mounted. ns
VolmanMountErrors Count of failed volume mounts. Emitted periodically by each rep. number
VolmanUnmountDuration Time volman took to unmount a volume. Emitted by each rep when volumes are mounted. ns
VolmanUnmountDurationFor Time volman took to unmount a volume with a specifc volume driver. Emitted by each rep when volumes are mounted. ns
VolmanUnmountErrors Count of failed volume unmounts. Emitted periodically by each rep. number

Route Emitter

Metric Description Unit
AddressCollisions Number of detected conflicting routes. A conflicting route is a set of two distinct instances with the same IP address on the routing table. number
HTTPRouteCount Number of HTTP route associations (route-endpoint pairs) in the route-emitter's routing table. Emitted periodically when emitter is in local mode. number
HTTPRouteNATSMessagesEmitted Cumulative number of HTTP routing messages the route-emitter sends over NATS to the gorouter. number
InternalRouteNATSMessagesEmitted Cumulative number of internal routing messages the route-emitter sends over NATS to the service discovery controller. number
RouteEmitterSyncDuration Time the route-emitter took to perform its synchronization pass. Emitted periodically. ns
RoutesRegistered Cumulative number of NATS route registrations emitted from the route-emitter as it reacts to changes to LRPs. number
RoutesSynced Cumulative number of route registrations emitted from the route-emitter during its periodic route-table emission. number
RoutesTotal Number of combined HTTP and TCP route associations (route-endpoint pairs) in the route-emitter's routing table. Emitted periodically. number
RoutesUnregistered Cumulative number of NATS route unregistrations emitted from the route-emitter as it reacts to changes to LRPs. number
TCPRouteCount Number of TCP route associations (route-endpoint pairs) in the route-emitter's routing table. Emitted periodically when emitter is in local mode. number

SSH Proxy

Metric Description Unit
ssh-connections Total number of SSH connections an SSH proxy has established. Emitted periodically by each SSH proxy. number

General Golang metrics

These metrics are automatically emitted on all the Diego components.

Metric Description Unit
memoryStats.lastGCPauseTimeNS Amount of time the Golang process paused for garbage collection. ns
memoryStats.numBytesAllocatedHeap Number of bytes the Golang process has allocated on the heap. bytes
memoryStats.numBytesAllocatedStack Number of bytes the Golang process has allocated on the stack. bytes
numGoRoutines Number of goroutines the Golang process is running. number