Metric Design

Test Monitoring 관련으로 메트릭 디자인하면서 진행했던 내용 공유

🔎 Metrics

💊 USE / RED

USE/RED는 메트릭 디자인 방법론의 종류. 
USE는 물리적 하드웨어(서버) 위주의 메트릭 디자인 방법을,
RED는 어플리케이션 레벨의 메트릭 디자인 방법을 제시함.

📒 USE

For every resource, check utilization, saturation, and errors.
- 모든 자원에 대해서 점유율, 포화율, 오류를 확인하자.

주로 물리적 자원에 대한 메트릭을 수집하는데 사용되는 기준이라고 이해함.
Resource(자원) : 물리적 서버의 모든 기능 구성 요소
Utilization(점유율) : 리소스가 서비스에 대하여 바쁘게 사용되는 평균 시간
-> 점유율이 높다 = 리소스가 처리되지않고 밀려있을 가능성(bottle neck 등)이 있다.
Saturation(포화율) : 리소스가 처리하지 못한 여분의 일(extra work)의 정도. 리소스가 처리할 수 있는 한계를 넘어 처리하지 못한 작업의 정도.
Errors(오류) : 에러 이벤트의 횟수
세 지표를 사용해서 낮은 레벨의(서버의 물리적 자원 - 네트워크) 모니터링이 가능하다.
CPU 점유율은 낮은데 포화율이 높다? -> 스레드나 프로세스 분배가 잘못되어 있어서 하나의 CPU 코어에 일이 몰려있다. -> 스레드 풀을 사용해서 하나의 스레드에 몰린 작업을 분배하는 등의 조치를 취할 수 있음.

Pros and cons

Pros :
Cons :

참고 사이트

📕 RED

“The USE Method doesn’t really apply to services; it applies to hardware, network disks, things like this,”
“We really wanted a microservices-oriented monitoring philosophy, so we came up with the RED Method.”

USE와

참고 사이트

📝 Metrics Design

USE/RED 방법론을 참고해서 사용할 시스템에 적합한 메트릭 선정하기

모니터링 대상 시스템

Backend App (Spring Actuator)
Frontend App (Node Exporter)
Nginx (Nginx Exporter)
Opensearch (Opensearch Exporter)
MySQL (MySQL Exporter)
Redis (Redis Exporter)
Node: Springboot, Vue.js 등의 동작하는 WorkerNode 모니터링 (Node Exporter)

시스템 별 기본 제공 메트릭

Spring Acutator (http://localhost:8080/metrics)

{
  "names": [
    "application.ready.time",
    "application.started.time",
    "disk.free",
    "disk.total",
    "executor.active",
    "executor.completed",
    "executor.pool.core",
    "executor.pool.max",
    "executor.pool.size",
    "executor.queue.remaining",
    "executor.queued",
    "http.server.requests",
    "http.server.requests.active",
    "jvm.buffer.count",
    "jvm.buffer.memory.used",
    "jvm.buffer.total.capacity",
    "jvm.classes.loaded",
    "jvm.classes.unloaded",
    "jvm.compilation.time",
    "jvm.gc.live.data.size",
    "jvm.gc.max.data.size",
    "jvm.gc.memory.allocated",
    "jvm.gc.memory.promoted",
    "jvm.gc.overhead",
    "jvm.gc.pause",
    "jvm.info",
    "jvm.memory.committed",
    "jvm.memory.max",
    "jvm.memory.usage.after.gc",
    "jvm.memory.used",
    "jvm.threads.daemon",
    "jvm.threads.live",
    "jvm.threads.peak",
    "jvm.threads.started",
    "jvm.threads.states",
    "logback.events",
    "process.cpu.time",
    "process.cpu.usage",
    "process.start.time",
    "process.uptime",
    "system.cpu.count",
    "system.cpu.usage",
    "tomcat.sessions.active.current",
    "tomcat.sessions.active.max",
    "tomcat.sessions.alive.max",
    "tomcat.sessions.created",
    "tomcat.sessions.expired",
    "tomcat.sessions.rejected"
  ]
}

디자인 된 메트릭

1차

1. 전체적으로 확인할 메트릭 목록 체크
2. USE / RED 각각 Error 메트릭 수집을 어떻게 해야 할지 확인

*Spring
USE
U
(disk.total - disk.free) / disk.total ( (전체 용량 - 여유 용량) / 전체 용량)
system.cpu.usage (전체 시스템 cpu 사용률)
process.cpu.usage (프로세스 cpu 사용률) 
(jvm.memory.max - jvm.memory.used) / jvm.memory.max (jvm 메모리 사용률)
(executor.pool.size - executor.queue.remaining) / executor.pool.size (현재 풀 사용량)

S
(executor.pool.size - executor.queued) / executor.pool.size ( 스레드 풀의 포화 상태 )
executor.queued (스레드 풀의 대기열에 현재 대기 중인 작업의 수) 
jvm.gc.overhead

E
??기본 제공 merics중에서 에러 카운트 할만한 메트릭이 뭐가 있는지 


RED
R
http.server.requests (총 HTTP 요청 수) 
http.server.requests.active ( 현재 처리 중인 HTTP 요청 수 )
-> 1분 단위로 끊어서 (총 HTTP 요청 - 현재 처리 중인 HTTP 요청) / 총 HTTP 요청 하면 
요청이 얼마나 남아있는지를 알 수 있을 듯


??

D
??


Opensearch
USE
U
CPU_Utilization (CPU 사용률) 
Disk_Utilization (디스크 사용률) 
Heap_Used / Heap_Maxed (메모리 사용률)
IO_ReadThroughput (지난 5초간 디스크에서 읽어온 데이터 양)

S
(ThreadPool_TotalThreads - ThreadPool_ActiveThreads) / ThreadPool_TotalThreads
(스레드 풀 잔여량) 

E
Paging_MajfltRate (초당 발생한 주요한 오류 수)
Paging_MinfltRate (초당 발생한 마이너 오류 수)

RED
R
HTTP_TotalRequest 
HTTP_RequestDocs
Disk_ServiceRate

E
ThreadPool_RejectedReqs(거부된  executions 수)

D
Disk_WaitTime( 지난 5초간 디스크 r/w 평균 응답시간)
GC_Collection_Time

2차

1. opensearch를 제외한 springboot 메트릭으로 범위 축소 (필요한 경우 타 아키텍쳐 메트릭 수집도 추가)
2. 사용자 정의 메트릭이 적절하게 적용되고 있는지 확인 필요
3. 추가적인 메트릭이 어느 게 있을 수 있을지 확인 (향후 Redis 연결 시 Redis 응답 시간 / 접속 에러 관련 메트릭 추가 등)

사용자 정의 메트릭

search.request.count(controller)

해당 어댑터(컨트롤러)로 들어온 전체 요청 수 확인
핵심 비즈니스 로직 파악
출력 예시

# HELP search_request_count_total  
# TYPE search_request_count_total counter
search_request_count_total{application="webtoon-search",class="search-webtoon-controller",exception="IllegalArgumentException",method="searchWebtoon",result="failure"} 4.0
search_request_count_total{application="webtoon-search",class="search-webtoon-controller",exception="none",method="searchWebtoon",result="success"} 3.0

search.request.duration(controller)

웹툰 검색 메소드의 response 까지의 반환 시간 확인
handler method ~ opensearch 까지의 전체적인 응답 시간 파악
여기서 문제가 없는데 전체 응답 시간이 길어지는 거면 톰캣 ~ Nginx ~ LB ~ Client 아키텍쳐의 문제를 의심해볼 수 있을 것
출력 예시

# HELP search_request_duration_seconds duration until search webtoon list
# TYPE search_request_duration_seconds summary
search_request_duration_seconds_count{application="webtoon-search",class="com.samsamohoh.webtoonsearch.adapter.web.SearchWebtoonController",endpoint="/webtoons/search",exception="IllegalArgumentException",method="searchWebtoon"} 4
search_request_duration_seconds_sum{application="webtoon-search",class="com.samsamohoh.webtoonsearch.adapter.web.SearchWebtoonController",endpoint="/webtoons/search",exception="IllegalArgumentException",method="searchWebtoon"} 7.155E-4
search_request_duration_seconds_count{application="webtoon-search",class="com.samsamohoh.webtoonsearch.adapter.web.SearchWebtoonController",endpoint="/webtoons/search",exception="none",method="searchWebtoon"} 3
search_request_duration_seconds_sum{application="webtoon-search",class="com.samsamohoh.webtoonsearch.adapter.web.SearchWebtoonController",endpoint="/webtoons/search",exception="none",method="searchWebtoon"} 0.1211729

search.condition.null.count(service)

controller로부터 받아오는 검색어 객체가 null 또는 빈값을 가지고 있는지 확인
프론트 단에서 1차적으로 처리하지 못한 케이스가 있을 경우 에러 반환 후 카운팅
의미 없는 메트릭일 가능성도 있음 (카운팅으로 어떤 효과를 얻을 수 있을지?)
출력 예시

# HELP search_condition_null_count_total title is null
# TYPE search_condition_null_count_total counter
search_condition_null_count_total{application="webtoon-search",class="search-webtoon-service",endpoint="/webtoons/search",method="search-webtoons"} 4.0

search.opensearch.reply.duration(adapter)

adapter ~ opensearch 간의 응답 시간 확인
출력 예시

# HELP search_opensearch_reply_duration_seconds duration until opensearch reply
# TYPE search_opensearch_reply_duration_seconds summary
search_opensearch_reply_duration_seconds_count{application="webtoon-search",class="com.samsamohoh.webtoonsearch.adapter.searchengine.SearchEngineAdapter",endpoint="/webtoons/search",exception="none",method="loadWebtoons"} 3
search_opensearch_reply_duration_seconds_sum{application="webtoon-search",class="com.samsamohoh.webtoonsearch.adapter.searchengine.SearchEngineAdapter",endpoint="/webtoons/search",exception="none",method="loadWebtoons"} 0.1199463
# HELP search_opensearch_reply_duration_seconds_max duration until opensearch reply
# TYPE search_opensearch_reply_duration_seconds_max gauge
search_opensearch_reply_duration_seconds_max{application="webtoon-search",class="com.samsamohoh.webtoonsearch.adapter.searchengine.SearchEngineAdapter",endpoint="/webtoons/search",exception="none",method="loadWebtoons"} 0.0971084

opensearch.connection.fail.count(adapter)

opensearch의 응답 실패 발생시 확인
opensearch 응답 실패가 얼마나 자주 있었는지 카운팅
횟수가 많으면 spring ~ opensearch 간의 네트워크 문제 또는 opensearch 자체의 문제 의심가능.
출력 예시

# HELP opensearch_connection_fail_count_total metrics for opensearch connecting failure
# TYPE opensearch_connection_fail_count_total counter
opensearch_connection_fail_count_total{application="webtoon-search",class="search-engine-adapter",endpoint="/webtoons/search",method="load-webtoons"} 2.0

Metric Design

🔎 Metrics

💊 USE / RED

📒 USE

Pros and cons

📕 RED

📝 Metrics Design

모니터링 대상 시스템

시스템 별 기본 제공 메트릭

디자인 된 메트릭

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally