[OT-311] [CHORE]: 그라파나 탬플릿 적용 및 필요 설정 추가#171
Conversation
WalkthroughCI/CD 워크플로우에 node-exporter SSM 파라미터 저장 및 원격 배포 단계를 추가하고, Prometheus 스크랩 설정에 node-exporter 대상과 Changes
Sequence Diagram(s)sequenceDiagram
participant CI as "CI/CD 워크플로우"
participant EC2 as "EC2 인스턴스"
participant SSM as "AWS SSM"
participant Prom as "Prometheus"
CI->>EC2: 대상 인스턴스 선택 및 PrivateIpAddress 조회
EC2-->>CI: PrivateIpAddress
CI->>SSM: SSM에 "<private_ip>:9100" 저장 (NODE_EXPORTER_*_SSM_PARAM)
CI->>EC2: 기존 node-exporter 컨테이너 제거, 이미지 풀, node-exporter 실행
EC2-->>CI: node-exporter 프로세스 실행 확인
Prom->>SSM: NODE_EXPORTER_TARGET / MACHINE_NODE_EXPORTER_TARGET 조회
SSM-->>Prom: 타겟 주소(IP:9100) 반환
Prom->>Prom: scrape job으로 node-exporter 메트릭 수집
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit 코드 리뷰 해주고 컨맨션 문제도 확인해서 각각을 코맨트 분리해서 달아줘 |
|
각 리뷰 항목은 별도의 댓글로 분리하여 작성해 주세요. 코드 품질 및 컨벤션 이슈 모두 포함해 주세요. 🧠 Learnings used✅ Actions performedFull review triggered. |
|
@CodeRabbit 리뷰 하고있니..? |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
.github/workflows/deploy-monitoring.yml (1)
110-116: 타깃 검증 실패 메시지에 누락된 키를 포함해 주세요.Line 110-116은 유효성 검증 자체는 맞지만, 현재 에러 메시지가 포괄적이라 어떤 타깃이 비었는지 즉시 파악하기 어렵습니다. 운영 장애 대응 속도를 위해 누락 키를 함께 출력하는 쪽이 좋습니다.
🔧 제안 수정안
- if [ -z "$USER_API_TARGET" ] || [ "$USER_API_TARGET" = "None" ] || \ - [ -z "$ADMIN_API_TARGET" ] || [ "$ADMIN_API_TARGET" = "None" ] || \ - [ -z "$TRANSCODER_TARGET" ] || [ "$TRANSCODER_TARGET" = "None" ] || \ - [ -z "$NODE_EXPORTER_TARGET" ] || [ "$NODE_EXPORTER_TARGET" = "None" ] || \ - [ -z "$MACHINE_NODE_EXPORTER_TARGET" ] || [ "$MACHINE_NODE_EXPORTER_TARGET" = "None" ]; then - echo "One or more scrape targets are empty. Check SSM parameter values." >&2 + MISSING_KEYS=() + [ -z "$USER_API_TARGET" ] || [ "$USER_API_TARGET" = "None" ] && MISSING_KEYS+=("USER_API_TARGET") + [ -z "$ADMIN_API_TARGET" ] || [ "$ADMIN_API_TARGET" = "None" ] && MISSING_KEYS+=("ADMIN_API_TARGET") + [ -z "$TRANSCODER_TARGET" ] || [ "$TRANSCODER_TARGET" = "None" ] && MISSING_KEYS+=("TRANSCODER_TARGET") + [ -z "$NODE_EXPORTER_TARGET" ] || [ "$NODE_EXPORTER_TARGET" = "None" ] && MISSING_KEYS+=("NODE_EXPORTER_TARGET") + [ -z "$MACHINE_NODE_EXPORTER_TARGET" ] || [ "$MACHINE_NODE_EXPORTER_TARGET" = "None" ] && MISSING_KEYS+=("MACHINE_NODE_EXPORTER_TARGET") + if [ "${`#MISSING_KEYS`[@]}" -gt 0 ]; then + echo "Empty scrape targets: ${MISSING_KEYS[*]}. Check SSM parameter values." >&2 exit 1 fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/deploy-monitoring.yml around lines 110 - 116, The current validation block that checks USER_API_TARGET, ADMIN_API_TARGET, TRANSCODER_TARGET, NODE_EXPORTER_TARGET, and MACHINE_NODE_EXPORTER_TARGET should be changed to build a list of which variables are empty or "None" and include those keys in the error message; locate the if-check referencing those variables, compute a missing_targets string/array by testing each variable (e.g., append "USER_API_TARGET" when [ -z "$USER_API_TARGET" ] || [ "$USER_API_TARGET" = "None" ]), and replace the generic echo with an echo that prints "Missing or empty scrape targets: $missing_targets" (then exit 1) so the failing keys are clearly reported.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/deploy-ai.yml:
- Around line 146-153: The deploy step currently swallows all errors with "||
true" when calling aws ec2 authorize-security-group-ingress (using
MONITORING_SG_ID and MACHINE_SG_ID), which hides real failures; remove the
blanket "|| true", run the aws command and capture its stderr/exit code, then
only ignore the call if the error indicates a duplicate rule (e.g. error string
or code like "InvalidPermission.Duplicate" / "InvalidPermission.Duplicate" in
stderr), otherwise print the aws error and exit non-zero so permission/SG/quota
problems fail the workflow; implement this logic around the aws ec2
authorize-security-group-ingress invocation.
In @.github/workflows/deploy-ec2-docker.yml:
- Around line 136-143: When promtail_role equals "user-api" ensure the script
fails fast if private_ip is missing or invalid instead of silently skipping SSM
update: inside the conditional around promtail_role/private_ip (the block that
currently sets NODE_EXPORTER_TARGET_SSM_PARAM) add a guard that checks
private_ip for emptiness or the literal "None" and, if invalid, emit an error
message and exit non‑zero so the workflow stops immediately rather than
proceeding without updating NODE_EXPORTER_TARGET_SSM_PARAM.
In `@apps/monitoring/docker-compose.yml`:
- Around line 53-56: The external network name is hardcoded as backend_default
(networks: backend_shared -> name: backend_default), which breaks when the
Compose project name differs; update the monitoring compose to derive the
network name from an env var instead of hardcoding (e.g., set
networks.backend_shared.name to something like
${COMPOSE_PROJECT_NAME:-backend}_default) or alternatively remove external:true
and document/expect the root docker-compose to declare an explicit shared
network; change the networks definition in apps/monitoring/docker-compose.yml to
reference the env-var-based name (or make it non-external) so Prometheus will
attach to the correct project network.
In `@apps/monitoring/grafana/provisioning/dashboards/json/loki-logs-15324.json`:
- Around line 132-205: The Loki panel queries (see the target expressions
"sum(count_over_time({namespace=\"$namespace\", pod=~\"$pod\"} |~
\"$search\"[$__interval]))" and "{namespace=\"$namespace\", pod=~\"$pod\"} |~
\"$search\"") assume stream labels "namespace" and "pod" that your
Promtail/Prometheus stack does not emit; replace those label matchers with the
actual labels your stack provides (or remove them) or alternatively enable/emit
the missing Kubernetes labels in Promtail/Prometheus so the queries match real
streams; update the two target expressions and any templating variables to use
the correct label keys so the dashboard returns non-empty results.
In `@apps/monitoring/prometheus/prometheus.prod.yml.tpl`:
- Around line 43-55: The prod Prometheus scrape jobs "node-exporter" and
"machine-node-exporter" only set the label app and lack the application label
expected by the Grafana dashboard (jvm-micrometer-4701.json) which uses
label_values(application) and selectors like {application="$application"};
update the template for those jobs (job_name "node-exporter" and
"machine-node-exporter") to include an application label alongside app (e.g.
application: "<appropriate value or template variable>") for both targets
(__NODE_EXPORTER_TARGET__ and __MACHINE_NODE_EXPORTER_TARGET__), then
regenerate/sync the resulting prometheus.prod.yml so prod matches the local
prometheus.yml label schema.
---
Nitpick comments:
In @.github/workflows/deploy-monitoring.yml:
- Around line 110-116: The current validation block that checks USER_API_TARGET,
ADMIN_API_TARGET, TRANSCODER_TARGET, NODE_EXPORTER_TARGET, and
MACHINE_NODE_EXPORTER_TARGET should be changed to build a list of which
variables are empty or "None" and include those keys in the error message;
locate the if-check referencing those variables, compute a missing_targets
string/array by testing each variable (e.g., append "USER_API_TARGET" when [ -z
"$USER_API_TARGET" ] || [ "$USER_API_TARGET" = "None" ]), and replace the
generic echo with an echo that prints "Missing or empty scrape targets:
$missing_targets" (then exit 1) so the failing keys are clearly reported.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: dbea28c5-1bd9-4e63-a031-c54d5041155e
📒 Files selected for processing (14)
.github/workflows/deploy-ai.yml.github/workflows/deploy-ec2-docker.yml.github/workflows/deploy-monitoring.ymlapps/monitoring/docker-compose.ymlapps/monitoring/grafana/provisioning/dashboards/json/New dashboard-1772584885701.json.disabledapps/monitoring/grafana/provisioning/dashboards/json/jvm-micrometer-4701.jsonapps/monitoring/grafana/provisioning/dashboards/json/loki-logs-15324.jsonapps/monitoring/grafana/provisioning/dashboards/json/node-exporter-full-1860.jsonapps/monitoring/grafana/provisioning/dashboards/json/rabbitmq-overview-10991.jsonapps/monitoring/grafana/provisioning/dashboards/json/spring-boot-3x-20727.jsonapps/monitoring/prometheus/prometheus.prod.ymlapps/monitoring/prometheus/prometheus.prod.yml.tplapps/monitoring/prometheus/prometheus.ymldocker-compose.yml
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
.github/workflows/deploy-ec2-docker.yml (1)
136-147:⚠️ Potential issue | 🟡 Minor
user-apiprivate IP 누락 시 경고만 출력하고 있습니다.현재 구현은 private IP가 없을 때
::warning::만 출력하고 워크플로우를 계속 진행합니다. 그러나deploy-monitoring.yml의 "Resolve scrape targets from SSM" 단계에서NODE_EXPORTER_TARGET값이 비어있거나 "None"이면 배포가 실패합니다.이 접근 방식이 의도된 것이라면 괜찮지만, user-api 배포 시 private IP 해결 실패가 다운스트림 모니터링 배포 실패로 이어질 수 있습니다.
🔧 fail-fast 방식으로 변경하려면
if [ "$promtail_role" = "user-api" ]; then - if [ -n "$private_ip" ] && [ "$private_ip" != "None" ]; then + if [ -z "$private_ip" ] || [ "$private_ip" = "None" ]; then + echo "Failed to resolve private IP for ${target_tag}; cannot update ${NODE_EXPORTER_TARGET_SSM_PARAM}" >&2 + exit 1 + fi aws ssm put-parameter \ --region "$AWS_REGION" \ --name "$NODE_EXPORTER_TARGET_SSM_PARAM" \ --type String \ --overwrite \ --value "${private_ip}:9100" >/dev/null - else - echo "::warning::user-api private IP is missing. NODE_EXPORTER_TARGET SSM parameter was not updated." - fi fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/deploy-ec2-docker.yml around lines 136 - 147, 현재 promtail_role이 "user-api"인 경우 private_ip가 비어있거나 "None"이면 단순히 경고만 출력하고 계속 진행하는데, 이로 인해 deploy-monitoring.yml의 "Resolve scrape targets from SSM" 단계에서 NODE_EXPORTER_TARGET이 비어 있어 배포가 실패할 수 있으므로 fail-fast로 변경해야 합니다; 수정 방법은 해당 조건의 else 블록에서 echo "::warning::user-api private IP is missing. NODE_EXPORTER_TARGET SSM parameter was not updated."를 오류로 바꾸고 즉시 종료하도록 처리(예: echo "::error::..." 후 exit 1)해 NODE_EXPORTER_TARGET_SSM_PARAM이 설정되지 않은 상태로 진행되지 않게 하세요, 점검 대상 변수/심볼: promtail_role, private_ip, NODE_EXPORTER_TARGET_SSM_PARAM 및 관련 deploy-monitoring.yml의 "Resolve scrape targets from SSM" 단계를 참고하세요.
🧹 Nitpick comments (1)
apps/monitoring/prometheus/prometheus.prod.yml.tpl (1)
46-60: node-exporter job들에relabel_configs가 없습니다.기존 user-api, admin-api, transcoder job들은
relabel_configs로instance라벨을 명시적으로 설정하고 있습니다. node-exporter와 machine-node-exporter에도 동일한 패턴을 적용하면 대시보드에서 instance 라벨 기반 필터링이 일관성 있게 동작합니다.♻️ 일관성을 위한 relabel_configs 추가 제안
- job_name: "node-exporter" static_configs: - targets: ["__NODE_EXPORTER_TARGET__"] labels: app: "node-exporter" application: "node-exporter" env: "prod" + relabel_configs: + - target_label: instance + replacement: "node-exporter" - job_name: "machine-node-exporter" static_configs: - targets: ["__MACHINE_NODE_EXPORTER_TARGET__"] labels: app: "machine" application: "machine" env: "prod" + relabel_configs: + - target_label: instance + replacement: "machine"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/monitoring/prometheus/prometheus.prod.yml.tpl` around lines 46 - 60, Add a relabel_configs block to the node-exporter and machine-node-exporter scrape jobs so they explicitly set the instance label like the other jobs do; locate the job entries identified by job_name: "node-exporter" and job_name: "machine-node-exporter" in the Prometheus template and insert a relabel_configs section that maps __address__ (or the target) to the instance label (matching the pattern used by user-api/admin-api/transcoder jobs) so dashboards can consistently filter by instance.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/deploy-ai.yml:
- Around line 137-144: The script currently skips updating the SSM parameter
when MACHINE_PRIVATE_IP is empty or "None" (the if block checking
MACHINE_PRIVATE_IP) without warning, which later causes deploy-monitoring.yml to
fail; modify that branch to explicitly log a clear error or warning and exit
non-zero (or fail the workflow) when MACHINE_PRIVATE_IP is missing/equals "None"
so the pipeline fails fast, or alternatively set a deliberate fallback value and
emit a warning before continuing; update the section that references
MACHINE_PRIVATE_IP and SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM to implement this
explicit error handling.
---
Duplicate comments:
In @.github/workflows/deploy-ec2-docker.yml:
- Around line 136-147: 현재 promtail_role이 "user-api"인 경우 private_ip가 비어있거나
"None"이면 단순히 경고만 출력하고 계속 진행하는데, 이로 인해 deploy-monitoring.yml의 "Resolve scrape
targets from SSM" 단계에서 NODE_EXPORTER_TARGET이 비어 있어 배포가 실패할 수 있으므로 fail-fast로
변경해야 합니다; 수정 방법은 해당 조건의 else 블록에서 echo "::warning::user-api private IP is
missing. NODE_EXPORTER_TARGET SSM parameter was not updated."를 오류로 바꾸고 즉시 종료하도록
처리(예: echo "::error::..." 후 exit 1)해 NODE_EXPORTER_TARGET_SSM_PARAM이 설정되지 않은 상태로
진행되지 않게 하세요, 점검 대상 변수/심볼: promtail_role, private_ip,
NODE_EXPORTER_TARGET_SSM_PARAM 및 관련 deploy-monitoring.yml의 "Resolve scrape
targets from SSM" 단계를 참고하세요.
---
Nitpick comments:
In `@apps/monitoring/prometheus/prometheus.prod.yml.tpl`:
- Around line 46-60: Add a relabel_configs block to the node-exporter and
machine-node-exporter scrape jobs so they explicitly set the instance label like
the other jobs do; locate the job entries identified by job_name:
"node-exporter" and job_name: "machine-node-exporter" in the Prometheus template
and insert a relabel_configs section that maps __address__ (or the target) to
the instance label (matching the pattern used by user-api/admin-api/transcoder
jobs) so dashboards can consistently filter by instance.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: c27546c9-6ffa-420a-b1e8-0d6aaf61202d
📒 Files selected for processing (6)
.github/workflows/deploy-ai.yml.github/workflows/deploy-ec2-docker.ymlapps/monitoring/docker-compose.ymlapps/monitoring/grafana/provisioning/dashboards/json/loki-logs-15324.jsonapps/monitoring/prometheus/prometheus.prod.ymlapps/monitoring/prometheus/prometheus.prod.yml.tpl
🚧 Files skipped from review as they are similar to previous changes (3)
- apps/monitoring/prometheus/prometheus.prod.yml
- apps/monitoring/grafana/provisioning/dashboards/json/loki-logs-15324.json
- apps/monitoring/docker-compose.yml
…ckend into OT-311-chore/grafana-template
There was a problem hiding this comment.
♻️ Duplicate comments (2)
.github/workflows/deploy-ai.yml (2)
137-144:⚠️ Potential issue | 🟡 Minor
MACHINE_PRIVATE_IP누락 시 명시적 에러 처리가 필요합니다.
MACHINE_PRIVATE_IP가 비어있거나 "None"인 경우 SSM 파라미터 업데이트가 조용히 생략됩니다. 이후deploy-monitoring.yml에서 해당 타겟을 사용할 때 배포가 실패할 수 있습니다.,
🔧 명시적 에러 처리 추가 제안
- if [ -n "$MACHINE_PRIVATE_IP" ] && [ "$MACHINE_PRIVATE_IP" != "None" ]; then + if [ -z "$MACHINE_PRIVATE_IP" ] || [ "$MACHINE_PRIVATE_IP" = "None" ]; then + echo "::error::Failed to resolve private IP for ${TARGET_TAG}; cannot update ${SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM}" >&2 + exit 1 + fi + aws ssm put-parameter \ --region "$AWS_REGION" \ --name "$SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM" \ --type String \ --overwrite \ --value "${MACHINE_PRIVATE_IP}:9100" >/dev/null - fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/deploy-ai.yml around lines 137 - 144, The workflow silently skips setting SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM when MACHINE_PRIVATE_IP is empty or "None", which later breaks deploy-monitoring.yml; add an explicit failure path: check MACHINE_PRIVATE_IP (the same variable used in the if condition) and if it's empty or "None" print a clear error referencing SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM and exit non‑zero so the job fails early; update the conditional block around MACHINE_PRIVATE_IP to include an else branch that echoes the error and exits 1 (so downstream deploy-monitoring.yml won't run with a missing target).
146-161:⚠️ Potential issue | 🟡 Minor보안 그룹 인그레스 에러 처리가 개선되었으나, 실패 시 워크플로우 중단 고려가 필요합니다.
InvalidPermission.Duplicate처리가 추가되어 이전|| true방식보다 개선되었습니다. 다만, 권한 부족이나 잘못된 SG ID 같은 실제 오류 발생 시 경고만 출력하고 계속 진행되므로, node-exporter가 Prometheus에서 접근 불가능한 상태로 배포될 수 있습니다.,
🔧 실패 시 워크플로우 중단 제안
if ! INGRESS_OUTPUT=$(aws ec2 authorize-security-group-ingress \ --region "$AWS_REGION" \ --group-id "$MACHINE_SG_ID" \ --protocol tcp \ --port 9100 \ --source-group "$MONITORING_SG_ID" 2>&1); then if echo "$INGRESS_OUTPUT" | grep -q "InvalidPermission.Duplicate"; then echo "Node exporter ingress rule already exists." else - echo "::warning::Failed to open node-exporter ingress (monitoring -> ai)." + echo "::error::Failed to open node-exporter ingress (monitoring -> ai)." echo "$INGRESS_OUTPUT" + exit 1 fi fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/deploy-ai.yml around lines 146 - 161, The current authorize-security-group-ingress block captures errors into INGRESS_OUTPUT and only warns on failures other than InvalidPermission.Duplicate, but then continues; change the error handling so that if the aws ec2 authorize-security-group-ingress command fails and the output does NOT contain "InvalidPermission.Duplicate" (i.e., INGRESS_OUTPUT check), you print the error and then exit non‑zero (e.g., exit 1) to fail the workflow and prevent continuing with a potentially inaccessible node-exporter; reference the aws ec2 authorize-security-group-ingress invocation and variables MONITORING_SG_ID, MACHINE_SG_ID, and INGRESS_OUTPUT when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.github/workflows/deploy-ai.yml:
- Around line 137-144: The workflow silently skips setting
SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM when MACHINE_PRIVATE_IP is empty or
"None", which later breaks deploy-monitoring.yml; add an explicit failure path:
check MACHINE_PRIVATE_IP (the same variable used in the if condition) and if
it's empty or "None" print a clear error referencing
SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM and exit non‑zero so the job fails early;
update the conditional block around MACHINE_PRIVATE_IP to include an else branch
that echoes the error and exits 1 (so downstream deploy-monitoring.yml won't run
with a missing target).
- Around line 146-161: The current authorize-security-group-ingress block
captures errors into INGRESS_OUTPUT and only warns on failures other than
InvalidPermission.Duplicate, but then continues; change the error handling so
that if the aws ec2 authorize-security-group-ingress command fails and the
output does NOT contain "InvalidPermission.Duplicate" (i.e., INGRESS_OUTPUT
check), you print the error and then exit non‑zero (e.g., exit 1) to fail the
workflow and prevent continuing with a potentially inaccessible node-exporter;
reference the aws ec2 authorize-security-group-ingress invocation and variables
MONITORING_SG_ID, MACHINE_SG_ID, and INGRESS_OUTPUT when making this change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f725fce7-04af-4013-a1c6-b1242b48aae0
📒 Files selected for processing (2)
.github/workflows/deploy-ai.ymldocker-compose.yml
🚧 Files skipped from review as they are similar to previous changes (1)
- docker-compose.yml
📝 작업 내용
☑️ 체크 리스트
#️⃣ 연관된 이슈
Summary by CodeRabbit
릴리스 노트
새 기능
인프라 업데이트