Skip to content

[OT-311] [CHORE]: 그라파나 탬플릿 적용 및 필요 설정 추가#171

Merged
arlen02-01 merged 6 commits intodevelopfrom
OT-311-chore/grafana-template
Mar 16, 2026
Merged

[OT-311] [CHORE]: 그라파나 탬플릿 적용 및 필요 설정 추가#171
arlen02-01 merged 6 commits intodevelopfrom
OT-311-chore/grafana-template

Conversation

@arlen02-01
Copy link
Copy Markdown
Contributor

@arlen02-01 arlen02-01 commented Mar 14, 2026

📝 작업 내용

이번 PR에서 작업한 내용을 적어주세요

  • 그라파나 탬플릿 적용 및 필요 설정 추가

☑️ 체크 리스트

체크 리스트를 확인해주세요

  • 테스트는 잘 통과했나요?
  • 충돌을 해결했나요?
  • 이슈는 등록했나요?
  • 라벨은 등록했나요?

#️⃣ 연관된 이슈

ex) # 이슈번호
close #167

Summary by CodeRabbit

릴리스 노트

  • 새 기능

    • 노드 익스포터 자동 배포 및 스크레이프 대상 등록(인스턴스 IP를 SSM에 저장)
    • JVM(Micrometer) 모니터링 대시보드 추가
    • Loki 로그 시각화 대시보드 추가
  • 인프라 업데이트

    • Prometheus 포트 변경(호스트 9090 → 9091)
    • Prometheus에 node-exporter 및 machine-node-exporter 스크랩 잡 및 application 라벨 추가
    • 모니터링 워크플로우 SSM 타임아웃·폴링 연장 및 네트워크/대시보드 기본 경로 설정 추가

@arlen02-01 arlen02-01 self-assigned this Mar 14, 2026
@arlen02-01 arlen02-01 added the chore 설정 파일 등 변경 (.gitignore, .yml 등) label Mar 14, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 14, 2026

Walkthrough

CI/CD 워크플로우에 node-exporter SSM 파라미터 저장 및 원격 배포 단계를 추가하고, Prometheus 스크랩 설정에 node-exporter 대상과 application 라벨을 추가하며 Grafana 대시보드와 Docker Compose 네트워크/포트를 업데이트했습니다. 변경 대상은 워크플로우, Prometheus, Grafana 및 compose 파일들입니다.

Changes

Cohort / File(s) Summary
배포 워크플로우
.github/workflows/deploy-ai.yml, .github/workflows/deploy-ec2-docker.yml
대상 인스턴스 PrivateIp 조회 후 <ip>:9100 값을 SSM 파라미터로 저장(NODE_EXPORTER_TARGET_SSM_PARAM / SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM)하고, 배포 시 기존 node-exporter 컨테이너 제거·이미지 풀·실행 단계 추가 및 보안그룹 인그레스 처리 추가. SSM 타임아웃·폴링 주기 연장.
모니터링 워크플로우
.github/workflows/deploy-monitoring.yml
워크플로우 입력(노드 익스포터 SSM 파라미터) 추가, SSM에서 NODE_EXPORTER_TARGET / MACHINE_NODE_EXPORTER_TARGET 조회·검증 및 GitHub 환경에 내보내기, Prometheus 템플릿에 새 타겟 치환 추가.
Prometheus 구성
apps/monitoring/prometheus/prometheus.prod.yml, apps/monitoring/prometheus/prometheus.prod.yml.tpl, apps/monitoring/prometheus/prometheus.yml
기존 scrape_configs에 application 라벨 추가. node-exportermachine-node-exporter 신규 scrape job 추가(정적_targets: SSM에서 치환되는 값).
Grafana 대시보드 추가
apps/monitoring/grafana/provisioning/dashboards/json/jvm-micrometer-4701.json, apps/monitoring/grafana/provisioning/dashboards/json/loki-logs-15324.json
JVM 및 Loki 로그용 Grafana 프로비저닝 대시보드 JSON 파일 추가(템플릿 변수·다수 패널 포함).
Compose / 네트워크·포트 조정
apps/monitoring/docker-compose.yml, docker-compose.yml
prometheus 서비스에 backend_shared 네트워크 추가 및 grafanaGF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH 환경변수 추가. 로컬 Prometheus 호스트 포트 변경(9090→9091).

Sequence Diagram(s)

sequenceDiagram
    participant CI as "CI/CD 워크플로우"
    participant EC2 as "EC2 인스턴스"
    participant SSM as "AWS SSM"
    participant Prom as "Prometheus"

    CI->>EC2: 대상 인스턴스 선택 및 PrivateIpAddress 조회
    EC2-->>CI: PrivateIpAddress
    CI->>SSM: SSM에 "<private_ip>:9100" 저장 (NODE_EXPORTER_*_SSM_PARAM)
    CI->>EC2: 기존 node-exporter 컨테이너 제거, 이미지 풀, node-exporter 실행
    EC2-->>CI: node-exporter 프로세스 실행 확인
    Prom->>SSM: NODE_EXPORTER_TARGET / MACHINE_NODE_EXPORTER_TARGET 조회
    SSM-->>Prom: 타겟 주소(IP:9100) 반환
    Prom->>Prom: scrape job으로 node-exporter 메트릭 수집
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

deploy

Suggested reviewers

  • phonil
  • marulog
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed PR 제목이 주요 변경사항인 그라파나 템플릿 적용 및 설정 추가를 명확하게 요약하고 있습니다.
Linked Issues check ✅ Passed PR의 변경사항들이 연결된 이슈 #167의 목표인 '그라파나 템플릿 적용 및 필요 설정 추가'를 충족합니다.
Out of Scope Changes check ✅ Passed 노드 익스포터 배포 단계, 프로메테우스 설정, 그라파나 대시보드 및 docker-compose 포트 변경이 포함되어 있으나, 이들은 모두 그라파나 모니터링 시스템 설정의 필수 구성 요소로 보입니다.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch OT-311-chore/grafana-template
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@arlen02-01
Copy link
Copy Markdown
Contributor Author

@CodeRabbit 코드 리뷰 해주고 컨맨션 문제도 확인해서 각각을 코맨트 분리해서 달아줘

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 14, 2026

@arlen02-01 PR #171에 대해 코드 리뷰 및 컨벤션 검사를 진행하겠습니다. 각 항목별로 분리된 댓글로 피드백을 남겨드리겠습니다.

각 리뷰 항목은 별도의 댓글로 분리하여 작성해 주세요. 코드 품질 및 컨벤션 이슈 모두 포함해 주세요.


🧠 Learnings used
Learnt from: phonil
Repo: OpenTheTaste/backend PR: 0
File: :0-0
Timestamp: 2026-02-27T04:09:55.496Z
Learning: OpenTheTaste/backend 저장소의 PR 리뷰 시, repo 내 coderabbit/coderabbit-guidelines.md를 준수하고 “하나의 리뷰 항목당 하나의 댓글” 방식으로 피드백을 남긴다.
✅ Actions performed

Full review triggered.

@arlen02-01
Copy link
Copy Markdown
Contributor Author

@CodeRabbit 리뷰 하고있니..?

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 14, 2026

@arlen02-01 네, 바로 리뷰 시작하겠습니다! 잠시만 기다려 주세요. 각 항목별로 댓글을 분리하여 피드백 드리겠습니다. 🐰


🧠 Learnings used
Learnt from: phonil
Repo: OpenTheTaste/backend PR: 0
File: :0-0
Timestamp: 2026-02-27T04:09:55.496Z
Learning: OpenTheTaste/backend 저장소의 PR 리뷰 시, repo 내 coderabbit/coderabbit-guidelines.md를 준수하고 “하나의 리뷰 항목당 하나의 댓글” 방식으로 피드백을 남긴다.
✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
.github/workflows/deploy-monitoring.yml (1)

110-116: 타깃 검증 실패 메시지에 누락된 키를 포함해 주세요.

Line 110-116은 유효성 검증 자체는 맞지만, 현재 에러 메시지가 포괄적이라 어떤 타깃이 비었는지 즉시 파악하기 어렵습니다. 운영 장애 대응 속도를 위해 누락 키를 함께 출력하는 쪽이 좋습니다.

🔧 제안 수정안
-          if [ -z "$USER_API_TARGET" ] || [ "$USER_API_TARGET" = "None" ] || \
-             [ -z "$ADMIN_API_TARGET" ] || [ "$ADMIN_API_TARGET" = "None" ] || \
-             [ -z "$TRANSCODER_TARGET" ] || [ "$TRANSCODER_TARGET" = "None" ] || \
-             [ -z "$NODE_EXPORTER_TARGET" ] || [ "$NODE_EXPORTER_TARGET" = "None" ] || \
-             [ -z "$MACHINE_NODE_EXPORTER_TARGET" ] || [ "$MACHINE_NODE_EXPORTER_TARGET" = "None" ]; then
-            echo "One or more scrape targets are empty. Check SSM parameter values." >&2
+          MISSING_KEYS=()
+          [ -z "$USER_API_TARGET" ] || [ "$USER_API_TARGET" = "None" ] && MISSING_KEYS+=("USER_API_TARGET")
+          [ -z "$ADMIN_API_TARGET" ] || [ "$ADMIN_API_TARGET" = "None" ] && MISSING_KEYS+=("ADMIN_API_TARGET")
+          [ -z "$TRANSCODER_TARGET" ] || [ "$TRANSCODER_TARGET" = "None" ] && MISSING_KEYS+=("TRANSCODER_TARGET")
+          [ -z "$NODE_EXPORTER_TARGET" ] || [ "$NODE_EXPORTER_TARGET" = "None" ] && MISSING_KEYS+=("NODE_EXPORTER_TARGET")
+          [ -z "$MACHINE_NODE_EXPORTER_TARGET" ] || [ "$MACHINE_NODE_EXPORTER_TARGET" = "None" ] && MISSING_KEYS+=("MACHINE_NODE_EXPORTER_TARGET")
+          if [ "${`#MISSING_KEYS`[@]}" -gt 0 ]; then
+            echo "Empty scrape targets: ${MISSING_KEYS[*]}. Check SSM parameter values." >&2
             exit 1
           fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy-monitoring.yml around lines 110 - 116, The current
validation block that checks USER_API_TARGET, ADMIN_API_TARGET,
TRANSCODER_TARGET, NODE_EXPORTER_TARGET, and MACHINE_NODE_EXPORTER_TARGET should
be changed to build a list of which variables are empty or "None" and include
those keys in the error message; locate the if-check referencing those
variables, compute a missing_targets string/array by testing each variable
(e.g., append "USER_API_TARGET" when [ -z "$USER_API_TARGET" ] || [
"$USER_API_TARGET" = "None" ]), and replace the generic echo with an echo that
prints "Missing or empty scrape targets: $missing_targets" (then exit 1) so the
failing keys are clearly reported.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/deploy-ai.yml:
- Around line 146-153: The deploy step currently swallows all errors with "||
true" when calling aws ec2 authorize-security-group-ingress (using
MONITORING_SG_ID and MACHINE_SG_ID), which hides real failures; remove the
blanket "|| true", run the aws command and capture its stderr/exit code, then
only ignore the call if the error indicates a duplicate rule (e.g. error string
or code like "InvalidPermission.Duplicate" / "InvalidPermission.Duplicate" in
stderr), otherwise print the aws error and exit non-zero so permission/SG/quota
problems fail the workflow; implement this logic around the aws ec2
authorize-security-group-ingress invocation.

In @.github/workflows/deploy-ec2-docker.yml:
- Around line 136-143: When promtail_role equals "user-api" ensure the script
fails fast if private_ip is missing or invalid instead of silently skipping SSM
update: inside the conditional around promtail_role/private_ip (the block that
currently sets NODE_EXPORTER_TARGET_SSM_PARAM) add a guard that checks
private_ip for emptiness or the literal "None" and, if invalid, emit an error
message and exit non‑zero so the workflow stops immediately rather than
proceeding without updating NODE_EXPORTER_TARGET_SSM_PARAM.

In `@apps/monitoring/docker-compose.yml`:
- Around line 53-56: The external network name is hardcoded as backend_default
(networks: backend_shared -> name: backend_default), which breaks when the
Compose project name differs; update the monitoring compose to derive the
network name from an env var instead of hardcoding (e.g., set
networks.backend_shared.name to something like
${COMPOSE_PROJECT_NAME:-backend}_default) or alternatively remove external:true
and document/expect the root docker-compose to declare an explicit shared
network; change the networks definition in apps/monitoring/docker-compose.yml to
reference the env-var-based name (or make it non-external) so Prometheus will
attach to the correct project network.

In `@apps/monitoring/grafana/provisioning/dashboards/json/loki-logs-15324.json`:
- Around line 132-205: The Loki panel queries (see the target expressions
"sum(count_over_time({namespace=\"$namespace\", pod=~\"$pod\"} |~
\"$search\"[$__interval]))" and "{namespace=\"$namespace\", pod=~\"$pod\"} |~
\"$search\"") assume stream labels "namespace" and "pod" that your
Promtail/Prometheus stack does not emit; replace those label matchers with the
actual labels your stack provides (or remove them) or alternatively enable/emit
the missing Kubernetes labels in Promtail/Prometheus so the queries match real
streams; update the two target expressions and any templating variables to use
the correct label keys so the dashboard returns non-empty results.

In `@apps/monitoring/prometheus/prometheus.prod.yml.tpl`:
- Around line 43-55: The prod Prometheus scrape jobs "node-exporter" and
"machine-node-exporter" only set the label app and lack the application label
expected by the Grafana dashboard (jvm-micrometer-4701.json) which uses
label_values(application) and selectors like {application="$application"};
update the template for those jobs (job_name "node-exporter" and
"machine-node-exporter") to include an application label alongside app (e.g.
application: "<appropriate value or template variable>") for both targets
(__NODE_EXPORTER_TARGET__ and __MACHINE_NODE_EXPORTER_TARGET__), then
regenerate/sync the resulting prometheus.prod.yml so prod matches the local
prometheus.yml label schema.

---

Nitpick comments:
In @.github/workflows/deploy-monitoring.yml:
- Around line 110-116: The current validation block that checks USER_API_TARGET,
ADMIN_API_TARGET, TRANSCODER_TARGET, NODE_EXPORTER_TARGET, and
MACHINE_NODE_EXPORTER_TARGET should be changed to build a list of which
variables are empty or "None" and include those keys in the error message;
locate the if-check referencing those variables, compute a missing_targets
string/array by testing each variable (e.g., append "USER_API_TARGET" when [ -z
"$USER_API_TARGET" ] || [ "$USER_API_TARGET" = "None" ]), and replace the
generic echo with an echo that prints "Missing or empty scrape targets:
$missing_targets" (then exit 1) so the failing keys are clearly reported.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: dbea28c5-1bd9-4e63-a031-c54d5041155e

📥 Commits

Reviewing files that changed from the base of the PR and between 38399d5 and 873fbc3.

📒 Files selected for processing (14)
  • .github/workflows/deploy-ai.yml
  • .github/workflows/deploy-ec2-docker.yml
  • .github/workflows/deploy-monitoring.yml
  • apps/monitoring/docker-compose.yml
  • apps/monitoring/grafana/provisioning/dashboards/json/New dashboard-1772584885701.json.disabled
  • apps/monitoring/grafana/provisioning/dashboards/json/jvm-micrometer-4701.json
  • apps/monitoring/grafana/provisioning/dashboards/json/loki-logs-15324.json
  • apps/monitoring/grafana/provisioning/dashboards/json/node-exporter-full-1860.json
  • apps/monitoring/grafana/provisioning/dashboards/json/rabbitmq-overview-10991.json
  • apps/monitoring/grafana/provisioning/dashboards/json/spring-boot-3x-20727.json
  • apps/monitoring/prometheus/prometheus.prod.yml
  • apps/monitoring/prometheus/prometheus.prod.yml.tpl
  • apps/monitoring/prometheus/prometheus.yml
  • docker-compose.yml

Comment thread .github/workflows/deploy-ai.yml Outdated
Comment thread .github/workflows/deploy-ec2-docker.yml Outdated
Comment thread apps/monitoring/docker-compose.yml Outdated
Comment thread apps/monitoring/prometheus/prometheus.prod.yml.tpl
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/deploy-ec2-docker.yml (1)

136-147: ⚠️ Potential issue | 🟡 Minor

user-api private IP 누락 시 경고만 출력하고 있습니다.

현재 구현은 private IP가 없을 때 ::warning::만 출력하고 워크플로우를 계속 진행합니다. 그러나 deploy-monitoring.yml의 "Resolve scrape targets from SSM" 단계에서 NODE_EXPORTER_TARGET 값이 비어있거나 "None"이면 배포가 실패합니다.

이 접근 방식이 의도된 것이라면 괜찮지만, user-api 배포 시 private IP 해결 실패가 다운스트림 모니터링 배포 실패로 이어질 수 있습니다.

🔧 fail-fast 방식으로 변경하려면
             if [ "$promtail_role" = "user-api" ]; then
-              if [ -n "$private_ip" ] && [ "$private_ip" != "None" ]; then
+              if [ -z "$private_ip" ] || [ "$private_ip" = "None" ]; then
+                echo "Failed to resolve private IP for ${target_tag}; cannot update ${NODE_EXPORTER_TARGET_SSM_PARAM}" >&2
+                exit 1
+              fi
                 aws ssm put-parameter \
                   --region "$AWS_REGION" \
                   --name "$NODE_EXPORTER_TARGET_SSM_PARAM" \
                   --type String \
                   --overwrite \
                   --value "${private_ip}:9100" >/dev/null
-              else
-                echo "::warning::user-api private IP is missing. NODE_EXPORTER_TARGET SSM parameter was not updated."
-              fi
             fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy-ec2-docker.yml around lines 136 - 147, 현재
promtail_role이 "user-api"인 경우 private_ip가 비어있거나 "None"이면 단순히 경고만 출력하고 계속 진행하는데,
이로 인해 deploy-monitoring.yml의 "Resolve scrape targets from SSM" 단계에서
NODE_EXPORTER_TARGET이 비어 있어 배포가 실패할 수 있으므로 fail-fast로 변경해야 합니다; 수정 방법은 해당 조건의
else 블록에서 echo "::warning::user-api private IP is missing. NODE_EXPORTER_TARGET
SSM parameter was not updated."를 오류로 바꾸고 즉시 종료하도록 처리(예: echo "::error::..." 후
exit 1)해 NODE_EXPORTER_TARGET_SSM_PARAM이 설정되지 않은 상태로 진행되지 않게 하세요, 점검 대상 변수/심볼:
promtail_role, private_ip, NODE_EXPORTER_TARGET_SSM_PARAM 및 관련
deploy-monitoring.yml의 "Resolve scrape targets from SSM" 단계를 참고하세요.
🧹 Nitpick comments (1)
apps/monitoring/prometheus/prometheus.prod.yml.tpl (1)

46-60: node-exporter job들에 relabel_configs가 없습니다.

기존 user-api, admin-api, transcoder job들은 relabel_configsinstance 라벨을 명시적으로 설정하고 있습니다. node-exporter와 machine-node-exporter에도 동일한 패턴을 적용하면 대시보드에서 instance 라벨 기반 필터링이 일관성 있게 동작합니다.

♻️ 일관성을 위한 relabel_configs 추가 제안
   - job_name: "node-exporter"
     static_configs:
       - targets: ["__NODE_EXPORTER_TARGET__"]
         labels:
           app: "node-exporter"
           application: "node-exporter"
           env: "prod"
+    relabel_configs:
+      - target_label: instance
+        replacement: "node-exporter"

   - job_name: "machine-node-exporter"
     static_configs:
       - targets: ["__MACHINE_NODE_EXPORTER_TARGET__"]
         labels:
           app: "machine"
           application: "machine"
           env: "prod"
+    relabel_configs:
+      - target_label: instance
+        replacement: "machine"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/monitoring/prometheus/prometheus.prod.yml.tpl` around lines 46 - 60, Add
a relabel_configs block to the node-exporter and machine-node-exporter scrape
jobs so they explicitly set the instance label like the other jobs do; locate
the job entries identified by job_name: "node-exporter" and job_name:
"machine-node-exporter" in the Prometheus template and insert a relabel_configs
section that maps __address__ (or the target) to the instance label (matching
the pattern used by user-api/admin-api/transcoder jobs) so dashboards can
consistently filter by instance.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/deploy-ai.yml:
- Around line 137-144: The script currently skips updating the SSM parameter
when MACHINE_PRIVATE_IP is empty or "None" (the if block checking
MACHINE_PRIVATE_IP) without warning, which later causes deploy-monitoring.yml to
fail; modify that branch to explicitly log a clear error or warning and exit
non-zero (or fail the workflow) when MACHINE_PRIVATE_IP is missing/equals "None"
so the pipeline fails fast, or alternatively set a deliberate fallback value and
emit a warning before continuing; update the section that references
MACHINE_PRIVATE_IP and SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM to implement this
explicit error handling.

---

Duplicate comments:
In @.github/workflows/deploy-ec2-docker.yml:
- Around line 136-147: 현재 promtail_role이 "user-api"인 경우 private_ip가 비어있거나
"None"이면 단순히 경고만 출력하고 계속 진행하는데, 이로 인해 deploy-monitoring.yml의 "Resolve scrape
targets from SSM" 단계에서 NODE_EXPORTER_TARGET이 비어 있어 배포가 실패할 수 있으므로 fail-fast로
변경해야 합니다; 수정 방법은 해당 조건의 else 블록에서 echo "::warning::user-api private IP is
missing. NODE_EXPORTER_TARGET SSM parameter was not updated."를 오류로 바꾸고 즉시 종료하도록
처리(예: echo "::error::..." 후 exit 1)해 NODE_EXPORTER_TARGET_SSM_PARAM이 설정되지 않은 상태로
진행되지 않게 하세요, 점검 대상 변수/심볼: promtail_role, private_ip,
NODE_EXPORTER_TARGET_SSM_PARAM 및 관련 deploy-monitoring.yml의 "Resolve scrape
targets from SSM" 단계를 참고하세요.

---

Nitpick comments:
In `@apps/monitoring/prometheus/prometheus.prod.yml.tpl`:
- Around line 46-60: Add a relabel_configs block to the node-exporter and
machine-node-exporter scrape jobs so they explicitly set the instance label like
the other jobs do; locate the job entries identified by job_name:
"node-exporter" and job_name: "machine-node-exporter" in the Prometheus template
and insert a relabel_configs section that maps __address__ (or the target) to
the instance label (matching the pattern used by user-api/admin-api/transcoder
jobs) so dashboards can consistently filter by instance.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c27546c9-6ffa-420a-b1e8-0d6aaf61202d

📥 Commits

Reviewing files that changed from the base of the PR and between 873fbc3 and cceb191.

📒 Files selected for processing (6)
  • .github/workflows/deploy-ai.yml
  • .github/workflows/deploy-ec2-docker.yml
  • apps/monitoring/docker-compose.yml
  • apps/monitoring/grafana/provisioning/dashboards/json/loki-logs-15324.json
  • apps/monitoring/prometheus/prometheus.prod.yml
  • apps/monitoring/prometheus/prometheus.prod.yml.tpl
🚧 Files skipped from review as they are similar to previous changes (3)
  • apps/monitoring/prometheus/prometheus.prod.yml
  • apps/monitoring/grafana/provisioning/dashboards/json/loki-logs-15324.json
  • apps/monitoring/docker-compose.yml

Comment thread .github/workflows/deploy-ai.yml
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
.github/workflows/deploy-ai.yml (2)

137-144: ⚠️ Potential issue | 🟡 Minor

MACHINE_PRIVATE_IP 누락 시 명시적 에러 처리가 필요합니다.

MACHINE_PRIVATE_IP가 비어있거나 "None"인 경우 SSM 파라미터 업데이트가 조용히 생략됩니다. 이후 deploy-monitoring.yml에서 해당 타겟을 사용할 때 배포가 실패할 수 있습니다.

,

🔧 명시적 에러 처리 추가 제안
-          if [ -n "$MACHINE_PRIVATE_IP" ] && [ "$MACHINE_PRIVATE_IP" != "None" ]; then
+          if [ -z "$MACHINE_PRIVATE_IP" ] || [ "$MACHINE_PRIVATE_IP" = "None" ]; then
+            echo "::error::Failed to resolve private IP for ${TARGET_TAG}; cannot update ${SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM}" >&2
+            exit 1
+          fi
+
             aws ssm put-parameter \
               --region "$AWS_REGION" \
               --name "$SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM" \
               --type String \
               --overwrite \
               --value "${MACHINE_PRIVATE_IP}:9100" >/dev/null
-          fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy-ai.yml around lines 137 - 144, The workflow
silently skips setting SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM when
MACHINE_PRIVATE_IP is empty or "None", which later breaks deploy-monitoring.yml;
add an explicit failure path: check MACHINE_PRIVATE_IP (the same variable used
in the if condition) and if it's empty or "None" print a clear error referencing
SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM and exit non‑zero so the job fails early;
update the conditional block around MACHINE_PRIVATE_IP to include an else branch
that echoes the error and exits 1 (so downstream deploy-monitoring.yml won't run
with a missing target).

146-161: ⚠️ Potential issue | 🟡 Minor

보안 그룹 인그레스 에러 처리가 개선되었으나, 실패 시 워크플로우 중단 고려가 필요합니다.

InvalidPermission.Duplicate 처리가 추가되어 이전 || true 방식보다 개선되었습니다. 다만, 권한 부족이나 잘못된 SG ID 같은 실제 오류 발생 시 경고만 출력하고 계속 진행되므로, node-exporter가 Prometheus에서 접근 불가능한 상태로 배포될 수 있습니다.

,

🔧 실패 시 워크플로우 중단 제안
           if ! INGRESS_OUTPUT=$(aws ec2 authorize-security-group-ingress \
             --region "$AWS_REGION" \
             --group-id "$MACHINE_SG_ID" \
             --protocol tcp \
             --port 9100 \
             --source-group "$MONITORING_SG_ID" 2>&1); then
             if echo "$INGRESS_OUTPUT" | grep -q "InvalidPermission.Duplicate"; then
               echo "Node exporter ingress rule already exists."
             else
-              echo "::warning::Failed to open node-exporter ingress (monitoring -> ai)."
+              echo "::error::Failed to open node-exporter ingress (monitoring -> ai)."
               echo "$INGRESS_OUTPUT"
+              exit 1
             fi
           fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/deploy-ai.yml around lines 146 - 161, The current
authorize-security-group-ingress block captures errors into INGRESS_OUTPUT and
only warns on failures other than InvalidPermission.Duplicate, but then
continues; change the error handling so that if the aws ec2
authorize-security-group-ingress command fails and the output does NOT contain
"InvalidPermission.Duplicate" (i.e., INGRESS_OUTPUT check), you print the error
and then exit non‑zero (e.g., exit 1) to fail the workflow and prevent
continuing with a potentially inaccessible node-exporter; reference the aws ec2
authorize-security-group-ingress invocation and variables MONITORING_SG_ID,
MACHINE_SG_ID, and INGRESS_OUTPUT when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.github/workflows/deploy-ai.yml:
- Around line 137-144: The workflow silently skips setting
SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM when MACHINE_PRIVATE_IP is empty or
"None", which later breaks deploy-monitoring.yml; add an explicit failure path:
check MACHINE_PRIVATE_IP (the same variable used in the if condition) and if
it's empty or "None" print a clear error referencing
SSM_MACHINE_NODE_EXPORTER_TARGET_PARAM and exit non‑zero so the job fails early;
update the conditional block around MACHINE_PRIVATE_IP to include an else branch
that echoes the error and exits 1 (so downstream deploy-monitoring.yml won't run
with a missing target).
- Around line 146-161: The current authorize-security-group-ingress block
captures errors into INGRESS_OUTPUT and only warns on failures other than
InvalidPermission.Duplicate, but then continues; change the error handling so
that if the aws ec2 authorize-security-group-ingress command fails and the
output does NOT contain "InvalidPermission.Duplicate" (i.e., INGRESS_OUTPUT
check), you print the error and then exit non‑zero (e.g., exit 1) to fail the
workflow and prevent continuing with a potentially inaccessible node-exporter;
reference the aws ec2 authorize-security-group-ingress invocation and variables
MONITORING_SG_ID, MACHINE_SG_ID, and INGRESS_OUTPUT when making this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f725fce7-04af-4013-a1c6-b1242b48aae0

📥 Commits

Reviewing files that changed from the base of the PR and between cceb191 and 3c6e032.

📒 Files selected for processing (2)
  • .github/workflows/deploy-ai.yml
  • docker-compose.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • docker-compose.yml

@arlen02-01 arlen02-01 merged commit bec88de into develop Mar 16, 2026
1 check passed
@phonil phonil deleted the OT-311-chore/grafana-template branch April 4, 2026 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore 설정 파일 등 변경 (.gitignore, .yml 등)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OT-311] [CHORE]: 그라파나 탬플릿 적용 및 필요 설정 추가

2 participants