Skip to content

Fix: grafana access issues#232

Merged
HardMax71 merged 4 commits intomainfrom
fix/grafana-access
Feb 24, 2026
Merged

Fix: grafana access issues#232
HardMax71 merged 4 commits intomainfrom
fix/grafana-access

Conversation

@HardMax71
Copy link
Owner

@HardMax71 HardMax71 commented Feb 24, 2026


Summary by cubic

Fixes Grafana login by disabling anonymous access and resetting the admin password on startup. Adds Loki-based log aggregation with OTLP logs from the backend and workers, enabling log-to-trace linking in Grafana.

  • Bug Fixes

    • Disabled Grafana anonymous auth; login is now required.
    • Resets admin password on container start via grafana-cli (GF_SECURITY_ADMIN_PASSWORD).
    • Grafana now waits for Loki to be healthy to avoid broken Explore/log panels.
  • New Features

    • Added Loki 3.x for centralized logs (filesystem, 30-day retention, structured metadata).
    • Added setup_log_exporter; backend and workers ship logs via OTLP to the collector and still log to stdout.
    • Updated OTel Collector to export logs to Loki via otlphttp with resource/attributes processors.
    • Provisioned Grafana Loki datasource with trace_id-to-Jaeger link; added Loki operations doc and updated Grafana docs.
    • Standardized environment handling via settings.ENVIRONMENT; unified deployment.environment label and removed TESTING.

Written for commit 08edbbc. Summary will update on new commits.

Summary by CodeRabbit

  • New Features

    • Added Loki-based centralized log aggregation; logs can be exported for historical analysis.
    • Trace-ID linking between logs and distributed traces in Grafana.
  • Configuration

    • Grafana anonymous access disabled.
    • Loki added as a Grafana data source and integrated into startup/health orchestration.
  • Documentation

    • New operations guide and navigation entry for Log Aggregation (Loki).

Copilot AI review requested due to automatic review settings February 24, 2026 21:52
@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

Adds OpenTelemetry-based log exporting to the Python app and workers, introduces Loki service and configuration, wires the OTel Collector to export logs to Loki, updates Grafana to use Loki (with trace linking), and adds docs and Docker Compose integration for the observability stack.

Changes

Cohort / File(s) Summary
Python logging & startup
backend/app/core/logging.py, backend/app/main.py, backend/workers/run_dlq_processor.py, backend/workers/run_event_replay.py, backend/workers/run_k8s_worker.py, backend/workers/run_pod_monitor.py, backend/workers/run_result_processor.py, backend/workers/run_saga_orchestrator.py
Added setup_log_exporter(settings, logger) to configure OTLP log export via OpenTelemetry and invoked it during app and worker initialization.
OTel Collector config
backend/otel-collector-config.yaml
Added OTLP HTTP exporter targeting Loki and updated logs pipeline to process and send logs to both the new Loki exporter and existing logging exporter.
Loki config & service
backend/loki/loki-config.yaml, docker-compose.yaml
Added Loki configuration (TSDB filesystem storage, retention, compactor) and a loki service in Docker Compose with volume, healthcheck, and network wiring; updated service dependencies and volumes.
Grafana provisioning
backend/grafana/grafana.ini, backend/grafana/provisioning/datasources/datasources.yml
Disabled anonymous auth and added a Loki datasource with derivedFields for trace_id → Jaeger trace linking.
Metrics & tracing adjustments
backend/app/core/middlewares/metrics.py, backend/app/core/tracing/tracer.py
Changed resource attributes (service.environment → deployment.environment) and sourced environment values from settings.ENVIRONMENT; removed TESTING-based guards for OTLP metrics export.
Settings & environment usage
backend/app/settings.py, backend/config.test.toml, backend/app/core/dishka_lifespan.py, backend/app/dlq/manager.py, backend/app/services/notification_service.py
Removed TESTING field from Settings and replaced boolean testing checks with ENVIRONMENT == "test" where applicable.
Documentation & site nav
docs/operations/loki.md, docs/operations/grafana-integration.md, mkdocs.yml
Added Loki documentation, updated Grafana integration docs to include Loki, and added a nav entry for Log Aggregation (Loki).

Sequence Diagram(s)

sequenceDiagram
    participant App as Python App
    participant SDK as OpenTelemetry SDK
    participant Collector as OTel Collector
    participant Loki as Loki
    participant Grafana as Grafana

    App->>SDK: setup_log_exporter(settings, logger)
    Note over SDK: Configure OTLP HTTP exporter (endpoint http://loki:3100/otlp)
    App->>App: Emit log records
    App->>SDK: Export log records (batched)
    SDK->>Collector: Send logs via OTLP HTTP
    Collector->>Collector: Process logs (memory_limiter, batch, resource, attributes)
    Collector->>Loki: Export logs via OTLP HTTP
    Loki->>Loki: Index & store logs (TSDB, filesystem)
    Grafana->>Loki: Query logs (LogQL / derivedFields trace link)
    Loki->>Grafana: Return logs & trace IDs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • PR #175: Adds setup_logger and structlog integration in backend/app/core/logging.py, which this change extends by adding setup_log_exporter.
  • PR #131: Modifies backend/app/services/notification_service.py; overlaps with this PR's environment-to-ENVIRONMENT changes affecting the same service logic.
  • PR #227: Alters Grafana configuration and provisioning files, overlapping with the Grafana datasource and ini changes here.

Poem

🐰
I hopped in code to stitch a stream,
OpenTelemetry made logs gleam.
Loki caught each hopping line,
Grafana twined them into shine.
Cheers — a rabbit's logging dream! 🥕✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'Fix: grafana access issues' is vague and does not accurately reflect the comprehensive scope of changes. The PR primarily adds Loki-based log aggregation with OpenTelemetry integration across the entire backend, but the title only mentions a Grafana access fix. Revise the title to reflect the main objective, such as 'Add Loki log aggregation with OpenTelemetry integration' or 'feat: integrate Loki for centralized logging with OTLP export', and optionally address the Grafana access fix in a separate commit or as a secondary point.
Docstring Coverage ⚠️ Warning Docstring coverage is 78.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/grafana-access

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 16 files

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/workers/run_event_replay.py (1)

17-56: ⚠️ Potential issue | 🟡 Minor

Add Google‑style docstrings for all functions in this module.
main(), run(), init_replay(), and shutdown() are missing Google‑style docstrings with Args/Returns/Raises.

📌 Proposed docstring updates
 def main() -> None:
-    """Main entry point for event replay service"""
+    """Main entry point for event replay service.
+
+    Returns:
+        None
+    """
@@
-    async def run() -> None:
+    async def run() -> None:
+        """Start the event replay runtime.
+
+        Returns:
+            None
+        """
@@
-        async def init_replay() -> None:
+        async def init_replay() -> None:
+            """Initialize the replay cleanup scheduler.
+
+            Returns:
+                None
+            """
@@
-        async def shutdown() -> None:
+        async def shutdown() -> None:
+            """Shut down the scheduler and DI container.
+
+            Returns:
+                None
+            """

As per coding guidelines, “Use Google-style docstrings with Args/Returns/Raises sections for all functions and classes.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/workers/run_event_replay.py` around lines 17 - 56, Add Google-style
docstrings for each function in this module: main, the nested async run,
init_replay, and shutdown. For each function include a brief summary line and
Args/Returns/Raises sections as appropriate (e.g., main() returns None, run() is
async and returns None, init_replay() and shutdown() are async and return None;
document any injected variables or side effects like initializing Beanie,
starting the scheduler, or retrieving EventReplayService). Place the docstrings
immediately below each def line using the Google docstring format so linters and
docs extractors can pick them up.
🧹 Nitpick comments (1)
backend/app/core/logging.py (1)

136-137: addHandler is not idempotent — consider guarding against repeated calls.

logging.getLogger().addHandler(otel_handler) appends a new LoggingHandler every time setup_log_exporter is called in the same process. If create_app (or any worker main) is accidentally invoked more than once per process, log records will be double- (or triple-) exported to the OTLP collector.

🛡️ Proposed fix
+    # Guard against accidental double-initialization in the same process
+    root_logger = logging.getLogger()
+    if any(isinstance(h, LoggingHandler) for h in root_logger.handlers):
+        return
+
     otel_handler = LoggingHandler(level=logging.DEBUG, logger_provider=logger_provider)
-    logging.getLogger().addHandler(otel_handler)
+    root_logger.addHandler(otel_handler)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/logging.py` around lines 136 - 137, setup_log_exporter
currently creates an otel_handler = LoggingHandler(...) and unconditionally
calls logging.getLogger().addHandler(otel_handler), which will append duplicates
on repeated setup; change setup_log_exporter to be idempotent by checking for an
existing LoggingHandler instance (or a handler with a unique attribute) on the
root logger before adding otel_handler, or set and check a module-level flag
like _otel_handler_installed; specifically reference and guard the code that
creates otel_handler and calls logging.getLogger().addHandler so the handler is
only added once per process.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/core/logging.py`:
- Around line 115-123: The ternary for "deployment.environment" inside
Resource.create is dead code because settings.TESTING is already guarded by the
early return; update the Resource.create call in logging.py (the dict that
includes SERVICE_NAME, SERVICE_VERSION, "service.namespace",
"deployment.environment", "service.instance.id") to set a deterministic
environment value instead of the unreachable ternary — either hard-code
"production" or read a dedicated setting (e.g., settings.DEPLOYMENT_ENV or
settings.ENVIRONMENT) and use that value so the field is meaningful at runtime.

---

Outside diff comments:
In `@backend/workers/run_event_replay.py`:
- Around line 17-56: Add Google-style docstrings for each function in this
module: main, the nested async run, init_replay, and shutdown. For each function
include a brief summary line and Args/Returns/Raises sections as appropriate
(e.g., main() returns None, run() is async and returns None, init_replay() and
shutdown() are async and return None; document any injected variables or side
effects like initializing Beanie, starting the scheduler, or retrieving
EventReplayService). Place the docstrings immediately below each def line using
the Google docstring format so linters and docs extractors can pick them up.

---

Nitpick comments:
In `@backend/app/core/logging.py`:
- Around line 136-137: setup_log_exporter currently creates an otel_handler =
LoggingHandler(...) and unconditionally calls
logging.getLogger().addHandler(otel_handler), which will append duplicates on
repeated setup; change setup_log_exporter to be idempotent by checking for an
existing LoggingHandler instance (or a handler with a unique attribute) on the
root logger before adding otel_handler, or set and check a module-level flag
like _otel_handler_installed; specifically reference and guard the code that
creates otel_handler and calls logging.getLogger().addHandler so the handler is
only added once per process.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0cd06a7 and 81fc79e.

📒 Files selected for processing (16)
  • backend/app/core/logging.py
  • backend/app/main.py
  • backend/grafana/grafana.ini
  • backend/grafana/provisioning/datasources/datasources.yml
  • backend/loki/loki-config.yaml
  • backend/otel-collector-config.yaml
  • backend/workers/run_dlq_processor.py
  • backend/workers/run_event_replay.py
  • backend/workers/run_k8s_worker.py
  • backend/workers/run_pod_monitor.py
  • backend/workers/run_result_processor.py
  • backend/workers/run_saga_orchestrator.py
  • docker-compose.yaml
  • docs/operations/grafana-integration.md
  • docs/operations/loki.md
  • mkdocs.yml

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the observability stack to include Loki-based log aggregation in Grafana, and adjusts Grafana auth/startup behavior to address access issues.

Changes:

  • Add Loki service + volume in docker-compose.yaml, and wire Grafana + OTel Collector to depend on Loki health.
  • Export Python application logs via OpenTelemetry OTLP to the collector, then to Loki; provision Loki as a Grafana datasource with trace linking to Jaeger.
  • Add/extend docs and MkDocs navigation for Loki log aggregation and updated Grafana integration.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
mkdocs.yml Adds Loki documentation page to Operations nav.
docs/operations/loki.md New end-to-end Loki/OTel/Grafana wiring + usage documentation.
docs/operations/grafana-integration.md Updates Grafana integration doc to include Loki.
docker-compose.yaml Adds Loki container, Grafana password reset entrypoint, and updates dependencies/volumes.
backend/otel-collector-config.yaml Adds otlphttp/loki exporter and routes logs pipeline to Loki.
backend/loki/loki-config.yaml New Loki single-instance config (filesystem + structured metadata).
backend/grafana/provisioning/datasources/datasources.yml Provisions Loki datasource with derived trace link to Jaeger.
backend/grafana/grafana.ini Disables anonymous Grafana auth.
backend/app/main.py Initializes setup_log_exporter during app creation.
backend/app/core/logging.py Introduces setup_log_exporter (OTLP log export via OTel LoggingHandler).
backend/workers/run_*.py (6 files) Initializes setup_log_exporter in each worker process.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

SERVICE_NAME: settings.SERVICE_NAME,
SERVICE_VERSION: settings.SERVICE_VERSION,
"service.namespace": "integr8scode",
"deployment.environment": "test" if settings.TESTING else "production",
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deployment.environment is computed with "test" if settings.TESTING else "production", but this function returns early when settings.TESTING is true, so the "test" branch is unreachable. Also hardcoding "production" ignores Settings.ENVIRONMENT (used elsewhere for metrics). Consider setting this attribute from settings.ENVIRONMENT (or remove it) to avoid incorrect resource metadata.

Suggested change
"deployment.environment": "test" if settings.TESTING else "production",
"deployment.environment": settings.ENVIRONMENT,

Copilot uses AI. Check for mistakes.
Comment on lines +118 to +119
resource = Resource.create({
SERVICE_NAME: settings.SERVICE_NAME,
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service.name is set from settings.SERVICE_NAME, but the per-worker override TOMLs only override TRACING_SERVICE_NAME, not SERVICE_NAME. As a result, all workers will likely export logs with the same service.name (integr8scode-backend), which makes the Loki queries in the new docs (per-worker service_name) misleading and reduces log/trace correlation. Consider deriving the log SERVICE_NAME from settings.TRACING_SERVICE_NAME or updating worker override configs to set SERVICE_NAME too.

Suggested change
resource = Resource.create({
SERVICE_NAME: settings.SERVICE_NAME,
# Prefer per-worker tracing service name for logs when available, to keep
# log/trace correlation by service consistent across workers.
service_name = getattr(settings, "TRACING_SERVICE_NAME", None) or settings.SERVICE_NAME
resource = Resource.create({
SERVICE_NAME: service_name,

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +138
logger_provider = LoggerProvider(resource=resource)
logger_provider.add_log_record_processor(BatchLogRecordProcessor(log_exporter))
set_logger_provider(logger_provider)

otel_handler = LoggingHandler(level=logging.DEBUG, logger_provider=logger_provider)
logging.getLogger().addHandler(otel_handler)

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup_log_exporter is not idempotent: each call adds a new LoggingHandler to the root logger and sets a new global LoggerProvider. If create_app() is invoked more than once in a process (e.g., reloads, certain test/CLI paths), this will duplicate log exports. Consider guarding with a module-level flag or checking existing handlers/providers before adding another handler.

Copilot uses AI. Check for mistakes.
Comment on lines +136 to +138
otel_handler = LoggingHandler(level=logging.DEBUG, logger_provider=logger_provider)
logging.getLogger().addHandler(otel_handler)

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New setup_log_exporter behavior isn’t covered by unit tests, while this module already has tests for setup_logger and related processors. Adding tests that verify the early-return guard (TESTING/missing endpoint) and that an OTel LoggingHandler is attached exactly once would help prevent regressions.

Suggested change
otel_handler = LoggingHandler(level=logging.DEBUG, logger_provider=logger_provider)
logging.getLogger().addHandler(otel_handler)
# Ensure the OTel LoggingHandler is attached only once to avoid duplicate exports.
root_logger = logging.getLogger()
otel_handler = None
for handler in root_logger.handlers:
if isinstance(handler, LoggingHandler):
otel_handler = handler
break
if otel_handler is None:
otel_handler = LoggingHandler(level=logging.DEBUG, logger_provider=logger_provider)
root_logger.addHandler(otel_handler)

Copilot uses AI. Check for mistakes.
The log exporter is initialized once per process, right after `setup_logger`. The backend calls it in `create_app`, and each worker calls it in its `main()`:

```python
--8<-- "backend/app/core/logging.py:setup_logger"
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section says the backend/workers initialize setup_log_exporter after setup_logger, but the included snippet embeds only setup_logger. Consider including setup_log_exporter (or a small excerpt showing both calls) so the documentation matches the wiring described in the text.

Suggested change
--8<-- "backend/app/core/logging.py:setup_logger"
--8<-- "backend/app/core/logging.py:setup_logger"
--8<-- "backend/app/core/logging.py:setup_log_exporter"

Copilot uses AI. Check for mistakes.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/core/logging.py`:
- Around line 108-116: The docstring for setup_log_exporter is stale and missing
Google-style sections: remove the misleading phrase "Skipped in tests or when no
OTLP endpoint is configured" (it only checks
settings.OTEL_EXPORTER_OTLP_ENDPOINT) and replace the entire docstring with a
Google-style docstring that clearly describes the function, its behavior, and
includes Args: (settings: Settings, logger: structlog.stdlib.BoundLogger) and
Returns: (None) entries; if you want to note test behavior, only do so if you
add an explicit ENVIRONMENT or TEST flag check in setup_log_exporter, otherwise
omit any mention of tests. Ensure the updated docstring references the function
name setup_log_exporter and the parameters so readers can locate the behavior.
- Around line 132-137: setup_log_exporter currently unconditionally creates and
attaches a new LoggingHandler and LoggerProvider each call, causing duplicate
handlers on repeated invocations; make it idempotent by tracking whether you
already set up the provider/handler (e.g., a module-level flag or stored
references) and only create/set/add when not already present, or check the root
logger's handlers for an existing instance of LoggingHandler (the otel_handler
type) before calling logging.getLogger().addHandler; ensure you also retain the
LoggerProvider reference returned/created by LoggerProvider/ set_logger_provider
so future shutdown/flush can operate on the single tracked instance.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81fc79e and b2e2992.

📒 Files selected for processing (8)
  • backend/app/core/dishka_lifespan.py
  • backend/app/core/logging.py
  • backend/app/core/middlewares/metrics.py
  • backend/app/core/tracing/tracer.py
  • backend/app/dlq/manager.py
  • backend/app/services/notification_service.py
  • backend/app/settings.py
  • backend/config.test.toml
💤 Files with no reviewable changes (1)
  • backend/app/settings.py

@sonarqubecloud
Copy link

@HardMax71 HardMax71 merged commit cac65bc into main Feb 24, 2026
18 checks passed
@HardMax71 HardMax71 deleted the fix/grafana-access branch February 24, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants