Skip to content

Observability infrastructure for the MathTrail platform — includes OpenTelemetry configuration, monitoring, logging, and dashboards for tracking system performance and user activity.

License

Notifications You must be signed in to change notification settings

MathTrail/infra-observability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MathTrail Observability Stack

Observability infrastructure for the MathTrail platform — includes OpenTelemetry Collector, Grafana LGTM stack (Loki, Tempo, Mimir, Grafana), and Pyroscope for continuous profiling.

Architecture

Services → Zipkin (9411) → OTel Collector → k8sattributes → OTLP → Grafana Alloy → LGTM Stack
Services → OTLP (4317/4318) → OTel Collector → k8sattributes → OTLP → Grafana Alloy → LGTM Stack
Go Services → Pyroscope SDK → Pyroscope (4040) → Grafana

Components:

  • OpenTelemetry Collector: Smart gateway receiving Zipkin traces and OTLP from services
  • Grafana LGTM: Loki (logs), Tempo (traces), Mimir (metrics), Grafana (visualization)
  • Pyroscope: Continuous profiling for Go services
  • Namespace: monitoring

Quick Start

# Deploy observability stack
skaffold run

# Or use automation
just deploy

# Access Grafana
just grafana
# Open http://localhost:3000 (admin/mathtrail)

# Access Pyroscope
just pyroscope
# Open http://localhost:4040

# Check health
just health

Deployment from Root

cd d:\Projects\MathTrail\core

# Deploy only observability
skaffold run -p infra-observability

# Deploy all infrastructure (including observability)
skaffold run -p all-infra

# Deploy everything
skaffold run

Service Integration

OpenTelemetry Tracing

Services send traces via Zipkin or OTLP to the OTel Collector. Configure the Zipkin endpoint in your service:

# Example: service tracing configuration
env:
  - name: OTEL_ENDPOINT
    value: "otel-collector-opentelemetry-collector.monitoring.svc.cluster.local:4317"

Pyroscope Profiling (Go Services)

Add the Pyroscope SDK to Go services:

import "github.com/grafana/pyroscope-go"

func main() {
    pyroscope.Start(pyroscope.Config{
        ApplicationName: "profile-api",
        ServerAddress:   "http://pyroscope.monitoring.svc.cluster.local:4040",
        ProfileTypes: []pyroscope.ProfileType{
            pyroscope.ProfileCPU,
            pyroscope.ProfileAllocObjects,
            pyroscope.ProfileAllocSpace,
            pyroscope.ProfileInuseObjects,
            pyroscope.ProfileInuseSpace,
        },
    })
    // Application code...
}

DNS Service Names

Service DNS Port Usage
OTel Collector otel-collector-opentelemetry-collector.monitoring.svc.cluster.local 9411 Zipkin traces
OTel Collector otel-collector-opentelemetry-collector.monitoring.svc.cluster.local 4317 OTLP gRPC
OTel Collector otel-collector-opentelemetry-collector.monitoring.svc.cluster.local 4318 OTLP HTTP
Grafana lgtm-grafana.monitoring.svc.cluster.local 80 Dashboard UI
Pyroscope pyroscope.monitoring.svc.cluster.local 4040 Profile push
Loki loki.monitoring.svc.cluster.local 3100 Log queries
Tempo tempo.monitoring.svc.cluster.local 3200 Trace queries
Mimir mimir.monitoring.svc.cluster.local 9009 Metric queries

Verification

Check Pods

kubectl get pods -n monitoring

# Expected:
# - lgtm-alloy-receiver-*
# - lgtm-alloy-logs-* (DaemonSet)
# - lgtm-alloy-metrics-* (DaemonSet)
# - lgtm-grafana-*
# - loki-*
# - tempo-*
# - mimir-*
# - pyroscope-*
# - otel-collector-opentelemetry-collector-*

Test OTel Collector

# Check health endpoint
kubectl port-forward -n monitoring svc/otel-collector-opentelemetry-collector 13133:13133
curl http://localhost:13133/health

# Check metrics
kubectl port-forward -n monitoring svc/otel-collector-opentelemetry-collector 8888:8888
curl http://localhost:8888/metrics | grep otelcol_receiver

Verify in Grafana

Open http://localhost:3000 (after running just grafana), login with admin/mathtrail:

  1. Datasources: Configuration → Data Sources → Verify Loki, Tempo, Mimir, Pyroscope all green
  2. Logs: Explore → Loki → Query {namespace="mathtrail"}
  3. Traces: Explore → Tempo → Search for service traces
  4. Metrics: Explore → Mimir → Query up{job="otel-collector"}
  5. Profiling: Explore → Pyroscope → Query for service names

Troubleshooting

OTel Collector Issues

# Check logs
kubectl logs -n monitoring deployment/otel-collector-opentelemetry-collector

# Common issues:
# - LGTM Alloy not ready: kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy-receiver
# - RBAC missing: kubectl get clusterrole otel-collector
# - Config error: Review values/otel-collector-values.yaml

Services Not Sending Traces

# Test connectivity from mathtrail namespace
kubectl run -n mathtrail -it --rm debug --image=busybox --restart=Never -- sh
# Inside pod:
nslookup otel-collector-opentelemetry-collector.monitoring.svc.cluster.local
wget -O- http://otel-collector-opentelemetry-collector.monitoring.svc.cluster.local:9411

No Logs in Loki

# Check Alloy logs collector
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy-logs
kubectl logs -n monitoring daemonset/lgtm-alloy-logs

Production Considerations

  • Resources: Increase CPU/memory for OTel Collector (4 CPU, 8Gi), storage (100Gi)
  • Sampling: Reduce trace sampling to 10% (samplingRate: "0.1")
  • Retention: Configure Loki/Tempo/Mimir retention (7-30 days)
  • HA: Increase replicas for OTel Collector (3), Alloy receiver (3)
  • Storage: Use S3-compatible storage for Loki, Tempo, Pyroscope

Documentation

License

Apache 2.0

About

Observability infrastructure for the MathTrail platform — includes OpenTelemetry configuration, monitoring, logging, and dashboards for tracking system performance and user activity.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors