MathTrail Observability Stack

Observability infrastructure for the MathTrail platform — includes OpenTelemetry Collector, Grafana LGTM stack (Loki, Tempo, Mimir, Grafana), and Pyroscope for continuous profiling.

Architecture

Services → Zipkin (9411) → OTel Collector → k8sattributes → OTLP → Grafana Alloy → LGTM Stack
Services → OTLP (4317/4318) → OTel Collector → k8sattributes → OTLP → Grafana Alloy → LGTM Stack
Go Services → Pyroscope SDK → Pyroscope (4040) → Grafana

Components:

OpenTelemetry Collector: Smart gateway receiving Zipkin traces and OTLP from services
Grafana LGTM: Loki (logs), Tempo (traces), Mimir (metrics), Grafana (visualization)
Pyroscope: Continuous profiling for Go services
Namespace: monitoring

Quick Start

# Deploy observability stack
skaffold run

# Or use automation
just deploy

# Access Grafana
just grafana
# Open http://localhost:3000 (admin/mathtrail)

# Access Pyroscope
just pyroscope
# Open http://localhost:4040

# Check health
just health

Deployment from Root

cd d:\Projects\MathTrail\core

# Deploy only observability
skaffold run -p infra-observability

# Deploy all infrastructure (including observability)
skaffold run -p all-infra

# Deploy everything
skaffold run

Service Integration

OpenTelemetry Tracing

Services send traces via Zipkin or OTLP to the OTel Collector. Configure the Zipkin endpoint in your service:

# Example: service tracing configuration
env:
  - name: OTEL_ENDPOINT
    value: "otel-collector-opentelemetry-collector.monitoring.svc.cluster.local:4317"

Pyroscope Profiling (Go Services)

Add the Pyroscope SDK to Go services:

import "github.com/grafana/pyroscope-go"

func main() {
    pyroscope.Start(pyroscope.Config{
        ApplicationName: "profile-api",
        ServerAddress:   "http://pyroscope.monitoring.svc.cluster.local:4040",
        ProfileTypes: []pyroscope.ProfileType{
            pyroscope.ProfileCPU,
            pyroscope.ProfileAllocObjects,
            pyroscope.ProfileAllocSpace,
            pyroscope.ProfileInuseObjects,
            pyroscope.ProfileInuseSpace,
        },
    })
    // Application code...
}

DNS Service Names

Service	DNS	Port	Usage
OTel Collector	`otel-collector-opentelemetry-collector.monitoring.svc.cluster.local`	9411	Zipkin traces
OTel Collector	`otel-collector-opentelemetry-collector.monitoring.svc.cluster.local`	4317	OTLP gRPC
OTel Collector	`otel-collector-opentelemetry-collector.monitoring.svc.cluster.local`	4318	OTLP HTTP
Grafana	`lgtm-grafana.monitoring.svc.cluster.local`	80	Dashboard UI
Pyroscope	`pyroscope.monitoring.svc.cluster.local`	4040	Profile push
Loki	`loki.monitoring.svc.cluster.local`	3100	Log queries
Tempo	`tempo.monitoring.svc.cluster.local`	3200	Trace queries
Mimir	`mimir.monitoring.svc.cluster.local`	9009	Metric queries

Verification

Check Pods

kubectl get pods -n monitoring

# Expected:
# - lgtm-alloy-receiver-*
# - lgtm-alloy-logs-* (DaemonSet)
# - lgtm-alloy-metrics-* (DaemonSet)
# - lgtm-grafana-*
# - loki-*
# - tempo-*
# - mimir-*
# - pyroscope-*
# - otel-collector-opentelemetry-collector-*

Test OTel Collector

# Check health endpoint
kubectl port-forward -n monitoring svc/otel-collector-opentelemetry-collector 13133:13133
curl http://localhost:13133/health

# Check metrics
kubectl port-forward -n monitoring svc/otel-collector-opentelemetry-collector 8888:8888
curl http://localhost:8888/metrics | grep otelcol_receiver

Verify in Grafana

Open http://localhost:3000 (after running just grafana), login with admin/mathtrail:

Datasources: Configuration → Data Sources → Verify Loki, Tempo, Mimir, Pyroscope all green
Logs: Explore → Loki → Query {namespace="mathtrail"}
Traces: Explore → Tempo → Search for service traces
Metrics: Explore → Mimir → Query up{job="otel-collector"}
Profiling: Explore → Pyroscope → Query for service names

Troubleshooting

OTel Collector Issues

# Check logs
kubectl logs -n monitoring deployment/otel-collector-opentelemetry-collector

# Common issues:
# - LGTM Alloy not ready: kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy-receiver
# - RBAC missing: kubectl get clusterrole otel-collector
# - Config error: Review values/otel-collector-values.yaml

Services Not Sending Traces

# Test connectivity from mathtrail namespace
kubectl run -n mathtrail -it --rm debug --image=busybox --restart=Never -- sh
# Inside pod:
nslookup otel-collector-opentelemetry-collector.monitoring.svc.cluster.local
wget -O- http://otel-collector-opentelemetry-collector.monitoring.svc.cluster.local:9411

No Logs in Loki

# Check Alloy logs collector
kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy-logs
kubectl logs -n monitoring daemonset/lgtm-alloy-logs

Production Considerations

Resources: Increase CPU/memory for OTel Collector (4 CPU, 8Gi), storage (100Gi)
Sampling: Reduce trace sampling to 10% (samplingRate: "0.1")
Retention: Configure Loki/Tempo/Mimir retention (7-30 days)
HA: Increase replicas for OTel Collector (3), Alloy receiver (3)
Storage: Use S3-compatible storage for Loki, Tempo, Pyroscope

Documentation

Architecture: core/docs/architecture/observability.md
Implementation Plan: C:\Users\Alexander.claude\plans\bubbly-toasting-kernighan.md

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.claude		.claude
.devcontainer		.devcontainer
manifests		manifests
values		values
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
justfile		justfile
skaffold.yaml		skaffold.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MathTrail Observability Stack

Architecture

Quick Start

Deployment from Root

Service Integration

OpenTelemetry Tracing

Pyroscope Profiling (Go Services)

DNS Service Names

Verification

Check Pods

Test OTel Collector

Verify in Grafana

Troubleshooting

OTel Collector Issues

Services Not Sending Traces

No Logs in Loki

Production Considerations

Documentation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

MathTrail/infra-observability

Folders and files

Latest commit

History

Repository files navigation

MathTrail Observability Stack

Architecture

Quick Start

Deployment from Root

Service Integration

OpenTelemetry Tracing

Pyroscope Profiling (Go Services)

DNS Service Names

Verification

Check Pods

Test OTel Collector

Verify in Grafana

Troubleshooting

OTel Collector Issues

Services Not Sending Traces

No Logs in Loki

Production Considerations

Documentation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages