-
-
Notifications
You must be signed in to change notification settings - Fork 0
Log Analysis
Norm Brandinger edited this page Nov 20, 2025
·
1 revision
Analyzing logs with Loki and troubleshooting in the DevStack Core environment.
- Overview
- Loki Overview
- LogQL Basics
- Common Queries
- Log Exploration
- Troubleshooting with Logs
- Service-Specific Logs
- Performance
- Alerting on Logs
- Best Practices
- Related Documentation
Loki provides centralized log aggregation for all DevStack Core. Logs are collected by Vector, stored in Loki, and queried via Grafana.
Log Stack:
- Collection: Vector (unified observability pipeline)
- Storage: Loki (log aggregation system)
- Visualization: Grafana (query and explore interface)
- Access: http://localhost:3001 (Grafana)
Application Logs → Vector → Loki → Grafana
↓
Prometheus (metrics)
Components:
- Vector: Collects logs from Docker containers
- Loki: Stores and indexes logs
- Grafana: Query interface with LogQL
# Via Grafana
open http://localhost:3001
# Navigate to Explore → Select Loki data source
# Via API
curl http://localhost:3100/loki/api/v1/labels
# Via LogCLI
brew install logcli
export LOKI_ADDR=http://localhost:3100
logcli labelsLogQL queries consist of:
-
Log Stream Selector:
{job="container", container_name="postgres"} -
Filter Expression:
|= "error"or|~ "regex" -
Parser:
| jsonor| logfmt -
Label Filter:
| level="error"
# All logs from postgres
{container_name="postgres"}
# Logs containing "error"
{container_name="postgres"} |= "error"
# Case-insensitive search
{container_name="postgres"} |~ "(?i)error"
# Exclude pattern
{container_name="postgres"} != "health check"
# Multiple filters
{container_name="postgres"} |= "error" != "DEBUG"
# Regex filter
{container_name="postgres"} |~ "error|warning|fatal"
# Filter by job
{job="container"}
# Filter by container
{container_name="postgres"}
# Multiple labels
{job="container", container_name="postgres"}
# Label regex
{container_name=~"postgres|mysql"}
# Exclude label
{container_name!="vault"}
# Contains (case-sensitive)
{container_name="postgres"} |= "SELECT"
# Not contains
{container_name="postgres"} != "DEBUG"
# Regex match
{container_name="postgres"} |~ "SELECT .* FROM users"
# Regex not match
{container_name="postgres"} !~ "health.*check"
# JSON parser
{container_name="reference-api"} | json
# Extract specific fields
{container_name="reference-api"} | json | level="error"
# Logfmt parser
{container_name="vault"} | logfmt
# Pattern parser
{container_name="postgres"} | pattern `<date> <time> <level> <message>`
# Regexp parser
{container_name="postgres"} | regexp `(?P<level>\\w+):\\s+(?P<message>.*)`
# All errors
{job="container"} |= "error"
# Errors from specific service
{container_name="postgres"} |= "error"
# Multiple error patterns
{job="container"} |~ "error|ERROR|Error"
# Errors with context (5 lines before/after)
{container_name="postgres"} |= "error"
# JSON errors
{container_name="reference-api"} | json | level="error"
# PostgreSQL logs
{container_name="postgres"}
# MySQL logs
{container_name="mysql"}
# All database logs
{container_name=~"postgres|mysql|mongodb"}
# Application logs
{container_name=~".*-api"}
# Infrastructure logs
{container_name=~"vault|vector|loki"}
# Last 5 minutes (use Grafana time picker)
{container_name="postgres"}
# Specific time range
{container_name="postgres"} # Set range in Grafana
# Rate of logs
rate({container_name="postgres"}[5m])
# Count over time
count_over_time({container_name="postgres"}[1h])
# Bytes over time
bytes_over_time({container_name="postgres"}[1h])
# Count by level
sum by (level) (count_over_time({container_name="reference-api"} | json [5m]))
# Error rate
sum(rate({container_name="postgres"} |= "error" [5m]))
# Top errors
topk(10, sum by (message) (count_over_time({container_name="postgres"} |= "error" [1h])))
# Logs per container
sum by (container_name) (count_over_time({job="container"}[5m]))
- Open Grafana: http://localhost:3001
- Navigate to Explore: Left sidebar → Explore
- Select Loki: Data source dropdown → Loki
- Build Query: Use query builder or raw LogQL
Query Builder:
- Select labels (container_name, job)
- Add line filters (contains, regex)
- Add parsers (json, logfmt)
- Add label filters (level, message)
Example Workflow:
# 1. Start with container
{container_name="postgres"}
# 2. Add error filter
{container_name="postgres"} |= "error"
# 3. Add time range (last 1 hour)
# Use time picker in top-right
# 4. View results
# Click "Run query" or Shift+Enter
# 5. Expand log lines
# Click on log line to see full details
# 6. Add to dashboard
# Click "Add to dashboard"
# Using LogCLI
export LOKI_ADDR=http://localhost:3100
# Tail all logs
logcli query -t '{job="container"}'
# Tail specific service
logcli query -t '{container_name="postgres"}'
# Tail with filter
logcli query -t '{container_name="postgres"} |= "error"'
# Using Docker logs (alternative)
docker logs -f postgres# In Grafana, click log line to expand
# Get surrounding logs
{container_name="postgres"} |= "error"
# Click timestamp to see full context
# Export logs
# Click "Download logs" in Grafana
# Find all unique error messages
{container_name="postgres"} |= "ERROR"
# Group by unique messages in Grafana
# Most common errors
topk(5, sum by (message) (count_over_time(
{container_name="postgres"} |= "ERROR" [1h]
)))
# Error frequency over time
sum by (container_name) (
count_over_time({job="container"} |= "error" [5m])
)
# Find stack traces
{container_name="reference-api"} |~ "Traceback|at .*\\(.*:\\d+\\)"
# Full exception context
{container_name="reference-api"} |~ "(?i)exception"
# Click to expand multi-line stack trace
# Find related logs by request ID
{container_name="reference-api"} | json | request_id="abc123"
# Trace request across services
{container_name=~".*-api"} | json | request_id="abc123"
# Time-based correlation
{job="container"} # Set time range around incident
Step-by-step investigation:
# 1. Identify timeframe
{container_name="postgres"}
# Use time picker to narrow down incident
# 2. Find errors in timeframe
{container_name="postgres"} |= "ERROR"
# 3. Look for warnings before errors
{container_name="postgres"} |~ "WARN|WARNING"
# 4. Check all services during timeframe
{job="container"}
# 5. Correlate with other services
{container_name=~"postgres|vault|redis-1"}
# 6. Identify root cause
# Look for first error in sequence
# All PostgreSQL logs
{container_name="postgres"}
# Connection logs
{container_name="postgres"} |~ "connection.*received|connection.*authorized"
# Query logs
{container_name="postgres"} |~ "statement:|duration:"
# Slow queries
{container_name="postgres"} |~ "duration:\\s+[0-9]{4,}" # >1000ms
# Errors
{container_name="postgres"} |= "ERROR"
# Deadlocks
{container_name="postgres"} |= "deadlock"
# Checkpoints
{container_name="postgres"} |= "checkpoint"
# All MySQL logs
{container_name="mysql"}
# Connection errors
{container_name="mysql"} |~ "Access denied|Too many connections"
# Slow queries
{container_name="mysql"} |= "Slow query"
# InnoDB errors
{container_name="mysql"} |= "InnoDB"
# Replication
{container_name="mysql"} |~ "Slave|Master"
# All MongoDB logs
{container_name="mongodb"}
# Slow queries
{container_name="mongodb"} |~ "Slow query"
# Connections
{container_name="mongodb"} |~ "connection.*accepted|connection.*ended"
# Errors
{container_name="mongodb"} |= "error"
# Index recommendations
{container_name="mongodb"} |= "Consider creating an index"
# All Vault logs
{container_name="vault"}
# Seal/unseal events
{container_name="vault"} |~ "seal|unseal"
# Authentication
{container_name="vault"} |~ "auth|login"
# Secret access
{container_name="vault"} |= "secret"
# Audit logs (if enabled)
{container_name="vault"} | json | type="response"
# FastAPI logs
{container_name="dev-reference-api"} | json
# HTTP requests
{container_name="dev-reference-api"} | json | path=~"/api/.*"
# Errors
{container_name="dev-reference-api"} | json | level="error"
# Slow requests
{container_name="dev-reference-api"} | json | duration_ms > 1000
# Bad: No label filter
{} |= "error"
# Good: Start with labels
{container_name="postgres"} |= "error"
# Bad: Regex on everything
{job="container"} |~ ".*error.*"
# Good: Specific filter
{container_name="postgres"} |= "error"
# Use narrow time ranges
{container_name="postgres"} # Last 1 hour
# Limit results
{container_name="postgres"} | limit 100
# Check label cardinality
curl http://localhost:3100/loki/api/v1/label/container_name/values
# Too many labels = poor performance
# Good: container_name, job
# Bad: request_id, user_id (high cardinality)# loki-config.yaml
limits_config:
retention_period: 168h # 7 days
# Compact old logs
curl -X POST http://localhost:3100/loki/api/v1/delete?query={job="container"}&start=2024-01-01T00:00:00Z&end=2024-01-07T00:00:00ZCreate alerts in Grafana:
# Alert: High error rate
alert: HighErrorRate
expr: |
sum(rate({container_name="postgres"} |= "ERROR" [5m])) > 10
for: 5m
annotations:
summary: High error rate in PostgreSQL
description: PostgreSQL error rate is {{ $value }} errors/sec
# Alert: Application errors
alert: ApplicationErrors
expr: |
sum(count_over_time({container_name="reference-api"} | json | level="error" [5m])) > 5
for: 2m
annotations:
summary: Application errors detected
description: {{ $value }} errors in last 5 minutes- Open Grafana: http://localhost:3001
- Navigate to Alerting: Left sidebar → Alerting
-
Create Alert Rule:
- Data source: Loki
- Query: LogQL expression
- Condition: Threshold
- Notification: Email, Slack, etc.
# Use structured logging (JSON)
import logging
import json
logger = logging.getLogger(__name__)
# Log with context
logger.info(json.dumps({
"level": "info",
"message": "User logged in",
"user_id": user_id,
"request_id": request_id,
"timestamp": datetime.utcnow().isoformat()
}))Use appropriate log levels:
- DEBUG: Detailed diagnostic info
- INFO: General informational messages
- WARNING: Warning messages
- ERROR: Error messages
- CRITICAL: Critical errors
# Query by level
{container_name="reference-api"} | json | level="error"
{container_name="reference-api"} | json | level=~"error|warning"
# Check Loki storage
docker exec loki ls -lh /loki/chunks
# Monitor storage size
du -sh $(docker volume inspect loki-data -f '{{.Mountpoint}}')
# Configure retention
# Edit configs/loki/loki-config.yml# 1. Always use label filters
{container_name="postgres"} # Good
{} |= "SELECT" # Bad
# 2. Narrow time ranges
# Use 1h or less for ad-hoc queries
# 3. Limit results
{container_name="postgres"} | limit 1000
# 4. Use aggregations
count_over_time({container_name="postgres"}[5m])
# 5. Avoid high-cardinality labels
# Don't index: request_id, user_id, session_id
- Observability Stack - Complete observability setup
- Grafana Dashboards - Creating dashboards
- Debugging Techniques - Debugging guide
- Health Monitoring - Service monitoring
- Performance Tuning - Optimization
- Troubleshooting - Common issues
Quick Reference Card:
# Basic Queries
{container_name="postgres"}
{container_name="postgres"} |= "error"
{container_name="postgres"} |~ "error|warning"
# Parsers
{container_name="reference-api"} | json
{container_name="reference-api"} | json | level="error"
# Aggregations
count_over_time({container_name="postgres"}[5m])
rate({container_name="postgres"} |= "error" [5m])
topk(10, sum by (container_name) (count_over_time({job="container"}[1h])))
# Time Ranges
# Use Grafana time picker
# Or: [5m], [1h], [24h]
# Access
# Grafana: http://localhost:3001/explore
# Loki API: http://localhost:3100