Skip to content

Conversation

dehvCurtis
Copy link
Contributor

Summary

This PR adds comprehensive documentation for the kube-prometheus-stack monitoring infrastructure, providing a complete reference for Prometheus, Grafana, and AlertManager.

Changes

📚 New Documentation: deployment/monitoring-stack.md

A complete monitoring guide covering:

Architecture & Components

  • Prometheus (port 9090)
    • Metrics collection and storage
    • ServiceMonitor auto-discovery
    • 10-day retention (configurable)
    • PromQL query language
  • Grafana (port 3000)
    • Pre-installed Kubernetes dashboards
    • Customizable visualizations
    • Default credentials: admin/admin
  • AlertManager (port 9093)
    • Alert grouping and deduplication
    • Routing to notification channels
    • Silence and inhibit rules

Installation & Access

  • Helm installation commands
  • Port-forward instructions
  • Quick start script reference (scripts/start-monitoring.sh)

ServiceMonitors

  • Explanation of ServiceMonitor CRD
  • Example ServiceMonitor for API service
  • How Prometheus discovers services

Metrics

Application Metrics:

  • http_requests_total, http_request_duration_seconds
  • scanner_jobs_total, scanner_job_duration_seconds
  • Custom metric examples

Kubernetes Metrics:

  • Pod CPU/memory usage
  • Node resource utilization
  • Container restarts
  • PersistentVolume usage

Alerts

  • PrometheusRule examples
  • HighPodMemory alert with thresholds
  • Alert configuration file location

Query Examples

# CPU usage by pod
rate(container_cpu_usage_seconds_total[5m])

# P95 request latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Scanner job success rate
rate(scanner_jobs_total{status="completed"}[5m]) / rate(scanner_jobs_total[5m])

Adding Custom Metrics

Python/FastAPI example showing:

  • Defining metrics with prometheus_client
  • Instrumenting code
  • Exposing /metrics endpoint

Troubleshooting

  • Check Prometheus targets
  • Verify ServiceMonitor discovery
  • View component logs
  • Common issues and solutions

Production Recommendations

  • Persistent storage for Prometheus and Grafana
  • Retention period adjustment
  • Resource limits
  • High availability (multiple replicas)
  • External storage (Thanos, Cortex)
  • Alert channel configuration
  • Authentication (OAuth, LDAP)
  • TLS enablement

Benefits

Complete Reference: Single source of truth for monitoring
Quick Start: Copy-paste commands to get started immediately
Production Ready: Best practices and recommendations included
Developer Friendly: Clear examples for adding custom metrics
Well Organized: Logical sections with clear headings

Integration

Integrates with tool-integration repo:

  • References scripts/start-monitoring.sh for quick access
  • ServiceMonitor examples match application structure
  • Links to relevant configuration files

Testing

Documentation tested locally:

  • All commands verified against local Minikube cluster
  • Port-forward instructions work correctly
  • Query examples return valid results
  • Links to GitHub repositories are correct

🤖 Generated with Claude Code

This commit adds detailed documentation for the kube-prometheus-stack monitoring infrastructure.

## New Documentation

**deployment/monitoring-stack.md**: Comprehensive monitoring guide covering:

### Components
- **Prometheus**: Metrics collection and storage with ServiceMonitor auto-discovery
- **Grafana**: Visualization dashboards with pre-installed Kubernetes monitoring
- **AlertManager**: Alert routing and notification management

### Features Documented
- Architecture diagram showing component relationships
- Installation instructions with Helm
- Access instructions for all monitoring tools
- ServiceMonitor examples for application metrics
- Common application and Kubernetes metrics
- Alert rule examples (HighPodMemory)
- Prometheus query examples (CPU, latency, request rates)
- Adding custom metrics guide (Python/FastAPI example)
- Troubleshooting commands

### Operational Guides
- Quick start with port-forward commands
- Health check procedures
- Target verification
- Log viewing commands

### Production Recommendations
- Persistent storage configuration
- High availability setup
- Resource limits
- External storage (Thanos, Cortex)
- Alert channel configuration
- Authentication (OAuth, LDAP)
- TLS enablement

## Benefits

✅ **Complete Reference**: All monitoring components documented in one place
✅ **Quick Start**: Copy-paste commands to get started
✅ **Production Ready**: Recommendations for production deployment
✅ **Developer Friendly**: Examples for adding custom metrics
✅ **Troubleshooting**: Common issues and solutions

## Integration

Links to tool-integration repo monitoring scripts:
- `scripts/start-monitoring.sh` for quick monitoring access
- ServiceMonitor examples for application metrics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant