Skip to content

Implement centralized error IDs system for tracking and monitoring #79

@neuromechanist

Description

@neuromechanist

Type: Infrastructure / Monitoring
Priority: P2 (Nice to have)
Effort: 3-4 hours

Description

Implement a centralized error ID system to improve error tracking, monitoring, and debugging across the application. Error IDs enable better correlation of errors across logs, metrics, and monitoring dashboards.

Background

During PR #78 review, we identified that error logs lack unique identifiers for tracking in monitoring systems like Sentry, DataDog, or Grafana. While we have good structured logging with context fields, we can't easily:

  • Track how often specific errors occur across deployments
  • Set up alerts for specific error conditions
  • Correlate related errors across different services
  • Link user-reported issues to specific error types

Current State

Error logging uses descriptive messages but no standardized IDs:

logger.error(
    "Community %s configured to use %s but env var not set, falling back to platform key.",
    community_id,
    env_var,
    extra={
        "community_id": community_id,
        "env_var_missing": True,
    },
)

Proposed Solution

Create a centralized error IDs module with standardized error codes:

1. Create Error IDs Module

# src/constants/error_ids.py
"""Centralized error IDs for tracking and monitoring.

Error ID Format: OSA_[Category][Number]
- OSA_E: General errors
- OSA_C: Configuration errors
- OSA_A: Authentication/Authorization errors
- OSA_S: Sync errors
- OSA_K: Knowledge base errors
"""

class ErrorIds:
    """Error IDs for tracking in monitoring systems."""

    # Configuration Errors (C001-C099)
    API_KEY_ENV_VAR_MISSING = "OSA_C001"
    API_KEY_NOT_CONFIGURED = "OSA_C002"
    COMMUNITY_CONFIG_INVALID = "OSA_C003"
    CORS_ORIGIN_INVALID = "OSA_C004"

    # Authentication Errors (A001-A099)
    API_KEY_INVALID = "OSA_A001"
    ORIGIN_NOT_AUTHORIZED = "OSA_A002"
    BYOK_REQUIRED = "OSA_A003"

    # Sync Errors (S001-S099)
    GITHUB_SYNC_FAILED = "OSA_S001"
    PAPERS_SYNC_FAILED = "OSA_S002"

    # Knowledge Base Errors (K001-K099)
    DOCUMENT_FETCH_FAILED = "OSA_K001"
    SEARCH_FAILED = "OSA_K002"

    # General Errors (E001-E099)
    REGISTRY_NOT_INITIALIZED = "OSA_E001"
    INTERNAL_SERVER_ERROR = "OSA_E002"

2. Update Logging Calls

from src.constants.error_ids import ErrorIds

logger.error(
    "Community %s configured to use %s but env var not set, falling back to platform key.",
    community_id,
    env_var,
    extra={
        "error_id": ErrorIds.API_KEY_ENV_VAR_MISSING,
        "community_id": community_id,
        "env_var": env_var,
        "env_var_missing": True,
        "fallback_to_platform": True,
    },
)

3. Include Error IDs in HTTPExceptions

raise HTTPException(
    status_code=403,
    detail={
        "error_id": ErrorIds.ORIGIN_NOT_AUTHORIZED,
        "message": "Origin not authorized. Please provide API key via X-OpenRouter-Key header.",
        "help_url": "https://docs.osa.osc.earth/errors/OSA_A002"
    }
)

Acceptance Criteria

  • Create src/constants/error_ids.py with ErrorIds class
  • Define error IDs for all major error categories
  • Update all logger.error() calls to include error_id in extra
  • Update all HTTPException raises to include error_id in detail
  • Add error_id to JSON log formatter output
  • Document error ID format and categories in module docstring
  • Add tests verifying error IDs are present in logs
  • Create error reference documentation (optional)

Benefits

  1. Monitoring: Easy to set up alerts for specific error IDs
  2. Debugging: Quickly find all instances of a specific error across logs
  3. Analytics: Track error frequency and trends over time
  4. User Support: Users can report error IDs for faster diagnosis
  5. Documentation: Error IDs can link to detailed error documentation

Example Monitoring Query

With error IDs, you can easily query:

-- Grafana/Loki query
{app="osa"} | json | error_id="OSA_C001" | count_over_time[1h]

-- Count API key missing errors per community
{app="osa"} | json | error_id="OSA_C001" | count by community_id

Implementation Notes

  • Follow existing pattern from HEDit project (referenced in CLAUDE.md)
  • Keep error IDs stable - never reuse IDs for different errors
  • Document deprecated error IDs if errors are removed
  • Consider creating error documentation site (e.g., docs.osa.osc.earth/errors/OSA_C001)

Related

Labels

enhancement, monitoring, P2, infrastructure

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Priority 1: Critical, fix as soon as possibleenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions