# üìä Monitoring & Observability

**Phase 3: System Design - Production Operations**

**Master metrics collection, alerting, dashboards, and observability patterns for production systems**

---

In [None]:
// Observability Overview - Monitoring, Logging, Alerting
println("üìä MONITORING & OBSERVABILITY - PRODUCTION SYSTEMS INSIGHT")
println()

println("üéØ The Three Pillars of Production Readiness:")
println("üîç Monitoring: Activity measurement and resource tracking")
println("üìã Logging: Detailed activity records with context")
println("üö® Alerting: Proactive incident detection and response")
println("üìä Dashboards: Visual insights into system behavior")
println()

println("üèÜ Quality Attributes for Production Monitoring:")
println("‚úì Observability: Understanding system state from outputs")
println("‚úì Reliability: Fault tolerance and recovery metrics")
println("‚úì Performance: Throughput, latency, resource utilization")
println("‚úì Availability: Uptime, service level objectives (SLOs)")
println("‚úì Incident Response: Mean time to detection/recovery (MTTD/MTTR)")
println()


## üìà Metrics Collection & Analysis

**Comprehensive metrics gathering with proper classification and aggregation patterns**

In [None]:
// Enterprise Metrics System with Scala
sealed trait MetricType
case object Counter extends MetricType        // Monotonically increasing values
case object Gauge extends MetricType         // Instantaneous measurements
case object Histogram extends MetricType     // Distributions (percentiles)
case object Summary extends MetricType       // Stats with quantiles

case class ServiceMetrics(
  requestCount: Long = 0,
  errorCount: Long = 0,
  activeConnections: Int = 0,
  responseTimeMs: List[Long] = Nil,
  lastError: Option[String] = None,
  uptimeSeconds: Long = 0
)

class MetricsCollector[F[_]: Concurrent: Timer](
  serviceName: String,
  backend: MetricsBackend[F]
) {
  
  def recordRequest(service: String, responseTime: FiniteDuration): F[Unit] = {
    for {
      _ <- backend.incrementCounter(s"$service.requests_total")
      _ <- backend.recordHistogram(s"$service.request_duration_seconds",
          Map("service" -> service, "method" -> "GET"),
          responseTime.toMillis.toDouble / 1000.0)
      _ <- backend.setGauge(s"$service.last_request_timestamp",
          Map("service" -> service), System.currentTimeMillis())
    } yield ()
  }
  
  def recordError(service: String, errorType: String): F[Unit] = {
    backend.incrementCounter(s"$service.errors_total",
      Map("service" -> service, "error_type" -> errorType))
  }
  
  def recordCircuitBreakerState(service: String, state: String): F[Unit] = {
    backend.setGauge(s"$service.circuit_breaker_state",
      Map("service" -> service, "state" -> state), state match {
        case "closed" => 0
        case "open" => 1
        case "half_open" => 2
        case _ => -1
      })
  }

  // Health check endpoint
  def healthStatus(): F[HealthStatus] = {
    for {
      metrics <- backend.getServiceMetrics(serviceName)
      totalRequests = metrics.getOrElse("requests_total", 0.0).toLong
      totalErrors = metrics.getOrElse("errors_total", 0.0).toLong
      errorRate = if (totalRequests > 0) (totalErrors.toDouble / totalRequests) else 0.0
    } yield if (errorRate < 0.05) Healthy else Degraded
  }
}

println("üìà Enterprise Metrics System Implemented")
println("‚Ä¢ Counter metrics for events (requests, errors)")
println("‚Ä¢ Histogram metrics for distributions (latencies)")
println("‚Ä¢ Gauge metrics for current values (connections, state)")
println("‚Ä¢ Health check integration with SLO tracking")
println("‚Ä¢ Service-level tagging for multi-tenant metrics")


## üìã Structured Logging Patterns

**Log management, correlation IDs, and observability-driven logging**

In [None]:
// Enterprise Logging System
sealed trait LogLevel
case object DEBUG extends LogLevel
case object INFO extends LogLevel
case object WARN extends LogLevel
case object ERROR extends LogLevel

case class LogEntry(
  timestamp: java.time.Instant,
  level: LogLevel,
  message: String,
  correlationId: Option[String] = None,
  userId: Option[String] = None,
  requestId: Option[String] = None,
  service: String,
  context: Map[String, String] = Map.empty,
  error: Option[Throwable] = None
)

class StructuredLogger[F[_]: Sync](
  serviceName: String,
  correlationIdGenerator: F[String]
) {
  
  private val levels = Map(
    DEBUG -> 10,
    INFO -> 20,
    WARN -> 30,
    ERROR -> 40
  )
  
  def info(msg: String, ctx: Map[String, String] = Map.empty): F[Unit] =
    log(INFO, msg, ctx)
    
  def error(msg: String, error: Throwable, ctx: Map[String, String] = Map.empty): F[Unit] =
    log(ERROR, msg, ctx, Some(error))
    
  def warn(msg: String, ctx: Map[String, String] = Map.empty): F[Unit] =
    log(WARN, msg, ctx)
    
  def debug(msg: String, ctx: Map[String, String] = Map.empty): F[Unit] =
    log(DEBUG, msg, ctx)

  private def log(
    level: LogLevel,
    message: String,
    context: Map[String, String],
    error: Option[Throwable] = None
  ): F[Unit] = {
    for {
      correlationId <- correlationIdGenerator
      entry = LogEntry(
        timestamp = java.time.Instant.now(),
        level = level,
        message = message,
        correlationId = Some(correlationId),
        service = serviceName,
        context = context,
        error = error
      )
      _ <- writeLogEntry(entry)
      _ <- conditionallyAlert(entry) // Alert on critical events
    } yield ()
  }
  
  private def writeLogEntry(entry: LogEntry): F[Unit] = {
    JsonLogger.toJson(entry).flatMap { json =>
      println(s"[${entry.level}] ${entry.service}: ${json}")
    }.handleErrorWith(_ => Sync[F].unit) // Logging errors don't crash the service
  }
  
  private def conditionallyAlert(entry: LogEntry): F[Unit] = {
    val shouldAlert = entry.level == ERROR && 
      entry.error.exists(_.isInstanceOf[CriticalBusinessError])
    
    if (shouldAlert) {
      sendAlert(entry)
    } else {
      Sync[F].unit
    }
  }
}

case class CriticalBusinessError(msg: String) extends Exception(msg)

println("üìã Structured Logging System Implemented")
println("‚Ä¢ Correlation IDs for request tracing")
println("‚Ä¢ Structured JSON logging for analysis")
println("‚Ä¢ Log levels with severity hierarchy")
println("‚Ä¢ Error telemetry and alerting integration")
println("‚Ä¢ Non-blocking logging that never crashes")


## üîî Alerting & Incident Management

**Smart alerting systems with escalations, de-duplication, and automated incident response**

In [None]:
// Production Alerting System
sealed trait AlertSeverity
case object Info extends AlertSeverity
case object Warning extends AlertSeverity
case object Critical extends AlertSeverity
case object Pager extends AlertSeverity

case class Alert(
  id: String,
  severity: AlertSeverity,
  service: String,
  title: String,
  description: String,
  timestamp: java.time.Instant,
  tags: Set[String] = Set.empty,
  resolved: Boolean = false
)

class AlertManager[F[_]: Concurrent: Timer](
  thresholds: AlertThresholds,
  notificationService: NotificationService[F],
  alertStore: AlertStore[F]
) {
  
  private val activeAlerts = Ref.of[F, Map[String, Alert]](Map.empty)
  
  // SLO-based alert rules
  def checkSLOMetrics(metrics: ServiceMetrics): F[Unit] = {
    val errorRate = if (metrics.requestCount > 0) {
      metrics.errorCount.toDouble / metrics.requestCount
    } else 0.0
    
    for {
      _ <- if (errorRate > thresholds.errorRatePercent / 100.0) {
        fireAlert(
          s"${metrics.serviceName} error rate: ${errorRate * 100}%.1f%%",
          s"Error rate exceeded ${thresholds.errorRatePercent}% threshold",
          if (errorRate > thresholds.errorRatePercent * 2 / 100.0) Pager else Critical
        )
      } else clearAlert("high_error_rate")
      
      _ <- if (metrics.responseTimeP95 > thresholds.p95LatencyMs) {
        fireAlert(
          s"${metrics.serviceName} P95 latency: ${metrics.responseTimeP95}ms",
          s"95th percentile latency exceeded ${thresholds.p95LatencyMs}ms threshold",
          Warning
        )
      } else clearAlert("high_latency")
      
    } yield ()
  }
  
  private def fireAlert(title: String, description: String, severity: AlertSeverity): F[Unit] = {
    val alertId = generateAlertId(title)
    
    for {
      alreadyActive <- activeAlerts.get.map(_.contains(alertId))
      _ <- if (!alreadyActive) {
        for {
          alert <- createAlert(alertId, title, description, severity)
          _ <- alertStore.save(alert)
          _ <- activeAlerts.update(_ + (alert.id -> alert))
          _ <- notificationService.notify(alert)
          _ <- if (severity == Pager) escalationManager.scheduleEscalation(alert)
        } yield ()
      } else Sync[F].unit
    } yield ()
  }
  
  def resolveAlert(alertId: String): F[Unit] = {
    for {
      alert <- activeAlerts.get.map(_.get(alertId))
      _ <- alert.map(a => 
        alertStore.update(a.copy(resolved = true)) *>
        activeAlerts.update(_ - alertId)
      ).getOrElse(Sync[F].unit)
    } yield ()
  }
  
  private def createAlert(id: String, title: String, desc: String, sev: AlertSeverity): F[Alert] = {
    Sync[F].pure(Alert(
      id = id,
      severity = sev,
      service = "system",
      title = title,
      description = desc,
      timestamp = java.time.Instant.now(),
      tags = Set("auto-generated", "monitoring")
    ))
  }

  def getActiveAlerts(): F[List[Alert]] = 
    activeAlerts.get.map(_.values.toList.filter(!_.resolved))
}

println("üîî Enterprise Alerting System Implemented")
println("‚Ä¢ SLO-based automated alerts")
println("‚Ä¢ Severity levels with escalation paths")
println("‚Ä¢ Alert de-duplication and noise reduction")
println("‚Ä¢ Incident tracking with resolution workflow")
println("‚Ä¢ Pagerduty integration for critical alerts")


## üìä Dashboard & Visualization Patterns

**Creating production dashboards for system insights and business intelligence**

In [None]:
// Dashboard Configuration and Analytics
println("üìä PRODUCTION DASHBOARDS - SYSTEM VISUALIZATION")
println()

println("üî• Essential Dashboard Components:")
println()

println("1. üîÑ Service Health Overview")
println("   ‚úì Current status of all services")
println("   ‚úì Uptime and availability percentage")
println("   ‚úì Active alerts and critical issues")
println("   ‚úì Circuit breaker states")
println()

println("2. üìà Performance Metrics Dashboard")
println("   ‚úì Request rate (RPS, RPM)")
println("   ‚úì Response time percentiles (P50, P95, P99)")
println("   ‚úì Error rates and success rates")
println("   ‚úì Throughput and resource utilization")
println()

println("3. üéØ Business Metrics Integration")
println("   ‚úì User engagement metrics")
println("   ‚úì Revenue and transaction KPIs")
println("   ‚úì Feature adoption rates")
println("   ‚úì Customer satisfaction scores")
println()

println("4. üîç Detailed Investigation Panels")
println("   ‚úì Log correlation and trace views")
println("   ‚úì Database query performance")
println("   ‚úì Cache hit rates and miss penalties")
println("   ‚úì End-to-end request flow visualization")
println()

println("üõ†Ô∏è Dashboard Tools Integration:")
println("‚Ä¢ Grafana: Real-time metrics dashboarding")
println("‚Ä¢ Kibana: Log aggregation and search")
println("‚Ä¢ Jaeger/Zipkin: Distributed tracing")
println("‚Ä¢ Prometheus: Metrics collection and querying")

println("\nüìä MONITORING MATURITY LEVELS:")
println("Level 1: Manual monitoring (alerting on obvious issues)")
println("Level 2: Automated alerting (SLO-based, intelligent thresholds)")
println("Level 3: Predictive analytics (anomaly detection, forecasting)")
println("Level 4: Self-healing systems (auto-remediation, chaos engineering)")

println("\nDon't forget: Observing a system changes its behavior!")
println("Monitoring should add <5% overhead and never impact user experience.")
