# Health Checks and Observability - Interactive Learning
# فحوصات الصحة والقابلية للملاحظة - تعلم تفاعلي

This notebook covers:
- Understanding health checks
- Types of health checks (liveness, readiness, deep)
- Implementing dependency health checks
- Kubernetes integration

يغطي هذا المفكرة:
- فهم فحوصات الصحة
- أنواع فحوصات الصحة
- تنفيذ فحوصات صحة التبعيات
- التكامل مع Kubernetes

## Part 1: Understanding Health Checks
## الجزء 1: فهم فحوصات الصحة

### What are health checks?
### ما هي فحوصات الصحة؟

Health checks are endpoints that external systems use to verify your application is functioning correctly.

In [None]:
# Basic health check simulation
class HealthChecker:
    def __init__(self):
        self._failures = 0
        self._max_failures = 3
    
    def check(self):
        """Basic health check."""
        if self._failures >= self._max_failures:
            return {"status": "error", "message": "Too many failures"}
        
        import random
        if random.random() < 0.2:  # 20% chance of failure
            self._failures += 1
            return {"status": "degraded", "failures": self._failures}
        
        self._failures = 0
        return {"status": "ok"}

checker = HealthChecker()
print("Health check simulation:")
for i in range(10):
    result = checker.check()
    print(f"Check {i+1}: {result}")

### Exercise 1: Implement Liveness Check
### تمرين 1: تنفيذ فحص النشاط

Implement a liveness check that:
- Returns quickly (< 100ms)
- Checks minimal state (is app running?)
- Returns simple OK/error

نفذ فحص نشاط:

In [None]:
import time

class ApplicationState:
    def __init__(self):
        self.is_running = True
        self._critical_error = None
    
    def set_error(self, error):
        self._critical_error = error
    
    def clear_error(self):
        self._critical_error = None

app_state = ApplicationState()

def liveness_check() -> dict:
    """
    TODO: Implement liveness check
    - Check if app is running
    - Check for critical errors
    - Return status and message
    """
    start = time.time()
    
    # TODO: Implement check
    # if not app_state.is_running:
    #     return {"status": "error", "message": "Application not running"}
    
    latency = (time.time() - start) * 1000
    return {"status": "ok", "latency_ms": latency}

# Test
print("Liveness check:")
print(liveness_check())

# Simulate error
app_state.set_error("Database connection lost")
print("\nWith error:")
print(liveness_check())

### Solution / الحل

In [None]:
# Solution
def liveness_check() -> dict:
    """Implement liveness check."""
    start = time.time()
    
    if not app_state.is_running:
        return {"status": "error", "message": "Application not running"}
    
    if app_state._critical_error:
        return {
            "status": "error",
            "message": f"Critical error: {app_state._critical_error}"
        }
    
    latency = (time.time() - start) * 1000
    return {"status": "ok", "latency_ms": latency}

# Test
print("Liveness check (normal):")
app_state.clear_error()
print(liveness_check())

print("\nLiveness check (with error):")
app_state.set_error("Database connection lost")
print(liveness_check())

## Part 2: Types of Health Checks
## الجزء 2: أنواع فحوصات الصحة

In [None]:
# Simulated dependency checks
class DependencyHealth:
    def __init__(self, name):
        self.name = name
        self._latency = 10
        self._is_healthy = True
    
    def set_latency(self, ms):
        self._latency = ms
    
    def set_healthy(self, healthy):
        self._is_healthy = healthy
    
    def check(self):
        if not self._is_healthy:
            return {"status": "error", "latency_ms": None, "message": f"{self.name} unavailable"}
        
        status = "ok" if self._latency < 100 else "degraded"
        return {
            "status": status,
            "latency_ms": self._latency,
            "message": f"{self.name} connection successful"
        }

# Create dependencies
database = DependencyHealth("PostgreSQL")
redis = DependencyHealth("Redis")
qdrant = DependencyHealth("Qdrant")

def readiness_check() -> dict:
    """Readiness check with dependency checks."""
    checks = {
        "database": database.check(),
        "redis": redis.check(),
        "qdrant": qdrant.check(),
    }
    
    # Ready if all critical checks are not in error state
    ready = all(check["status"] != "error" for check in checks.values())
    
    return {"ready": ready, "checks": checks}

# Test readiness check
print("Readiness check (all healthy):")
result = readiness_check()
print(f"  Ready: {result['ready']}")
    for name, check in result['checks'].items():
        print(f"  {name}: {check['status']} ({check['latency_ms']}ms)")

In [None]:
# Test with degraded dependency
print("\nReadiness check (Redis degraded):")
redis.set_latency(120)
result = readiness_check()
print(f"  Ready: {result['ready']}")
for name, check in result['checks'].items():
    print(f"  {name}: {check['status']} ({check['latency_ms']}ms)")

In [None]:
# Test with failed dependency
print("\nReadiness check (Database error):")
database.set_healthy(False)
result = readiness_check()
print(f"  Ready: {result['ready']}")
for name, check in result['checks'].items():
    print(f"  {name}: {check['status']}")

### Exercise 2: Implement Deep Health Check
### تمرين 2: تنفيذ فحص صحة عميق

Implement a deep health check that:
- Checks all dependencies
- Performs actual operations (not just connection)
- Returns detailed status
- Computes overall status

نفذ فحص صحة عميق:

In [None]:
def deep_health_check() -> dict:
    """
    TODO: Implement deep health check
    - Check all dependencies with actual operations
    - Compute overall status (ok/degraded/error)
    - Return detailed results
    """
    checks = {
        "database": database.check(),
        "redis": redis.check(),
        "qdrant": qdrant.check(),
    }
    
    # TODO: Compute overall status
    # statuses = [check["status"] for check in checks.values()]
    # if "error" in statuses:
    #     overall_status = "error"
    # elif "degraded" in statuses:
    #     overall_status = "degraded"
    # else:
    #     overall_status = "ok"
    
    return {"status": "ok", "checks": checks, "timestamp": time.time()}

# Reset state
database.set_healthy(True)
database.set_latency(10)
redis.set_latency(50)

print("Deep health check:")
result = deep_health_check()
print(f"  Overall: {result['status']}")
for name, check in result['checks'].items():
    print(f"  {name}: {check['status']} ({check['latency_ms']}ms)")

### Solution / الحل

In [None]:
# Solution
def deep_health_check() -> dict:
    """Implement deep health check."""
    checks = {
        "database": database.check(),
        "redis": redis.check(),
        "qdrant": qdrant.check(),
    }
    
    statuses = [check["status"] for check in checks.values()]
    
    if "error" in statuses:
        overall_status = "error"
    elif "degraded" in statuses:
        overall_status = "degraded"
    else:
        overall_status = "ok"
    
    return {
        "status": overall_status,
        "checks": checks,
        "timestamp": time.time()
    }

# Test various states
states = [
    {"name": "All healthy", "db": True, "db_lat": 10, "redis": 50, "qdrant": 20},
    {"name": "One degraded", "db": True, "db_lat": 10, "redis": 150, "qdrant": 20},
    {"name": "One error", "db": False, "db_lat": 10, "redis": 50, "qdrant": 20},
]

for state in states:
    print(f"\n{state['name']}:")
    database.set_healthy(state["db"])
    database.set_latency(state["db_lat"])
    redis.set_latency(state["redis"])
    qdrant.set_latency(state["qdrant"])
    
    result = deep_health_check()
    print(f"  Overall: {result['status']}")
    for name, check in result['checks'].items():
        print(f"    {name}: {check['status']} ({check['latency_ms']}ms)")

## Part 3: Dependency Health Checks
## الجزء 3: فحوصات صحة التبعيات

In [None]:
# Simulating PostgreSQL health check
import random

class PostgreSQLChecker:
    def __init__(self):
        self._latency = 50
        self._is_connected = True
    
    def set_latency(self, ms):
        self._latency = ms
    
    def disconnect(self):
        self._is_connected = False
    
    def check(self):
        try:
            if not self._is_connected:
                raise ConnectionError("Connection refused")
            
            # Simulate query execution
            start = time.time()
            time.sleep(self._latency / 1000)
            
            # Simulate SELECT 1 query
            latency_ms = (time.time() - start) * 1000
            
            status = "ok" if latency_ms < 100 else "degraded"
            return {
                "status": status,
                "latency_ms": round(latency_ms, 2),
                "message": "Database connection successful"
            }
        except Exception as e:
            return {
                "status": "error",
                "latency_ms": None,
                "message": f"Database connection failed: {str(e)}"
            }

pg_checker = PostgreSQLChecker()

print("PostgreSQL health check:")
print(pg_checker.check())

In [None]:
# Simulating Redis health check
class RedisChecker:
    def __init__(self):
        self._latency = 20
        self._is_connected = True
    
    def check(self):
        try:
            if not self._is_connected:
                raise ConnectionError("Redis not available")
            
            start = time.time()
            # Simulate PING command
            time.sleep(self._latency / 1000)
            
            latency_ms = (time.time() - start) * 1000
            
            status = "ok" if latency_ms < 50 else "degraded"
            return {
                "status": status,
                "latency_ms": round(latency_ms, 2),
                "message": "Redis connection successful"
            }
        except Exception as e:
            return {
                "status": "error",
                "latency_ms": None,
                "message": f"Redis connection failed: {str(e)}"
            }

redis_checker = RedisChecker()
print("Redis health check:")
print(redis_checker.check())

In [None]:
# Simulating Qdrant health check
class QdrantChecker:
    def __init__(self):
        self._latency = 30
        self._is_connected = True
        self._has_collections = True
    
    def check(self):
        try:
            if not self._is_connected:
                raise ConnectionError("Qdrant not available")
            
            start = time.time()
            # Simulate get_collections()
            time.sleep(self._latency / 1000)
            
            if not self._has_collections:
                raise Exception("No collections available")
            
            latency_ms = (time.time() - start) * 1000
            
            status = "ok" if latency_ms < 100 else "degraded"
            return {
                "status": status,
                "latency_ms": round(latency_ms, 2),
                "message": "Qdrant connection successful"
            }
        except Exception as e:
            return {
                "status": "error",
                "latency_ms": None,
                "message": f"Qdrant connection failed: {str(e)}"
            }

qdrant_checker = QdrantChecker()
print("Qdrant health check:")
print(qdrant_checker.check())

### Exercise 3: Implement LLM Health Check
### تمرين 3: تنفيذ فحص صحة LLM

Implement an LLM health check that:
- Performs actual generation (not just connection)
- Uses short prompt and max_tokens
- Times out after 5 seconds
- Returns status, latency, and backend info

نفذ فحص صحة LLM:

In [None]:
class LLMChecker:
    def __init__(self, backend="openai"):
        self.backend = backend
        self._latency = 500  # LLMs are slower
        self._is_available = True
    
    def check(self):
        """
        TODO: Implement LLM health check
        - Perform simple generation test
        - Use short prompt and max_tokens
        - Set timeout
        - Return status with backend info
        """
        try:
            if not self._is_available:
                raise ConnectionError("LLM API unavailable")
            
            start = time.time()
            # TODO: Simulate generation
            # result = self._llm.generate("Test", max_tokens=5, timeout=5)
            
            latency_ms = (time.time() - start) * 1000
            
            status = "ok" if latency_ms < 2000 else "degraded"
            return {
                "status": status,
                "latency_ms": round(latency_ms, 2),
                "message": f"LLM connection successful ({self.backend})"
            }
        except Exception as e:
            return {
                "status": "error",
                "latency_ms": None,
                "message": f"LLM connection failed ({self.backend}): {str(e)}"
            }

llm_checker = LLMChecker(backend="openai")
print("LLM health check:")
print(llm_checker.check())

## Part 4: Kubernetes Integration
## الجزء 4: التكامل مع Kubernetes

In [None]:
# Simulating Kubernetes probe behavior
class KubernetesProbe:
    def __init__(self, name, check_func):
        self.name = name
        self.check_func = check_func
        self._failures = 0
        self._successes = 0
        self._failure_threshold = 3
        self._success_threshold = 1
    
    def execute(self):
        """Execute probe and track failures/successes."""
        result = self.check_func()
        
        if result.get("status") == "error":
            self._failures += 1
            self._successes = 0
        else:
            self._successes += 1
            self._failures = 0
        
        return result
    
    def should_restart(self):
        """Check if container should be restarted."""
        return self._failures >= self._failure_threshold
    
    def is_ready(self):
        """Check if container is ready."""
        return self._successes >= self._success_threshold

# Create probes
liveness_probe = KubernetesProbe("liveness", lambda: liveness_check())
readiness_probe = KubernetesProbe("readiness", lambda: readiness_check())

print("Simulating Kubernetes probes:\n")

# Simulate probe execution
for i in range(10):
    live_result = liveness_probe.execute()
    ready_result = readiness_probe.execute()
    
    print(f"Cycle {i+1}:")
    print(f"  Liveness: {live_result['status']} (failures: {liveness_probe._failures})")
    print(f"  Readiness: {ready_result['ready']} (failures: {readiness_probe._failures})")
    
    if liveness_probe.should_restart():
        print(f"  ⚠️  Container would be restarted!")
        break
    
    if not readiness_probe.is_ready():
        print(f"  ⚠️  Container not ready for traffic!")
    
    time.sleep(0.1)

## Part 5: Best Practices Quiz
## الجزء 5: اختبار أفضل الممارسات

### Quiz Questions / أسئلة الاختبار

**Q1:** What's the difference between liveness and readiness probes?
a) They're the same
b) Liveness checks if container is running, readiness checks if it can handle traffic
c) Readiness is faster than liveness
d) Liveness is used for scaling

**Q2:** Why should health checks be fast?
a) To save money
b) Frequent checking shouldn't overwhelm dependencies
c) They're only called once
d) They don't need to be fast

**Q3:** What's a good latency threshold for a liveness check?
a) 10ms
b) 100ms
c) 1 second
d) 10 seconds

### أسئلة الاختبار

**س1:** ما الفرق بين فحوصات النشاط والجاهزية؟
أ) متطابقة
ب) فحص النشاط يتحقق من تشغيل الحاوية، فحص الجاهزية يتحقق من القدرة على التعامل مع الحركة
ج) فحص الجاهزية أسرع من فحص النشاط
د) فحص النشاط يستخدم للتوسيع

In [None]:
# Answer check
quiz_answers = {
    "Q1": "b",
    "Q2": "b",
    "Q3": "b"
}

for q, answer in quiz_answers.items():
    print(f"{q}: {answer}")

## Part 6: Health Check Visualization
## الجزء 6: تصور فحوصات الصحة

In [None]:
# Visualizing health check history
import matplotlib.pyplot as plt
import numpy as np

# Simulate health check metrics over time
time_points = np.arange(0, 60, 1)  # 60 seconds

# Simulate varying latencies
db_latency = 50 + np.random.normal(10, 5, len(time_points))
redis_latency = 20 + np.random.normal(5, 2, len(time_points))
qdrant_latency = 30 + np.random.normal(8, 3, len(time_points))

# Add some spikes (degraded periods)
db_latency[30:35] += 150
redis_latency[45:48] += 100

# Create plot
fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(time_points, db_latency, label='PostgreSQL', color='blue')
ax.plot(time_points, redis_latency, label='Redis', color='green')
ax.plot(time_points, qdrant_latency, label='Qdrant', color='orange')

# Add threshold lines
ax.axhline(y=100, color='red', linestyle='--', label='Degraded threshold')

ax.set_xlabel('Time (seconds)')
ax.set_ylabel('Latency (ms)')
ax.set_title('Health Check Latency Over Time')
ax.legend()
ax.grid(True)

plt.tight_layout()
plt.show()

# Analysis
print("\nHealth Check Analysis:")
print(f"PostgreSQL avg latency: {np.mean(db_latency):.2f}ms")
print(f"Redis avg latency: {np.mean(redis_latency):.2f}ms")
print(f"Qdrant avg latency: {np.mean(qdrant_latency):.2f}ms")
print(f"\nDegraded periods:")
print(f"  PostgreSQL: {sum(db_latency > 100)} seconds")
print(f"  Redis: {sum(redis_latency > 100)} seconds")
print(f"  Qdrant: {sum(qdrant_latency > 100)} seconds")

## Summary / الملخص

**Key concepts covered / المفاهيم الرئيسية المشمولة:**

1. **Health checks** verify application and dependency status
2. **Types of checks**:
   - Liveness: Is container alive?
   - Readiness: Can it handle traffic?
   - Deep: Are all dependencies healthy?
3. **Dependency checks**: PostgreSQL, Redis, Qdrant, LLM, Storage
4. **Kubernetes integration**: Probes for restarts and traffic routing
5. **Best practices**: Fast responses, appropriate thresholds, graceful degradation

**النقاط الرئيسية المشمولة:**

1. **فحوصات الصحة** تتحقق من حالة التطبيق والتبعيات
2. **أنواع الفحوصات**:
   - النشاط: هل الحاوية حية؟
   - الجاهزية: هل يمكنها التعامل مع الحركة؟
   - عميق: هل جميع التبعيات صحية؟
3. **فحوصات التبعية**: PostgreSQL و Redis و Qdrant و LLM و التخزين
4. **التكامل مع Kubernetes**: فحوصات لإعادة التشغيل وتوجيه الحركة
5. **أفضل الممارسات**: استجابات سريعة، عتبات مناسبة، تدهور مشرف