# End-to-End Application Troubleshooting via MCP Server

## Overview

This notebook demonstrates a complete application troubleshooting workflow using the MCP server as the unified interface. It simulates a real-world scenario where an application experiences performance degradation and shows how to:

1. **Detect** the issue using MCP cluster health tools
2. **Investigate** root causes using MCP pod analysis tools
3. **Analyze** anomalies using KServe ML models (via MCP)
4. **Trigger** automated remediation via Coordination Engine (via MCP)
5. **Verify** the fix using MCP monitoring tools

## Scenario: Debugging a Degraded Sample Application

**Problem**: We will deploy a sample application with intentional configuration issues (memory limits too low), then use the MCP server to detect, diagnose, and fix the problems.

**Sample App**: A simple Python Flask web application with deliberately misconfigured resource limits to demonstrate the troubleshooting workflow.

**Goal**: Use the MCP server to diagnose the issue, identify the root cause, and automatically remediate it through the Coordination Engine.

## Why This Notebook Matters

Unlike notebooks in `05-end-to-end-scenarios/` which use direct API calls, this notebook demonstrates how AI assistants (like OpenShift Lightspeed) interact with the platform through the MCP server interface. This is the recommended pattern for:
- External AI assistants
- Natural language operations  
- Unified observability and remediation

## Prerequisites

- Completed: All Phase 1-5 notebooks
- MCP server deployed at `cluster-health-mcp-openshift-cluster-health-mcp:8080`
- Coordination Engine running
- KServe models deployed
- OpenShift cluster with monitoring enabled

## Learning Objectives

- Use MCP server as single pane of glass for cluster operations
- Query cluster health and pod status via MCP tools
- Analyze anomalies using ML models through MCP
- Trigger automated remediation through Coordination Engine via MCP
- Verify remediation success using MCP resources
- Understand MCP tool integration patterns for AI assistants

## Key Concepts

- **MCP Protocol**: Model Context Protocol for tool integration
- **Unified Interface**: Single API for all platform interactions
- **Tool-Based Operations**: Discrete, composable operations
- **Resource Queries**: Cached cluster state information
- **AI Integration**: How assistants use MCP tools


## Setup Section


In [None]:
import sys
import os
import json
import logging
import time
from pathlib import Path
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import requests
from typing import Dict, List, Any, Optional
import matplotlib.pyplot as plt
import seaborn as sns

# Setup path for utils module - works from any directory
def find_utils_path():
    """Find utils path regardless of current working directory"""
    possible_paths = [
        Path(__file__).parent.parent / 'utils' if '__file__' in dir() else None,
        Path.cwd() / 'notebooks' / 'utils',
        Path.cwd().parent / 'utils',
        Path('/workspace/repo/notebooks/utils'),
        Path('/opt/app-root/src/notebooks/utils'),
        Path('/opt/app-root/src/openshift-aiops-platform/notebooks/utils'),
    ]
    for p in possible_paths:
        if p and p.exists() and (p / 'common_functions.py').exists():
            return str(p)
    current = Path.cwd()
    for _ in range(5):
        utils_path = current / 'notebooks' / 'utils'
        if utils_path.exists():
            return str(utils_path)
        current = current.parent
    return None

utils_path = find_utils_path()
if utils_path:
    sys.path.insert(0, utils_path)
    print(f"‚úÖ Utils path found: {utils_path}")
else:
    print("‚ö†Ô∏è Utils path not found - will use fallback implementations")

# Try to import common functions, with fallback
try:
    from common_functions import setup_environment
    print("‚úÖ Common functions imported")
except ImportError as e:
    print(f"‚ö†Ô∏è Common functions not available: {e}")
    def setup_environment():
        os.makedirs('/opt/app-root/src/data/processed', exist_ok=True)
        os.makedirs('/opt/app-root/src/models', exist_ok=True)
        os.makedirs('/opt/app-root/src/outputs', exist_ok=True)
        return {'data_dir': '/opt/app-root/src/data', 'models_dir': '/opt/app-root/src/models'}

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Setup environment
env_info = setup_environment()
logger.info(f"Environment ready: {env_info}")

# Define paths
DATA_DIR = Path('/opt/app-root/src/data')
PROCESSED_DIR = DATA_DIR / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
OUTPUTS_DIR = Path('/opt/app-root/src/outputs')
OUTPUTS_DIR.mkdir(parents=True, exist_ok=True)

# Configuration
MCP_SERVER_URL = os.getenv('MCP_SERVER_URL', 'http://cluster-health-mcp-openshift-cluster-health-mcp:8080')
NAMESPACE = 'self-healing-platform'
REQUEST_TIMEOUT = 30

# Configure matplotlib for inline plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

logger.info(f"End-to-end troubleshooting workflow initialized")
logger.info(f"MCP Server URL: {MCP_SERVER_URL}")
logger.info(f"Target Namespace: {NAMESPACE}")

print("=" * 80)
print("üîß END-TO-END TROUBLESHOOTING WORKFLOW")
print("=" * 80)
print(f"MCP Server: {MCP_SERVER_URL}")
print(f"Namespace: {NAMESPACE}")
print(f"Timeout: {REQUEST_TIMEOUT}s")
print("=" * 80)


### MCP Client Implementation

This client wraps all MCP server interactions with:
- Connection management and health checking
- Error handling and retry logic
- Response parsing and logging
- Support for all 6 MCP tools and 3 resources


In [None]:
class MCPClient:
    """
    Client for MCP server communication with comprehensive error handling.
    
    This client provides a Python interface to all MCP server tools and resources.
    """
    
    def __init__(self, server_url: str, timeout: int = 30):
        self.server_url = server_url.rstrip('/')
        self.timeout = timeout
        self.session = requests.Session()
        self.connected = False
        logger.info(f"Initialized MCP client for {self.server_url}")
    
    def connect(self) -> bool:
        """Test connection to MCP server"""
        try:
            response = self.session.get(
                f"{self.server_url}/health",
                timeout=self.timeout
            )
            self.connected = response.status_code == 200
            if self.connected:
                logger.info("‚úÖ Connected to MCP server")
            else:
                logger.error(f"‚ùå MCP server returned status {response.status_code}")
            return self.connected
        except Exception as e:
            logger.error(f"‚ùå Connection failed: {e}")
            self.connected = False
            return False
    
    def list_tools(self) -> Dict[str, Any]:
        """List available MCP tools"""
        try:
            response = self.session.get(
                f"{self.server_url}/mcp/tools",
                timeout=self.timeout
            )
            response.raise_for_status()
            result = response.json()
            logger.info(f"Retrieved {result.get('count', 0)} MCP tools")
            return result
        except Exception as e:
            logger.error(f"Failed to list tools: {e}")
            return {'error': str(e)}
    
    def call_tool(self, tool_name: str, arguments: Dict[str, Any] = None) -> Dict[str, Any]:
        """
        Call an MCP tool with arguments
        
        Args:
            tool_name: Name of the tool (e.g., 'get-cluster-health')
            arguments: Tool-specific arguments
            
        Returns:
            Tool response as dictionary
        """
        if arguments is None:
            arguments = {}
            
        try:
            # MCP server uses /mcp/tools/{tool-name}/call endpoint
            url = f"{self.server_url}/mcp/tools/{tool_name}/call"
            response = self.session.post(
                url,
                json=arguments,
                timeout=self.timeout,
                headers={'Content-Type': 'application/json'}
            )
            response.raise_for_status()
            result = response.json()
            
            if result.get('success', True):
                logger.info(f"‚úÖ Tool '{tool_name}' executed successfully")
            else:
                logger.warning(f"‚ö†Ô∏è Tool '{tool_name}' returned error")
                
            return result
        except Exception as e:
            logger.error(f"‚ùå Tool '{tool_name}' failed: {e}")
            return {'error': str(e), 'success': False}
    
    def get_resource(self, resource_uri: str) -> Dict[str, Any]:
        """
        Get an MCP resource
        
        Args:
            resource_uri: Resource URI (e.g., 'cluster://health')
            
        Returns:
            Resource data as dictionary
        """
        try:
            # Convert resource URI to endpoint path
            # cluster://health -> /mcp/resources/cluster/health
            resource_path = resource_uri.replace('://', '/')
            url = f"{self.server_url}/mcp/resources/{resource_path}"
            
            response = self.session.get(
                url,
                timeout=self.timeout
            )
            response.raise_for_status()
            result = response.json()
            logger.info(f"‚úÖ Retrieved resource '{resource_uri}'")
            return result
        except Exception as e:
            logger.error(f"‚ùå Resource '{resource_uri}' failed: {e}")
            return {'error': str(e)}
    
    def get_cluster_health(self, include_details: bool = True) -> Dict[str, Any]:
        """Get cluster health summary"""
        return self.call_tool('get-cluster-health', {'include_details': include_details})
    
    def list_pods(self, namespace: str = None, label_selector: str = "", 
                  field_selector: str = "", limit: int = 100) -> Dict[str, Any]:
        """List pods with filtering options"""
        args = {
            'limit': limit,
            'label_selector': label_selector,
            'field_selector': field_selector
        }
        if namespace:
            args['namespace'] = namespace
        return self.call_tool('list-pods', args)
    
    def list_incidents(self, status: str = "all", severity: str = "all") -> Dict[str, Any]:
        """List incidents from Coordination Engine"""
        return self.call_tool('list-incidents', {
            'status': status,
            'severity': severity
        })
    
    def trigger_remediation(self, action: str, target: Dict[str, Any],
                           dry_run: bool = False, priority: str = "medium") -> Dict[str, Any]:
        """Trigger automated remediation action"""
        return self.call_tool('trigger-remediation', {
            'action': action,
            'target': target,
            'dry_run': dry_run,
            'priority': priority
        })
    
    def analyze_anomalies(self, metric: str, namespace: str = None,
                          time_range: str = "1h", threshold: float = 0.7,
                          model_name: str = "predictive-analytics") -> Dict[str, Any]:
        """Analyze metrics for anomalies using KServe models"""
        args = {
            'metric': metric,
            'time_range': time_range,
            'threshold': threshold,
            'model_name': model_name
        }
        if namespace:
            args['namespace'] = namespace
        return self.call_tool('analyze-anomalies', args)
    
    def get_model_status(self, model_name: str, include_endpoints: bool = True) -> Dict[str, Any]:
        """Get KServe model status"""
        return self.call_tool('get-model-status', {
            'model_name': model_name,
            'include_endpoints': include_endpoints
        })

# Initialize MCP client
mcp_client = MCPClient(MCP_SERVER_URL, timeout=REQUEST_TIMEOUT)

# Test connection
print("\nüîå Connecting to MCP server...")
if mcp_client.connect():
    print("‚úÖ MCP server is healthy and ready")
    
    # List available tools
    tools_info = mcp_client.list_tools()
    if 'count' in tools_info:
        print(f"\nüìã Available MCP Tools: {tools_info['count']}")
        for tool in tools_info.get('tools', []):
            print(f"  ‚Ä¢ {tool['name']}: {tool['description'][:80]}...")
else:
    print("‚ùå Failed to connect to MCP server")
    print("‚ö†Ô∏è Please check that the MCP server is deployed and accessible")


## Step 0: Deploy Sample Problematic Application

Before we troubleshoot, let's deploy a sample application with intentional issues to demonstrate the MCP server's troubleshooting capabilities.

**Sample App**: A Python Flask web server with memory limits set too low (32Mi), which will cause OOMKilled errors and restart loops.

This creates a realistic scenario for the troubleshooting workflow.


In [None]:
# Sample application YAML with intentionally problematic configuration
sample_app_yaml = """
apiVersion: apps/v1
kind: Deployment
metadata:
  name: troubleshoot-demo-app
  namespace: {namespace}
  labels:
    app: troubleshoot-demo
    demo: "true"
spec:
  replicas: 2
  selector:
    matchLabels:
      app: troubleshoot-demo
  template:
    metadata:
      labels:
        app: troubleshoot-demo
        demo: "true"
    spec:
      containers:
      - name: web
        image: python:3.11-slim
        command:
          - "/bin/bash"
          - "-c"
          - |
            pip install --quiet flask && python -c '
            from flask import Flask
            import os
            app = Flask(__name__)
            # Allocate memory to trigger OOM
            data = []
            @app.route("/")
            def hello():
                # Intentionally consume memory
                data.append("x" * 1024 * 1024 * 10)  # 10MB chunks
                return f"Hello! Allocated {{len(data)}} chunks"
            @app.route("/health")
            def health():
                return "OK"
            app.run(host="0.0.0.0", port=8080)
            '
        ports:
        - containerPort: 8080
          name: http
        resources:
          limits:
            memory: "32Mi"   # Intentionally too low - will cause OOMKilled
            cpu: "100m"
          requests:
            memory: "16Mi"
            cpu: "50m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 3
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: troubleshoot-demo-app
  namespace: {namespace}
  labels:
    app: troubleshoot-demo
spec:
  selector:
    app: troubleshoot-demo
  ports:
  - port: 8080
    targetPort: 8080
    name: http
  type: ClusterIP
"""

print("\n" + "="*80)
print("üöÄ STEP 0: DEPLOY - Sample Problematic Application")
print("="*80)

# Write the YAML to a temporary file
sample_app_file = OUTPUTS_DIR / 'troubleshoot-demo-app.yaml'
with open(sample_app_file, 'w') as f:
    f.write(sample_app_yaml.format(namespace=NAMESPACE))

print(f"\nüìù Created deployment manifest: {sample_app_file}")
print("\nüîß Deploying sample application with intentional issues:")
print("   ‚Ä¢ Memory limit: 32Mi (too low for Flask app)")
print("   ‚Ä¢ Expected behavior: OOMKilled and restart loops")
print("   ‚Ä¢ Replicas: 2")

# Deploy using kubectl/oc
try:
    import subprocess
    result = subprocess.run(
        ['oc', 'apply', '-f', str(sample_app_file)],
        capture_output=True,
        text=True,
        timeout=30
    )
    
    if result.returncode == 0:
        print(f"\n‚úÖ Application deployed successfully:")
        for line in result.stdout.strip().split('\n'):
            print(f"   {line}")
        
        print("\n‚è≥ Waiting 30 seconds for pods to start failing...")
        time.sleep(30)
        
        print("\n‚úÖ Sample application is now running (and failing)")
        print("   Ready for troubleshooting workflow!")
        
        deployment_successful = True
    else:
        print(f"\n‚ö†Ô∏è  Deployment command returned non-zero exit code:")
        print(f"   {result.stderr}")
        print("\nüí° This is expected if 'oc' is not available")
        print("   Continuing with simulation mode...")
        deployment_successful = False
        
except Exception as e:
    print(f"\n‚ö†Ô∏è  Could not deploy via kubectl/oc: {e}")
    print("üí° This is expected when running outside the cluster")
    print("   The troubleshooting workflow will work with existing pods")
    deployment_successful = False

print("\n" + "="*80)
print(f"Deployment Status: {'‚úÖ Real deployment' if deployment_successful else '‚ö†Ô∏è  Simulated (using existing pods)'}")
print("="*80)


## Step 1: Detect Application Issues

Use MCP tools to detect cluster health problems and identify failing pods.


In [None]:
# Step 1a: Get overall cluster health
print("\n" + "="*80)
print("üè• STEP 1: DETECT - Cluster Health Check")
print("="*80)

cluster_health = mcp_client.get_cluster_health(include_details=True)

if 'error' not in cluster_health:
    result = cluster_health.get('result', {})
    print(f"\nüìä Cluster Status: {result.get('status', 'unknown').upper()}")
    print(f"‚è∞ Timestamp: {result.get('timestamp', 'N/A')}")
    
    # Display node statistics
    nodes = result.get('nodes', {})
    print(f"\nüñ•Ô∏è  Nodes:")
    print(f"  ‚Ä¢ Total: {nodes.get('total', 0)}")
    print(f"  ‚Ä¢ Ready: {nodes.get('ready', 0)}")
    print(f"  ‚Ä¢ Not Ready: {nodes.get('not_ready', 0)}")
    
    # Display pod statistics
    pods = result.get('pods', {})
    print(f"\nüì¶ Pods:")
    print(f"  ‚Ä¢ Total: {pods.get('total', 0)}")
    print(f"  ‚Ä¢ Running: {pods.get('running', 0)}")
    print(f"  ‚Ä¢ Pending: {pods.get('pending', 0)}")
    print(f"  ‚Ä¢ Failed: {pods.get('failed', 0)}")
    print(f"  ‚Ä¢ Succeeded: {pods.get('succeeded', 0)}")
    
    # Display warnings
    warnings = result.get('warnings', [])
    if warnings:
        print(f"\n‚ö†Ô∏è  Warnings ({len(warnings)}):")
        for warning in warnings:
            print(f"  ‚Ä¢ {warning}")
    
    # Store for later comparison
    initial_health = result
else:
    print(f"‚ùå Error getting cluster health: {cluster_health.get('error')}")
    initial_health = {}

# Step 1b: List problematic pods (focus on our sample app)
print("\n" + "="*80)
print("üîç Searching for Problematic Pods (troubleshoot-demo-app)")
print("="*80)

# Query pods for our sample application specifically
sample_app_pods = mcp_client.list_pods(
    namespace=NAMESPACE,
    label_selector="app=troubleshoot-demo",
    limit=10
)

print("\nüéØ Querying sample application pods:")
print("   Label Selector: app=troubleshoot-demo")
print("   Expected: Pods with OOMKilled status and high restarts")

if 'error' not in sample_app_pods:
    result = sample_app_pods.get('result', {})
    pods_list = result.get('pods', [])
    
    # Filter for pods with issues
    problem_pods = []
    for pod in pods_list:
        restarts = pod.get('restarts', 0)
        status = pod.get('status', '')
        phase = pod.get('phase', '')
        
        # Identify problematic conditions
        is_problematic = (
            restarts > 2 or  # High restart count
            status in ['CrashLoopBackOff', 'Error', 'Failed', 'OOMKilled'] or
            phase in ['Failed', 'Unknown']
        )
        
        if is_problematic:
            problem_pods.append(pod)
    
    print(f"\nüö® Found {len(problem_pods)} problematic pods:")
    for pod in problem_pods[:5]:  # Show first 5
        print(f"\n  üì¶ {pod['name']}")
        print(f"     Status: {pod['status']} | Phase: {pod['phase']}")
        print(f"     Restarts: {pod['restarts']} | Age: {pod['age']}")
        print(f"     Node: {pod['node']}")
        
        # Show container details
        for container in pod.get('containers', []):
            print(f"     Container: {container['name']}")
            print(f"       ‚Ä¢ Image: {container['image']}")
            print(f"       ‚Ä¢ Ready: {container['ready']}")
            print(f"       ‚Ä¢ State: {container.get('state', 'N/A')}")
            if container.get('reason'):
                print(f"       ‚Ä¢ Reason: {container['reason']}")
    
    # Store for investigation
    detected_issues = problem_pods
    
    if len(problem_pods) == 0:
        print("\n‚ö†Ô∏è  No problematic pods found yet")
        print("   This could mean:")
        print("   ‚Ä¢ Pods haven't started failing yet (wait a bit longer)")
        print("   ‚Ä¢ Sample app deployment didn't complete")
        print("   ‚Ä¢ Will continue with generic troubleshooting")
else:
    print(f"‚ùå Error listing pods: {sample_app_pods.get('error')}")
    detected_issues = []

print(f"\n‚úÖ Detection Complete: {len(detected_issues)} issues identified")
print("   These are our troubleshoot-demo-app pods with intentional OOM issues")


## Step 2: Investigate Root Causes

Use MCP resources and tools to dig deeper into the identified issues.


In [None]:
print("\n" + "="*80)
print("üî¨ STEP 2: INVESTIGATE - Root Cause Analysis")
print("="*80)

# Focus on the most problematic pod (if any found)
if detected_issues:
    target_pod = detected_issues[0]
    pod_name = target_pod['name']
    
    print(f"\nüéØ Investigating: {pod_name}")
    print(f"   Current Status: {target_pod['status']}")
    print(f"   Restart Count: {target_pod['restarts']}")
    
    # Get detailed pod information via MCP list-pods with specific filter
    detailed_info = mcp_client.list_pods(
        namespace=NAMESPACE,
        limit=1
    )
    
    # Analyze the pod configuration
    print(f"\nüìã Pod Configuration:")
    for container in target_pod.get('containers', []):
        print(f"  Container: {container['name']}")
        print(f"    ‚Ä¢ Image: {container['image']}")
        print(f"    ‚Ä¢ Restarts: {container.get('restart_count', 0)}")
        print(f"    ‚Ä¢ State: {container.get('state', 'N/A')}")
        if container.get('reason'):
            print(f"    ‚Ä¢ Failure Reason: {container['reason']}")
    
    # Check for resource constraints
    print(f"\nüíæ Resource Analysis:")
    print(f"  Node: {target_pod['node']}")
    print(f"  IP: {target_pod.get('ip', 'N/A')}")
    
    # Identify common issues
    print(f"\nüîç Common Issue Patterns:")
    issues_found = []
    
    if target_pod['restarts'] > 5:
        issues_found.append("High restart count suggests crash loop or OOM")
    if target_pod['status'] == 'CrashLoopBackOff':
        issues_found.append("Application is crashing on startup")
    if target_pod['phase'] == 'Pending':
        issues_found.append("Pod cannot be scheduled (resources/node selector)")
    
    for issue in issues_found:
        print(f"  ‚ö†Ô∏è  {issue}")
    
    investigation_complete = True
else:
    print("\n‚úÖ No problematic pods found for investigation")
    target_pod = None
    pod_name = None
    investigation_complete = False
