

## **Chapter 13: Monitoring and Observability**

---

## **Learning Objectives**

By the end of this chapter, you will be able to:

- Implement structured logging in GraphQL resolvers for better debugging
- Trace resolver execution times to identify performance bottlenecks
- Integrate with Apollo Studio for schema monitoring and performance analytics
- Set up OpenTelemetry for distributed tracing across microservices
- Configure health checks and readiness probes for GraphQL servers
- Implement alerting mechanisms for error rates and latency thresholds
- Analyze query performance using field-level metrics

---

## **Prerequisites**

- Completed Chapter 7: Building a GraphQL Server
- Completed Chapter 11: Performance Optimization
- Understanding of logging levels (INFO, WARN, ERROR, DEBUG)
- Basic knowledge of monitoring concepts (metrics, traces, logs)
- Optional: Access to Apollo Studio account (free tier available)

---

## **13.1 Logging in a GraphQL World**

Standard `console.log` statements are insufficient for production GraphQL APIs. You need structured logging that captures the context of each request, including query complexity, user identity, and execution timeline.

### **Structured Logging with Winston**

**Setup:**

```javascript
const winston = require('winston');

// Create a structured logger
const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json() // Structured JSON output for log aggregation
  ),
  defaultMeta: { service: 'graphql-api' },
  transports: [
    new winston.transports.Console(),
    // In production, add file or external transports (Datadog, Splunk)
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' })
  ]
});
```

### **Contextual Request Logging**

Log every GraphQL request with relevant context:

```javascript
const { ApolloServer } = require('apollo-server');

const server = new ApolloServer({
  typeDefs,
  resolvers,
  
  // Plugin for request lifecycle logging
  plugins: [
    {
      requestDidStart(initialRequestContext) {
        const startTime = Date.now();
        const { query, operationName } = initialRequestContext.request;
        
        // Log incoming request
        logger.info('GraphQL request started', {
          operationName,
          query: query?.substring(0, 200), // Truncate long queries
          userId: initialRequestContext.context.user?.id,
          ip: initialRequestContext.context.ip
        });

        return {
          willSendResponse(requestContext) {
            const duration = Date.now() - startTime;
            
            // Log completion with metrics
            logger.info('GraphQL request completed', {
              operationName,
              durationMs: duration,
              userId: requestContext.context.user?.id,
              status: requestContext.response.http?.status
            });
          },
          
          didEncounterErrors(requestContext) {
            const duration = Date.now() - startTime;
            
            // Log errors with full context
            logger.error('GraphQL request failed', {
              operationName,
              durationMs: duration,
              errors: requestContext.errors.map(err => ({
                message: err.message,
                path: err.path,
                code: err.extensions?.code
              })),
              userId: requestContext.context.user?.id,
              query: requestContext.request.query?.substring(0, 500)
            });
          }
        };
      }
    }
  ]
});
```

**Sample Log Output:**

```json
{
  "timestamp": "2026-02-13T10:30:00.123Z",
  "level": "error",
  "message": "GraphQL request failed",
  "service": "graphql-api",
  "operationName": "GetUserProfile",
  "durationMs": 450,
  "errors": [
    {
      "message": "User not found",
      "path": ["user"],
      "code": "NOT_FOUND"
    }
  ],
  "userId": "123",
  "query": "query GetUserProfile { user(id: \"999\") { name email } }"
}
```

---

## **13.2 Tracing Resolver Execution Time**

To identify bottlenecks, you need field-level performance data. Apollo Server's tracing extensions provide this out of the box.

### **Enabling Tracing**

```javascript
const { ApolloServer } = require('apollo-server');

const server = new ApolloServer({
  typeDefs,
  resolvers,
  
  // Enable tracing (adds performance data to responses)
  tracing: process.env.NODE_ENV !== 'production', // Dev only
  
  // For production, use Apollo Studio or custom plugins
  plugins: [
    {
      requestDidStart() {
        return {
          didResolveOperation({ request, document }) {
            if (process.env.NODE_ENV === 'production') return;
            
            // Console log resolver timing in development
            console.log(`Operation: ${request.operationName}`);
          },
          
          didEncounterErrors({ request, errors }) {
            logger.error('Resolver errors', {
              operation: request.operationName,
              errors: errors.map(e => e.message)
            });
          }
        };
      }
    }
  ]
});
```

### **Custom Field-Level Tracing**

For more granular control, instrument individual resolvers:

```javascript
// Utility wrapper for timing resolvers
const withTiming = (resolverName, resolverFn) => {
  return async (parent, args, context, info) => {
    const start = process.hrtime.bigint();
    
    try {
      const result = await resolverFn(parent, args, context, info);
      
      const end = process.hrtime.bigint();
      const durationMs = Number(end - start) / 1000000; // Convert nanoseconds to ms
      
      // Log if slow (over 100ms)
      if (durationMs > 100) {
        logger.warn('Slow resolver detected', {
          resolver: resolverName,
          field: info.fieldName,
          type: info.parentType.name,
          durationMs,
          args: JSON.stringify(args).substring(0, 200)
        });
      }
      
      // Send to metrics system (e.g., StatsD, Prometheus)
      if (context.metrics) {
        context.metrics.timing(`graphql.resolver.${resolverName}`, durationMs);
      }
      
      return result;
    } catch (error) {
      const end = process.hrtime.bigint();
      logger.error('Resolver error', {
        resolver: resolverName,
        durationMs: Number(end - start) / 1000000,
        error: error.message
      });
      throw error;
    }
  };
};

// Usage in resolvers
const resolvers = {
  Query: {
    user: withTiming('Query.user', async (_, { id }, { dataSources }) => {
      return dataSources.userAPI.getUserById(id);
    }),
    
    searchUsers: withTiming('Query.searchUsers', async (_, { query }, { db }) => {
      return db.search(query);
    })
  }
};
```

---

## **13.3 Monitoring Tools (Apollo Studio, OpenTelemetry)**

### **Apollo Studio Integration**

Apollo Studio (formerly Apollo Engine) is the industry-standard tool for GraphQL monitoring. It provides:
- Schema versioning and change tracking
- Performance analytics by resolver
- Error tracking and alerting
- Query volume metrics

**Setup:**

```javascript
const { ApolloServer } = require('apollo-server');
const { ApolloServerPluginUsageReporting } = require('apollo-server-core');

const server = new ApolloServer({
  typeDefs,
  resolvers,
  
  plugins: [
    // Send metrics to Apollo Studio
    ApolloServerPluginUsageReporting({
      // Your Apollo Studio API key
      apiKey: process.env.APOLLO_KEY,
      
      // Graph ref (graph@variant)
      graphRef: process.env.APOLLO_GRAPH_REF,
      
      // Send traces for detailed performance data
      sendVariableValues: { all: true }, // or { none: true } for privacy
      sendHeaders: { exceptNames: ['authorization', 'cookie'] }, // Exclude sensitive headers
      
      // Custom filters for reporting
      fieldLevelInstrumentation: 1.0, // 100% of requests
      
      // Hook for modifying reports before sending
      generateClientInfo: ({ request }) => {
        const clientName = request.http.headers.get('client-name');
        const clientVersion = request.http.headers.get('client-version');
        
        return {
          clientName: clientName || 'unknown-client',
          clientVersion: clientVersion || 'unknown-version'
        };
      }
    })
  ]
});
```

**Apollo Studio Dashboard Benefits:**
- **Field Usage**: See which fields are queried most often
- **Performance Heatmap**: Identify slow resolvers (color-coded by latency)
- **Error Rates**: Track which operations fail most frequently
- **Schema Checks**: Prevent breaking changes in CI/CD

### **OpenTelemetry for Distributed Tracing**

In microservice architectures, a single GraphQL request may span multiple services. OpenTelemetry traces the entire journey.

**Setup:**

```javascript
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { GraphQLInstrumentation } = require('@opentelemetry/instrumentation-graphql');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');

// Initialize OpenTelemetry
const sdk = new NodeSDK({
  traceExporter: new JaegerExporter({
    endpoint: 'http://jaeger:14268/api/traces',
  }),
  instrumentations: [
    new HttpInstrumentation(),
    new GraphQLInstrumentation({
      // Merge resolver spans for cleaner traces
      mergeItems: true,
      // Allow adding custom attributes to spans
      responseHook: (span, data) => {
        span.setAttribute('graphql.operation.name', data.operationName);
      }
    })
  ]
});

sdk.start();

// In your resolvers, add custom spans
const resolvers = {
  Query: {
    user: async (_, { id }, { tracer }) => {
      // Create custom span for database call
      const span = tracer.startSpan('db.fetchUser');
      span.setAttribute('db.userId', id);
      
      try {
        const user = await db.getUser(id);
        span.setStatus({ code: SpanStatusCode.OK });
        return user;
      } catch (error) {
        span.recordException(error);
        throw error;
      } finally {
        span.end();
      }
    }
  }
};
```

---

## **13.4 Health Checks and Readiness Probes**

In containerized environments (Kubernetes, Docker), your GraphQL server needs health endpoints.

**Implementation:**

```javascript
const express = require('express');
const { ApolloServer } = require('apollo-server-express');

async function startServer() {
  const app = express();
  const server = new ApolloServer({ typeDefs, resolvers });
  
  await server.start();
  server.applyMiddleware({ app });
  
  // Health check endpoint
  app.get('/health', (req, res) => {
    res.status(200).json({
      status: 'healthy',
      timestamp: new Date().toISOString(),
      uptime: process.uptime()
    });
  });
  
  // Readiness probe (checks if server is ready to accept traffic)
  app.get('/ready', async (req, res) => {
    try {
      // Check database connectivity
      await db.ping();
      
      // Check external dependencies
      await redis.ping();
      
      res.status(200).json({
        status: 'ready',
        checks: {
          database: 'connected',
          cache: 'connected'
        }
      });
    } catch (error) {
      res.status(503).json({
        status: 'not ready',
        error: error.message
      });
    }
  });
  
  // Metrics endpoint for Prometheus
  app.get('/metrics', async (req, res) => {
    res.set('Content-Type', 'text/plain');
    res.send(await prometheus.register.metrics());
  });
  
  app.listen({ port: 4000 }, () => {
    console.log(`Server ready at http://localhost:4000${server.graphqlPath}`);
  });
}
```

**Kubernetes Configuration:**

```yaml
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: graphql-server
      image: graphql-app:latest
      livenessProbe:
        httpGet:
          path: /health
          port: 4000
        initialDelaySeconds: 30
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /ready
          port: 4000
        initialDelaySeconds: 5
        periodSeconds: 5
```

---

## **13.5 Alerting Configuration**

Set up alerts for critical conditions:

**Using Apollo Studio Alerts:**
- Configure email/Slack notifications for error rate > 1%
- Alert on p95 latency > 500ms
- Notify on schema change proposals

**Custom Alerting with Prometheus/Grafana:**

```javascript
// Install prom-client
const client = require('prom-client');

// Define custom metrics
const graphqlRequestDuration = new client.Histogram({
  name: 'graphql_request_duration_seconds',
  help: 'Duration of GraphQL requests in seconds',
  labelNames: ['operation', 'status'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

const graphqlErrors = new client.Counter({
  name: 'graphql_errors_total',
  help: 'Total number of GraphQL errors',
  labelNames: ['operation', 'error_code']
});

// Use in plugins
const server = new ApolloServer({
  plugins: [
    {
      requestDidStart() {
        const start = Date.now();
        
        return {
          willSendResponse(requestContext) {
            const duration = (Date.now() - start) / 1000;
            const operation = requestContext.operationName || 'anonymous';
            
            graphqlRequestDuration.observe(
              { operation, status: 'success' },
              duration
            );
          },
          
          didEncounterErrors(requestContext) {
            const operation = requestContext.operationName || 'anonymous';
            
            requestContext.errors.forEach(error => {
              graphqlErrors.inc({
                operation,
                error_code: error.extensions?.code || 'UNKNOWN'
              });
            });
          }
        };
      }
    }
  ]
});
```

**Alert Rules (Prometheus):**

```yaml
groups:
  - name: graphql_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(graphql_errors_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          
      - alert: SlowQueries
        expr: histogram_quantile(0.95, rate(graphql_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "GraphQL p95 latency > 1s"
```

---

## **Chapter Summary**

Observability transforms debugging from guesswork into data-driven investigation. This chapter covered comprehensive monitoring strategies.

### **Key Takeaways:**

1.  **Structured Logging**: Use JSON-formatted logs with context (user ID, operation name, duration) for aggregation and analysis.
2.  **Resolver Tracing**: Instrument individual resolvers to identify performance bottlenecks. Log slow queries (>100ms) for optimization.
3.  **Apollo Studio**: Essential for production GraphQL monitoring. Provides field-level analytics, error tracking, and schema change validation.
4.  **OpenTelemetry**: Implement distributed tracing for microservice architectures to follow requests across service boundaries.
5.  **Health Checks**: Provide `/health` and `/ready` endpoints for container orchestration platforms to manage traffic routing.
6.  **Metrics**: Export Prometheus metrics for latency, error rates, and query volume. Set up alerts for anomaly detection.
7.  **Privacy**: Never log sensitive data (passwords, tokens, PII). Redact or hash identifiers in logs.

### **Monitoring Checklist:**

- [ ] Structured JSON logging implemented
- [ ] Request/response logging with duration tracking
- [ ] Error logging with full stack traces
- [ ] Apollo Studio integration (or equivalent)
- [ ] Field-level resolver timing
- [ ] Health and readiness endpoints
- [ ] Prometheus metrics export
- [ ] Alerting rules for error rates and latency
- [ ] Distributed tracing for microservices
- [ ] Log aggregation system (ELK, Datadog, Splunk)

---

### **🚀 Next Up: Chapter 14 - Testing GraphQL**

**Summary:** A monitored system is only as good as its reliability. In Chapter 14, we will ensure our GraphQL API is robust through comprehensive testing. You will learn how to write unit tests for resolvers, integration tests for the entire schema, and end-to-end tests. We will also cover mocking strategies and contract testing to ensure your API meets its specifications.**

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='12. security_hardening.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='14. testing_graphql.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
