# Performance Testing Tutorial

## Learning Objectives

By the end of this tutorial, you will:
1. Understand different types of performance testing
2. Set up Locust for load testing
3. Create realistic user scenarios
4. Interpret performance metrics
5. Identify and fix performance bottlenecks

## Prerequisites

- Python 3.8+
- RAG Engine Mini running locally
- Basic understanding of HTTP APIs

## Part 1: Introduction to Performance Testing

### What is Performance Testing?

Performance testing evaluates how a system performs under various workloads. It helps answer questions like:

- How many users can the system handle?
- What's the response time under load?
- Where are the bottlenecks?
- Will the system crash under stress?

### Types of Performance Testing

```
1. Load Testing      → Normal expected traffic
2. Stress Testing    → Beyond normal capacity
3. Spike Testing     → Sudden traffic increases
4. Endurance Testing → Long duration stability
5. Benchmark Testing → Baseline metrics
```

## Part 2: Setting Up Locust

### Installation

Let's install the required packages:

In [None]:
# Install required packages
!pip install locust pytest-benchmark aiohttp psutil

### Your First Locustfile

A Locustfile defines user behavior using Python code:

In [None]:
# Simple Locust example
locust_code = '''
from locust import HttpUser, task, between

class SimpleUser(HttpUser):
    """A simple user that makes requests."""
    
    # Wait 1-3 seconds between tasks
    wait_time = between(1, 3)
    
    @task
    def visit_homepage(self):
        """Simulate visiting the homepage."""
        self.client.get("/health")
    
    @task(3)  # 3x more likely than other tasks
    def ask_question(self):
        """Simulate asking a question."""
        self.client.post(
            "/api/v1/ask",
            json={"question": "What is RAG?", "k": 5}
        )
'''

print(locust_code)

### Key Concepts

1. **HttpUser**: Represents a user making HTTP requests
2. **@task**: Decorator marking a method as a task to perform
3. **wait_time**: Time between consecutive tasks
4. **self.client**: HTTP client for making requests

### Task Weights

The number in `@task(n)` determines probability:
- `@task` = weight 1
- `@task(3)` = 3x more likely

Example with multiple tasks:

In [None]:
# Demonstrate task weights
import random

# Simulate task selection
tasks = [
    ("health_check", 2),    # Weight 2
    ("ask_question", 5),    # Weight 5
    ("search", 3),          # Weight 3
]

total_weight = sum(w for _, w in tasks)

print("Task Distribution (out of 1000 requests):")
for task_name, weight in tasks:
    percentage = (weight / total_weight) * 100
    count = int((weight / total_weight) * 1000)
    print(f"  {task_name}: {percentage:.1f}% (~{count} requests)")

## Part 3: Creating Realistic User Scenarios

### Authenticated Users

Most API endpoints require authentication. Let's create an authenticated user:

In [None]:
# Authenticated user example
auth_user_code = '''
from locust import HttpUser, task, between

class AuthenticatedUser(HttpUser):
    wait_time = between(1, 5)
    
    def on_start(self):
        """Called when user starts - use for authentication."""
        response = self.client.post(
            "/api/v1/auth/login",
            json={
                "email": "test@example.com",
                "password": "TestPass123!"
            }
        )
        
        if response.status_code == 200:
            self.token = response.json()["access_token"]
        else:
            print(f"Failed to login: {response.status_code}")
    
    def get_headers(self):
        """Get request headers with authentication."""
        return {"Authorization": f"Bearer {self.token}"}
    
    @task(3)
    def ask_question(self):
        self.client.post(
            "/api/v1/ask",
            headers=self.get_headers(),
            json={"question": "What is RAG?", "k": 5}
        )
'''

print(auth_user_code)

### Multiple User Types

Real systems have different user types:

In [None]:
# Multiple user types
print("""
Different User Types in Load Testing:

1. RegularUser (80% of traffic)
   - Browses documents
   - Asks occasional questions
   - Low resource usage

2. PowerUser (15% of traffic)
   - Uploads many documents
   - Complex queries
   - High resource usage

3. MonitoringBot (5% of traffic)
   - Only health checks
   - Very frequent
   - Minimal payload
""")

## Part 4: Running Load Tests

### Web Interface Mode

Best for development and debugging:

```bash
locust -f tests/performance/locustfile.py --host=http://localhost:8000
```

Then open http://localhost:8089 in your browser.

### Headless Mode

Best for CI/CD and automation:

```bash
locust -f tests/performance/locustfile.py \\
    --host=http://localhost:8000 \\
    --headless \\
    -u 100 \\
    -r 10 \\
    --run-time 5m \\
    --csv=results
```

Parameters explained:
- `-u 100`: 100 concurrent users
- `-r 10`: Spawn 10 users per second
- `--run-time 5m`: Run for 5 minutes
- `--csv=results`: Save results to CSV

## Part 5: Understanding Performance Metrics

### Key Metrics Explained

Let's simulate and visualize performance metrics:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Simulate response times (in milliseconds)
# Most requests are fast, some are slow (realistic distribution)
np.random.seed(42)
response_times = np.concatenate([
    np.random.normal(200, 50, 900),    # 90% around 200ms
    np.random.normal(1000, 200, 80),   # 8% around 1s
    np.random.normal(3000, 500, 20),   # 2% around 3s (slow)
])

# Calculate percentiles
p50 = np.percentile(response_times, 50)
p95 = np.percentile(response_times, 95)
p99 = np.percentile(response_times, 99)

print(f"Response Time Metrics:")
print(f"  P50 (Median): {p50:.0f}ms")
print(f"  P95: {p95:.0f}ms")
print(f"  P99: {p99:.0f}ms")
print(f"  Mean: {np.mean(response_times):.0f}ms")
print(f"  Max: {np.max(response_times):.0f}ms")

In [None]:
# Visualize distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
ax1.hist(response_times, bins=50, edgecolor='black', alpha=0.7)
ax1.axvline(p50, color='green', linestyle='--', label=f'P50: {p50:.0f}ms')
ax1.axvline(p95, color='orange', linestyle='--', label=f'P95: {p95:.0f}ms')
ax1.axvline(p99, color='red', linestyle='--', label=f'P99: {p99:.0f}ms')
ax1.set_xlabel('Response Time (ms)')
ax1.set_ylabel('Frequency')
ax1.set_title('Response Time Distribution')
ax1.legend()

# Box plot
ax2.boxplot(response_times, vert=True)
ax2.set_ylabel('Response Time (ms)')
ax2.set_title('Response Time Box Plot')

plt.tight_layout()
plt.show()

### Interpreting Percentiles

- **P50 (Median)**: Half of users experience response times faster than this
- **P95**: 95% of users experience response times faster than this
- **P99**: 99% of users experience response times faster than this

**Why percentiles matter more than averages:**

Averages hide outliers. A system with:
- Average: 500ms
- P95: 5 seconds

is worse than:
- Average: 700ms
- P95: 1 second

Even though the average is higher, users have a more consistent experience.

## Part 6: Advanced Load Patterns

### Custom Load Shapes

Sometimes you need non-linear load patterns:

In [None]:
# Custom load shapes example
print("""
Custom Load Shape Examples:

1. Spike Test
   Normal: 50 users
   Spike (every 60s): 200 users for 10s
   Tests: Auto-scaling, circuit breakers

2. Ramp Up
   Start: 10 users
   Increase: +10 users every 30s
   Max: 300 users
   Tests: Capacity limits, gradual scaling

3. Daily Pattern
   Low: 20 users (night)
   Medium: 100 users (morning)
   High: 500 users (peak hours)
   Tests: Variable capacity planning
""")

In [None]:
# Visualize different load patterns
import matplotlib.pyplot as plt

time_minutes = np.arange(0, 30, 0.5)

# Spike pattern
spike_users = []
for t in time_minutes:
    if int(t) % 6 < 1:  # Spike every 6 minutes
        spike_users.append(200)
    else:
        spike_users.append(50)

# Ramp up pattern
ramp_users = np.minimum(10 + time_minutes * 5, 300)

# Steady pattern
steady_users = [100] * len(time_minutes)

plt.figure(figsize=(12, 6))
plt.plot(time_minutes, spike_users, label='Spike Pattern', linewidth=2)
plt.plot(time_minutes, ramp_users, label='Ramp Up', linewidth=2)
plt.plot(time_minutes, steady_users, label='Steady Load', linewidth=2)
plt.xlabel('Time (minutes)')
plt.ylabel('Number of Users')
plt.title('Different Load Patterns')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Part 7: Identifying Bottlenecks

### Common Performance Bottlenecks

1. **Database Connections**
   - Symptom: Increasing response times under load
   - Solution: Connection pooling, query optimization

2. **LLM API Rate Limits**
   - Symptom: Timeouts on /ask endpoint
   - Solution: Request queuing, caching

3. **Vector Search Latency**
   - Symptom: Slow RAG queries
   - Solution: Index optimization, approximate search

4. **Memory Leaks**
   - Symptom: Memory grows over time
   - Solution: Proper cleanup, monitoring

In [None]:
# Simulate bottleneck detection
print("Bottleneck Detection Example:")
print("="*50)

# Simulate metrics
metrics = {
    "Response Time": {"current": "1.2s", "threshold": "< 2s", "status": "✓ OK"},
    "DB Connections": {"current": "45/50", "threshold": "< 80%", "status": "⚠ WARNING"},
    "Memory Usage": {"current": "512MB", "threshold": "< 1GB", "status": "✓ OK"},
    "Error Rate": {"current": "0.5%", "threshold": "< 1%", "status": "✓ OK"},
    "LLM Latency": {"current": "3.5s", "threshold": "< 2s", "status": "✗ CRITICAL"},
}

for metric, data in metrics.items():
    print(f"{metric:20} {data['current']:>10} (target: {data['threshold']}) {data['status']}")

print("\n" + "="*50)
print("ALERT: LLM latency is above threshold!")
print("ACTION: Consider implementing request queuing or caching.")

## Part 8: Practical Exercise

### Exercise 1: Baseline Performance Test

1. Start your RAG Engine locally
2. Run a baseline load test:
   ```bash
   locust -f tests/performance/locustfile.py --host=http://localhost:8000
   ```
3. Configure:
   - Users: 50
   - Spawn rate: 5
   - Duration: 5 minutes
4. Record the results

### Exercise 2: Identify Breaking Point

1. Gradually increase users from 50 to 500
2. Monitor:
   - Response times
   - Error rates
   - Resource usage
3. Note when performance degrades
4. Document the breaking point

### Exercise 3: Optimize and Retest

1. Implement one optimization:
   - Add caching
   - Optimize database queries
   - Increase connection pool
2. Rerun the same test
3. Compare before/after results

## Part 9: CI/CD Integration

### Running Performance Tests in CI

```yaml
name: Performance Tests

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM

jobs:
  performance:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Start services
        run: docker-compose up -d
      
      - name: Run performance tests
        run: |
          locust -f tests/performance/locustfile.py \
            --headless \
            --host=http://localhost:8000 \
            -u 100 \
            -r 10 \
            --run-time 5m \
            --csv=results
      
      - name: Upload results
        uses: actions/upload-artifact@v3
        with:
          name: performance-results
          path: results_*.csv
```

## Summary

### Key Takeaways

1. **Performance testing is continuous** - Not a one-time activity
2. **Use realistic scenarios** - Match production usage patterns
3. **Monitor percentiles** - Not just averages
4. **Test different patterns** - Steady, spike, ramp up
5. **Integrate into CI/CD** - Catch regressions early

### Performance Checklist

- [ ] Define performance targets (RPS, latency)
- [ ] Create realistic user scenarios
- [ ] Set up automated load testing
- [ ] Monitor resource utilization
- [ ] Document bottlenecks and optimizations
- [ ] Track performance over time
- [ ] Set up alerts for regressions

### Next Steps

1. Run the practical exercises in this notebook
2. Set up daily automated performance tests
3. Create a performance dashboard
4. Define SLOs (Service Level Objectives)

## Additional Resources

- [Locust Documentation](https://docs.locust.io/)
- [Performance Testing Best Practices](https://www.guru99.com/performance-testing.html)
- [Google SRE Book](https://sre.google/sre-book/table-of-contents/)
- [RAG Engine Performance Guide](../../docs/learning/testing/03-performance-testing.md)