# Server Optimization vs Scaling: Decision Framework

This notebook provides a systematic approach to deciding when to optimize your existing resources versus when to scale up your infrastructure for Django applications like WatchParty.

## 1. Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import psutil  # For system monitoring
import requests
import json
from datetime import datetime, timedelta

# Set up visualization
plt.style.use('ggplot')
%matplotlib inline

## 2. Key Metrics to Monitor

Before making any decision about optimization vs. scaling, you need to collect and analyze key performance metrics:

In [None]:
# Sample function to collect system metrics
def collect_system_metrics():
    metrics = {
        'cpu_percent': psutil.cpu_percent(interval=1),
        'memory_percent': psutil.virtual_memory().percent,
        'swap_percent': psutil.swap_memory().percent,
        'disk_percent': psutil.disk_usage('/').percent,
        'load_avg': psutil.getloadavg()
    }
    return metrics

# Example: Collect metrics once
current_metrics = collect_system_metrics()
print("Current System Metrics:")
for key, value in current_metrics.items():
    print(f"- {key}: {value}")

## 3. Decision Framework: When to Optimize vs. Scale

Let's create a decision framework based on the metrics we collect:

In [None]:
# Sample decision framework
def analyze_resource_needs(metrics, thresholds={}):
    
    # Default thresholds if not provided
    default_thresholds = {
        'cpu_percent': {'optimize': 70, 'scale': 85},
        'memory_percent': {'optimize': 75, 'scale': 90},
        'swap_percent': {'optimize': 50, 'scale': 70},
        'disk_percent': {'optimize': 70, 'scale': 90},
    }
    
    # Use provided thresholds or defaults
    t = {k: thresholds.get(k, v) for k, v in default_thresholds.items()}
    
    recommendations = {}
    
    # Analyze each metric
    for metric, value in metrics.items():
        if metric in t:
            if value >= t[metric]['scale']:
                recommendations[metric] = {'action': 'SCALE', 'value': value, 'threshold': t[metric]['scale']}
            elif value >= t[metric]['optimize']:
                recommendations[metric] = {'action': 'OPTIMIZE', 'value': value, 'threshold': t[metric]['optimize']}
            else:
                recommendations[metric] = {'action': 'MONITOR', 'value': value}
    
    return recommendations

# Example usage
recommendations = analyze_resource_needs(current_metrics)
for metric, rec in recommendations.items():
    print(f"{metric}: {rec['action']} (Current: {rec['value']}%)")

## 4. Optimization Techniques for Django Applications

Before scaling, consider these optimization techniques:

### 4.1 Database Optimization

- **Identify slow queries**: Use Django Debug Toolbar or query logging
- **Add indexes**: For frequently queried fields
- **Implement query caching**: With Redis or Memcached
- **Use select_related/prefetch_related**: To reduce query count

Example implementation for analyzing Django queries:

In [None]:
# Sample function to parse Django query logs
def analyze_query_log(log_file_path):
    # This is a simplified example - in practice you'd parse actual log files
    sample_queries = [
        {'query': 'SELECT * FROM users_profile', 'time': 0.05, 'count': 250},
        {'query': 'SELECT * FROM videos WHERE user_id=?', 'time': 0.3, 'count': 120},
        {'query': 'SELECT * FROM parties LEFT JOIN users', 'time': 1.2, 'count': 45},
    ]
    
    # Convert to DataFrame for analysis
    query_df = pd.DataFrame(sample_queries)
    
    # Calculate total time per query type
    query_df['total_time'] = query_df['time'] * query_df['count']
    
    # Sort by most expensive queries
    return query_df.sort_values('total_time', ascending=False)

# Example usage
query_analysis = analyze_query_log('django_queries.log')
query_analysis

### 4.2 Memory Optimization

- **Gunicorn worker tuning**: Adjust worker count based on cores and available memory
- **Memory limits**: Set in systemd service files
- **Middleware evaluation**: Remove unnecessary middleware
- **Cache optimization**: Redis connection pooling and key expiration

In [None]:
# Function to calculate optimal Gunicorn workers
def calculate_gunicorn_workers(cpu_count=None, memory_gb=None):
    if cpu_count is None:
        cpu_count = psutil.cpu_count()
    
    if memory_gb is None:
        memory_gb = psutil.virtual_memory().total / (1024**3)
    
    # CPU-based calculation (2-4 × cores)
    cpu_based = min(2 * cpu_count + 1, 4 * cpu_count)
    
    # Memory-based calculation (assume 250MB per worker)
    memory_based = int((memory_gb * 0.75) / 0.25)  # Use 75% of memory, assume 250MB per worker
    
    # Take the minimum of the two
    optimal_workers = min(cpu_based, memory_based)
    
    return {
        'cpu_based': cpu_based,
        'memory_based': memory_based,
        'optimal_workers': optimal_workers,
        'worker_memory_mb': (memory_gb * 1024 * 0.75) / optimal_workers
    }

# Example usage
worker_recommendation = calculate_gunicorn_workers()
print(f"Recommended Gunicorn Workers: {worker_recommendation['optimal_workers']}")
print(f"Estimated Memory per Worker: {worker_recommendation['worker_memory_mb']:.0f} MB")

### 4.3 Cache Optimization

- **View caching**: Cache entire views or fragments
- **Template fragment caching**: Cache portions of templates
- **Low-level API caching**: Cache API responses with varying TTLs
- **Use Redis for session storage**: Better performance than database sessions

In [None]:
# Function to analyze cache hit rates
def analyze_cache_performance(days=7):
    # Sample data - in practice, get this from monitoring tools
    dates = pd.date_range(end=datetime.now(), periods=days)
    
    # Generate sample data
    np.random.seed(42)  # For reproducible results
    cache_data = {
        'date': dates,
        'hit_rate': np.random.uniform(0.6, 0.9, days),
        'miss_rate': np.random.uniform(0.1, 0.4, days),
        'request_count': np.random.randint(5000, 15000, days)
    }
    
    df = pd.DataFrame(cache_data)
    
    # Create visualization
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 10))
    
    # Plot hit/miss rates
    ax1.plot(df['date'], df['hit_rate'], 'g-', label='Hit Rate')
    ax1.plot(df['date'], df['miss_rate'], 'r-', label='Miss Rate')
    ax1.set_ylabel('Rate')
    ax1.set_title('Cache Hit vs Miss Rate')
    ax1.legend()
    ax1.grid(True)
    
    # Plot request count
    ax2.bar(df['date'], df['request_count'], color='skyblue')
    ax2.set_ylabel('Request Count')
    ax2.set_title('Daily Request Count')
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # Calculate average hit rate
    avg_hit_rate = df['hit_rate'].mean()
    print(f"Average cache hit rate: {avg_hit_rate:.2%}")
    
    if avg_hit_rate < 0.7:
        print("RECOMMENDATION: Improve cache strategy - hit rate below 70%")
    else:
        print("RECOMMENDATION: Cache performance acceptable")
        
    return df

# Example usage
cache_analysis = analyze_cache_performance()

## 5. When to Scale - Clear Indicators

Here are the key indicators that it's time to scale rather than optimize:

### 5.1 Persistent High Resource Usage Despite Optimization

If you've implemented optimizations but still see:

- **CPU**: Consistently >85% utilization during normal traffic
- **Memory**: Consistently >90% usage with swapping
- **Disk I/O**: Persistent high wait times
- **Network**: Bandwidth saturation

### 5.2 Response Time Degradation

- **Increasing response times**: Even after optimization
- **Timeout errors**: Appearing during peak traffic
- **Error rates**: Increasing under load

### 5.3 Traffic Growth Patterns

- **Sustained growth**: Not just temporary spikes
- **Predictable peaks**: That exceed current capacity
- **New features**: That will increase per-user resource requirements

In [None]:
# Function to analyze response time trends
def analyze_response_times(days=30):
    # Sample data - in practice, get this from monitoring tools
    dates = pd.date_range(end=datetime.now(), periods=days)
    
    # Generate sample data with an upward trend
    np.random.seed(42)
    base = np.linspace(100, 350, days)  # Increasing baseline from 100ms to 350ms
    variation = np.random.normal(0, 50, days)  # Add some noise
    p95_factor = 2.5  # p95 is 2.5x the average
    p99_factor = 5  # p99 is 5x the average
    
    response_data = {
        'date': dates,
        'avg_response_ms': base + variation,
        'p95_response_ms': (base + variation) * p95_factor + np.random.normal(0, 100, days),
        'p99_response_ms': (base + variation) * p99_factor + np.random.normal(0, 200, days),
        'error_rate': np.clip(np.random.normal(0.01, 0.005, days) + np.linspace(0, 0.02, days), 0, 1),
        'daily_users': np.random.randint(1000, 2000, days) + np.linspace(0, 1000, days).astype(int)
    }
    
    df = pd.DataFrame(response_data)
    
    # Create visualization
    fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(14, 12))
    
    # Plot response times
    ax1.plot(df['date'], df['avg_response_ms'], 'b-', label='Avg Response Time')
    ax1.plot(df['date'], df['p95_response_ms'], 'orange', label='p95 Response Time')
    ax1.plot(df['date'], df['p99_response_ms'], 'r-', label='p99 Response Time')
    ax1.axhline(y=300, color='r', linestyle='--', alpha=0.7, label='Target SLA (300ms)')
    ax1.set_ylabel('Response Time (ms)')
    ax1.set_title('API Response Time Trends')
    ax1.legend()
    ax1.grid(True)
    
    # Plot error rate
    ax2.plot(df['date'], df['error_rate'] * 100, 'r-')
    ax2.set_ylabel('Error Rate (%)')
    ax2.set_title('API Error Rate')
    ax2.axhline(y=1, color='orange', linestyle='--', alpha=0.7, label='Target (1%)')
    ax2.grid(True)
    
    # Plot user count
    ax3.bar(df['date'], df['daily_users'], color='skyblue')
    ax3.set_ylabel('Daily Active Users')
    ax3.set_title('User Growth')
    ax3.grid(True)
    
    plt.tight_layout()
    plt.show()
    
    # Calculate trend analysis
    first_week_avg = df['avg_response_ms'].head(7).mean()
    last_week_avg = df['avg_response_ms'].tail(7).mean()
    percent_change = (last_week_avg - first_week_avg) / first_week_avg * 100
    
    print(f"Response time trend: {percent_change:.1f}% change over {days} days")
    
    user_growth = (df['daily_users'].tail(7).mean() - df['daily_users'].head(7).mean()) / df['daily_users'].head(7).mean() * 100
    print(f"User growth: {user_growth:.1f}% over {days} days")
    
    # Make recommendation
    if percent_change > 50 and last_week_avg > 300:
        print("RECOMMENDATION: SCALE - Response times increasing significantly and exceeding SLA")
    elif percent_change > 30:
        print("RECOMMENDATION: OPTIMIZE then consider scaling if no improvement")
    else:
        print("RECOMMENDATION: Continue monitoring, current performance acceptable")
        
    return df

# Example usage
response_analysis = analyze_response_times()

## 6. Cost-Benefit Analysis

Let's create a simplified cost-benefit calculator for optimization vs. scaling:

In [None]:
def cost_benefit_analysis(current_instance_cost, optimization_cost, scaling_cost, expected_optimization_gain=0.2):
    """Calculate ROI of optimization vs scaling
    
    Parameters:
    - current_instance_cost: Monthly cost of current server
    - optimization_cost: One-time cost of optimization effort (developer time)
    - scaling_cost: Monthly cost of scaled infrastructure
    - expected_optimization_gain: Expected performance improvement (0-1)
    """
    
    # Calculate monthly savings from optimization
    effective_capacity_gain = 1 / (1 - expected_optimization_gain) - 1
    print(f"Effective capacity gain from optimization: {effective_capacity_gain:.1%}")
    
    # Calculate break-even point for optimization
    if scaling_cost > current_instance_cost:
        scaling_cost_difference = scaling_cost - current_instance_cost
        optimization_breakeven_months = optimization_cost / scaling_cost_difference
        print(f"Optimization breaks even in {optimization_breakeven_months:.1f} months compared to scaling")
    else:
        print("Scaling appears to be cheaper than current instance - verify calculations")
    
    # Calculate 1-year cost
    year_cost_current = current_instance_cost * 12
    year_cost_optimized = current_instance_cost * 12 + optimization_cost
    year_cost_scaled = scaling_cost * 12
    
    results = pd.DataFrame({
        'Scenario': ['Current', 'Optimize', 'Scale'],
        'Initial Cost': [0, optimization_cost, 0],
        'Monthly Cost': [current_instance_cost, current_instance_cost, scaling_cost],
        '1-Year Total Cost': [year_cost_current, year_cost_optimized, year_cost_scaled],
        'Relative Capacity': [1, 1 + effective_capacity_gain, scaling_cost/current_instance_cost]  # Assuming linear scaling with cost
    })
    
    return results

# Example: $10/month t2.micro vs $20/month t2.small vs optimization work
analysis = cost_benefit_analysis(
    current_instance_cost=10,  # Current server $10/month
    optimization_cost=100,     # 5 hours optimization work at $20/hr
    scaling_cost=20,           # Scaled server $20/month
    expected_optimization_gain=0.3  # Expect 30% improvement from optimization
)

analysis

## 7. Practical Decision Framework for Your WatchParty App

Based on everything we've covered, here's a practical decision framework:

### Optimize First When:

1. **Memory usage** is high but **CPU usage** is moderate (<70%)
   - Focus on memory leaks, caching strategies
   - Adjust worker counts and memory limits
   
2. **Database queries** are slow or numerous
   - Add indexes, optimize ORM usage
   - Implement query caching
   
3. **Static assets** are uncompressed or not cached
   - Implement CDN, compress assets
   - Use proper cache headers
   
4. **API endpoints** are slow but not CPU-bound
   - Implement API-level caching
   - Optimize serialization

5. **Cost is a primary concern**
   - Optimizing existing infrastructure is almost always cheaper in the short term

### Scale When:

1. **CPU usage** consistently exceeds 85% despite optimizations
   - Add more CPU cores/instances
   
2. **Memory usage** consistently exceeds 90% despite optimizations
   - Add more RAM
   
3. **User growth** is consistent and predictable
   - Scale ahead of projected needs
   
4. **Response times** continue to increase despite optimizations
   - Indicates fundamental resource limitations
   
5. **New features** require additional resources
   - Sometimes optimization isn't enough for new workloads

6. **Time is more valuable than money**
   - When developer time spent optimizing exceeds cost of scaling

## 8. Monitoring Setup for WatchParty

To make informed decisions, set up proper monitoring:

In [None]:
# Sample monitoring script for key metrics
# This is just a conceptual example - in production use proper monitoring tools like Prometheus, Grafana, etc.

def monitor_health_metrics():
    # System metrics
    cpu = psutil.cpu_percent(interval=1)
    memory = psutil.virtual_memory().percent
    swap = psutil.swap_memory().percent
    disk = psutil.disk_usage('/').percent
    
    # Sample API metrics (in production, get these from actual endpoints)
    api_status = requests.get('http://localhost:8000/health/')
    response_time = api_status.elapsed.total_seconds() * 1000  # Convert to ms
    
    metrics = {
        'timestamp': datetime.now().isoformat(),
        'cpu_percent': cpu,
        'memory_percent': memory,
        'swap_percent': swap,
        'disk_percent': disk,
        'api_response_time_ms': response_time,
        'api_status_code': api_status.status_code
    }
    
    # In production, send to logging/monitoring system
    print(json.dumps(metrics, indent=2))
    
    # Check against thresholds
    warnings = []
    if cpu > 85: warnings.append(f"HIGH CPU: {cpu}%")
    if memory > 90: warnings.append(f"HIGH MEMORY: {memory}%")
    if swap > 70: warnings.append(f"HIGH SWAP: {swap}%")
    if response_time > 500: warnings.append(f"SLOW API: {response_time:.1f}ms")
    
    if warnings:
        print("WARNING: " + ", ".join(warnings))
        # In production, trigger alerts here
    
    return metrics

# Example usage
# In production, this would run on a schedule
try:
    current_health = monitor_health_metrics()
except Exception as e:
    print(f"Error monitoring health: {e}")

## 9. Recommended Monitoring Script for WatchParty

Here's a simple bash script you can use to monitor your WatchParty server and make optimization vs. scaling decisions:

In [None]:
%%writefile /var/www/watchparty/scripts/monitor_resources.sh

#!/bin/bash

# Simple monitoring script for WatchParty server
# Save as /var/www/watchparty/scripts/monitor_resources.sh
# Usage: ./monitor_resources.sh [--email admin@example.com] [--threshold 90]

# Default settings
THRESHOLD=90
EMAIL=""
LOG_FILE="/var/log/watchparty/resource_usage.log"
ALERT_HISTORY="/var/log/watchparty/resource_alerts.log"

# Process command line arguments
while [[ $# -gt 0 ]]; do
  case $1 in
    --email)
      EMAIL="$2"
      shift 2
      ;;
    --threshold)
      THRESHOLD="$2"
      shift 2
      ;;
    *)
      echo "Unknown option: $1"
      exit 1
      ;;
  esac
done

# Ensure log directory exists
mkdir -p $(dirname "$LOG_FILE")
mkdir -p $(dirname "$ALERT_HISTORY")

# Get timestamp
TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")

# Get resource usage
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | sed "s/.*, *\([0-9.]*\)%* id.*/\1/" | awk '{print 100 - $1}')
MEMORY_USAGE=$(free | grep Mem | awk '{print $3/$2 * 100.0}')
SWAP_USAGE=$(free | grep Swap | awk '{if ($2 > 0) print $3/$2 * 100.0; else print 0}')
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')

# Round to integers
CPU_USAGE=$(printf "%.0f" "$CPU_USAGE")
MEMORY_USAGE=$(printf "%.0f" "$MEMORY_USAGE")
SWAP_USAGE=$(printf "%.0f" "$SWAP_USAGE")

# Get load average
LOAD_AVG=$(uptime | awk -F'load average:' '{ print $2 }' | sed 's/,.*//')

# Log resource usage
echo "$TIMESTAMP,CPU:$CPU_USAGE%,MEM:$MEMORY_USAGE%,SWAP:$SWAP_USAGE%,DISK:$DISK_USAGE%,LOAD:$LOAD_AVG" >> "$LOG_FILE"

# Check for alerts
ALERTS=""
if [ "$CPU_USAGE" -gt "$THRESHOLD" ]; then
  ALERTS="$ALERTS CPU usage is high: $CPU_USAGE%\n"
fi

if [ "$MEMORY_USAGE" -gt "$THRESHOLD" ]; then
  ALERTS="$ALERTS Memory usage is high: $MEMORY_USAGE%\n"
fi

if [ "$SWAP_USAGE" -gt "75" ]; then
  ALERTS="$ALERTS Swap usage is high: $SWAP_USAGE%\n"
fi

if [ "$DISK_USAGE" -gt "$THRESHOLD" ]; then
  ALERTS="$ALERTS Disk usage is high: $DISK_USAGE%\n"
fi

# Calculate processes by user
PROCESS_COUNT=$(ps -eo user | sort | uniq -c | sort -nr | head -5 | tr '\n' ' ')

# Get top memory processes
TOP_MEM_PROCESSES=$(ps -eo pid,pmem,rss,command --sort=-%mem | head -6 | tail -5 | sed 's/^/  /')

# Get top CPU processes
TOP_CPU_PROCESSES=$(ps -eo pid,pcpu,command --sort=-%cpu | head -6 | tail -5 | sed 's/^/  /')

# Print summary
echo "===== WatchParty Resource Monitor ====="
echo "Time: $TIMESTAMP"
echo "CPU Usage: $CPU_USAGE%"
echo "Memory Usage: $MEMORY_USAGE%"
echo "Swap Usage: $SWAP_USAGE%"
echo "Disk Usage: $DISK_USAGE%"
echo "Load Average: $LOAD_AVG"
echo
echo "Top Memory Processes:"
echo "$TOP_MEM_PROCESSES"
echo
echo "Top CPU Processes:"
echo "$TOP_CPU_PROCESSES"
echo

# Display decision guidance
if [ "$CPU_USAGE" -gt 85 ] && [ "$MEMORY_USAGE" -gt 85 ]; then
  echo "RECOMMENDATION: Consider SCALING the server (both CPU and memory are high)"
  echo "$TIMESTAMP - SCALING recommended (CPU: $CPU_USAGE%, MEM: $MEMORY_USAGE%)" >> "$ALERT_HISTORY"
elif [ "$CPU_USAGE" -gt 85 ] && [ "$MEMORY_USAGE" -le 70 ]; then
  echo "RECOMMENDATION: Consider CPU OPTIMIZATION or scaling to higher CPU instance"
  echo "$TIMESTAMP - CPU OPTIMIZATION recommended (CPU: $CPU_USAGE%)" >> "$ALERT_HISTORY"
elif [ "$MEMORY_USAGE" -gt 85 ] && [ "$CPU_USAGE" -le 70 ]; then
  echo "RECOMMENDATION: Consider MEMORY OPTIMIZATION or scaling to higher memory instance"
  echo "$TIMESTAMP - MEMORY OPTIMIZATION recommended (MEM: $MEMORY_USAGE%)" >> "$ALERT_HISTORY"
elif [ "$CPU_USAGE" -gt 70 ] || [ "$MEMORY_USAGE" -gt 70 ]; then
  echo "RECOMMENDATION: Monitor closely, optimize if consistent pattern observed"
else
  echo "RECOMMENDATION: Resource usage acceptable, no action needed"
fi

# Send email alert if configured and threshold exceeded
if [ -n "$ALERTS" ] && [ -n "$EMAIL" ]; then
  echo -e "WatchParty Server Alert\n\n$ALERTS\nTime: $TIMESTAMP\n\nTop Memory Processes:\n$TOP_MEM_PROCESSES\n\nTop CPU Processes:\n$TOP_CPU_PROCESSES" | mail -s "WatchParty Server Alert" "$EMAIL"
  echo "Alerts sent to $EMAIL"
elif [ -n "$ALERTS" ]; then
  echo -e "ALERTS:\n$ALERTS"
fi

# Done
exit 0

In [None]:
# Make script executable
!chmod +x /var/www/watchparty/scripts/monitor_resources.sh

# Run the script once to test
!/var/www/watchparty/scripts/monitor_resources.sh

## 10. Set up Cron Job for Regular Monitoring

To run the monitoring script regularly:

In [None]:
%%writefile setup_cron_job.sh

#!/bin/bash

# Set up cron job for regular monitoring
# This will run the monitoring script every 15 minutes

# Create crontab entry
(crontab -l 2>/dev/null; echo "*/15 * * * * /var/www/watchparty/scripts/monitor_resources.sh >> /var/log/watchparty/monitoring_output.log 2>&1") | crontab -

echo "Cron job set up to run every 15 minutes"
echo "To add email alerts, run: /var/www/watchparty/scripts/monitor_resources.sh --email admin@example.com"

In [None]:
# Make setup script executable
!chmod +x setup_cron_job.sh

# Run setup script
# Uncomment the next line to actually set up the cron job
# !./setup_cron_job.sh

## 11. Conclusion: The Decision Framework

To summarize the decision process:

1. **Monitor key metrics** consistently
   - CPU, memory, disk usage, response times
   
2. **Implement optimizations first** when:
   - Resource usage is moderate (<85%)
   - Specific bottlenecks are identified
   - Cost is a primary concern
   
3. **Scale when**:
   - Optimizations have been exhausted
   - Resource usage remains consistently high
   - Growth is sustained and predictable
   - Response times continue to degrade
   
4. **Consider hybrid approach**:
   - Optimize the current workload
   - Scale specific components as needed
   - Use auto-scaling for predictable traffic patterns

5. **Re-evaluate regularly**:
   - User growth patterns
   - New feature resource requirements
   - Cost vs. performance tradeoffs

By following this framework, you'll make informed decisions about when to optimize your WatchParty application versus when to scale its infrastructure.