Skip to content

[Epic] 💾 Document Backup & Restore - Data Protection Strategy #1355

@crivetimihai

Description

@crivetimihai

💾 Backup & Restore - Comprehensive Data Protection Strategy

Goal

Implement comprehensive backup and restore strategy covering both database-level and application-level data protection:

  1. Database backups (PostgreSQL pg_dump, SQLite backups, WAL archiving)
  2. Point-in-time recovery (PITR) for PostgreSQL with WAL archiving
  3. Application-level export/import enhancement and automation
  4. Automated backup scheduling with retention policies
  5. Backup verification and testing procedures
  6. Restore procedures documentation and testing
  7. Disaster recovery (DR) runbooks and procedures
  8. Backup monitoring and alerting

This ensures data protection, business continuity, and ability to recover from data loss, corruption, or disasters.

Why Now?

Data protection is critical for production deployments:

  1. Data Loss Prevention: Protect against accidental deletion, corruption, or hardware failure
  2. Disaster Recovery: Enable recovery from catastrophic failures
  3. Compliance: Meet regulatory requirements for data retention and backup
  4. Point-in-Time Recovery: Restore to any point in time before data corruption
  5. Migration Support: Enable safe migrations between environments
  6. Business Continuity: Minimize downtime and data loss (RPO/RTO)

📖 User Stories

US-1: Operations Engineer - Automated Daily Backups

As an Operations Engineer
I want automated daily database backups with retention policies
So that I can recover data without manual intervention

Acceptance Criteria:

```gherkin
Given I have deployed MCP Gateway to production
When the backup cron job runs at 2 AM daily
Then a full database backup should be created automatically
And the backup should be compressed and encrypted
And old backups should be cleaned up per retention policy (keep 7 daily, 4 weekly, 12 monthly)
And backup success/failure should be logged and alerted

Given a backup completes successfully
When I check the backup directory
Then the backup file should exist with correct naming (mcpgateway-backup-20250128-020000.sql.gz)
And the backup should be verified for integrity
And backup metadata should be recorded (size, duration, entity counts)

Given a backup fails
When checking monitoring alerts
Then I should receive an alert notification
And the failure should be logged with error details
And the previous backup should still be available
```

Technical Requirements:

  • Cron job or systemd timer for automated backups
  • pg_dump wrapper script with error handling
  • Compression (gzip) and encryption (GPG) support
  • Retention policy enforcement (delete old backups)
  • Backup verification after creation
  • Prometheus metrics for backup monitoring
US-2: Database Administrator - Point-in-Time Recovery

As a Database Administrator
I want PostgreSQL WAL archiving and point-in-time recovery
So that I can restore to any point in time before data corruption

Acceptance Criteria:

```gherkin
Given PostgreSQL is configured with WAL archiving
When transactions occur throughout the day
Then WAL segments should be archived continuously
And archived WAL files should be stored in backup location
And WAL archiving should not impact database performance

Given data corruption occurs at 2:30 PM
When I need to restore to 2:25 PM (before corruption)
Then I should be able to restore the latest base backup
And replay WAL segments up to 2:25 PM
And the database should be in consistent state at that point
And no data should be lost from before 2:25 PM

Given I perform point-in-time recovery
When the restore completes
Then the recovery process should be logged
And I should verify data consistency
And the database should be ready for production use
```

Technical Requirements:

  • PostgreSQL `wal_level = replica` configuration
  • `archive_mode = on` with `archive_command` configured
  • WAL archiving to backup storage (local, S3, NFS)
  • Base backup creation with pg_basebackup
  • Recovery configuration with recovery.conf or recovery.signal
  • Documented PITR procedure with examples
US-3: DevOps Engineer - Application Export/Import

As a DevOps Engineer
I want to export and import gateway configuration between environments
So that I can migrate configurations safely

Acceptance Criteria:

```gherkin
Given I need to migrate configuration from staging to production
When I run `mcpgateway export --output staging-config.json`
Then all tools, servers, gateways, prompts, and resources should be exported
And authentication data should be encrypted
And dependencies should be included
And the export should be in portable JSON format

Given I have an export file from another environment
When I run `mcpgateway import staging-config.json --dry-run`
Then the import should be validated without making changes
And conflicts should be identified (duplicate names, etc.)
And import plan should be displayed

Given I proceed with actual import
When I run `mcpgateway import staging-config.json --strategy merge`
Then entities should be imported with conflict resolution
And existing entities should be preserved (merge strategy)
And import should be atomic (all or nothing)
And import results should be logged (created, updated, skipped)
```

Technical Requirements:

  • Export/import CLI commands (already implemented)
  • Support for filtering by type, tags, active/inactive
  • Conflict resolution strategies (skip, overwrite, merge)
  • Dry-run mode for validation
  • Encryption of sensitive data (API keys, passwords)
  • Import rollback on failure
US-4: System Administrator - Disaster Recovery Testing

As a System Administrator
I want to test disaster recovery procedures regularly
So that I can ensure backups are valid and recovery works

Acceptance Criteria:

```gherkin
Given I have regular database backups
When I perform monthly DR testing
Then I should be able to restore to a separate test environment
And verify all data is present and correct
And test database queries and API functionality
And document any issues found during testing

Given a restore test completes
When reviewing the test results
Then restore time should be within RTO (Recovery Time Objective)
And data completeness should be verified
And application should be functional after restore
And test results should be documented

Given a restore test fails
When analyzing the failure
Then the root cause should be identified
And backup procedures should be improved
And DR documentation should be updated
And the next test should verify the fix
```

Technical Requirements:

  • DR test environment (isolated from production)
  • Automated restore testing scripts
  • Data verification queries
  • RTO/RPO measurement and tracking
  • DR test checklist and runbook
  • Quarterly DR testing schedule

🏗 Architecture

Backup Strategy Overview

```mermaid
graph TD
subgraph "Database Backups"
DB[(PostgreSQL
Production)]
PG_DUMP[pg_dump
Full Backup]
WAL[WAL Archiving
Continuous]
DB --> PG_DUMP
DB --> WAL
end

subgraph "Application Backups"
    API[MCP Gateway API]
    EXPORT[Export Service<br/>JSON/YAML]
    API --> EXPORT
end

subgraph "Backup Storage"
    LOCAL[Local Storage<br/>/backups]
    S3[S3/Cloud Storage<br/>Offsite]
    PG_DUMP --> LOCAL
    WAL --> LOCAL
    EXPORT --> LOCAL
    LOCAL --> S3
end

subgraph "Restore Operations"
    RESTORE_DB[pg_restore<br/>Database]
    PITR[Point-in-Time<br/>Recovery]
    IMPORT[Import Service<br/>JSON/YAML]
    LOCAL --> RESTORE_DB
    LOCAL --> PITR
    LOCAL --> IMPORT
end

style DB fill:#336791
style PG_DUMP fill:#51cf66
style WAL fill:#51cf66
style EXPORT fill:#339af0
style S3 fill:#ff922b

```

Backup Types and Frequency

Backup Type Method Frequency Retention Purpose
Full Database pg_dump Daily 2 AM 7 daily, 4 weekly, 12 monthly Complete database backup
WAL Archives PostgreSQL WAL Continuous 30 days Point-in-time recovery
Application Export JSON export Daily 3 AM 30 days Configuration backup
Before Migration pg_dump + export On-demand Until verified Safety net for changes

Recovery Objectives

  • RPO (Recovery Point Objective): 1 hour (with WAL archiving)
  • RTO (Recovery Time Objective): 2 hours for full restore
  • PITR Window: 30 days (limited by WAL retention)

Backup File Naming Convention

```bash

Database backups

mcpgateway-db-backup-{YYYYMMDD}-{HHMMSS}.sql.gz
mcpgateway-db-backup-20250128-020000.sql.gz

Application exports

mcpgateway-export-{YYYYMMDD}-{HHMMSS}.json
mcpgateway-export-20250128-030000.json

WAL archives

{pg_wal_directory}/0000000100000A1F000000C8.gz
```


📋 Implementation Tasks

Phase 1: Database Backup Scripts ✅

PostgreSQL Backup Script

  • Create `scripts/backup_postgresql.sh` script
  • Implement pg_dump with compression (`pg_dump | gzip`)
  • Add backup verification (`pg_restore --list`)
  • Add error handling and logging
  • Add encryption support (optional GPG)
  • Add backup size and duration metrics
  • Test with sample database

SQLite Backup Script

  • Create `scripts/backup_sqlite.sh` script
  • Implement SQLite backup using `.backup` command or file copy
  • Handle WAL checkpoint before backup
  • Add compression and verification
  • Test backup integrity

Backup Configuration

  • Add `BACKUP_DIR` environment variable (default: `/var/backups/mcpgateway`)
  • Add `BACKUP_RETENTION_DAYS` environment variable (default: 7)
  • Add `BACKUP_COMPRESSION` setting (gzip, zstd, none)
  • Add `BACKUP_ENCRYPTION_KEY` for GPG encryption (optional)
  • Document all backup settings in .env.example

Phase 2: WAL Archiving (PostgreSQL) ✅

Configure WAL Archiving

  • Update postgresql.conf with WAL settings
    ```conf
    wal_level = replica
    archive_mode = on
    archive_command = 'test ! -f /var/lib/postgresql/wal_archive/%f && cp %p /var/lib/postgresql/wal_archive/%f'
    archive_timeout = 300 # Force WAL switch every 5 minutes
    ```
  • Create WAL archive directory
  • Set proper permissions on WAL archive directory
  • Test WAL archiving with `SELECT pg_switch_wal();`

WAL Archive Script

  • Create `scripts/archive_wal.sh` wrapper script
  • Add compression for archived WAL files
  • Add error handling for archive failures
  • Monitor WAL archive directory size
  • Implement WAL cleanup for old archives (>30 days)

Point-in-Time Recovery Setup

  • Create `scripts/pitr_restore.sh` script
  • Implement pg_basebackup for base backups
  • Document PITR restore procedure
  • Create recovery.conf template
  • Test PITR on staging environment

Phase 3: Automated Backup Scheduling ✅

Cron Jobs (Linux)

  • Create `/etc/cron.d/mcpgateway-backup` file
  • Schedule daily database backup: `0 2 * * * backup_postgresql.sh`
  • Schedule daily application export: `0 3 * * * mcpgateway export`
  • Add email notifications on failure (MAILTO)
  • Test cron job execution

Systemd Timers (Alternative)

  • Create `mcpgateway-backup.service` systemd unit
  • Create `mcpgateway-backup.timer` systemd timer
  • Enable and start timer: `systemctl enable --now mcpgateway-backup.timer`
  • View timer status: `systemctl list-timers`

Kubernetes CronJobs (for K8s deployments)

  • Create CronJob manifest for database backups
  • Mount backup volume (PVC or external storage)
  • Configure image with pg_dump and backup scripts
  • Add environment variables for database credentials
  • Test CronJob execution in staging

Phase 4: Backup Retention and Cleanup ✅

Retention Policy Implementation

  • Create `scripts/cleanup_backups.sh` script
  • Implement retention logic:
    • Keep all backups from last 7 days
    • Keep one backup per week for last 4 weeks
    • Keep one backup per month for last 12 months
  • Add dry-run mode for testing
  • Add logging for deleted backups
  • Schedule cleanup job (daily or weekly)

WAL Archive Cleanup

  • Implement WAL archive cleanup based on age
  • Keep WAL archives for PITR window (30 days)
  • Use `pg_archivecleanup` utility for safe cleanup
  • Add monitoring for WAL archive disk usage

Phase 5: Application Export Enhancement ✅

Export CLI Improvements

  • Add `--format` option (json, yaml) (already supports JSON)
  • Add `--compress` option for gzip compression
  • Add `--encrypt` option for GPG encryption
  • Add `--verify` option to validate export
  • Add progress bar for large exports
  • Improve error messages and user feedback

Automated Application Exports

  • Create `scripts/backup_application.sh` wrapper
  • Export all entity types automatically
  • Include metadata (timestamp, version, entity counts)
  • Compress and optionally encrypt exports
  • Store exports in backup directory
  • Apply retention policy to exports

Export Verification

  • Implement JSON schema validation
  • Verify all required fields present
  • Check for data consistency
  • Validate encryption if enabled
  • Report export statistics

Phase 6: Restore Procedures ✅

Database Restore Script

  • Create `scripts/restore_postgresql.sh` script
  • Implement pg_restore with progress reporting
  • Add pre-restore validation (check backup integrity)
  • Add post-restore verification (row counts, checksums)
  • Add rollback on failure
  • Test restore on empty database

Application Import Enhancement

  • Improve import CLI with better progress reporting
  • Add `--verify` mode to check import before applying
  • Add `--backup-before-import` option
  • Add detailed logging of import operations
  • Add rollback mechanism for failed imports

Restore Testing Automation

  • Create `scripts/test_restore.sh` script
  • Automate restore to test environment
  • Verify data completeness (compare row counts)
  • Run smoke tests on restored database
  • Generate restore test report
  • Schedule monthly restore testing

Phase 7: Backup Verification ✅

Backup Integrity Checks

  • Implement post-backup verification
  • For PostgreSQL: `pg_restore --list backup.sql.gz`
  • For SQLite: Open database and run PRAGMA integrity_check
  • Verify file checksums (SHA256)
  • Store verification results in metadata file

Backup Metadata Tracking

  • Create metadata file for each backup (JSON)
  • Record backup details:
    • Timestamp, duration, size
    • Database version, row counts
    • Backup type (full, incremental)
    • Verification status and checksum
  • Store metadata alongside backup files
  • Use metadata for monitoring and reporting

Corruption Detection

  • Implement regular backup scanning
  • Verify file checksums against metadata
  • Alert on corrupted backups
  • Automatically re-run backup if corruption detected

Phase 8: Disaster Recovery Documentation ✅

DR Runbooks

  • Create `docs/disaster-recovery.md` guide
  • Document full database restore procedure
  • Document point-in-time recovery procedure
  • Document application import procedure
  • Document emergency contacts and escalation
  • Include troubleshooting section

Backup/Restore Playbooks

  • Create step-by-step restore checklist
  • Document backup verification procedures
  • Document WAL archiving troubleshooting
  • Document common failure scenarios
  • Include command examples and screenshots

RTO/RPO Documentation

  • Document recovery objectives (RTO: 2 hours, RPO: 1 hour)
  • Document backup schedule and retention
  • Document test schedule (quarterly DR tests)
  • Include SLA commitments if applicable

Phase 9: Backup Monitoring and Alerting ✅

Prometheus Metrics

  • Add backup success/failure counter
  • Add backup duration histogram
  • Add backup size gauge
  • Add last backup timestamp gauge
  • Add WAL archive lag metric
  • Export metrics via `/metrics` endpoint or separate exporter

Alerting Rules

  • Alert: Backup failed (last backup >25 hours ago)
  • Alert: Backup size anomaly (>2x average or <50% average)
  • Alert: WAL archive lag (>100 segments behind)
  • Alert: Backup disk space low (<20% free)
  • Alert: Backup verification failed

Monitoring Dashboard

  • Create Grafana dashboard for backup monitoring
  • Show backup success rate over time
  • Show backup duration trend
  • Show disk usage for backups
  • Show WAL archive status

Phase 10: Container and Cloud Backups ✅

Docker Volume Backups

  • Document Docker volume backup procedure
  • Create script to backup Docker volumes
  • Support named volumes and bind mounts
  • Test backup/restore of Docker volumes

Kubernetes Backup Strategy

  • Document PVC (PersistentVolumeClaim) backups
  • Integrate with Velero for cluster backups
  • Configure snapshot storage class if supported
  • Test volume snapshot restore

Cloud Storage Integration

  • Add S3/Cloud Storage backend support
  • Implement `scripts/sync_to_s3.sh` script
  • Use AWS CLI or rclone for uploads
  • Configure lifecycle policies for offsite backups
  • Add encryption for cloud-stored backups
  • Test restore from cloud storage

Phase 11: Backup Security ✅

Encryption at Rest

  • Implement GPG encryption for backups
  • Generate encryption keypair for backups
  • Document key management procedures
  • Encrypt backups before upload to cloud
  • Store encryption keys securely (vault, KMS)

Access Control

  • Restrict backup file permissions (600 or 640)
  • Use dedicated backup user account
  • Limit database user permissions for pg_dump
  • Audit backup access logs

Compliance

  • Document backup encryption for compliance (GDPR, HIPAA)
  • Implement backup retention per regulations
  • Secure deletion of expired backups
  • Maintain audit trail of backup operations

Phase 12: Testing and Quality Assurance ✅

Unit Tests

  • Add tests for backup script functions
  • Test retention policy logic
  • Test backup verification functions
  • Mock pg_dump/pg_restore for testing

Integration Tests

  • Test full backup workflow (backup → verify → restore)
  • Test PITR workflow (base backup → WAL replay)
  • Test export/import workflow
  • Test backup cleanup and retention
  • Verify backups work across PostgreSQL versions

DR Testing

  • Perform quarterly disaster recovery tests
  • Document test procedures and results
  • Measure actual RTO and RPO
  • Update procedures based on test findings

✅ Success Criteria

  • Automated daily database backups with 99%+ success rate
  • WAL archiving configured for point-in-time recovery
  • Application export/import working with all entity types
  • Retention policy enforced (7 daily, 4 weekly, 12 monthly)
  • Backup verification passing 100% of backups
  • Restore procedures documented and tested
  • DR runbook complete with step-by-step instructions
  • Monitoring and alerting configured for backup health
  • Quarterly DR testing scheduled and passing
  • RTO ≤ 2 hours, RPO ≤ 1 hour achieved in testing

🏁 Definition of Done

  • PostgreSQL backup script created and tested
  • SQLite backup script created and tested
  • WAL archiving configured for PostgreSQL
  • PITR restore procedure documented and tested
  • Automated backup scheduling configured (cron/systemd)
  • Retention policy implemented and tested
  • Application export CLI enhanced (compression, encryption)
  • Database restore scripts created and tested
  • Backup verification implemented
  • Backup metadata tracking implemented
  • Disaster recovery runbook created
  • Backup monitoring and alerting configured
  • Cloud storage integration implemented (S3/Cloud)
  • Backup encryption implemented (GPG)
  • DR testing performed successfully
  • Documentation complete (backup guide, DR runbook)
  • Ready for production deployment

📝 Additional Notes

🔹 Backup Strategy Trade-offs:

  • Full backups: Simple, independent, but large and slow
  • Incremental backups: Smaller, faster, but complex restore
  • WAL archiving: Continuous, point-in-time recovery, but requires base backup
  • Recommendation: Full daily backups + WAL archiving for best balance

🔹 Retention Policy Calculation:
```
Total backups = 7 daily + 4 weekly + 12 monthly = 23 backups
Disk space = backup_size * 23 * compression_ratio
Example: 5GB backup * 23 * 0.1 (gzip) = 11.5GB storage needed
```

🔹 PostgreSQL Backup Methods Comparison:

  • pg_dump: Logical backup, portable, slow for large databases (>100GB)
  • pg_basebackup: Physical backup, fast, includes WAL, requires same PG version
  • WAL archiving: Continuous backup, enables PITR, requires base backup
  • Recommendation: pg_dump for daily backups, WAL archiving for PITR

🔹 Backup Encryption Best Practices:

  • Use GPG symmetric encryption for backups: `gpg -c backup.sql.gz`
  • Store encryption key in secure vault (HashiCorp Vault, AWS KMS)
  • Use different keys for different environments (dev, staging, prod)
  • Rotate encryption keys annually
  • Test restore with encrypted backups regularly

🔹 Cloud Storage Considerations:

  • Use lifecycle policies to move old backups to cheaper storage (S3 Glacier)
  • Enable versioning for accidental deletion protection
  • Use encryption at rest (SSE-S3, SSE-KMS)
  • Use encryption in transit (HTTPS)
  • Consider cross-region replication for DR

🔹 Common Backup Pitfalls to Avoid:

  • ❌ Not testing restores regularly (backup without restore = no backup!)
  • ❌ Backing up to same disk as database (single point of failure)
  • ❌ No offsite backups (vulnerable to site-wide disasters)
  • ❌ No backup verification (corrupted backups discovered during restore)
  • ❌ No monitoring (failed backups go unnoticed for days)
  • ❌ Forgetting to backup configuration files and secrets

🔹 Testing Checklist:

  • ✅ Backup creation succeeds and completes within time limit
  • ✅ Backup verification passes (pg_restore --list)
  • ✅ Restore completes successfully on clean database
  • ✅ Data integrity verified after restore (checksums, row counts)
  • ✅ Application functional after restore
  • ✅ PITR works for various timestamps
  • ✅ Backup cleanup removes old backups correctly
  • ✅ Alerts fire when backups fail
  • ✅ Offsite backups uploaded successfully

🔗 Related Issues


📚 References

Metadata

Metadata

Assignees

Labels

choreLinting, formatting, dependency hygiene, or project maintenance choresdevopsDevOps activities (containers, automation, deployment, makefiles, etc)enhancementNew feature or requestpythonPython / backend development (FastAPI)

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions