OPS-08: Backup/restore automation and DR drill playbook#663
Conversation
- scripts/backup.sh: timestamped hot backup via sqlite3 .backup (safe with active writers); configurable retention (default 7); integrity check on output; chmod 600 on backup files; cp fallback with warning - scripts/restore.sh: magic-bytes + integrity_check verification before overwrite; automatic safety copy; post-restore integrity verification; interactive confirmation (--yes to skip) - scripts/backup.ps1 / restore.ps1: PowerShell equivalents for Windows with equivalent logic and ACL restriction instead of chmod
- docs/ops/DISASTER_RECOVERY_RUNBOOK.md: RTO < 30min / RPO < 24h targets; backup schedule and cron examples; step-by-step restore procedure; backup verification checks; access control requirements (chmod 600 / restricted ACL); DR drill schedule; evidence table template; escalation path - docs/ops/rehearsal-scenarios/backup-restore-drill.md: three injection options (delete, destructive query, Docker volume); pass criteria; evidence checklist; registered in scenario library
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
sqlite3 dot-commands (.backup, .restore) pass file paths as single-quoted string literals. Reject paths containing single-quote characters with a clear error message to prevent truncation or misrouting of the dot-command. Both backup.sh and restore.sh now validate --db-path, --output-dir, and --backup-file before proceeding.
Self-Review FindingsBackup hot-copy safety (active writes): Safe — Restore integrity check: Fully verified before overwrite — (1) SQLite magic bytes checked via Retention logic correctness: Correct — Path injection safety: Was a concern — RTO/RPO realism: Realistic for a SQLite local-first app. RTO < 30 min is achievable: the restore script itself completes in seconds; the bulk of the window is API restart + WAL checkpoint. The runbook notes Docker/hosted instances may need up to 60 min. RPO < 24h under daily cron; < 1h with hourly scheduling — both documented. Drill template usability: Actionable — the evidence table has Date, Operator, Backup Age, Backup File, Restore Duration, integrity_check result, Row Count Match, Pass/Fail, and Notes columns. The Critical issues: None remaining. Path injection fix pushed; all other checks passed initial review. |
There was a problem hiding this comment.
Code Review
This pull request establishes a disaster recovery framework for Taskdeck, including a comprehensive runbook, rehearsal scenarios, and automation scripts for SQLite backups and restores on both Linux/macOS and Windows. The review feedback identifies opportunities to harden connection string parsing against multi-parameter strings, ensure Bash compatibility for macOS users by avoiding Bash 4.0 features, and prevent potential memory issues when validating large database files in PowerShell.
scripts/backup.sh
Outdated
| if [[ -z "$DB_PATH" ]]; then | ||
| # Try to extract from the ConnectionStrings env var (e.g. "Data Source=/app/data/taskdeck.db") | ||
| if [[ -n "${ConnectionStrings__DefaultConnection:-}" ]]; then | ||
| DB_PATH="${ConnectionStrings__DefaultConnection#*=}" |
There was a problem hiding this comment.
The connection string parsing is fragile. Using ${VAR#*=} assumes the path is the only parameter and follows the first =. If the connection string contains multiple parameters (e.g., Data Source=taskdeck.db;Version=3), this will result in an invalid file path.
| DB_PATH="${ConnectionStrings__DefaultConnection#*=}" | |
| DB_PATH=$(echo "${ConnectionStrings__DefaultConnection}" | sed -n 's/.*Data Source=\([^;]*\).*/\1/p') |
| echo "ERROR: backup integrity check failed: $INTEGRITY" >&2 | ||
| rm -f "$BACKUP_FILE" | ||
| exit 1 |
There was a problem hiding this comment.
mapfile (or readarray) was introduced in Bash 4.0. Default installations of macOS still use Bash 3.2, where this command is unavailable. Using a while loop with read provides better compatibility for local-first deployments.
| echo "ERROR: backup integrity check failed: $INTEGRITY" >&2 | |
| rm -f "$BACKUP_FILE" | |
| exit 1 | |
| ALL_BACKUPS=() | |
| while IFS= read -r line; do | |
| ALL_BACKUPS+=("$line") | |
| done < <(ls -1t "${OUTPUT_DIR}/taskdeck-backup-"*.db 2>/dev/null) |
scripts/restore.sh
Outdated
| # --------------------------------------------------------------------------- | ||
| if [[ -z "$DB_PATH" ]]; then | ||
| if [[ -n "${ConnectionStrings__DefaultConnection:-}" ]]; then | ||
| DB_PATH="${ConnectionStrings__DefaultConnection#*=}" |
There was a problem hiding this comment.
The connection string parsing is fragile. Using ${VAR#*=} assumes the path is the only parameter and follows the first =. If the connection string contains multiple parameters (e.g., Data Source=taskdeck.db;Version=3), this will result in an invalid file path.
| DB_PATH="${ConnectionStrings__DefaultConnection#*=}" | |
| DB_PATH=$(echo "${ConnectionStrings__DefaultConnection}" | sed -n 's/.*Data Source=\([^;]*\).*/\1/p') |
scripts/backup.ps1
Outdated
| $csEnv = $env:ConnectionStrings__DefaultConnection | ||
| if (-not [string]::IsNullOrWhiteSpace($csEnv)) { | ||
| # Parse "Data Source=/path/to/taskdeck.db" | ||
| $DbPath = ($csEnv -replace '^(?i)data\s+source\s*=\s*', '').Trim() |
There was a problem hiding this comment.
The connection string parsing using -replace is fragile and will fail if the connection string contains multiple parameters. It's safer to split the string and extract the specific key.
$DbPath = ($csEnv -split ';' | Where-Object { $_ -match 'Data Source=' } | ForEach-Object { $_ -replace '.*Data Source=', '' }).Trim()
scripts/restore.ps1
Outdated
| if ([string]::IsNullOrWhiteSpace($DbPath)) { | ||
| $csEnv = $env:ConnectionStrings__DefaultConnection | ||
| if (-not [string]::IsNullOrWhiteSpace($csEnv)) { | ||
| $DbPath = ($csEnv -replace '^(?i)data\s+source\s*=\s*', '').Trim() |
There was a problem hiding this comment.
The connection string parsing using -replace is fragile and will fail if the connection string contains multiple parameters. It's safer to split the string and extract the specific key.
$DbPath = ($csEnv -split ';' | Where-Object { $_ -match 'Data Source=' } | ForEach-Object { $_ -replace '.*Data Source=', '' }).Trim()
scripts/restore.ps1
Outdated
|
|
||
| # Check SQLite magic bytes (first 16 bytes = "SQLite format 3\0") | ||
| $SqliteMagic = [System.Text.Encoding]::ASCII.GetBytes("SQLite format 3") | ||
| $FileBytes = [System.IO.File]::ReadAllBytes($BackupFile) |
There was a problem hiding this comment.
[System.IO.File]::ReadAllBytes($BackupFile) reads the entire file into memory. For large database backups, this can lead to high memory consumption or OutOfMemoryException. Since only the first 16 bytes are needed for the magic check, it's more efficient to read only the required header.
$FileBytes = Get-Content -Path $BackupFile -Encoding Byte -TotalCount 16
There was a problem hiding this comment.
Pull request overview
Adds cross-platform (bash + PowerShell) automation and documentation to support SQLite backup/restore workflows and to formalize a DR drill playbook for Taskdeck’s local-first database.
Changes:
- Introduces
backup/restorescripts for Linux/macOS/WSL and Windows, including integrity checks and permission hardening. - Adds a Disaster Recovery runbook documenting RTO/RPO targets, scheduling guidance, verification steps, and escalation.
- Adds a rehearsal scenario for running a backup/restore DR drill and registers it in the rehearsal cadence index.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/backup.sh | Hot-backup + retention automation with integrity check and chmod hardening. |
| scripts/restore.sh | Restore automation with pre/post integrity checks, schema sanity checks, and safety copy creation. |
| scripts/backup.ps1 | Windows backup equivalent with retention and owner-only ACL hardening. |
| scripts/restore.ps1 | Windows restore equivalent with validation, safety copy, and owner-only ACL hardening. |
| docs/ops/DISASTER_RECOVERY_RUNBOOK.md | DR runbook describing backup/restore procedure, scheduling, verification, and evidence requirements. |
| docs/ops/rehearsal-scenarios/backup-restore-drill.md | Step-by-step DR drill scenario including injections, recovery steps, and pass criteria. |
| docs/ops/INCIDENT_REHEARSAL_CADENCE.md | Registers the new backup/restore drill scenario in the scenario library list. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/backup.sh
Outdated
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
|
|
||
| DEFAULT_DB_PATH="${HOME}/.taskdeck/taskdeck.db" | ||
| DEFAULT_OUTPUT_DIR="${HOME}/.taskdeck/backups" | ||
| DEFAULT_RETAIN=7 |
There was a problem hiding this comment.
SCRIPT_DIR is assigned but never used, which adds dead code/noise. Remove it or use it (e.g., to resolve repo-relative defaults) so the script stays minimal and easier to maintain.
| # Restrict backup directory permissions (owner read/write only) | ||
| chmod 700 "$OUTPUT_DIR" 2>/dev/null || true | ||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Build backup filename | ||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
The sqlite3 dot-command is built with single-quoted interpolation (.backup '${BACKUP_FILE}'). If the output path contains a single quote, the backup will fail; worse, a crafted path could inject additional sqlite3 commands. Prefer passing a safely escaped path (e.g., escape single quotes for sqlite3) and avoid constructing dot-commands that can be interpreted as multiple commands.
| exit 1 | ||
| fi | ||
| if [[ "$DB_PATH" == *"'"* ]]; then | ||
| echo "ERROR: --db-path must not contain single-quote characters: $DB_PATH" >&2 |
There was a problem hiding this comment.
The header check comment says the first 16 bytes must match SQLite format 3\0, but the script reads 15 bytes and compares against the hex for SQLite format 3 (no null terminator). Either read/compare 16 bytes including the trailing 00, or update the comment/expected value so the check and documentation align.
scripts/restore.sh
Outdated
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Defaults | ||
| # --------------------------------------------------------------------------- | ||
| DEFAULT_DB_PATH="${HOME}/.taskdeck/taskdeck.db" |
There was a problem hiding this comment.
SCRIPT_DIR is assigned but never used. Please remove it to reduce dead code, or use it to resolve repo-relative paths if that was the intent.
| [[ "$CONFIRM" =~ ^[Yy]$ ]] || { echo "Aborted."; exit 1; } | ||
| fi | ||
|
|
||
| # --------------------------------------------------------------------------- | ||
| # Step 3: Create safety copy of the current live database | ||
| # --------------------------------------------------------------------------- | ||
| mkdir -p "$SAFETY_DIR" |
There was a problem hiding this comment.
The sqlite3 dot-command string uses single-quoted interpolation (.backup '${SAFETY_FILE}'). A safety-copy path containing a single quote will break the command; additionally, constructing dot-commands from user-controlled paths can allow sqlite3 command injection. Escape single quotes for sqlite3 (or otherwise sanitize/encode the filename) before embedding it into a dot-command.
| ```powershell | ||
| # Create a daily backup task | ||
| $action = New-ScheduledTaskAction -Execute "pwsh.exe" ` | ||
| -Argument "-NonInteractive -File C:\taskdeck\scripts\backup.ps1 -Yes" | ||
| $trigger = New-ScheduledTaskTrigger -Daily -At "02:00" | ||
| Register-ScheduledTask -TaskName "Taskdeck-Daily-Backup" ` | ||
| -Action $action -Trigger $trigger -RunLevel Highest | ||
| ``` |
There was a problem hiding this comment.
The Task Scheduler example passes -Yes to backup.ps1, but scripts/backup.ps1 does not define a Yes parameter. This will cause scheduled backups to fail. Remove -Yes from the example (or add a corresponding switch parameter in the script if intended).
| # 4. Restore via the restore script (exec into a temp container) | ||
| docker run --rm \ | ||
| -v taskdeck_taskdeck-db:/data \ | ||
| -v "$(pwd):/repo" \ | ||
| alpine:3 \ | ||
| sh -c "cp /data/backups/taskdeck-backup-<timestamp>.db /data/taskdeck.db" |
There was a problem hiding this comment.
In Option C, Step 4 says "Restore via the restore script" but the command shown does a raw cp into the volume. This bypasses the restore script’s validations (magic bytes/integrity check) and safety copy behavior. Either update the text to say this is a manual copy (and why it's acceptable), or change the instructions to use scripts/restore.sh inside a container that has sqlite3 available.
| # 4. Restore via the restore script (exec into a temp container) | |
| docker run --rm \ | |
| -v taskdeck_taskdeck-db:/data \ | |
| -v "$(pwd):/repo" \ | |
| alpine:3 \ | |
| sh -c "cp /data/backups/taskdeck-backup-<timestamp>.db /data/taskdeck.db" | |
| # 4. Restore via the restore script (exec into a temp container with sqlite3 available) | |
| docker run --rm \ | |
| -v taskdeck_taskdeck-db:/data \ | |
| -v "$(pwd):/repo" \ | |
| alpine:3 \ | |
| sh -c "apk add --no-cache bash sqlite && bash /repo/scripts/restore.sh --db-path /data/taskdeck.db --backup-path /data/backups/taskdeck-backup-<timestamp>.db" |
| $Sqlite3 = Get-Command sqlite3.exe -ErrorAction SilentlyContinue | ||
|
|
||
| if ($Sqlite3) { | ||
| # sqlite3 .backup is a hot backup: copies pages under a shared lock, | ||
| # flushing WAL frames first. Safe with active readers and writers. | ||
| & $Sqlite3.Source $DbPath ".backup '$BackupFile'" | ||
| if ($LASTEXITCODE -ne 0) { | ||
| Write-Error "sqlite3 .backup failed (exit code $LASTEXITCODE)." | ||
| } |
There was a problem hiding this comment.
The sqlite3 dot-command is constructed with single-quoted interpolation (.backup '$BackupFile'). If the path contains a single quote, sqlite3 parsing will fail; additionally, embedding unsanitized input into dot-commands can allow command injection. Escape/sanitize the filename before constructing the .backup command.
scripts/restore.ps1
Outdated
| # Check SQLite magic bytes (first 16 bytes = "SQLite format 3\0") | ||
| $SqliteMagic = [System.Text.Encoding]::ASCII.GetBytes("SQLite format 3") | ||
| $FileBytes = [System.IO.File]::ReadAllBytes($BackupFile) | ||
|
|
||
| if ($FileBytes.Length -lt 16) { | ||
| Write-Error "Backup file is too small to be a valid SQLite database." | ||
| } |
There was a problem hiding this comment.
Comment says the first 16 bytes are SQLite format 3\0, but the code compares only the 15-byte ASCII string SQLite format 3 (and never validates the trailing null byte). Either include the null terminator in the expected bytes, or update the comment/length check to match what’s being validated.
| # Quick integrity check on the backup | ||
| # --------------------------------------------------------------------------- | ||
| if command -v sqlite3 &>/dev/null; then | ||
| INTEGRITY="$(sqlite3 "$BACKUP_FILE" 'PRAGMA integrity_check;' 2>&1)" | ||
| if [[ "$INTEGRITY" != "ok" ]]; then | ||
| echo "ERROR: backup integrity check failed: $INTEGRITY" >&2 | ||
| rm -f "$BACKUP_FILE" | ||
| exit 1 |
There was a problem hiding this comment.
Retention comment says backups are sorted oldest-first, but ls -1t sorts newest-first (and the deletion loop relies on newest-first ordering). Update the comment to match the actual ordering to avoid confusion when maintaining the retention logic.
- Fix fragile connection string parsing: use sed/split to extract Data Source from multi-parameter connection strings in all 4 scripts - Replace mapfile with while-read loop for macOS Bash 3.2 compatibility - Remove dead SCRIPT_DIR variables from backup.sh and restore.sh - Add sqlite3 single-quote escaping as defense-in-depth against command injection in all .backup/.restore dot-commands - Fix header byte count comment: code reads 15 bytes (text only), not 16 (which includes the null terminator) - Replace ReadAllBytes with stream read of first 16 bytes in restore.ps1 to avoid loading entire file into memory - Fix retention comment: ls -t sorts newest-first, not oldest-first - Remove nonexistent -Yes flag from Task Scheduler example in runbook - Update Option C Docker restore to use restore.sh instead of raw cp
Chris0Jeky
left a comment
There was a problem hiding this comment.
Adversarial Review — PR #663 (backup/restore scripts + DR runbook)
The two fix-up commits addressed the bot findings well (connection string parsing, mapfile compatibility, ReadAllBytes, single-quote escaping, dead SCRIPT_DIR, comment corrections, Docker Option C). Good follow-through. The scripts are generally well-structured with solid defensive checks.
Below are the remaining findings I could confirm through code analysis and local shell testing.
1. [HIGH] Restore scripts do not clean up stale WAL/SHM files — risk of silent corruption
Files: scripts/restore.sh, scripts/restore.ps1
When the live database is in WAL mode (which EF Core enables by default), the database directory contains taskdeck.db-wal and taskdeck.db-shm companion files. The restore scripts replace the main .db file (via sqlite3 .restore or cp) but leave any existing -wal and -shm files untouched.
On the next application startup, SQLite will discover the old WAL file and attempt to replay it against the newly restored database. Since the WAL contains page references from the pre-restore database, this can silently corrupt the restored data or cause an immediate SQLITE_CORRUPT error.
Fix: Before overwriting DB_PATH, delete (or rename alongside the safety copy) the companion files:
# In restore.sh, before the restore step:
rm -f "${DB_PATH}-wal" "${DB_PATH}-shm"# In restore.ps1, before the restore step:
Remove-Item "${DbPath}-wal" -Force -ErrorAction SilentlyContinue
Remove-Item "${DbPath}-shm" -Force -ErrorAction SilentlyContinueThis is the most important finding. Without it, a restore of a WAL-mode database followed by an API restart can produce data corruption that defeats the purpose of the backup.
2. [MEDIUM] cp fallback in backup.sh does not copy WAL file — backup may be incomplete
File: scripts/backup.sh (and backup.ps1 Copy-Item fallback)
When sqlite3 is not available, the script falls back to cp "$DB_PATH" "$BACKUP_FILE". If the database is in WAL mode, the main .db file alone is an incomplete snapshot — committed transactions in the -wal file are not included, and the backup will appear to have rolled back those transactions.
The script already warns that cp is unsafe with active writers. Consider strengthening the warning to mention WAL explicitly, or copying the -wal file alongside the main database in the fallback path (with a caveat that it is still not crash-safe). This would at least give the operator a complete set of files to work with.
Not a blocker since the warning is already present, but worth a one-line clarification.
3. [MEDIUM] restore.sh magic-byte validation logic: file is checked before using the computed xxd result
File: scripts/restore.sh, lines ~131-157
The script computes MAGIC_ACTUAL via dd | xxd unconditionally, then checks if command -v file first. If file is available, MAGIC_ACTUAL is computed but never used. More importantly, the file command is less precise — it checks heuristics, not exact bytes. A file that happens to contain "SQLite" in the right place but is not a valid SQLite database could pass the file check.
Consider reversing the priority: prefer the exact xxd magic-byte check when available, fall back to file only when xxd is not present. Or just skip the file fallback entirely since xxd is more common in production environments (it ships with vim which is nearly universal on Linux).
4. [LOW] No disk-space pre-check — partial backup files on full disk
Files: scripts/backup.sh, scripts/backup.ps1
If the disk fills during sqlite3 .backup, the command will fail and leave a partial .db file at the backup path. The subsequent PRAGMA integrity_check will catch it and delete the partial file, which is good. However, the partial file itself may consume the remaining scarce disk space, making recovery harder. A pre-flight check (df on Linux, Get-PSDrive on Windows) comparing available space to the source database size would give an early, actionable error message instead of a cryptic sqlite3 failure.
Not a blocker — the integrity check is a good safety net — but a nice-to-have for operational clarity.
5. [LOW] Sub-second re-runs silently overwrite the previous backup
File: scripts/backup.sh, scripts/backup.ps1
The backup filename uses second-granularity timestamps (%Y-%m-%d-%H%M%S). If the script is invoked twice within the same second (e.g., automated retry on transient failure), the second run overwrites the first backup without warning. Consider appending a random suffix or checking for filename collision, or simply documenting this as a known limitation.
Summary
The PR is in good shape after the fix-up commits. Finding #1 (WAL/SHM cleanup on restore) is the only one I would call a blocker — it is a real data corruption vector for the default EF Core SQLite configuration. The rest are improvements that could ship in a follow-up.
EF Core uses WAL mode by default. Both restore scripts replaced the
main .db file but left stale -wal and -shm companion files in place.
On next API startup the stale WAL would be replayed against the
restored database, risking silent data corruption.
Delete ${DB_PATH}-wal and ${DB_PATH}-shm (bash) and their PowerShell
equivalents immediately before the sqlite3 .restore / cp step.
The cp fallback (used when sqlite3 is unavailable) only copied the main .db file, producing an incomplete snapshot for WAL-mode databases. Now also copies the -wal file if it exists and warns that the backup may be incomplete for WAL-mode databases.
Closes #86
Summary
scripts/backup.sh— timestamped SQLite hot backup viasqlite3 .backup(safe with active writers); configurable retention (default 7 days);PRAGMA integrity_checkon output;chmod 600on backup files;cpfallback with explicit warningscripts/restore.sh— SQLite magic-bytes +PRAGMA integrity_checkverification before overwrite; automatic safety copy of live DB; post-restore integrity verification; interactive confirmation (--yesto skip)scripts/backup.ps1/restore.ps1— PowerShell equivalents for Windows with equivalent logic and restricted ACL (owner-only) instead of chmoddocs/ops/DISASTER_RECOVERY_RUNBOOK.md— RTO/RPO targets, backup scheduling (cron + Task Scheduler + Docker), step-by-step restore procedure, backup verification checks, access control requirements, DR drill schedule, evidence table template, escalation pathdocs/ops/rehearsal-scenarios/backup-restore-drill.md— three injection options (delete, destructive query, Docker volume); pass criteria table; evidence checklistdocs/ops/INCIDENT_REHEARSAL_CADENCE.md— registered new scenario in scenario libraryRTO/RPO
Usage
Test plan
bash scripts/backup.shon a live database and confirmPRAGMA integrity_checkpasses on the outputbash scripts/restore.shagainst the backup and confirm post-restore integrity check passes--yesflag skips confirmation promptbackup-restore-drillrehearsal scenario per the runbook