# Operations ## Starting & Stopping ### Full Startup Sequence ```bash # Step 1: Start Docker infrastructure docker compose -f ~/brainsait-unified/docker-compose.yml up -d # Step 2: Wait for IRIS to be ready sleep 30 # Step 3: Start Python microservices bash ~/iris-ecosystem-integration/brainsait_startup.sh # Step 4: Verify all services for p in 58773 58080 58081 58082 58083 3000; do ss -tlnp | grep -q ":$p " && echo "✅ Port $p" || echo "❌ Port $p" done ``` ### Graceful Shutdown ```bash # Step 1: Stop Python services pkill -f "ecosystem_supervisor.py" pkill -f "rest_server.py" pkill -f "brainsait_dashboard.py" pkill -f "brainsait_webhook.py" pkill -f "brainsait_metrics.py" # Step 2: Stop Docker containers docker compose -f ~/brainsait-unified/docker-compose.yml down # Step 3: Verify ss -tlnp | grep -E "58773|58080|58081|58082|58083|3000" || echo "All services stopped" ``` ### Restarting Individual Services ```bash # Restart API gateway pkill -f "rest_server.py" nohup python3 -u ~/iris-ecosystem-integration/rest_server.py > /tmp/rest_server.log 2>&1 & # Restart supervisor pkill -f "ecosystem_supervisor.py" nohup python3 -u ~/iris-ecosystem-integration/ecosystem_supervisor.py > /tmp/ecosystem_supervisor.log 2>&1 & # Restart a Docker container docker restart iris-unified # Restart all unhealthy containers via supervisor curl -s http://localhost:58773/recover/all | jq ``` --- ## Monitoring Health ### Live Dashboard Open `http://localhost:58081/` or `https://dashboard.brainsait.org/` for real-time visual status with auto-refresh every 5 seconds. ### Supervisor API ```bash # Full health report curl -s http://localhost:58773/health | jq # Containers curl -s http://localhost:58773/containers | jq # Workers curl -s http://localhost:58773/workers | jq # Circuit breakers curl -s http://localhost:58773/circuits | jq ``` ### Prometheus Metrics ```bash # Full metrics dump curl http://localhost:58083/metrics # Firing alerts curl http://localhost:58083/alerts | jq ``` ### Pulse Agents Health ```bash # List all 19 agents curl http://localhost:58080/linc/ | jq # Test a specific agent curl "http://localhost:58080/linc/summary?patient=P-5842" | jq .status # Test all 19 agents for agent in summary prior-auth gaps-in-care medication-safety care-plan clinical-trials readmission-risk triage imaging-followup lab-explainer nl-query sdoh-referral chat hf-models predict-readmission predict-pa-denial predict-ed-util predict-interaction predict-no-show; do STATUS=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:58080/linc/$agent?patientId=P001" 2>/dev/null || echo "FAIL") echo "$agent → HTTP $STATUS" done # Test chat curl "http://localhost:58080/hf/chat?q=hello" | jq ``` ### Predictive Models ```bash # Readmission risk (30-day) curl "http://localhost:58080/linc/predict-readmission?patientId=P001" | jq . # PA denial probability curl "http://localhost:58080/linc/predict-pa-denial?patientId=P004&service=CT%20Abdomen&payer=Medicare" | jq . # ED utilization forecast (7d + 30d) curl "http://localhost:58080/linc/predict-ed-util?patientId=P002" | jq . # Drug-drug interaction check (4 checks) curl "http://localhost:58080/linc/predict-interaction?patientId=P005&medications=metformin,lisinopril,warfarin,atorvastatin" | jq . # No-show probability curl "http://localhost:58080/linc/predict-no-show?patientId=P003&appointmentDate=2026-06-15" | jq . ``` ### Security Health ```bash # Security module status curl http://localhost:58080/security/health | jq . # Sensitive path requires auth curl http://localhost:58080/oracle/test # → 401 curl -H "X-API-Key: brainsait-iris-connector-2026" http://localhost:58080/oracle/test # → 200 ``` ### System Monitoring ```bash # Resources htop # Interactive process viewer free -h # Memory df -h / # Disk uptime # Load # Docker stats docker stats --no-stream docker ps --format "table {{.Names}}\t{{.Status}}" # Network ss -tlnp | grep -E "58773|58080|58081|58082|58083|52773|5432|6379|3000" ``` --- ## Auto-Recovery The supervisor runs automatic health checks every 30 seconds. If a container is down or unreachable for more than 60 seconds, it triggers auto-recovery: - Max 3 containers per recovery cycle (prevents cascading restarts) - Each restart logged to PostgreSQL `brainsait.health_audit` - Alert sent to webhook on success and failure - Circuit breaker reset after successful recovery ```bash # Manual recovery curl -s http://localhost:58773/recover/iris-unified | jq curl -s http://localhost:58773/recover/all | jq curl -s -X POST http://localhost:58773/probe # Verify after recovery ``` --- ## Troubleshooting ### Service Unreachable ```bash ss -tlnp | grep :{PORT} tail -100 /tmp/ecosystem_supervisor.log ``` ### Circuit Breaker Open Worker API showing `state: "open"`. Causes: downstream worker unavailable, 5xx errors, network issues. ```bash curl -s http://localhost:58773/circuits | jq curl -s -X POST http://localhost:58773/probe ``` ### High CPU / Memory ```bash ps aux --sort=-%cpu | head -10 free -h docker stats --no-stream ``` Common fix: Renice Coolify Laravel workers. ```bash sudo renice -n 10 -p $(pgrep -f artisan | head -4) ``` ### Disk Space Low ```bash df -h / docker system df sudo journalctl --vacuum-time=7d docker image prune -af sudo rm -rf /var/lib/apt/lists/* npm cache clean --force ``` ### Docker Container Issues ```bash docker ps -a --filter name={container_name} docker logs {container_name} --tail 50 docker inspect {container_name} | jq '.[0].State' docker restart {container_name} ``` ### Tunnel Connectivity Public URLs returning 502/522 — check tunnel container and DNS. ```bash docker ps --filter name=cloudflare-tunnel docker logs cloudflare-tunnel --tail 20 docker restart cloudflare-tunnel ``` --- ## Backup & Recovery ### What to Back Up | Component | Location | Method | |-----------|----------|--------| | PostgreSQL data | Docker volume | `pg_dump` | | IRIS data | Docker volume | IRIS backup | | Credentials | `~/.brainsait/credentials.env` | File copy | | Python services | `~/iris-ecosystem-integration/` | Git | | Docker compose | `~/brainsait-unified/` | Git | | Grafana dashboards | Grafana API | API call | ### PostgreSQL Backup ```bash docker exec postgres-brainsait pg_dump -U coolify -d brainsait > backup_$(date +%Y%m%d).sql ``` ### Restore ```bash cat backup.sql | docker exec -i postgres-brainsait psql -U coolify -d brainsait ``` ### Credential Recovery If `~/.brainsait/credentials.env` is lost: 1. Cloudflare Tunnel token → `~/brainsait-unified/.env` as `CLOUDFLARE_TUNNEL_TOKEN` 2. Grafana password → defaults to `brainsait2026` 3. API keys → regenerate from Cloudflare dashboard --- ## Security | Layer | Mechanism | |-------|-----------| | Edge | Cloudflare SSL/TLS, origin pinned to tunnel | | Tunnel | Random 256-bit token, stored in `.env` | | Gateway | Bearer token + X-API-Key on sensitive routes | | Webhook | Authorization header validated against `BRAINSAIT_API_KEY` | | Grafana | Basic auth (`admin` / `brainsait2026`) | | Firewall | UFW — only essential ports open | | Secrets | `credentials.env` is `chmod 600` | | Rate Limiting | 100 req/min per IP, 60s sliding window | | Audit Log | All auth failures tracked in `/tmp/brainsait_audit.json` | ### Firewall Rules Open ports: 22 (SSH), 80/443 (HTTP/S), 3000 (Grafana), 8000 (Coolify), 58080-58083/58773 (BRAINSAIT), 6001/6002 (Coolify realtime). ### Secrets Management All secrets in `~/.brainsait/credentials.env`: ```bash BRAINSAIT_API_KEY=brainsait-iris-connector-2026 CLOUDFLARE_API_TOKEN=cfut_xxxxx CLOUDFLARE_TUNNEL_TOKEN=eyJxxxxx PG_DSN=postgresql://coolify:xxxx@localhost:5432/brainsait ``` --- ## Configuration Reference ### File Locations | File | Purpose | |------|---------| | `~/brainsait-unified/docker-compose.yml` | Container definitions | | `~/brainsait-unified/.env` | Docker env vars | | `~/.brainsait/credentials.env` | API keys and secrets | | `~/iris-ecosystem-integration/brainsait_config.py` | Config loader | | `~/iris-ecosystem-integration/ecosystem_supervisor.py` | Health monitor | | `~/iris-ecosystem-integration/rest_server.py` | API Gateway (28+ routes) | | `~/iris-ecosystem-integration/linc_agents.py` | **19** Pulse AI agents (14 std + 5 predictive) | | `~/iris-ecosystem-integration/chat_agent.py` | Chat AI agent | | `~/iris-ecosystem-integration/linc_hf_bridge.py` | HF model bridge | | `~/iris-ecosystem-integration/brainsait_security.py` | Auth + rate limiting + audit | | `~/iris-ecosystem-integration/brainsait_dashboard.py` | Live dashboard | | `~/iris-ecosystem-integration/brainsait_webhook.py` | Webhook receiver | | `~/iris-ecosystem-integration/brainsait_metrics.py` | Metrics exporter | ### System Tuning ```ini vm.swappiness=5 vm.dirty_ratio=20 vm.dirty_background_ratio=5 fs.inotify.max_user_watches=524288 net.ipv4.tcp_fastopen=3 ``` Docker daemon (`/etc/docker/daemon.json`): ```json { "log-driver": "json-file", "log-opts": { "max-size": "10m", "max-file": "3" } } ``` --- ## Cheat Sheet ```bash # Health curl localhost:58773/health | jq .summary curl localhost:58080/health | jq .services # Pulse agents (19) curl localhost:58080/linc/ | jq .total curl "localhost:58080/linc/summary?patient=P-5842" | jq # Predictive models (5) curl "localhost:58080/linc/predict-readmission?patientId=P001" | jq . curl "localhost:58080/linc/predict-interaction?patientId=P005&medications=metformin,lisinopril" | jq . curl "localhost:58080/linc/predict-no-show?patientId=P003&appointmentDate=2026-06-15" | jq . # Chat curl "localhost:58080/hf/chat?q=what+is+diabetes" | jq # HF models curl localhost:58080/hf/models | jq # Security curl localhost:58080/security/health | jq # Recovery curl -X POST localhost:58773/probe curl localhost:58773/recover/all # Workflows curl localhost:58773/workflow/homecare/P-5842 # Docker docker ps --format "table {{.Names}}\t{{.Status}}" docker logs --tail 50 iris-unified ```