Skip to content

OPERATIONS

Fadil369 edited this page Jun 9, 2026 · 1 revision

Operations

Starting & Stopping

Full Startup Sequence

# Step 1: Start Docker infrastructure
docker compose -f ~/brainsait-unified/docker-compose.yml up -d

# Step 2: Wait for IRIS to be ready
sleep 30

# Step 3: Start Python microservices
bash ~/iris-ecosystem-integration/brainsait_startup.sh

# Step 4: Verify all services
for p in 58773 58080 58081 58082 58083 3000; do
  ss -tlnp | grep -q ":$p " && echo "✅ Port $p" || echo "❌ Port $p"
done

Graceful Shutdown

# Step 1: Stop Python services
pkill -f "ecosystem_supervisor.py"
pkill -f "rest_server.py"
pkill -f "brainsait_dashboard.py"
pkill -f "brainsait_webhook.py"
pkill -f "brainsait_metrics.py"

# Step 2: Stop Docker containers
docker compose -f ~/brainsait-unified/docker-compose.yml down

# Step 3: Verify
ss -tlnp | grep -E "58773|58080|58081|58082|58083|3000" || echo "All services stopped"

Restarting Individual Services

# Restart API gateway
pkill -f "rest_server.py"
nohup python3 -u ~/iris-ecosystem-integration/rest_server.py > /tmp/rest_server.log 2>&1 &

# Restart supervisor
pkill -f "ecosystem_supervisor.py"
nohup python3 -u ~/iris-ecosystem-integration/ecosystem_supervisor.py > /tmp/ecosystem_supervisor.log 2>&1 &

# Restart a Docker container
docker restart iris-unified

# Restart all unhealthy containers via supervisor
curl -s http://localhost:58773/recover/all | jq

Monitoring Health

Live Dashboard

Open http://localhost:58081/ or https://dashboard.brainsait.org/ for real-time visual status with auto-refresh every 5 seconds.

Supervisor API

# Full health report
curl -s http://localhost:58773/health | jq

# Containers
curl -s http://localhost:58773/containers | jq

# Workers
curl -s http://localhost:58773/workers | jq

# Circuit breakers
curl -s http://localhost:58773/circuits | jq

Prometheus Metrics

# Full metrics dump
curl http://localhost:58083/metrics

# Firing alerts
curl http://localhost:58083/alerts | jq

Pulse Agents Health

# List all 19 agents
curl http://localhost:58080/linc/ | jq

# Test a specific agent
curl "http://localhost:58080/linc/summary?patient=P-5842" | jq .status

# Test all 19 agents
for agent in summary prior-auth gaps-in-care medication-safety care-plan clinical-trials readmission-risk triage imaging-followup lab-explainer nl-query sdoh-referral chat hf-models predict-readmission predict-pa-denial predict-ed-util predict-interaction predict-no-show; do
  STATUS=$(curl -s -o /dev/null -w "%{http_code}" "http://localhost:58080/linc/$agent?patientId=P001" 2>/dev/null || echo "FAIL")
  echo "$agent → HTTP $STATUS"
done

# Test chat
curl "http://localhost:58080/hf/chat?q=hello" | jq

Predictive Models

# Readmission risk (30-day)
curl "http://localhost:58080/linc/predict-readmission?patientId=P001" | jq .

# PA denial probability
curl "http://localhost:58080/linc/predict-pa-denial?patientId=P004&service=CT%20Abdomen&payer=Medicare" | jq .

# ED utilization forecast (7d + 30d)
curl "http://localhost:58080/linc/predict-ed-util?patientId=P002" | jq .

# Drug-drug interaction check (4 checks)
curl "http://localhost:58080/linc/predict-interaction?patientId=P005&medications=metformin,lisinopril,warfarin,atorvastatin" | jq .

# No-show probability
curl "http://localhost:58080/linc/predict-no-show?patientId=P003&appointmentDate=2026-06-15" | jq .

Security Health

# Security module status
curl http://localhost:58080/security/health | jq .

# Sensitive path requires auth
curl http://localhost:58080/oracle/test  # → 401
curl -H "X-API-Key: brainsait-iris-connector-2026" http://localhost:58080/oracle/test  # → 200

System Monitoring

# Resources
htop                          # Interactive process viewer
free -h                       # Memory
df -h /                       # Disk
uptime                        # Load

# Docker stats
docker stats --no-stream
docker ps --format "table {{.Names}}\t{{.Status}}"

# Network
ss -tlnp | grep -E "58773|58080|58081|58082|58083|52773|5432|6379|3000"

Auto-Recovery

The supervisor runs automatic health checks every 30 seconds. If a container is down or unreachable for more than 60 seconds, it triggers auto-recovery:

  • Max 3 containers per recovery cycle (prevents cascading restarts)
  • Each restart logged to PostgreSQL brainsait.health_audit
  • Alert sent to webhook on success and failure
  • Circuit breaker reset after successful recovery
# Manual recovery
curl -s http://localhost:58773/recover/iris-unified | jq
curl -s http://localhost:58773/recover/all | jq
curl -s -X POST http://localhost:58773/probe   # Verify after recovery

Troubleshooting

Service Unreachable

ss -tlnp | grep :{PORT}
tail -100 /tmp/ecosystem_supervisor.log

Circuit Breaker Open

Worker API showing state: "open". Causes: downstream worker unavailable, 5xx errors, network issues.

curl -s http://localhost:58773/circuits | jq
curl -s -X POST http://localhost:58773/probe

High CPU / Memory

ps aux --sort=-%cpu | head -10
free -h
docker stats --no-stream

Common fix: Renice Coolify Laravel workers.

sudo renice -n 10 -p $(pgrep -f artisan | head -4)

Disk Space Low

df -h /
docker system df
sudo journalctl --vacuum-time=7d
docker image prune -af
sudo rm -rf /var/lib/apt/lists/*
npm cache clean --force

Docker Container Issues

docker ps -a --filter name={container_name}
docker logs {container_name} --tail 50
docker inspect {container_name} | jq '.[0].State'
docker restart {container_name}

Tunnel Connectivity

Public URLs returning 502/522 — check tunnel container and DNS.

docker ps --filter name=cloudflare-tunnel
docker logs cloudflare-tunnel --tail 20
docker restart cloudflare-tunnel

Backup & Recovery

What to Back Up

Component Location Method
PostgreSQL data Docker volume pg_dump
IRIS data Docker volume IRIS backup
Credentials ~/.brainsait/credentials.env File copy
Python services ~/iris-ecosystem-integration/ Git
Docker compose ~/brainsait-unified/ Git
Grafana dashboards Grafana API API call

PostgreSQL Backup

docker exec postgres-brainsait pg_dump -U coolify -d brainsait > backup_$(date +%Y%m%d).sql

Restore

cat backup.sql | docker exec -i postgres-brainsait psql -U coolify -d brainsait

Credential Recovery

If ~/.brainsait/credentials.env is lost:

  1. Cloudflare Tunnel token → ~/brainsait-unified/.env as CLOUDFLARE_TUNNEL_TOKEN
  2. Grafana password → defaults to brainsait2026
  3. API keys → regenerate from Cloudflare dashboard

Security

Layer Mechanism
Edge Cloudflare SSL/TLS, origin pinned to tunnel
Tunnel Random 256-bit token, stored in .env
Gateway Bearer token + X-API-Key on sensitive routes
Webhook Authorization header validated against BRAINSAIT_API_KEY
Grafana Basic auth (admin / brainsait2026)
Firewall UFW — only essential ports open
Secrets credentials.env is chmod 600
Rate Limiting 100 req/min per IP, 60s sliding window
Audit Log All auth failures tracked in /tmp/brainsait_audit.json

Firewall Rules

Open ports: 22 (SSH), 80/443 (HTTP/S), 3000 (Grafana), 8000 (Coolify), 58080-58083/58773 (BRAINSAIT), 6001/6002 (Coolify realtime).

Secrets Management

All secrets in ~/.brainsait/credentials.env:

BRAINSAIT_API_KEY=brainsait-iris-connector-2026
CLOUDFLARE_API_TOKEN=cfut_xxxxx
CLOUDFLARE_TUNNEL_TOKEN=eyJxxxxx
PG_DSN=postgresql://coolify:xxxx@localhost:5432/brainsait

Configuration Reference

File Locations

File Purpose
~/brainsait-unified/docker-compose.yml Container definitions
~/brainsait-unified/.env Docker env vars
~/.brainsait/credentials.env API keys and secrets
~/iris-ecosystem-integration/brainsait_config.py Config loader
~/iris-ecosystem-integration/ecosystem_supervisor.py Health monitor
~/iris-ecosystem-integration/rest_server.py API Gateway (28+ routes)
~/iris-ecosystem-integration/linc_agents.py 19 Pulse AI agents (14 std + 5 predictive)
~/iris-ecosystem-integration/chat_agent.py Chat AI agent
~/iris-ecosystem-integration/linc_hf_bridge.py HF model bridge
~/iris-ecosystem-integration/brainsait_security.py Auth + rate limiting + audit
~/iris-ecosystem-integration/brainsait_dashboard.py Live dashboard
~/iris-ecosystem-integration/brainsait_webhook.py Webhook receiver
~/iris-ecosystem-integration/brainsait_metrics.py Metrics exporter

System Tuning

vm.swappiness=5
vm.dirty_ratio=20
vm.dirty_background_ratio=5
fs.inotify.max_user_watches=524288
net.ipv4.tcp_fastopen=3

Docker daemon (/etc/docker/daemon.json):

{
  "log-driver": "json-file",
  "log-opts": { "max-size": "10m", "max-file": "3" }
}

Cheat Sheet

# Health
curl localhost:58773/health | jq .summary
curl localhost:58080/health | jq .services

# Pulse agents (19)
curl localhost:58080/linc/ | jq .total
curl "localhost:58080/linc/summary?patient=P-5842" | jq

# Predictive models (5)
curl "localhost:58080/linc/predict-readmission?patientId=P001" | jq .
curl "localhost:58080/linc/predict-interaction?patientId=P005&medications=metformin,lisinopril" | jq .
curl "localhost:58080/linc/predict-no-show?patientId=P003&appointmentDate=2026-06-15" | jq .

# Chat
curl "localhost:58080/hf/chat?q=what+is+diabetes" | jq

# HF models
curl localhost:58080/hf/models | jq

# Security
curl localhost:58080/security/health | jq

# Recovery
curl -X POST localhost:58773/probe
curl localhost:58773/recover/all

# Workflows
curl localhost:58773/workflow/homecare/P-5842

# Docker
docker ps --format "table {{.Names}}\t{{.Status}}"
docker logs --tail 50 iris-unified

Clone this wiki locally