Upload CSV or Excel files and ask natural language questions. Powered by OpenAI and PandasAI, with production-grade sandboxed code execution.
- Features
- Architecture
- Security Model
- AI Accuracy
- Quick Start
- Environment Variables
- Project Structure
- API Reference
- Testing
- Production Deployment
- Troubleshooting
- Natural language queries — ask questions about your data in plain English
- Multi-format support — CSV, XLSX, XLS with automatic sheet detection
- AI chart generation — bar charts, line graphs, histograms via matplotlib
- Multi-file analysis — query across multiple uploaded files simultaneously
- Query history — reuse and iterate on previous queries
- Sandboxed execution — AI-generated code runs in an isolated Docker container, never in the backend process
- Anonymous sessions — no sign-up required; JWT-based session management
- Hallucination reduction — full DataFrame contents injected into every LLM prompt; model sees all rows before writing code
- Robust serialization — handles every real-world Excel/CSV type: datetime, timedelta, numpy scalars, Decimal, UUID, bytes, complex numbers, inf, NaN
┌─────────────────────────────────────────────────────────────────────┐
│ User Browser │
│ http://localhost:3000 │
└──────────────────────────┬──────────────────────────────────────────┘
│ HTTPS / REST
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Next.js Frontend │
│ │
│ React 18 · TypeScript · TanStack Query · Tailwind CSS │
│ Axios (JWT interceptor) · react-dropzone │
└──────────────────────────┬──────────────────────────────────────────┘
│ REST API (port 8000)
▼
┌─────────────────────────────────────────────────────────────────────┐
│ FastAPI Backend │
│ │
│ Async Python 3.11 · SQLAlchemy (asyncpg) · Pydantic v2 │
│ JWT auth · structlog · slowapi rate limiting │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ AI Query Pipeline │ │
│ │ │ │
│ │ User prompt ──► PandasAI (code generation) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ CodeExecution.execute_code │ │
│ │ [monkey-patched → SandboxRunner] │ │
│ └──────────────────────┬───────────────────────────────────┘ │
│ │ HTTP (internal network) │
└──────────────────────────┼──────────────────────────────────────────┘
│
┌────────────────┴─────────────────────┐
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Sandbox Sidecar │ │ PostgreSQL 16 │
│ (port 9000) │ │ │
│ │ │ sessions │
│ FastAPI + auth │ │ uploaded_files │
│ Docker socket │ │ query_history │
│ (only service │ │ feedback │
│ with socket) │ └──────────────────┘
└────────┬─────────┘
│ docker run (locked-down)
▼
┌──────────────────────────────────────────────────────────┐
│ Sandbox Container │
│ │
│ python:3.11-slim · pandas · numpy · matplotlib │
│ │
│ --network=none no network access │
│ --read-only immutable filesystem │
│ --tmpfs /tmp 64 MB scratch (noexec) │
│ --memory=256m hard memory cap │
│ --cpus=0.5 CPU quota │
│ --pids-limit=32 prevents fork bombs │
│ --cap-drop=ALL no Linux capabilities │
│ --user=65534:65534 runs as nobody │
│ restricted __builtins__ no import/open/exec/eval │
└──────────────────────────────────────────────────────────┘
User prompt
│
▼
PandasAI: build prompt with DataFrame schemas
│
▼
OpenAI GPT-4o-mini: generate Python code
│
▼
PandasAI: clean & validate generated code
│
▼
CodeExecution.execute_code [PATCHED]
│
├─ serialize DataFrames as JSON
│
▼
POST /execute ──► Sandbox Sidecar
│
▼
docker run [locked-down container]
│
▼
worker.py: exec(code) with restricted __builtins__
│ dfs[0], pd, np, plt available
│ no network, no filesystem writes (except /tmp)
│
▼
result serialised to JSON (scalar / dataframe / chart base64)
│
▼
PandasAI: ResultParsing & validation
│
▼
API response → Frontend
The backend automatically selects the safest available isolation backend at startup:
1. Sidecar (RECOMMENDED) ─── SANDBOX_API_URL + SANDBOX_API_KEY set
Backend has no Docker socket access.
Compromised backend can only call POST /execute.
2. Docker (development) ─── Docker daemon reachable + sandbox image built.
Direct socket access. Suitable for local dev.
3. Subprocess (fallback) ─── Python-level isolation only.
CRITICAL log emitted. Not safe for production.
| Layer | Control | Detail |
|---|---|---|
| Network isolation | --network=none |
Sandbox container has zero network access |
| Filesystem | --read-only + --tmpfs /tmp |
Root FS immutable; only 64 MB scratch in /tmp |
| Memory | --memory=256m --memory-swap=256m |
Hard cap, no swap |
| CPU | --cpus=0.5 |
Prevents CPU exhaustion |
| Processes | --pids-limit=32 |
Prevents fork bombs |
| Capabilities | --cap-drop=ALL |
All Linux capabilities removed |
| Privilege escalation | --security-opt=no-new-privileges:true |
Locked |
| User | --user=65534:65534 |
Runs as nobody:nogroup |
| Builtins | Restricted __builtins__ |
No import, open, exec, eval, compile |
| Docker socket | Sidecar-only | Backend container has no Docker socket mount |
| API key | secrets.compare_digest |
Constant-time comparison on sidecar auth |
| File upload | 4-layer validation | MIME type + extension + magic bytes + size |
| SQL | SQLAlchemy ORM only | No raw SQL queries |
| Secrets | No defaults in production | SECRET_KEY rejected if weak; no fallback |
| Error leakage | Sanitised 500 responses | Internal errors never sent to clients |
Real-world Excel files are often form-style layouts with many Unnamed columns and data in unpredictable cell positions. To prevent GPT from guessing wrong iloc indices, the backend applies three layers before every query:
| Layer | What it does |
|---|---|
| Full data context | For DataFrames ≤ 60 rows, the complete df.to_string() is appended to the user prompt so the model can read every value before writing code |
| Increased rows_to_read | PandasAI's own prompt section shows up to 30 rows (default is 5) |
| Custom instructions | Instructs the model to search for values by content, use pd.to_numeric(errors='coerce'), and handle None/NaN explicitly instead of blindly multiplying |
For larger DataFrames (> 60 rows), column info, dtypes, and first/last 10 rows are provided instead of the full dump.
The _coerce_cell function in runner.py normalises every DataFrame cell to a JSON-safe Python primitive before sending data to the sandbox container. Supported types:
| Input type | Serialised as |
|---|---|
datetime, date, time, pd.Timestamp |
ISO 8601 string |
timedelta, pd.Timedelta |
Total seconds (float) |
numpy.integer |
int |
numpy.floating |
float (inf/NaN → None) |
numpy.bool_ |
bool |
numpy.bytes_ |
UTF-8 string |
decimal.Decimal |
float |
uuid.UUID |
string |
bytes |
UTF-8 string |
complex |
string ("a+bj") |
pd.NaT, pd.NA, float('nan'), inf |
None |
| Everything else | str(v) |
print() calls inside sandbox-executed code are redirected to stderr so they never pollute the JSON result written to stdout.
- Docker Desktop 4.x or later
- An OpenAI API key
git clone <repo-url>
cd data-analysis-appGenerate secrets and create the root .env (read by docker-compose):
# Generate a strong sandbox API key
python3 -c "import secrets; print('SANDBOX_API_KEY=' + secrets.token_hex(32))" >> .envCreate backend/.env:
cp backend/.env.example backend/.env
# Edit backend/.env — set SECRET_KEY and OPENAI_API_KEY
python3 -c "import secrets; print(secrets.token_hex(32))" # use output as SECRET_KEYdocker build -t data-analysis-sandbox:latest backend/app/sandbox/docker compose up --build| Service | URL |
|---|---|
| Frontend | http://localhost:3000 |
| Backend API | http://localhost:8000 |
| API docs (Swagger) | http://localhost:8000/docs |
- Open http://localhost:3000 — a session JWT is issued automatically
- Upload a CSV or Excel file via drag-and-drop
- Select the file and sheet in the left panel
- Type a question in the Query box, e.g. "What is the average salary by department?"
- Click Submit — results appear as text, table, or chart
- Previous queries appear in the history panel; click Reuse to populate the query box
# ── Required ──────────────────────────────────────────────
OPENAI_API_KEY=sk-...
# Minimum 32 characters. Generate: python3 -c "import secrets; print(secrets.token_hex(32))"
SECRET_KEY=
# ── Database ──────────────────────────────────────────────
DATABASE_URL=postgresql+asyncpg://postgres:postgres@postgres:5432/data_analysis
# ── Environment ───────────────────────────────────────────
ENVIRONMENT=development # set to "production" to enable secret-key validation
LOG_LEVEL=INFO
# ── CORS ──────────────────────────────────────────────────
ALLOWED_ORIGINS=["http://localhost:3000"]
# ── File upload ───────────────────────────────────────────
UPLOAD_DIR=./uploads
MAX_FILE_SIZE_MB=50
MAX_SESSION_SIZE_MB=200
# ── AI ────────────────────────────────────────────────────
AI_TIMEOUT_SECONDS=30
AI_MODEL=gpt-4o-mini
AI_TEMPERATURE=0
# ── Sandbox sidecar ───────────────────────────────────────
# Set both to use sidecar backend (recommended for production).
# Leave empty to fall back to direct Docker isolation.
SANDBOX_API_URL=http://sandbox-sidecar:9000
SANDBOX_API_KEY= # must match root .env SANDBOX_API_KEY
# ── Rate limiting ─────────────────────────────────────────
RATE_LIMIT_QUERIES_PER_MINUTE=20
RATE_LIMIT_UPLOADS_PER_MINUTE=10# Shared between docker-compose and backend/.env
SANDBOX_API_KEY=<generated-hex-string>data-analysis-app/
├── .env # root-level (docker-compose reads this)
├── .gitignore
├── docker-compose.yml
├── README.md
│
├── backend/
│ ├── Dockerfile
│ ├── .env # backend service env
│ ├── .env.example
│ ├── requirements.txt
│ │
│ ├── alembic/
│ │ ├── env.py
│ │ └── versions/
│ │ └── 0001_initial_schema.py
│ │
│ ├── tests/
│ │ └── test_sandbox.py # 25 sandbox integration tests
│ │
│ └── app/
│ ├── main.py
│ ├── dependencies.py
│ ├── models.py
│ │
│ ├── core/
│ │ ├── config.py # Pydantic BaseSettings + production validation
│ │ ├── database.py # SQLAlchemy async engine
│ │ ├── logging.py # structlog JSON logging
│ │ └── security.py # JWT + 4-layer file validation
│ │
│ ├── routers/
│ │ ├── auth.py
│ │ ├── files.py
│ │ ├── preview.py
│ │ ├── query.py
│ │ ├── history.py
│ │ └── feedback.py
│ │
│ ├── schemas/
│ │ ├── query.py
│ │ └── files.py
│ │
│ ├── services/
│ │ ├── ai_engine.py # PandasAI integration + sandbox monkey-patch
│ │ ├── file_manager.py
│ │ └── sheet_parser.py
│ │
│ └── sandbox/
│ ├── runner.py # SandboxRunner (sidecar/docker/subprocess backends)
│ ├── worker.py # Executed inside sandbox container
│ ├── Dockerfile # python:3.11-slim + pandas/numpy/matplotlib only
│ ├── .dockerignore
│ └── sidecar/
│ ├── main.py # FastAPI sidecar (POST /execute, GET /health)
│ ├── Dockerfile # Multi-stage: docker CLI + python:3.11-slim
│ └── requirements.txt
│
└── frontend/
├── Dockerfile
├── package.json
├── tsconfig.json
├── next.config.js
└── src/
├── app/
│ ├── layout.tsx
│ ├── page.tsx
│ ├── dashboard/page.tsx
│ └── globals.css
├── components/
│ ├── FileUploadZone.tsx
│ ├── FileList.tsx
│ ├── SheetSelector.tsx
│ ├── DataPreviewTable.tsx
│ ├── QueryInput.tsx
│ ├── AnswerDisplay.tsx
│ ├── PromptHistoryPanel.tsx
│ └── FeedbackButtons.tsx
├── hooks/
│ ├── useSession.ts
│ ├── useFiles.ts
│ ├── usePreview.ts
│ ├── useQuery.ts
│ └── useHistory.ts
└── lib/
├── api.ts
├── types.ts
└── queryClient.ts
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/auth/token |
Issue anonymous JWT (7-day expiry) |
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/files |
Upload CSV or Excel file |
GET |
/api/v1/files |
List files for current session |
DELETE |
/api/v1/files/{file_id} |
Soft-delete a file |
| Method | Path | Description |
|---|---|---|
GET |
/api/v1/preview |
Preview top N rows of a sheet |
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/query |
Execute natural language query |
GET |
/api/v1/history |
Paginated query history |
DELETE |
/api/v1/history/{query_id} |
Delete a history entry |
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/feedback |
Submit thumbs up / thumbs down |
All endpoints except /api/v1/auth/token require Authorization: Bearer <jwt>.
Interactive API docs available at http://localhost:8000/docs.
Tests run against the real Docker sandbox image and verify:
- Correct results: scalar, string, DataFrame, numpy types, PandasAI result dicts
- Security boundaries:
os,subprocess,__import__,open,eval,execall blocked - Network:
--network=noneconfirmed blocking outbound connections - Filesystem:
--read-onlyconfirmed blocking writes - Timeout: infinite loops killed correctly
- Edge cases: syntax errors, division by zero, NaN serialisation, large DataFrames
- Serialization robustness: datetime, date, time, timedelta, numpy int/float/bool, inf/NaN→None, mixed Excel-like sheets
# From repo root
python -m pytest backend/tests/test_sandbox.py -vdocker compose exec backend pytest tests/ -v- Generate a strong
SECRET_KEY(≥ 32 chars, not the default) - Generate a strong
SANDBOX_API_KEY - Set
ENVIRONMENT=productionto enable secret-key validation - Build the sandbox image on the production host:
docker build -t data-analysis-sandbox:latest backend/app/sandbox/ - Use a managed PostgreSQL service (AWS RDS, Cloud SQL, etc.)
- Place an nginx reverse proxy with TLS termination in front of the backend
- Set
ALLOWED_ORIGINSto your actual frontend domain - Switch JWT storage from
localStoragetohttpOnlycookies - Set strict Content-Security-Policy headers
- Ship logs to a centralised log aggregator (structlog outputs JSON)
On Linux, the Docker socket is owned by the docker group. Update group_add in docker-compose.yml:
getent group docker | cut -d: -f3 # get the GIDThen set group_add: ["<GID>"] for the sandbox-sidecar service.
On macOS with Docker Desktop, "0" (root group) is correct.
docker ps
docker stop <container-id>docker build -t data-analysis-sandbox:latest backend/app/sandbox/# Root .env must contain:
SANDBOX_API_KEY=<your-key>
# backend/.env must contain the same value for SANDBOX_API_KEYdocker compose down -v # drops volumes — data will be lost
docker compose up --build- Verify
OPENAI_API_KEYis valid and the account has credits - Model
gpt-4o-minimust be available on the account - Check rate limits: default 20 queries/min, 10 uploads/min
Check backend startup logs for:
Sandbox using sidecar backend url=http://sandbox-sidecar:9000
PandasAI CodeExecution.execute_code patched → sandbox backend backend=sidecar
Check sidecar logs during a query — each execution produces:
POST /execute HTTP/1.1" 200 OK
MIT License