AI-Powered Data Analysis Platform

Upload CSV or Excel files and ask natural language questions. Powered by OpenAI and PandasAI, with production-grade sandboxed code execution.

Features

Natural language queries — ask questions about your data in plain English
Multi-format support — CSV, XLSX, XLS with automatic sheet detection
AI chart generation — bar charts, line graphs, histograms via matplotlib
Multi-file analysis — query across multiple uploaded files simultaneously
Query history — reuse and iterate on previous queries
Sandboxed execution — AI-generated code runs in an isolated Docker container, never in the backend process
Anonymous sessions — no sign-up required; JWT-based session management
Hallucination reduction — full DataFrame contents injected into every LLM prompt; model sees all rows before writing code
Robust serialization — handles every real-world Excel/CSV type: datetime, timedelta, numpy scalars, Decimal, UUID, bytes, complex numbers, inf, NaN

Architecture

System Overview

┌─────────────────────────────────────────────────────────────────────┐
│                          User Browser                               │
│                     http://localhost:3000                           │
└──────────────────────────┬──────────────────────────────────────────┘
                           │ HTTPS / REST
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Next.js Frontend                                │
│                                                                     │
│   React 18 · TypeScript · TanStack Query · Tailwind CSS            │
│   Axios (JWT interceptor) · react-dropzone                         │
└──────────────────────────┬──────────────────────────────────────────┘
                           │ REST API (port 8000)
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     FastAPI Backend                                 │
│                                                                     │
│   Async Python 3.11 · SQLAlchemy (asyncpg) · Pydantic v2          │
│   JWT auth · structlog · slowapi rate limiting                     │
│                                                                     │
│   ┌──────────────────────────────────────────────────────────┐     │
│   │                  AI Query Pipeline                        │     │
│   │                                                          │     │
│   │  User prompt ──► PandasAI (code generation)             │     │
│   │                       │                                  │     │
│   │                       ▼                                  │     │
│   │             CodeExecution.execute_code                   │     │
│   │             [monkey-patched → SandboxRunner]             │     │
│   └──────────────────────┬───────────────────────────────────┘     │
│                          │ HTTP (internal network)                  │
└──────────────────────────┼──────────────────────────────────────────┘
                           │
          ┌────────────────┴─────────────────────┐
          │                                      │
          ▼                                      ▼
┌──────────────────┐                   ┌──────────────────┐
│  Sandbox Sidecar │                   │   PostgreSQL 16  │
│  (port 9000)     │                   │                  │
│                  │                   │  sessions        │
│  FastAPI + auth  │                   │  uploaded_files  │
│  Docker socket   │                   │  query_history   │
│  (only service   │                   │  feedback        │
│   with socket)   │                   └──────────────────┘
└────────┬─────────┘
         │ docker run (locked-down)
         ▼
┌──────────────────────────────────────────────────────────┐
│              Sandbox Container                           │
│                                                          │
│  python:3.11-slim · pandas · numpy · matplotlib         │
│                                                          │
│  --network=none          no network access               │
│  --read-only             immutable filesystem            │
│  --tmpfs /tmp            64 MB scratch (noexec)          │
│  --memory=256m           hard memory cap                 │
│  --cpus=0.5              CPU quota                       │
│  --pids-limit=32         prevents fork bombs             │
│  --cap-drop=ALL          no Linux capabilities           │
│  --user=65534:65534      runs as nobody                  │
│  restricted __builtins__ no import/open/exec/eval        │
└──────────────────────────────────────────────────────────┘

AI Query Flow

User prompt
    │
    ▼
PandasAI: build prompt with DataFrame schemas
    │
    ▼
OpenAI GPT-4o-mini: generate Python code
    │
    ▼
PandasAI: clean & validate generated code
    │
    ▼
CodeExecution.execute_code [PATCHED]
    │
    ├─ serialize DataFrames as JSON
    │
    ▼
POST /execute ──► Sandbox Sidecar
    │
    ▼
docker run [locked-down container]
    │
    ▼
worker.py: exec(code) with restricted __builtins__
    │    dfs[0], pd, np, plt available
    │    no network, no filesystem writes (except /tmp)
    │
    ▼
result serialised to JSON (scalar / dataframe / chart base64)
    │
    ▼
PandasAI: ResultParsing & validation
    │
    ▼
API response → Frontend

Sandbox Backend Selection

The backend automatically selects the safest available isolation backend at startup:

1. Sidecar  (RECOMMENDED) ─── SANDBOX_API_URL + SANDBOX_API_KEY set
                               Backend has no Docker socket access.
                               Compromised backend can only call POST /execute.

2. Docker   (development) ─── Docker daemon reachable + sandbox image built.
                               Direct socket access. Suitable for local dev.

3. Subprocess (fallback)  ─── Python-level isolation only.
                               CRITICAL log emitted. Not safe for production.

Security Model

Layer	Control	Detail
Network isolation	`--network=none`	Sandbox container has zero network access
Filesystem	`--read-only` + `--tmpfs /tmp`	Root FS immutable; only 64 MB scratch in `/tmp`
Memory	`--memory=256m --memory-swap=256m`	Hard cap, no swap
CPU	`--cpus=0.5`	Prevents CPU exhaustion
Processes	`--pids-limit=32`	Prevents fork bombs
Capabilities	`--cap-drop=ALL`	All Linux capabilities removed
Privilege escalation	`--security-opt=no-new-privileges:true`	Locked
User	`--user=65534:65534`	Runs as `nobody:nogroup`
Builtins	Restricted `__builtins__`	No `import`, `open`, `exec`, `eval`, `compile`
Docker socket	Sidecar-only	Backend container has no Docker socket mount
API key	`secrets.compare_digest`	Constant-time comparison on sidecar auth
File upload	4-layer validation	MIME type + extension + magic bytes + size
SQL	SQLAlchemy ORM only	No raw SQL queries
Secrets	No defaults in production	`SECRET_KEY` rejected if weak; no fallback
Error leakage	Sanitised 500 responses	Internal errors never sent to clients

AI Accuracy

Hallucination Reduction

Real-world Excel files are often form-style layouts with many Unnamed columns and data in unpredictable cell positions. To prevent GPT from guessing wrong iloc indices, the backend applies three layers before every query:

Layer	What it does
Full data context	For DataFrames ≤ 60 rows, the complete `df.to_string()` is appended to the user prompt so the model can read every value before writing code
Increased rows_to_read	PandasAI's own prompt section shows up to 30 rows (default is 5)
Custom instructions	Instructs the model to search for values by content, use `pd.to_numeric(errors='coerce')`, and handle `None`/`NaN` explicitly instead of blindly multiplying

For larger DataFrames (> 60 rows), column info, dtypes, and first/last 10 rows are provided instead of the full dump.

DataFrame Serialization

The _coerce_cell function in runner.py normalises every DataFrame cell to a JSON-safe Python primitive before sending data to the sandbox container. Supported types:

Input type	Serialised as
`datetime`, `date`, `time`, `pd.Timestamp`	ISO 8601 string
`timedelta`, `pd.Timedelta`	Total seconds (float)
`numpy.integer`	`int`
`numpy.floating`	`float` (inf/NaN → `None`)
`numpy.bool_`	`bool`
`numpy.bytes_`	UTF-8 string
`decimal.Decimal`	`float`
`uuid.UUID`	string
`bytes`	UTF-8 string
`complex`	string (`"a+bj"`)
`pd.NaT`, `pd.NA`, `float('nan')`, `inf`	`None`
Everything else	`str(v)`

print() calls inside sandbox-executed code are redirected to stderr so they never pollute the JSON result written to stdout.

Quick Start

Prerequisites

Docker Desktop 4.x or later
An OpenAI API key

1. Clone & configure

git clone <repo-url>
cd data-analysis-app

Generate secrets and create the root .env (read by docker-compose):

# Generate a strong sandbox API key
python3 -c "import secrets; print('SANDBOX_API_KEY=' + secrets.token_hex(32))" >> .env

Create backend/.env:

cp backend/.env.example backend/.env
# Edit backend/.env — set SECRET_KEY and OPENAI_API_KEY
python3 -c "import secrets; print(secrets.token_hex(32))"  # use output as SECRET_KEY

2. Build the sandbox image

docker build -t data-analysis-sandbox:latest backend/app/sandbox/

3. Start all services

docker compose up --build

Service	URL
Frontend	http://localhost:3000
Backend API	http://localhost:8000
API docs (Swagger)	http://localhost:8000/docs

4. First use

Open http://localhost:3000 — a session JWT is issued automatically
Upload a CSV or Excel file via drag-and-drop
Select the file and sheet in the left panel
Type a question in the Query box, e.g. "What is the average salary by department?"
Click Submit — results appear as text, table, or chart
Previous queries appear in the history panel; click Reuse to populate the query box

Environment Variables

`backend/.env`

# ── Required ──────────────────────────────────────────────
OPENAI_API_KEY=sk-...

# Minimum 32 characters. Generate: python3 -c "import secrets; print(secrets.token_hex(32))"
SECRET_KEY=

# ── Database ──────────────────────────────────────────────
DATABASE_URL=postgresql+asyncpg://postgres:postgres@postgres:5432/data_analysis

# ── Environment ───────────────────────────────────────────
ENVIRONMENT=development    # set to "production" to enable secret-key validation
LOG_LEVEL=INFO

# ── CORS ──────────────────────────────────────────────────
ALLOWED_ORIGINS=["http://localhost:3000"]

# ── File upload ───────────────────────────────────────────
UPLOAD_DIR=./uploads
MAX_FILE_SIZE_MB=50
MAX_SESSION_SIZE_MB=200

# ── AI ────────────────────────────────────────────────────
AI_TIMEOUT_SECONDS=30
AI_MODEL=gpt-4o-mini
AI_TEMPERATURE=0

# ── Sandbox sidecar ───────────────────────────────────────
# Set both to use sidecar backend (recommended for production).
# Leave empty to fall back to direct Docker isolation.
SANDBOX_API_URL=http://sandbox-sidecar:9000
SANDBOX_API_KEY=   # must match root .env SANDBOX_API_KEY

# ── Rate limiting ─────────────────────────────────────────
RATE_LIMIT_QUERIES_PER_MINUTE=20
RATE_LIMIT_UPLOADS_PER_MINUTE=10

Root `.env` (docker-compose)

# Shared between docker-compose and backend/.env
SANDBOX_API_KEY=<generated-hex-string>

Project Structure

data-analysis-app/
├── .env                          # root-level (docker-compose reads this)
├── .gitignore
├── docker-compose.yml
├── README.md
│
├── backend/
│   ├── Dockerfile
│   ├── .env                      # backend service env
│   ├── .env.example
│   ├── requirements.txt
│   │
│   ├── alembic/
│   │   ├── env.py
│   │   └── versions/
│   │       └── 0001_initial_schema.py
│   │
│   ├── tests/
│   │   └── test_sandbox.py       # 25 sandbox integration tests
│   │
│   └── app/
│       ├── main.py
│       ├── dependencies.py
│       ├── models.py
│       │
│       ├── core/
│       │   ├── config.py         # Pydantic BaseSettings + production validation
│       │   ├── database.py       # SQLAlchemy async engine
│       │   ├── logging.py        # structlog JSON logging
│       │   └── security.py       # JWT + 4-layer file validation
│       │
│       ├── routers/
│       │   ├── auth.py
│       │   ├── files.py
│       │   ├── preview.py
│       │   ├── query.py
│       │   ├── history.py
│       │   └── feedback.py
│       │
│       ├── schemas/
│       │   ├── query.py
│       │   └── files.py
│       │
│       ├── services/
│       │   ├── ai_engine.py      # PandasAI integration + sandbox monkey-patch
│       │   ├── file_manager.py
│       │   └── sheet_parser.py
│       │
│       └── sandbox/
│           ├── runner.py         # SandboxRunner (sidecar/docker/subprocess backends)
│           ├── worker.py         # Executed inside sandbox container
│           ├── Dockerfile        # python:3.11-slim + pandas/numpy/matplotlib only
│           ├── .dockerignore
│           └── sidecar/
│               ├── main.py       # FastAPI sidecar (POST /execute, GET /health)
│               ├── Dockerfile    # Multi-stage: docker CLI + python:3.11-slim
│               └── requirements.txt
│
└── frontend/
    ├── Dockerfile
    ├── package.json
    ├── tsconfig.json
    ├── next.config.js
    └── src/
        ├── app/
        │   ├── layout.tsx
        │   ├── page.tsx
        │   ├── dashboard/page.tsx
        │   └── globals.css
        ├── components/
        │   ├── FileUploadZone.tsx
        │   ├── FileList.tsx
        │   ├── SheetSelector.tsx
        │   ├── DataPreviewTable.tsx
        │   ├── QueryInput.tsx
        │   ├── AnswerDisplay.tsx
        │   ├── PromptHistoryPanel.tsx
        │   └── FeedbackButtons.tsx
        ├── hooks/
        │   ├── useSession.ts
        │   ├── useFiles.ts
        │   ├── usePreview.ts
        │   ├── useQuery.ts
        │   └── useHistory.ts
        └── lib/
            ├── api.ts
            ├── types.ts
            └── queryClient.ts

API Reference

Authentication

Method	Path	Description
`POST`	`/api/v1/auth/token`	Issue anonymous JWT (7-day expiry)

Files

Method	Path	Description
`POST`	`/api/v1/files`	Upload CSV or Excel file
`GET`	`/api/v1/files`	List files for current session
`DELETE`	`/api/v1/files/{file_id}`	Soft-delete a file

Preview

Method	Path	Description
`GET`	`/api/v1/preview`	Preview top N rows of a sheet

AI Query

Method	Path	Description
`POST`	`/api/v1/query`	Execute natural language query
`GET`	`/api/v1/history`	Paginated query history
`DELETE`	`/api/v1/history/{query_id}`	Delete a history entry

Feedback

Method	Path	Description
`POST`	`/api/v1/feedback`	Submit thumbs up / thumbs down

All endpoints except /api/v1/auth/token require Authorization: Bearer <jwt>.

Interactive API docs available at http://localhost:8000/docs.

Testing

Sandbox integration tests (32 tests)

Tests run against the real Docker sandbox image and verify:

Correct results: scalar, string, DataFrame, numpy types, PandasAI result dicts
Security boundaries: os, subprocess, __import__, open, eval, exec all blocked
Network: --network=none confirmed blocking outbound connections
Filesystem: --read-only confirmed blocking writes
Timeout: infinite loops killed correctly
Edge cases: syntax errors, division by zero, NaN serialisation, large DataFrames
Serialization robustness: datetime, date, time, timedelta, numpy int/float/bool, inf/NaN→None, mixed Excel-like sheets

# From repo root
python -m pytest backend/tests/test_sandbox.py -v

Run all backend tests

docker compose exec backend pytest tests/ -v

Production Deployment

Checklist

Docker group on Linux hosts

On Linux, the Docker socket is owned by the docker group. Update group_add in docker-compose.yml:

getent group docker | cut -d: -f3   # get the GID

Then set group_add: ["<GID>"] for the sandbox-sidecar service. On macOS with Docker Desktop, "0" (root group) is correct.

Troubleshooting

Port already in use

docker ps
docker stop <container-id>

Sandbox image missing

docker build -t data-analysis-sandbox:latest backend/app/sandbox/

SANDBOX_API_KEY not set

# Root .env must contain:
SANDBOX_API_KEY=<your-key>
# backend/.env must contain the same value for SANDBOX_API_KEY

Database migration errors

docker compose down -v   # drops volumes — data will be lost
docker compose up --build

OpenAI API errors

Verify OPENAI_API_KEY is valid and the account has credits
Model gpt-4o-mini must be available on the account
Check rate limits: default 20 queries/min, 10 uploads/min

Verifying sandbox is active

Check backend startup logs for:

Sandbox using sidecar backend   url=http://sandbox-sidecar:9000
PandasAI CodeExecution.execute_code patched → sandbox backend   backend=sidecar

Check sidecar logs during a query — each execution produces:

POST /execute HTTP/1.1" 200 OK

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

AI-Powered Data Analysis Platform

Table of Contents

Features

Architecture

System Overview

AI Query Flow

Sandbox Backend Selection

Security Model

AI Accuracy

Hallucination Reduction

DataFrame Serialization

Quick Start

Prerequisites

1. Clone & configure

2. Build the sandbox image

3. Start all services

4. First use

Environment Variables

backend/.env

Root .env (docker-compose)

Project Structure

API Reference

Authentication

Files

Preview

AI Query

Feedback

Testing

Sandbox integration tests (32 tests)

Run all backend tests

Production Deployment

Checklist

Docker group on Linux hosts

Troubleshooting

Port already in use

Sandbox image missing

SANDBOX_API_KEY not set

Database migration errors

OpenAI API errors

Verifying sandbox is active

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`backend/.env`

Root `.env` (docker-compose)

Packages