Skip to content

Dharshan2004/DataLens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Powered Data Analysis Platform

Upload CSV or Excel files and ask natural language questions. Powered by OpenAI and PandasAI, with production-grade sandboxed code execution.


Table of Contents


Features

  • Natural language queries — ask questions about your data in plain English
  • Multi-format support — CSV, XLSX, XLS with automatic sheet detection
  • AI chart generation — bar charts, line graphs, histograms via matplotlib
  • Multi-file analysis — query across multiple uploaded files simultaneously
  • Query history — reuse and iterate on previous queries
  • Sandboxed execution — AI-generated code runs in an isolated Docker container, never in the backend process
  • Anonymous sessions — no sign-up required; JWT-based session management
  • Hallucination reduction — full DataFrame contents injected into every LLM prompt; model sees all rows before writing code
  • Robust serialization — handles every real-world Excel/CSV type: datetime, timedelta, numpy scalars, Decimal, UUID, bytes, complex numbers, inf, NaN

Architecture

System Overview

┌─────────────────────────────────────────────────────────────────────┐
│                          User Browser                               │
│                     http://localhost:3000                           │
└──────────────────────────┬──────────────────────────────────────────┘
                           │ HTTPS / REST
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Next.js Frontend                                │
│                                                                     │
│   React 18 · TypeScript · TanStack Query · Tailwind CSS            │
│   Axios (JWT interceptor) · react-dropzone                         │
└──────────────────────────┬──────────────────────────────────────────┘
                           │ REST API (port 8000)
                           ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     FastAPI Backend                                 │
│                                                                     │
│   Async Python 3.11 · SQLAlchemy (asyncpg) · Pydantic v2          │
│   JWT auth · structlog · slowapi rate limiting                     │
│                                                                     │
│   ┌──────────────────────────────────────────────────────────┐     │
│   │                  AI Query Pipeline                        │     │
│   │                                                          │     │
│   │  User prompt ──► PandasAI (code generation)             │     │
│   │                       │                                  │     │
│   │                       ▼                                  │     │
│   │             CodeExecution.execute_code                   │     │
│   │             [monkey-patched → SandboxRunner]             │     │
│   └──────────────────────┬───────────────────────────────────┘     │
│                          │ HTTP (internal network)                  │
└──────────────────────────┼──────────────────────────────────────────┘
                           │
          ┌────────────────┴─────────────────────┐
          │                                      │
          ▼                                      ▼
┌──────────────────┐                   ┌──────────────────┐
│  Sandbox Sidecar │                   │   PostgreSQL 16  │
│  (port 9000)     │                   │                  │
│                  │                   │  sessions        │
│  FastAPI + auth  │                   │  uploaded_files  │
│  Docker socket   │                   │  query_history   │
│  (only service   │                   │  feedback        │
│   with socket)   │                   └──────────────────┘
└────────┬─────────┘
         │ docker run (locked-down)
         ▼
┌──────────────────────────────────────────────────────────┐
│              Sandbox Container                           │
│                                                          │
│  python:3.11-slim · pandas · numpy · matplotlib         │
│                                                          │
│  --network=none          no network access               │
│  --read-only             immutable filesystem            │
│  --tmpfs /tmp            64 MB scratch (noexec)          │
│  --memory=256m           hard memory cap                 │
│  --cpus=0.5              CPU quota                       │
│  --pids-limit=32         prevents fork bombs             │
│  --cap-drop=ALL          no Linux capabilities           │
│  --user=65534:65534      runs as nobody                  │
│  restricted __builtins__ no import/open/exec/eval        │
└──────────────────────────────────────────────────────────┘

AI Query Flow

User prompt
    │
    ▼
PandasAI: build prompt with DataFrame schemas
    │
    ▼
OpenAI GPT-4o-mini: generate Python code
    │
    ▼
PandasAI: clean & validate generated code
    │
    ▼
CodeExecution.execute_code [PATCHED]
    │
    ├─ serialize DataFrames as JSON
    │
    ▼
POST /execute ──► Sandbox Sidecar
    │
    ▼
docker run [locked-down container]
    │
    ▼
worker.py: exec(code) with restricted __builtins__
    │    dfs[0], pd, np, plt available
    │    no network, no filesystem writes (except /tmp)
    │
    ▼
result serialised to JSON (scalar / dataframe / chart base64)
    │
    ▼
PandasAI: ResultParsing & validation
    │
    ▼
API response → Frontend

Sandbox Backend Selection

The backend automatically selects the safest available isolation backend at startup:

1. Sidecar  (RECOMMENDED) ─── SANDBOX_API_URL + SANDBOX_API_KEY set
                               Backend has no Docker socket access.
                               Compromised backend can only call POST /execute.

2. Docker   (development) ─── Docker daemon reachable + sandbox image built.
                               Direct socket access. Suitable for local dev.

3. Subprocess (fallback)  ─── Python-level isolation only.
                               CRITICAL log emitted. Not safe for production.

Security Model

Layer Control Detail
Network isolation --network=none Sandbox container has zero network access
Filesystem --read-only + --tmpfs /tmp Root FS immutable; only 64 MB scratch in /tmp
Memory --memory=256m --memory-swap=256m Hard cap, no swap
CPU --cpus=0.5 Prevents CPU exhaustion
Processes --pids-limit=32 Prevents fork bombs
Capabilities --cap-drop=ALL All Linux capabilities removed
Privilege escalation --security-opt=no-new-privileges:true Locked
User --user=65534:65534 Runs as nobody:nogroup
Builtins Restricted __builtins__ No import, open, exec, eval, compile
Docker socket Sidecar-only Backend container has no Docker socket mount
API key secrets.compare_digest Constant-time comparison on sidecar auth
File upload 4-layer validation MIME type + extension + magic bytes + size
SQL SQLAlchemy ORM only No raw SQL queries
Secrets No defaults in production SECRET_KEY rejected if weak; no fallback
Error leakage Sanitised 500 responses Internal errors never sent to clients

AI Accuracy

Hallucination Reduction

Real-world Excel files are often form-style layouts with many Unnamed columns and data in unpredictable cell positions. To prevent GPT from guessing wrong iloc indices, the backend applies three layers before every query:

Layer What it does
Full data context For DataFrames ≤ 60 rows, the complete df.to_string() is appended to the user prompt so the model can read every value before writing code
Increased rows_to_read PandasAI's own prompt section shows up to 30 rows (default is 5)
Custom instructions Instructs the model to search for values by content, use pd.to_numeric(errors='coerce'), and handle None/NaN explicitly instead of blindly multiplying

For larger DataFrames (> 60 rows), column info, dtypes, and first/last 10 rows are provided instead of the full dump.

DataFrame Serialization

The _coerce_cell function in runner.py normalises every DataFrame cell to a JSON-safe Python primitive before sending data to the sandbox container. Supported types:

Input type Serialised as
datetime, date, time, pd.Timestamp ISO 8601 string
timedelta, pd.Timedelta Total seconds (float)
numpy.integer int
numpy.floating float (inf/NaN → None)
numpy.bool_ bool
numpy.bytes_ UTF-8 string
decimal.Decimal float
uuid.UUID string
bytes UTF-8 string
complex string ("a+bj")
pd.NaT, pd.NA, float('nan'), inf None
Everything else str(v)

print() calls inside sandbox-executed code are redirected to stderr so they never pollute the JSON result written to stdout.


Quick Start

Prerequisites

  • Docker Desktop 4.x or later
  • An OpenAI API key

1. Clone & configure

git clone <repo-url>
cd data-analysis-app

Generate secrets and create the root .env (read by docker-compose):

# Generate a strong sandbox API key
python3 -c "import secrets; print('SANDBOX_API_KEY=' + secrets.token_hex(32))" >> .env

Create backend/.env:

cp backend/.env.example backend/.env
# Edit backend/.env — set SECRET_KEY and OPENAI_API_KEY
python3 -c "import secrets; print(secrets.token_hex(32))"  # use output as SECRET_KEY

2. Build the sandbox image

docker build -t data-analysis-sandbox:latest backend/app/sandbox/

3. Start all services

docker compose up --build
Service URL
Frontend http://localhost:3000
Backend API http://localhost:8000
API docs (Swagger) http://localhost:8000/docs

4. First use

  1. Open http://localhost:3000 — a session JWT is issued automatically
  2. Upload a CSV or Excel file via drag-and-drop
  3. Select the file and sheet in the left panel
  4. Type a question in the Query box, e.g. "What is the average salary by department?"
  5. Click Submit — results appear as text, table, or chart
  6. Previous queries appear in the history panel; click Reuse to populate the query box

Environment Variables

backend/.env

# ── Required ──────────────────────────────────────────────
OPENAI_API_KEY=sk-...

# Minimum 32 characters. Generate: python3 -c "import secrets; print(secrets.token_hex(32))"
SECRET_KEY=

# ── Database ──────────────────────────────────────────────
DATABASE_URL=postgresql+asyncpg://postgres:postgres@postgres:5432/data_analysis

# ── Environment ───────────────────────────────────────────
ENVIRONMENT=development    # set to "production" to enable secret-key validation
LOG_LEVEL=INFO

# ── CORS ──────────────────────────────────────────────────
ALLOWED_ORIGINS=["http://localhost:3000"]

# ── File upload ───────────────────────────────────────────
UPLOAD_DIR=./uploads
MAX_FILE_SIZE_MB=50
MAX_SESSION_SIZE_MB=200

# ── AI ────────────────────────────────────────────────────
AI_TIMEOUT_SECONDS=30
AI_MODEL=gpt-4o-mini
AI_TEMPERATURE=0

# ── Sandbox sidecar ───────────────────────────────────────
# Set both to use sidecar backend (recommended for production).
# Leave empty to fall back to direct Docker isolation.
SANDBOX_API_URL=http://sandbox-sidecar:9000
SANDBOX_API_KEY=   # must match root .env SANDBOX_API_KEY

# ── Rate limiting ─────────────────────────────────────────
RATE_LIMIT_QUERIES_PER_MINUTE=20
RATE_LIMIT_UPLOADS_PER_MINUTE=10

Root .env (docker-compose)

# Shared between docker-compose and backend/.env
SANDBOX_API_KEY=<generated-hex-string>

Project Structure

data-analysis-app/
├── .env                          # root-level (docker-compose reads this)
├── .gitignore
├── docker-compose.yml
├── README.md
│
├── backend/
│   ├── Dockerfile
│   ├── .env                      # backend service env
│   ├── .env.example
│   ├── requirements.txt
│   │
│   ├── alembic/
│   │   ├── env.py
│   │   └── versions/
│   │       └── 0001_initial_schema.py
│   │
│   ├── tests/
│   │   └── test_sandbox.py       # 25 sandbox integration tests
│   │
│   └── app/
│       ├── main.py
│       ├── dependencies.py
│       ├── models.py
│       │
│       ├── core/
│       │   ├── config.py         # Pydantic BaseSettings + production validation
│       │   ├── database.py       # SQLAlchemy async engine
│       │   ├── logging.py        # structlog JSON logging
│       │   └── security.py       # JWT + 4-layer file validation
│       │
│       ├── routers/
│       │   ├── auth.py
│       │   ├── files.py
│       │   ├── preview.py
│       │   ├── query.py
│       │   ├── history.py
│       │   └── feedback.py
│       │
│       ├── schemas/
│       │   ├── query.py
│       │   └── files.py
│       │
│       ├── services/
│       │   ├── ai_engine.py      # PandasAI integration + sandbox monkey-patch
│       │   ├── file_manager.py
│       │   └── sheet_parser.py
│       │
│       └── sandbox/
│           ├── runner.py         # SandboxRunner (sidecar/docker/subprocess backends)
│           ├── worker.py         # Executed inside sandbox container
│           ├── Dockerfile        # python:3.11-slim + pandas/numpy/matplotlib only
│           ├── .dockerignore
│           └── sidecar/
│               ├── main.py       # FastAPI sidecar (POST /execute, GET /health)
│               ├── Dockerfile    # Multi-stage: docker CLI + python:3.11-slim
│               └── requirements.txt
│
└── frontend/
    ├── Dockerfile
    ├── package.json
    ├── tsconfig.json
    ├── next.config.js
    └── src/
        ├── app/
        │   ├── layout.tsx
        │   ├── page.tsx
        │   ├── dashboard/page.tsx
        │   └── globals.css
        ├── components/
        │   ├── FileUploadZone.tsx
        │   ├── FileList.tsx
        │   ├── SheetSelector.tsx
        │   ├── DataPreviewTable.tsx
        │   ├── QueryInput.tsx
        │   ├── AnswerDisplay.tsx
        │   ├── PromptHistoryPanel.tsx
        │   └── FeedbackButtons.tsx
        ├── hooks/
        │   ├── useSession.ts
        │   ├── useFiles.ts
        │   ├── usePreview.ts
        │   ├── useQuery.ts
        │   └── useHistory.ts
        └── lib/
            ├── api.ts
            ├── types.ts
            └── queryClient.ts

API Reference

Authentication

Method Path Description
POST /api/v1/auth/token Issue anonymous JWT (7-day expiry)

Files

Method Path Description
POST /api/v1/files Upload CSV or Excel file
GET /api/v1/files List files for current session
DELETE /api/v1/files/{file_id} Soft-delete a file

Preview

Method Path Description
GET /api/v1/preview Preview top N rows of a sheet

AI Query

Method Path Description
POST /api/v1/query Execute natural language query
GET /api/v1/history Paginated query history
DELETE /api/v1/history/{query_id} Delete a history entry

Feedback

Method Path Description
POST /api/v1/feedback Submit thumbs up / thumbs down

All endpoints except /api/v1/auth/token require Authorization: Bearer <jwt>.

Interactive API docs available at http://localhost:8000/docs.


Testing

Sandbox integration tests (32 tests)

Tests run against the real Docker sandbox image and verify:

  • Correct results: scalar, string, DataFrame, numpy types, PandasAI result dicts
  • Security boundaries: os, subprocess, __import__, open, eval, exec all blocked
  • Network: --network=none confirmed blocking outbound connections
  • Filesystem: --read-only confirmed blocking writes
  • Timeout: infinite loops killed correctly
  • Edge cases: syntax errors, division by zero, NaN serialisation, large DataFrames
  • Serialization robustness: datetime, date, time, timedelta, numpy int/float/bool, inf/NaN→None, mixed Excel-like sheets
# From repo root
python -m pytest backend/tests/test_sandbox.py -v

Run all backend tests

docker compose exec backend pytest tests/ -v

Production Deployment

Checklist

  • Generate a strong SECRET_KEY (≥ 32 chars, not the default)
  • Generate a strong SANDBOX_API_KEY
  • Set ENVIRONMENT=production to enable secret-key validation
  • Build the sandbox image on the production host: docker build -t data-analysis-sandbox:latest backend/app/sandbox/
  • Use a managed PostgreSQL service (AWS RDS, Cloud SQL, etc.)
  • Place an nginx reverse proxy with TLS termination in front of the backend
  • Set ALLOWED_ORIGINS to your actual frontend domain
  • Switch JWT storage from localStorage to httpOnly cookies
  • Set strict Content-Security-Policy headers
  • Ship logs to a centralised log aggregator (structlog outputs JSON)

Docker group on Linux hosts

On Linux, the Docker socket is owned by the docker group. Update group_add in docker-compose.yml:

getent group docker | cut -d: -f3   # get the GID

Then set group_add: ["<GID>"] for the sandbox-sidecar service. On macOS with Docker Desktop, "0" (root group) is correct.


Troubleshooting

Port already in use

docker ps
docker stop <container-id>

Sandbox image missing

docker build -t data-analysis-sandbox:latest backend/app/sandbox/

SANDBOX_API_KEY not set

# Root .env must contain:
SANDBOX_API_KEY=<your-key>
# backend/.env must contain the same value for SANDBOX_API_KEY

Database migration errors

docker compose down -v   # drops volumes — data will be lost
docker compose up --build

OpenAI API errors

  • Verify OPENAI_API_KEY is valid and the account has credits
  • Model gpt-4o-mini must be available on the account
  • Check rate limits: default 20 queries/min, 10 uploads/min

Verifying sandbox is active

Check backend startup logs for:

Sandbox using sidecar backend   url=http://sandbox-sidecar:9000
PandasAI CodeExecution.execute_code patched → sandbox backend   backend=sidecar

Check sidecar logs during a query — each execution produces:

POST /execute HTTP/1.1" 200 OK

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors