Natural Language Analytics Engine
NLytics is an AI-powered data analysis platform that transforms natural language questions into executable Python code. Built for data analysts, researchers, and business users who need rapid insights without manual coding.
- Quick Start
- Features
- Architecture
- Example Queries
- Installation
- Usage
- REST API
- Security
- Configuration
- Testing
- Project Scope
- License
- Python 3.9+ (tested on 3.13.7)
- Groq API key (free tier available at console.groq.com)
# Clone repository
git clone https://github.com/Tech-Genkai/NLytics.git
cd NLytics
# Configure environment
echo "GROQ_API_KEY=your_api_key_here" > .env
# Install dependencies
pip install -r backend/requirements.txt
# Start server
python start.pyAccess the interface at http://localhost:5000
- Groq Llama 3.3-70B for natural language understanding
- Full dataset context awareness (columns, types, samples, statistics)
- Conversation history tracking for contextual queries
- Minimal clarifications needed
- Generates actual pandas/Python code, not API wrappers
- Shows generated code before execution
- Self-correcting with retry logic (3 attempts with structured feedback)
- Validates for syntax, security, and correctness
- AST parsing for syntax validation
- Blacklist filtering - blocks dangerous operations (eval, exec, os, sys, file I/O)
- Import whitelist - only pandas and numpy allowed
- Sandboxed execution with restricted builtins
- Timeout protection (30-second default)
- Column-aware validation
- Auto-detects best chart type (bar, scatter, pie, box, line)
- Interactive Plotly charts with Chart.js fallback
- Professional color palette
- Statistical insights and narrative summaries
- View generated code before execution
- Clear error messages with retry feedback
- Execution time tracking
- Export results (CSV, Excel, JSON)
NLytics employs a 9-phase AI pipeline that transforms conversational queries into validated, executable analytics:
1. Intent Detection → Natural language understanding (Groq Llama 3.3-70B)
2. Query Refinement → Semantic optimization for analytical depth
3. Multi-Step Planning → Decomposition into logical execution steps
4. Code Generation → Pandas/Python synthesis from natural language
5. Security Validation → AST parsing and blacklist verification
6. Sandboxed Execution → Isolated subprocess with restricted builtins
7. Insight Generation → Statistical analysis and visualization config
8. Answer Synthesis → Plain-language explanation generation
9. Result Presentation → Interactive charts with narrative context
User Query
↓
AI Intent Detection (Groq Llama 3.3)
↓
Query Refinement (Semantic optimization)
↓
Query Planning (Multi-step decomposition)
↓
Code Generation (Pandas/Python)
↓
Validation (Security, Syntax, Logic)
↓ [Retry loop with feedback - 3 attempts]
Safe Execution (Sandboxed, Timeout)
↓
Insights (Narrative, Plotly viz, Export)
Stock market analysis (samples/stock_data_july_2025.csv):
"highest growing stock" → Top 10 comparison with volatility analysis
"average market cap by sector" → Grouped aggregation with visualizations
"stocks with PE ratio below 15" → Multi-condition filtering
"correlation between volume and price" → Statistical correlation with scatter plot
General analytics patterns:
"show me a summary" → Statistical overview
"average [column] by [category]" → Grouped aggregation
"distribution of [column]" → Frequency analysis
"top 10 [column] by [metric]" → Ranked comparisons
"outliers in [column]" → Statistical outlier detection
- Python 3.9 or higher
- 2GB RAM minimum
- Internet connection (for Groq API)
- Modern web browser
-
Clone the repository:
git clone https://github.com/Tech-Genkai/NLytics.git cd NLytics -
Get your Groq API key:
- Visit console.groq.com/keys
- Sign up for free (no credit card required)
- Create and copy your API key (starts with
gsk_...)
-
Configure environment:
# Create .env file echo "GROQ_API_KEY=your_actual_key_here" > .env
-
Install dependencies:
pip install -r backend/requirements.txt
-
Start the server:
python start.py
-
Open your browser: Navigate to
http://localhost:5000
-
Upload Data
- Click "📤 Upload Data" button
- Select CSV or Excel file (max 50MB)
- Wait for preprocessing confirmation
-
Ask Questions
- Type natural language queries
- Press Enter to send
- View generated code, results, and insights
-
View Insights
- Interactive Plotly charts
- Statistical summaries
- Key findings and recommendations
- CSV (.csv)
- Excel (.xlsx, .xls)
- Maximum file size: 50MB
NLytics provides a comprehensive REST API for programmatic access.
- Production:
https://nlytics.onrender.com/api/v1 - Local:
http://localhost:5000/api/v1
# Complete analysis in one call
curl -X POST https://nlytics.onrender.com/api/v1/analyze \
-F "file=@data.csv" \
-F "query=highest stock by volume"| Endpoint | Method | Description |
|---|---|---|
/analyze |
POST | Upload & analyze in one call |
/query |
POST | Query existing session |
/status/<session_id> |
GET | Get session status |
/code/validate |
POST | Validate code |
/code/execute |
POST | Execute code |
/health |
GET | Health check |
{
"success": true,
"status": "completed",
"query": {
"original": "highest stock",
"refined": "top 10 stocks comparison",
"intent": {...}
},
"code": {
"generated": "df.nlargest(10, 'Close')...",
"execution_time": 0.23
},
"result": {
"data": [...]
},
"visualization": {
"type": "bar",
"plotly": "{...}",
"config": {...}
},
"insights": {
"narrative": "...",
"key_findings": [...]
},
"answer": "Based on the data, AAPL is the highest..."
}See DEV_GUIDE.md for complete API documentation with examples.
- AST Parsing - Syntax validation before execution
- Blacklist Filtering - Blocks dangerous operations:
eval,exec,compile,__import__open,input,getattr,setattros,sys,subprocess,socketglobals,locals,vars,dir
- Import Whitelist - Only pandas and numpy permitted
- Column Validation - Schema-aware code generation
- Timeout Enforcement - 30-second execution limit
- Restricted builtins - Only safe functions available
- No introspection - Blocked getattr, setattr, globals, locals
- No imports - Prevented import access
- No file system - No read/write operations
- No network - No external connections
- Isolated context - Each execution is independent
Create a .env file in the project root:
GROQ_API_KEY=your_groq_api_key_here
SECRET_KEY=your_secret_key_for_flask # Optional
FLASK_ENV=development # or production
PORT=5000 # Optional, default is 5000Configurable in backend/config.py:
MAX_CONTENT_LENGTH = 50 * 1024 * 1024 # 50MB file upload limit
API_TIMEOUT = 30 # API request timeout (seconds)
MAX_RETRIES = 3 # Code generation retry attempts
EXECUTION_TIMEOUT = 300 # Code execution timeout (5 minutes)
MAX_ROWS = 1_000_000 # Maximum rows to process
MAX_COLUMNS = 500 # Maximum columns to process# All unit tests
python -m pytest backend/tests/ -v
# With coverage report
python -m pytest backend/tests/ --cov=backend/services --cov-report=html
# Automated integration tests (60+ scenarios)
python backend/tests/automated_test.py
# Quick integration test
python backend/tests/automated_test.py --quick- Code validation - Syntax, security, imports
- Safe execution - Error handling, result capture, isolation
- Insight generation - DataFrames, scalars, visualizations
- Schema inspection - Column types, statistics
- Integration tests - 60+ scenarios covering:
- Basic aggregations
- Growth & performance analysis
- Sector comparisons
- Complex queries
- Statistical insights
- Rankings
- Investment screening
- Natural language variations
- Edge cases
- Proof-of-concept validating AI-powered code generation
- Educational tool demonstrating LLM-powered analytics
- Personal analysis tool for local use and demos
- Architecture blueprint for conversational analytics
- Production-ready SaaS platform
- Enterprise-grade software
- Fully hardened security system
- Scalable multi-user service
NLytics prioritizes transparency and education over abstraction:
- Code Visibility - Generated pandas code is always visible
- Query Refinement - Automatically enhances queries for analytical depth
- Schema Awareness - Understands data types and relationships
- Self-Correction - Retry loops with structured feedback
- Educational Value - Learn data analysis by observing code generation
Why no database?
- Users analyze sensitive data - no storage = no liability
- Sessions expire on restart by design
- This is an analysis tool, not a data warehouse
- Stateless architecture works for cloud deployment
Why no authentication?
- Single-user tool - control access via URL sharing
- For public deployment: use Render's basic auth
- Adding app-level auth is overkill for demos
Why no caching?
- Queries are rarely identical
- Groq API responses are fast (~5s average)
- Cache invalidation complexity outweighs benefits
Why no Docker?
- Local development tool for single users
- Python virtual environment is simpler
- Docker adds complexity with no benefit
- Single-user - Local or Render deployment
- Session-based - No persistence by design
- API-dependent - Requires Groq API key
- Dataset limits - Optimized for <100K rows
- Stateless - Works on cloud platforms
| Layer | Technology |
|---|---|
| AI Model | Groq Llama 3.3-70B (70B parameters) |
| Backend | Flask 3.1.0, Python 3.9+ |
| Data Processing | pandas 2.2.3, numpy 2.2.4 |
| Visualization | Plotly 5.24.1, Chart.js fallback |
| Frontend | Vanilla JavaScript, Marked.js |
| Testing | pytest, 60+ integration scenarios |
NLytics/
├── backend/
│ ├── main.py # Flask application core
│ ├── config.py # Configuration management
│ ├── requirements.txt # Python dependencies
│ ├── api/
│ │ └── analytics_api.py # REST API endpoints
│ ├── services/ # AI pipeline services
│ │ ├── ai_intent_detector.py # Natural language understanding
│ │ ├── query_refiner.py # Query optimization
│ │ ├── query_planner.py # Multi-step planning
│ │ ├── code_generator.py # LLM code synthesis
│ │ ├── code_validator.py # Security validation
│ │ ├── safe_executor.py # Sandboxed execution
│ │ ├── insight_generator.py # Statistical insights
│ │ ├── answer_synthesizer.py # Natural language answers
│ │ ├── preprocessor.py # Data cleaning
│ │ ├── schema_inspector.py # Column analysis
│ │ └── file_handler.py # File operations
│ ├── models/
│ │ └── chat_message.py # Message types
│ ├── tests/ # Test suite
│ │ ├── automated_test.py # Integration tests (60+)
│ │ ├── test_api.py
│ │ ├── test_code_validator.py
│ │ ├── test_insight_generator.py
│ │ ├── test_safe_executor.py
│ │ └── test_schema_inspector.py
│ └── utils/ # Utility modules
├── frontend/
│ ├── index.html # Chat interface
│ └── static/
│ ├── js/app.js # Client logic
│ └── css/style.css # Styling
├── data/
│ ├── uploads/ # User files
│ └── processed/ # Cleaned data
├── samples/ # Example datasets
├── .env # Environment config
├── .gitignore
├── README.md # This file
├── DEV_GUIDE.md # Developer guide
└── start.py # Application entry point
"GROQ_API_KEY not found"
- Ensure
.envfile exists in project root - Verify API key is correct (starts with
gsk_...) - Restart server after creating
.env
"Address already in use" (Port 5000)
# Windows: Kill process using port 5000
netstat -ano | findstr :5000
taskkill /PID <PID> /F
# Or change port in backend/main.py"File type not supported"
- Only CSV (.csv) and Excel (.xlsx, .xls) supported
- File must be under 50MB
- Try re-saving as CSV (UTF-8)
"Column not found"
- Use exact column names from upload confirmation
- Preprocessing normalizes names (spaces → underscores)
- Check health report for actual column names
"No dataset loaded"
- Upload a file before asking questions
- Check if upload completed successfully
- Try refreshing the page
For more troubleshooting, see DEV_GUIDE.md.
This is a proof-of-concept project. Contributions are welcome for:
- Bug fixes
- Documentation improvements
- Test coverage expansion
- Additional visualization types
- Performance optimizations
Please open an issue before submitting large changes.
MIT License - See LICENSE file for details.
- Groq for fast LLM inference
- Plotly for interactive visualizations
- pandas and numpy for data processing
- Flask for web framework
Built with ❤️ by Tech-Genkai
Proof-of-concept demonstrating AI-powered natural language to code generation for data analytics.


