- Executive Overview
- System Architecture
- Installation Guide
- Configuration Guide
- User Guide
- Administrator Guide
- API Documentation
- Development Guide
- Deployment Guide
- Model Management
- Security Guide
- Performance Tuning
- Troubleshooting
- Maintenance & Monitoring
- FAQ
- Appendices
The Sovereign AI Application Suite is an enterprise-grade, fully self-hosted artificial intelligence platform that provides state-of-the-art AI capabilities while maintaining complete data sovereignty and privacy. Built on a foundation of open-source models and technologies, it delivers performance comparable to leading commercial AI services without any external dependencies.
- Council of Experts Architecture: Multiple specialized AI models working in concert
- Gemma 3 Integration: Google's latest efficient models for ultra-fast responses
- Multimodal Processing: Support for text, images, audio, and documents
- RAG System: Retrieval-Augmented Generation with private knowledge bases
- Advanced Coding Assistant: Professional-grade code generation and debugging
- 100% Self-Hosted: Complete control over infrastructure and data
- No External APIs: All processing happens on-premises
- Role-Based Access Control: Granular permission management
- Audit Logging: Complete activity tracking for compliance
- High Availability: Built-in redundancy and failover capabilities
- Model Quantization: 4-bit and 8-bit precision for efficient GPU usage
- Smart Caching: Intelligent response caching for improved performance
- Auto-Scaling: Dynamic resource allocation based on demand
- API-First Design: RESTful API for easy integration
- Monitoring & Analytics: Comprehensive usage and performance metrics
| Use Case | Description | Key Models Used |
|---|---|---|
| Code Development | Generate, debug, and optimize code across multiple languages | Codestral, CodeLlama, DeepSeek Coder |
| Data Analysis | Analyze complex datasets and generate insights | Mixtral, Gemma 3, Llama 3 |
| Content Creation | Write documentation, articles, and creative content | Llama 3, Gemma 3, Mixtral |
| Customer Support | Automated response generation and ticket classification | Gemma 3, Mistral |
| Research Assistant | Literature review, summarization, and synthesis | Llama 3, Mixtral, Nemotron |
| Data Generation | Create test data, schemas, and configurations | Mixtral, Gemma 3 |
- CPU: 16 cores (AMD Ryzen 9 or Intel i9)
- RAM: 64GB DDR4
- GPU: NVIDIA RTX 4090 24GB or A100 40GB
- Storage: 2TB NVMe SSD
- OS: Ubuntu 22.04 LTS
- Network: 1Gbps connection
- CPU: 64+ cores (AMD EPYC or Intel Xeon)
- RAM: 256GB DDR4 ECC
- GPU: 4x NVIDIA A100 80GB or 2x H100 80GB
- Storage: 8TB NVMe SSD RAID 10
- OS: Ubuntu 22.04 LTS Server
- Network: 10Gbps connection
The Sovereign AI Suite follows a three-tier architecture pattern:
┌─────────────────────────────────────────────────────────┐
│ Presentation Tier │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Streamlit │ │ Web UI │ │ Mobile │ │
│ │ App │ │ (Future) │ │ (Future) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────┐
│ Application Tier │
│ ┌──────────────────────────────────────────────────┐ │
│ │ FastAPI Backend Core │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Router │ │ Council │ │ RAG │ │ │
│ │ │ Engine │ │ of │ │ System │ │ │
│ │ │ │ │ Experts │ │ │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Flask Admin Portal │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────┐
│ Data Tier │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ MongoDB │ │ChromaDB │ │ Redis │ │
│ │ │ │ Vectors │ │ Cache │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────┘
The heart of the system, responsible for:
- Request routing and orchestration
- Model loading and management
- Response synthesis
- API endpoint exposure
Key Modules:
model_loader.py: Dynamic model loading with quantizationcouncil_of_experts.py: Multi-model orchestrationtask_router.py: Intelligent query routingmain.py: FastAPI application and endpoints
The unique multi-model architecture that enables superior performance:
Council Process Flow:
1. Query Analysis → Determine complexity and type
2. Expert Selection → Choose optimal models
3. Parallel/Sequential Processing → Execute based on resources
4. Response Synthesis → Combine and refine outputs
5. Quality Validation → Ensure response meets standardsCentralized model configuration and management:
| Model | Purpose | Memory | Speed | Quality |
|---|---|---|---|---|
| Gemma 3 2B | Ultra-fast responses | 3GB | 10/10 | 7/10 |
| Gemma 3 9B | Balanced performance | 6GB | 9/10 | 8/10 |
| Gemma 3 27B | High-quality general | 14GB | 7/10 | 9/10 |
| Llama 3.1 70B | Complex reasoning | 35GB | 6/10 | 9/10 |
| Codestral 22B | Code generation | 12GB | 8/10 | 9/10 |
| CodeLlama 70B | Complex algorithms | 35GB | 6/10 | 9/10 |
| Mixtral 8x22B | Multi-domain expert | 45GB | 5/10 | 9/10 |
graph LR
A[User Request] --> B[API Gateway]
B --> C{Authentication}
C -->|Valid| D[Task Router]
C -->|Invalid| E[Error Response]
D --> F[Query Classifier]
F --> G[Model Selector]
G --> H[Council of Experts]
H --> I[Response Synthesis]
I --> J[Quality Check]
J --> K[Cache Response]
K --> L[Return to User]
Multiple layers of security ensure data protection:
-
Network Security
- SSL/TLS encryption
- Firewall rules
- Rate limiting
-
Application Security
- API key authentication
- JWT tokens for sessions
- Input validation and sanitization
-
Data Security
- Encryption at rest
- Encrypted backups
- Audit logging
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install essential tools
sudo apt install -y \
build-essential \
git \
curl \
wget \
vim \
htop \
software-properties-common# Add Python 3.10 repository
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
# Install Python 3.10
sudo apt install -y python3.10 python3.10-venv python3.10-dev
# Set as default
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1# Add NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
# Install CUDA and drivers
sudo apt update
sudo apt install -y cuda-12-3 nvidia-driver-545
# Verify installation
nvidia-smi# Add to .bashrc
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# Test CUDA
nvcc --version# Import MongoDB public key
curl -fsSL https://pgp.mongodb.com/server-7.0.asc | \
sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg --dearmor
# Add repository
echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] \
https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | \
sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
# Install MongoDB
sudo apt update
sudo apt install -y mongodb-org
# Start service
sudo systemctl start mongod
sudo systemctl enable mongod# Create application directory
sudo mkdir -p /opt/sovereign-ai-suite
sudo chown $USER:$USER /opt/sovereign-ai-suite
# Clone repository
git clone https://github.com/your-org/sovereign-ai-suite.git /opt/sovereign-ai-suite
cd /opt/sovereign-ai-suite# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate
# Upgrade pip
pip install --upgrade pip setuptools wheel
# Install requirements
pip install -r setup/requirements.txt# Download all models (2-4 hours)
python scripts/download_models.py
# Or download priority models only
python scripts/download_models.py --priority-only# Copy environment template
cp config/.env.example .env
# Edit configuration
vim .envKey configuration variables:
# Application
APP_NAME="Sovereign AI Suite"
DEBUG=false
# API
API_HOST=0.0.0.0
API_PORT=8000
# Security
SECRET_KEY=<generate-strong-key>
# MongoDB
MONGODB_URL=mongodb://localhost:27017
DATABASE_NAME=sovereign_ai
# Models
MAX_CACHED_MODELS=3
ENABLE_GEMMA3_FAST_MODE=true# Run database initialization
python scripts/init_db.py
# Create admin user
python scripts/create_admin.py# Copy service file
sudo cp config/sovereign-ai.service /etc/systemd/system/
# Reload systemd
sudo systemctl daemon-reload
# Enable service
sudo systemctl enable sovereign-ai
# Start service
sudo systemctl start sovereign-ai# Install Nginx
sudo apt install -y nginx
# Copy configuration
sudo cp config/nginx.conf /etc/nginx/sites-available/sovereign-ai
sudo ln -s /etc/nginx/sites-available/sovereign-ai /etc/nginx/sites-enabled/
# Test configuration
sudo nginx -t
# Restart Nginx
sudo systemctl restart nginxComplete list of environment variables:
| Variable | Description | Default | Required |
|---|---|---|---|
APP_NAME |
Application name | Sovereign AI Suite | No |
DEBUG |
Debug mode | false | No |
API_HOST |
API bind address | 0.0.0.0 | No |
API_PORT |
API port | 8000 | No |
API_WORKERS |
Number of API workers | 1 | No |
SECRET_KEY |
Secret key for encryption | None | Yes |
MONGODB_URL |
MongoDB connection string | mongodb://localhost:27017 | Yes |
DATABASE_NAME |
Database name | sovereign_ai | No |
MAX_CACHED_MODELS |
Maximum cached models | 3 | No |
DEFAULT_TEMPERATURE |
Default model temperature | 0.7 | No |
DEFAULT_MAX_TOKENS |
Default max tokens | 2048 | No |
ENABLE_CACHE |
Enable response caching | true | No |
CACHE_TTL |
Cache TTL in seconds | 3600 | No |
# Model-specific configurations
models:
gemma3:
2b:
model_id: "google/gemma-2b-it"
optimal_batch_size: 8
max_sequence_length: 8192
quantization: "none"
9b:
model_id: "google/gemma-7b-it"
optimal_batch_size: 4
max_sequence_length: 8192
quantization: "4bit"
27b:
model_id: "google/gemma-7b-it"
optimal_batch_size: 2
max_sequence_length: 8192
quantization: "4bit"
# Optimization settings
optimization:
enable_flash_attention: true
enable_quantization: true
default_quantization: "4bit"
gradient_checkpointing: false
# Performance settings
performance:
max_concurrent_models: 3
model_timeout_seconds: 300
enable_caching: true
cache_ttl_seconds: 3600
# Task routing
routing:
complexity_thresholds:
simple: 0.3
moderate: 0.6
complex: 0.8
expert_counts:
simple: 1
moderate: 2
complex: 3
very_complex: 5# Generate secure API key
import secrets
api_key = f"sk-{secrets.token_hex(32)}"# Generate self-signed certificate (for testing)
openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365
# Or use Let's Encrypt (production)
sudo apt install certbot python3-certbot-nginx
sudo certbot --nginx -d your-domain.com// /etc/mongod.conf
storage:
dbPath: /var/lib/mongodb
journal:
enabled: true
engine: wiredTiger
wiredTiger:
engineConfig:
cacheSizeGB: 8
journalCompressor: snappy
collectionConfig:
blockCompressor: snappy
net:
port: 27017
bindIp: 127.0.0.1
security:
authorization: enabled
replication:
replSetName: rs0- Open your browser and navigate to:
http://your-server:8501 - Enter your API key in the authentication section
- Click "Connect" to authenticate
- Type your question in the chat input
- Press Enter or click Send
- Wait for the AI response
- View response details by expanding the metadata section
The main interface for interacting with the AI:
- Query Input: Natural language input for questions
- Context Field: Additional context for better responses
- Response Display: Formatted AI responses with syntax highlighting
- Metadata View: Processing details and confidence scores
Upload and analyze various file types:
Supported Formats:
- Documents: PDF, DOCX, TXT, CSV, XLSX
- Images: PNG, JPG, JPEG, GIF
- Audio: MP3, WAV, M4A
Processing Options:
- Analyze: Extract insights from content
- Summarize: Create concise summaries
- Extract: Pull out specific information
- Add to Knowledge Base: Store for RAG retrieval
Choose between different processing modes:
| Mode | Description | Response Time | Use Case |
|---|---|---|---|
| Fast Mode | Uses Gemma 3 models | < 2 seconds | Quick questions, simple tasks |
| Standard Mode | 2-3 expert models | 3-5 seconds | Most queries |
| Deep Mode | 4-5 expert models | 5-10 seconds | Complex analysis |
Generate code with specific requirements:
# Example prompt
"Create a Python FastAPI endpoint for user authentication with JWT tokens,
including input validation and error handling"Best Practices:
- Be specific about requirements
- Mention language and framework
- Include error handling needs
- Specify security requirements
Create structured data:
# Example prompt
"Generate a CSV with 100 rows of sample e-commerce data including:
- order_id (unique)
- customer_name
- product_name
- quantity
- price
- order_date (last 30 days)
- status (pending/shipped/delivered)"Analyze uploaded documents:
- Upload document using file uploader
- Select "Analyze" as process type
- Add specific questions (optional)
- Click "Process File"
| Do | Don't |
|---|---|
| Be specific and detailed | Use vague language |
| Provide context when needed | Assume prior knowledge |
| Break complex tasks into steps | Ask multiple unrelated questions |
| Specify output format | Leave format ambiguous |
Effective context improves response quality:
# Good context example
Context: "I'm building a React application with TypeScript.
We use Material-UI for components and Redux for state management."
Query: "How do I implement infinite scrolling for a product list?"
# Poor context example
Context: "React app"
Query: "How to do scrolling?"- Navigate to:
http://your-server:5000 - Default credentials:
- Username:
admin - Password:
changeme123
- Username:
- Important: Change default password immediately
The admin dashboard provides:
- User statistics
- API key management
- System metrics
- Recent activity logs
- Quick actions
# Via Admin Portal
1. Click "Create User"
2. Fill in required fields:
- Username (unique)
- Email
- Organization (optional)
- Role (user/developer/admin)
3. Click "Create"
4. Copy generated API key
5. Share securely with user| Role | Permissions | Use Case |
|---|---|---|
| User | Basic API access | Standard users |
| Developer | Extended API access, higher limits | Development teams |
| Admin | Full access, user management | System administrators |
# Regenerate API key
1. Navigate to Users page
2. Find user in list
3. Click "Regenerate Key"
4. Confirm action
5. Share new key with user
# Revoke API key
1. Navigate to Users page
2. Find user in list
3. Click "Revoke Key"
4. Key is immediately invalidated# Check service status
sudo systemctl status sovereign-ai
# Restart services
sudo systemctl restart sovereign-ai
# View logs
sudo journalctl -u sovereign-ai -f
# Individual service control
sudo supervisorctl status
sudo supervisorctl restart backend
sudo supervisorctl restart frontend
sudo supervisorctl restart admin# Backup MongoDB
mongodump --db sovereign_ai --out /backup/$(date +%Y%m%d)
# Restore MongoDB
mongorestore --db sovereign_ai /backup/20240101
# Database maintenance
mongo sovereign_ai --eval "db.repairDatabase()"
# Clear old logs (older than 30 days)
mongo sovereign_ai --eval "
db.query_logs.deleteMany({
timestamp: { \$lt: new Date(Date.now() - 30*24*60*60*1000) }
})
"# List downloaded models
ls -la /opt/sovereign-ai-suite/models/
# Update specific model
python scripts/update_model.py --model gemma3-27b
# Clear model cache
python scripts/clear_cache.py --type models
# Verify model integrity
python scripts/verify_models.pyMonitor key metrics using built-in tools:
# GPU utilization
nvidia-smi -l 1
# Memory usage
free -h
# Disk usage
df -h
# Process monitoring
htopAccess metrics via API:
# Get system health
curl -X GET http://localhost:8000/health
# Get usage analytics
curl -X GET http://localhost:8000/api/v1/analytics \
-H "X-API-Key: your-admin-key"Important log locations:
# Application logs
/opt/sovereign-ai-suite/logs/backend.log
/opt/sovereign-ai-suite/logs/frontend.log
/opt/sovereign-ai-suite/logs/admin.log
# System logs
/var/log/nginx/access.log
/var/log/nginx/error.log
/var/log/mongodb/mongod.log
# Analyze logs
grep ERROR /opt/sovereign-ai-suite/logs/backend.log
tail -f /opt/sovereign-ai-suite/logs/backend.log#!/bin/bash
# backup.sh - Daily backup script
BACKUP_DIR="/backup/sovereign-ai/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
# Backup MongoDB
mongodump --db sovereign_ai --out $BACKUP_DIR/mongodb
# Backup configuration
cp -r /opt/sovereign-ai-suite/config $BACKUP_DIR/
# Backup uploaded files
cp -r /opt/sovereign-ai-suite/data/uploads $BACKUP_DIR/
# Compress backup
tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR
rm -rf $BACKUP_DIR
# Keep only last 30 days
find /backup/sovereign-ai -name "*.tar.gz" -mtime +30 -delete# Full recovery
1. Stop all services
sudo systemctl stop sovereign-ai
2. Restore MongoDB
mongorestore --db sovereign_ai /backup/mongodb
3. Restore configuration
cp -r /backup/config/* /opt/sovereign-ai-suite/config/
4. Restore uploads
cp -r /backup/uploads/* /opt/sovereign-ai-suite/data/uploads/
5. Start services
sudo systemctl start sovereign-aiBase URL: http://your-server:8000/api/v1
Authentication: All endpoints require API key in header:
X-API-Key: sk-your-api-key
POST /api/v1/query
Process a query through the Council of Experts.
Request Body:
{
"query": "string",
"context": "string (optional)",
"task_type": "general|coding|analysis|creative|data_generation",
"num_experts": 3,
"temperature": 0.7,
"max_tokens": 2048,
"use_rag": true,
"fast_mode": false
}Response:
{
"query_id": "uuid",
"success": true,
"answer": "string",
"task_type": "general",
"experts_consulted": ["gemma3-27b", "llama3-70b"],
"confidence_score": 0.85,
"processing_time": 3.5,
"tokens_used": 1250,
"metadata": {},
"timestamp": "2024-01-01T00:00:00Z"
}Example:
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-H "X-API-Key: sk-your-key" \
-d '{
"query": "Write a Python function to calculate fibonacci",
"task_type": "coding",
"num_experts": 2
}'POST /api/v1/upload
Upload and process files.
Request:
- Method:
POST - Content-Type:
multipart/form-data - Fields:
file: File dataquery: Optional query about the fileprocess_type:analyze|extract|summarize|add_to_knowledge
Response:
{
"success": true,
"filename": "document.pdf",
"file_type": ".pdf",
"result": "string or object",
"process_type": "analyze"
}Example:
curl -X POST http://localhost:8000/api/v1/upload \
-H "X-API-Key: sk-your-key" \
-F "file=@document.pdf" \
-F "query=Summarize this document" \
-F "process_type=summarize"GET /api/v1/models
Get available models and their status.
Response:
{
"models": [
{
"key": "gemma3-27b",
"name": "Gemma 3 27B",
"categories": ["general", "coding", "analysis"],
"memory_gb": 14,
"speed_rating": 7,
"quality_rating": 9,
"is_loaded": true,
"specialties": ["comprehensive", "multilingual"]
}
],
"loaded": ["gemma3-27b", "codestral-22b"]
}POST /api/v1/users
Create a new user (admin only).
Request Body:
{
"username": "string",
"email": "email@example.com",
"role": "user|developer|admin",
"organization": "string (optional)"
}Response:
{
"user": {
"_id": "user_id",
"username": "string",
"email": "email@example.com",
"role": "user",
"created_at": "2024-01-01T00:00:00Z"
},
"api_key": "sk-generated-key"
}GET /api/v1/analytics
Get usage analytics (admin/developer only).
Query Parameters:
days: Number of days to analyze (default: 7)
Response:
{
"period_days": 7,
"analytics": [
{
"_id": "coding",
"count": 150,
"avg_time": 4.2,
"avg_confidence": 0.87
}
]
}All errors follow standard HTTP status codes:
| Status Code | Description | Example |
|---|---|---|
| 200 | Success | Request processed successfully |
| 400 | Bad Request | Invalid input parameters |
| 401 | Unauthorized | Invalid or missing API key |
| 403 | Forbidden | Insufficient permissions |
| 404 | Not Found | Resource not found |
| 429 | Too Many Requests | Rate limit exceeded |
| 500 | Internal Server Error | Server error |
Error Response Format:
{
"error": {
"code": "INVALID_INPUT",
"message": "Detailed error message",
"details": {}
}
}Default rate limits:
| User Type | Requests/Hour | Concurrent Requests |
|---|---|---|
| User | 100 | 2 |
| Developer | 500 | 5 |
| Admin | Unlimited | Unlimited |
Configure webhooks for event notifications:
{
"url": "https://your-server.com/webhook",
"events": ["query.completed", "user.created", "error.critical"],
"secret": "webhook-secret"
}# Clone repository
git clone https://github.com/your-org/sovereign-ai-suite.git
cd sovereign-ai-suite
# Create development branch
git checkout -b feature/your-feature
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install.env.development:
DEBUG=true
API_HOST=127.0.0.1
API_PORT=8000
MONGODB_URL=mongodb://localhost:27017
DATABASE_NAME=sovereign_ai_dev
MAX_CACHED_MODELS=1sovereign-ai-suite/
├── backend/
│ ├── core/ # Core functionality
│ ├── services/ # Business logic
│ ├── api/ # API endpoints
│ ├── schemas/ # Data models
│ └── utils/ # Utilities
├── frontend/
│ └── streamlit_app.py
├── admin/
│ ├── app.py
│ └── templates/
├── tests/
│ ├── unit/
│ ├── integration/
│ └── e2e/
└── scripts/
└── development/
Follow PEP 8 with these additions:
# Good
from typing import Optional, List, Dict, Any
from dataclasses import dataclass
@dataclass
class ModelConfig:
"""Model configuration with clear documentation."""
model_id: str
max_tokens: int = 2048
temperature: float = 0.7
def validate(self) -> bool:
"""Validate configuration parameters."""
return 0.0 <= self.temperature <= 2.0
# Bad
class model_config:
def __init__(self, id, max_t=2048, temp=0.7):
self.id = id
self.max_t = max_t
self.temp = tempRESTful principles:
- Use appropriate HTTP methods
- Return consistent response formats
- Include proper status codes
- Version your API
# tests/unit/test_model_loader.py
import pytest
from backend.core.model_loader import ModelLoader
@pytest.fixture
def model_loader():
return ModelLoader(cache_dir="/tmp/test_models")
def test_load_model(model_loader):
"""Test model loading functionality."""
model = model_loader.load_model("gemma3-2b")
assert model is not None
assert "gemma3-2b" in model_loader.loaded_models
def test_invalid_model(model_loader):
"""Test loading invalid model."""
with pytest.raises(ValueError):
model_loader.load_model("invalid-model")# tests/integration/test_api.py
import pytest
from fastapi.testclient import TestClient
from backend.main import app
client = TestClient(app)
def test_query_endpoint():
"""Test query processing endpoint."""
response = client.post(
"/api/v1/query",
json={"query": "Hello, world!"},
headers={"X-API-Key": "test-key"}
)
assert response.status_code == 200
assert "answer" in response.json()# Run all tests
pytest
# Run with coverage
pytest --cov=backend --cov-report=html
# Run specific test file
pytest tests/unit/test_model_loader.py
# Run with verbose output
pytest -v
# Run only marked tests
pytest -m "not slow"# backend/core/model_loader.py
ModelRegistry.MODELS["new-model"] = ModelProfile(
model_id="org/model-name",
name="New Model",
categories=[TaskCategory.GENERAL],
max_length=4096,
optimal_temperature=0.7,
memory_gb=20,
speed_rating=7,
quality_rating=8,
specialties=["specific_task"],
load_in_4bit=True
)# backend/core/council_of_experts.py
EXPERT_STRATEGIES[TaskType.GENERAL]["primary"].append("new-model")# scripts/download_models.py
LANGUAGE_MODELS.append(
("org/model-name", "new-model", True) # priority flag
)# Test model loading
python -c "
from backend.core.model_loader import ModelLoader
loader = ModelLoader()
model = loader.load_model('new-model')
print('Model loaded successfully')
"- Fork the repository
- Create a feature branch
- Make changes
- Write tests
- Run tests and linting
- Submit pull request
## Description
Brief description of changes
## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update
## Testing
- [ ] Unit tests pass
- [ ] Integration tests pass
- [ ] Manual testing completed
## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new warnings# Create dedicated user
sudo useradd -m -s /bin/bash sovereign-ai
sudo usermod -aG sudo sovereign-ai
# Set up directories
sudo mkdir -p /opt/sovereign-ai-suite
sudo chown sovereign-ai:sovereign-ai /opt/sovereign-ai-suite
# Configure firewall
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable# Switch to service user
sudo su - sovereign-ai
# Clone and setup
cd /opt/sovereign-ai-suite
git clone https://github.com/your-org/sovereign-ai-suite.git .
# Install dependencies
python3.10 -m venv venv
source venv/bin/activate
pip install -r setup/requirements.txt
# Download models
python scripts/download_models.py --production# /etc/nginx/nginx.conf
upstream backend_servers {
least_conn;
server backend1.local:8000;
server backend2.local:8000;
server backend3.local:8000;
}
server {
listen 443 ssl http2;
location /api {
proxy_pass http://backend_servers;
proxy_next_upstream error timeout invalid_header http_500;
}
}// MongoDB replica set configuration
rs.initiate({
_id: "rs0",
members: [
{ _id: 0, host: "mongo1.local:27017", priority: 2 },
{ _id: 1, host: "mongo2.local:27017", priority: 1 },
{ _id: 2, host: "mongo3.local:27017", arbiterOnly: true }
]
})# docker-compose.scale.yml
version: '3.8'
services:
backend:
image: sovereign-ai/backend
deploy:
replicas: 3
resources:
limits:
cpus: '4'
memory: 32G
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]# Distribute models across GPUs
GPU_ASSIGNMENTS = {
"gpu0": ["gemma3-2b", "gemma3-9b"],
"gpu1": ["codestral-22b", "codellama-70b"],
"gpu2": ["llama3-70b"],
"gpu3": ["mixtral-8x22b"]
}# .github/workflows/deploy.yml
name: Deploy to Production
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run tests
run: |
pip install -r requirements-dev.txt
pytest
deploy:
needs: test
runs-on: ubuntu-latest
steps:
- name: Deploy to server
uses: appleboy/ssh-action@master
with:
host: ${{ secrets.PROD_HOST }}
username: ${{ secrets.PROD_USER }}
key: ${{ secrets.PROD_SSH_KEY }}
script: |
cd /opt/sovereign-ai-suite
git pull origin main
source venv/bin/activate
pip install -r setup/requirements.txt
sudo systemctl restart sovereign-ai# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'sovereign-ai'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'{
"dashboard": {
"title": "Sovereign AI Monitoring",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
]
},
{
"title": "Response Time",
"targets": [
{
"expr": "histogram_quantile(0.95, http_request_duration_seconds_bucket)"
}
]
},
{
"title": "GPU Utilization",
"targets": [
{
"expr": "nvidia_gpu_utilization"
}
]
}
]
}
}| Model | Size | Use Case | Quantization | Memory |
|---|---|---|---|---|
| Gemma 3 2B | 2B | Ultra-fast responses | None | 3GB |
| Gemma 3 9B | 9B | Balanced performance | 4-bit | 6GB |
| Gemma 3 27B | 27B | High quality | 4-bit | 14GB |
| Llama 3.1 70B | 70B | Complex reasoning | 4-bit | 35GB |
| Codestral 22B | 22B | Code generation | 4-bit | 12GB |
| CodeLlama 70B | 70B | Complex coding | 4-bit | 35GB |
| Mixtral 8x22B | 176B | Multi-domain | 4-bit | 45GB |
| DeepSeek Coder 33B | 33B | Debugging | 4-bit | 17GB |
def select_models_for_task(task_type: str, complexity: float) -> List[str]:
"""Select optimal models based on task and complexity."""
if complexity < 0.3: # Simple
return ["gemma3-2b"]
elif complexity < 0.6: # Moderate
if task_type == "coding":
return ["gemma3-9b", "codestral-22b"]
else:
return ["gemma3-27b", "llama3-70b"]
else: # Complex
if task_type == "coding":
return ["codestral-22b", "codellama-70b", "deepseek-coder-33b"]
else:
return ["llama3-70b", "mixtral-8x22b", "gemma3-27b"]from transformers import BitsAndBytesConfig
# 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
# 8-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.float16
)def optimize_memory_usage():
"""Optimize GPU memory usage."""
# Clear cache
torch.cuda.empty_cache()
# Set memory fraction
torch.cuda.set_per_process_memory_fraction(0.9)
# Enable gradient checkpointing
model.gradient_checkpointing_enable()
# Use flash attention
model.config.use_flash_attention_2 = True# Update specific model
python scripts/update_model.py --model gemma3-27b
# Update all models
python scripts/update_model.py --all
# Check for updates
python scripts/check_updates.py# models/versions.yaml
models:
gemma3-27b:
current: "v1.0.0"
available: ["v1.0.0", "v1.1.0"]
changelog:
v1.1.0: "Improved coding capabilities"Internet
│
▼
Firewall (Layer 1)
│
▼
Nginx/SSL (Layer 2)
│
▼
API Gateway (Layer 3)
│
▼
Authentication (Layer 4)
│
▼
Application (Layer 5)
│
▼
Database (Layer 6)
import hashlib
import secrets
def generate_secure_api_key() -> tuple[str, str]:
"""Generate secure API key and hash."""
# Generate random key
raw_key = secrets.token_hex(32)
api_key = f"sk-{raw_key}"
# Hash for storage
key_hash = hashlib.pbkdf2_hmac(
'sha256',
api_key.encode(),
b'salt', # Use proper salt in production
100000
)
return api_key, key_hash.hex()from jose import JWTError, jwt
from datetime import datetime, timedelta
def create_access_token(data: dict) -> str:
"""Create JWT access token."""
to_encode = data.copy()
expire = datetime.utcnow() + timedelta(minutes=30)
to_encode.update({"exp": expire})
encoded_jwt = jwt.encode(
to_encode,
SECRET_KEY,
algorithm="HS256"
)
return encoded_jwt# Enable MongoDB encryption
mongod --enableEncryption \
--encryptionKeyFile /path/to/keyfile# SSL/TLS configuration
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:10m;from pydantic import BaseModel, validator
import re
class QueryRequest(BaseModel):
query: str
@validator('query')
def validate_query(cls, v):
# Check length
if len(v) > 10000:
raise ValueError("Query too long")
# Check for injection attempts
sql_patterns = [
r"(\b(SELECT|INSERT|UPDATE|DELETE|DROP|CREATE)\b)",
r"(--|#|\/\*|\*\/)",
r"(\bOR\b.*=.*)",
]
for pattern in sql_patterns:
if re.search(pattern, v, re.IGNORECASE):
raise ValueError("Invalid query content")
return v- Change default passwords
- Enable SSL/TLS
- Configure firewall rules
- Enable audit logging
- Set up intrusion detection
- Implement rate limiting
- Enable CORS properly
- Validate all inputs
- Sanitize outputs
- Regular security updates
- Backup encryption keys
- Monitor failed authentications
- Review access logs
- Conduct security audits
- Implement least privilege
import os
# Set CUDA environment variables
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
os.environ['CUDA_LAUNCH_BLOCKING'] = '0'
os.environ['TORCH_CUDA_ARCH_LIST'] = '8.0;8.6'
# Enable TF32
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True# Gradient accumulation
def train_with_gradient_accumulation(model, dataloader, accumulation_steps=4):
optimizer.zero_grad()
for i, batch in enumerate(dataloader):
outputs = model(batch)
loss = outputs.loss / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()def batch_inference(model, queries, batch_size=8):
"""Process queries in batches."""
results = []
for i in range(0, len(queries), batch_size):
batch = queries[i:i+batch_size]
with torch.no_grad():
outputs = model(batch)
results.extend(outputs)
return resultsfrom functools import lru_cache
import hashlib
@lru_cache(maxsize=1000)
def cached_inference(query_hash: str) -> str:
"""Cache inference results."""
return model.generate(query_hash)
def process_with_cache(query: str) -> str:
query_hash = hashlib.md5(query.encode()).hexdigest()
return cached_inference(query_hash)// Create indexes for performance
db.users.createIndex({ "username": 1 }, { unique: true })
db.users.createIndex({ "email": 1 }, { unique: true })
db.api_keys.createIndex({ "key": 1 })
db.query_logs.createIndex({ "timestamp": -1 })
db.query_logs.createIndex({ "user_id": 1, "timestamp": -1 })# Use projection to limit returned fields
db.query_logs.find(
{"user_id": user_id},
{"query": 1, "timestamp": 1, "response_time": 1}
).limit(100)from fastapi import FastAPI
from fastapi.middleware.gzip import GZipMiddleware
app = FastAPI()
app.add_middleware(GZipMiddleware, minimum_size=1000)from motor.motor_asyncio import AsyncIOMotorClient
class DatabasePool:
def __init__(self, url: str, max_pool_size: int = 100):
self.client = AsyncIOMotorClient(
url,
maxPoolSize=max_pool_size,
minPoolSize=10
)| Operation | Target | Actual | Status |
|---|---|---|---|
| Simple query (Gemma 3 2B) | < 1s | 0.8s | ✅ |
| Standard query (3 experts) | < 5s | 4.2s | ✅ |
| Complex query (5 experts) | < 10s | 8.5s | ✅ |
| File upload (10MB) | < 3s | 2.1s | ✅ |
| RAG search | < 500ms | 320ms | ✅ |
| API latency | < 50ms | 35ms | ✅ |
Symptoms:
- Error: "CUDA out of memory"
- Model loading timeout
Solutions:
# Clear GPU memory
python -c "import torch; torch.cuda.empty_cache()"
# Check GPU usage
nvidia-smi
# Reduce batch size
export BATCH_SIZE=1
# Use smaller model or quantization
python scripts/load_model.py --model gemma3-2b --quantize 8bitSymptoms:
- Invalid API key error
- Authentication failed
Solutions:
# Verify API key in database
mongo sovereign_ai --eval "db.api_keys.find()"
# Regenerate API key
python scripts/regenerate_key.py --user username
# Check headers
curl -H "X-API-Key: sk-your-key" http://localhost:8000/healthSymptoms:
- Queries taking > 10 seconds
- Timeout errors
Solutions:
# Enable fast mode
config.fast_mode = True
# Reduce number of experts
config.num_experts = 1
# Check model cache
print(model_loader.loaded_models)
# Monitor GPU utilization
watch -n 1 nvidia-smi| Error | Cause | Solution |
|---|---|---|
CUDA out of memory |
Insufficient GPU memory | Use quantization or smaller model |
Connection refused |
Service not running | Start service: systemctl start sovereign-ai |
Model not found |
Model not downloaded | Run: python scripts/download_models.py |
Rate limit exceeded |
Too many requests | Wait or upgrade user tier |
Invalid query |
Malformed input | Check query format and encoding |
# Check service status
systemctl status sovereign-ai
# View recent logs
journalctl -u sovereign-ai -n 100
# Test database connection
mongo sovereign_ai --eval "db.stats()"
# Test API endpoint
curl http://localhost:8000/health
# Check disk space
df -h
# Monitor system resources
htop
# GPU diagnostics
nvidia-smi -l 1
# Network diagnostics
netstat -tulpn | grep LISTEN
# Process list
ps aux | grep python/opt/sovereign-ai-suite/logs/
├── backend.log # API server logs
├── frontend.log # Streamlit logs
├── admin.log # Admin portal logs
├── model_loader.log # Model loading logs
└── error.log # Error aggregation# Find errors
grep ERROR /opt/sovereign-ai-suite/logs/*.log
# Count requests by type
grep "task_type" backend.log | cut -d'"' -f4 | sort | uniq -c
# Response time analysis
grep "processing_time" backend.log | awk '{print $NF}' | sort -n
# Failed authentications
grep "401" backend.log | wc -l#!/bin/bash
# recover.sh - Service recovery script
echo "Starting recovery procedure..."
# Stop all services
systemctl stop sovereign-ai
# Clear temporary files
rm -rf /tmp/model_cache/*
# Clear GPU memory
python -c "import torch; torch.cuda.empty_cache()"
# Repair database
mongo sovereign_ai --eval "db.repairDatabase()"
# Restart services
systemctl start sovereign-ai
# Verify health
sleep 10
curl http://localhost:8000/health# Check service health
systemctl status sovereign-ai
# Review error logs
grep ERROR /opt/sovereign-ai-suite/logs/*.log
# Monitor disk usage
df -h /opt/sovereign-ai-suite# Backup database
mongodump --db sovereign_ai --out /backup/weekly/
# Clean old logs
find /opt/sovereign-ai-suite/logs -name "*.log" -mtime +7 -delete
# Update models (if needed)
python scripts/check_updates.py# Full system backup
tar -czf /backup/monthly/sovereign-ai-$(date +%Y%m).tar.gz /opt/sovereign-ai-suite
# Database optimization
mongo sovereign_ai --eval "db.runCommand({compact:'query_logs'})"
# Security audit
python scripts/security_audit.py
# Performance analysis
python scripts/performance_report.py --month $(date +%Y-%m)# backend/utils/metrics.py
from prometheus_client import Counter, Histogram, Gauge
# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests')
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration')
active_models = Gauge('active_models', 'Number of loaded models')
gpu_memory = Gauge('gpu_memory_usage_bytes', 'GPU memory usage')# Health check endpoint
@app.get("/health")
async def health_check():
checks = {
"database": await check_database(),
"models": check_models(),
"gpu": check_gpu(),
"disk": check_disk_space()
}
status = "healthy" if all(checks.values()) else "unhealthy"
return {
"status": status,
"checks": checks,
"timestamp": datetime.utcnow()
}# alerting/rules.yml
groups:
- name: sovereign_ai_alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
- alert: LowGPUMemory
expr: gpu_memory_available < 2000000000
for: 10m
annotations:
summary: "GPU memory running low"
- alert: SlowResponseTime
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 10
for: 5m
annotations:
summary: "95th percentile response time > 10s"#!/bin/bash
# backup.sh - Comprehensive backup script
BACKUP_ROOT="/backup/sovereign-ai"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_DIR="$BACKUP_ROOT/$DATE"
# Create backup directory
mkdir -p $BACKUP_DIR
# Backup database
echo "Backing up database..."
mongodump --db sovereign_ai --out $BACKUP_DIR/mongodb
# Backup configuration
echo "Backing up configuration..."
cp -r /opt/sovereign-ai-suite/config $BACKUP_DIR/
# Backup user uploads
echo "Backing up user data..."
cp -r /opt/sovereign-ai-suite/data/uploads $BACKUP_DIR/
# Backup logs (last 7 days)
echo "Backing up recent logs..."
find /opt/sovereign-ai-suite/logs -name "*.log" -mtime -7 -exec cp {} $BACKUP_DIR/logs/ \;
# Create archive
echo "Creating archive..."
tar -czf $BACKUP_DIR.tar.gz -C $BACKUP_ROOT $DATE
# Upload to remote storage (optional)
# aws s3 cp $BACKUP_DIR.tar.gz s3://backup-bucket/sovereign-ai/
# Cleanup old backups (keep last 30 days)
find $BACKUP_ROOT -name "*.tar.gz" -mtime +30 -delete
echo "Backup completed: $BACKUP_DIR.tar.gz"-
Immediate Response
- Assess damage scope
- Notify stakeholders
- Activate backup systems
-
Recovery Steps
# Step 1: Restore from backup tar -xzf /backup/sovereign-ai/latest.tar.gz -C / # Step 2: Restore database mongorestore --db sovereign_ai /backup/mongodb # Step 3: Verify configuration python scripts/verify_config.py # Step 4: Start services systemctl start sovereign-ai # Step 5: Run health checks python scripts/health_check.py --comprehensive
-
Validation
- Test all endpoints
- Verify model loading
- Check user access
- Monitor performance
Q: What makes Sovereign AI different from other AI solutions? A: Sovereign AI is 100% self-hosted, uses only open-source models, provides complete data privacy, and achieves performance comparable to commercial solutions through its unique Council of Experts architecture.
Q: Can I use this without a GPU? A: While technically possible with CPU-only inference, performance will be significantly degraded. We strongly recommend at least one NVIDIA GPU with 24GB+ VRAM.
Q: How many concurrent users can the system support? A: With the recommended hardware setup (4x A100 GPUs), the system can handle 50-100 concurrent users depending on query complexity.
Q: How do I add support for a new language model?
A: Update the ModelRegistry in model_loader.py, add to expert groups in council_of_experts.py, download the model, and test integration.
Q: Can I use this with Docker/Kubernetes? A: While the current implementation is bare-metal focused, containerization is possible. See the development roadmap for container support.
Q: How do I optimize for faster responses? A: Enable fast mode (Gemma 3), reduce number of experts, use model caching, and ensure GPU memory is available.
Q: Why am I getting "CUDA out of memory" errors? A: This typically occurs when trying to load too many models simultaneously. Solutions:
- Use quantization (4-bit or 8-bit)
- Reduce MAX_CACHED_MODELS
- Use smaller models
- Add more GPU memory
Q: The system is slow, how can I improve performance? A: Check:
- GPU utilization (
nvidia-smi) - Number of experts being used
- Model quantization settings
- Cache hit rates
- Network latency
Q: How do I reset a forgotten admin password?
A: Run: python scripts/reset_admin_password.py
Q: Is my data really private? A: Yes. All processing happens on your servers, no external APIs are called, and all data remains within your infrastructure.
Q: How are API keys secured? A: API keys are hashed using PBKDF2 with salt before storage. Keys are never stored in plaintext.
Q: Can I integrate with my company's SSO? A: Yes, the system supports SAML and OAuth integration. See the enterprise integration guide.
| Term | Definition |
|---|---|
| Council of Experts | Multi-model architecture where multiple AI models collaborate |
| RAG | Retrieval-Augmented Generation - enhancing AI with retrieved context |
| Quantization | Reducing model precision to decrease memory usage |
| LLM | Large Language Model |
| API Key | Authentication credential for API access |
| Fast Mode | Quick response mode using Gemma 3 models |
| Model Registry | Central configuration for all available models |
| Model | Strengths | Weaknesses | Best For |
|---|---|---|---|
| Gemma 3 2B | Ultra-fast, low memory | Lower quality | Quick responses, simple queries |
| Gemma 3 27B | Balanced, versatile | Moderate memory | General purpose |
| Llama 3.1 70B | Excellent reasoning | High memory | Complex analysis |
| Codestral 22B | Code generation | Limited to coding | Programming tasks |
| Mixtral 8x22B | Multi-domain expert | Very high memory | Comprehensive analysis |
| Code | Meaning | Example Response |
|---|---|---|
| 200 | Success | {"success": true, "data": {...}} |
| 400 | Bad Request | {"error": "Invalid input"} |
| 401 | Unauthorized | {"error": "Invalid API key"} |
| 403 | Forbidden | {"error": "Insufficient permissions"} |
| 404 | Not Found | {"error": "Resource not found"} |
| 429 | Rate Limited | {"error": "Rate limit exceeded"} |
| 500 | Server Error | {"error": "Internal server error"} |
APP_NAME="Sovereign AI Suite"
DEBUG=false
API_HOST=0.0.0.0
API_PORT=8000
SECRET_KEY=your-production-secret-key
MONGODB_URL=mongodb://username:password@localhost:27017
DATABASE_NAME=sovereign_ai
MAX_CACHED_MODELS=5
ENABLE_MONITORING=trueserver {
listen 443 ssl http2;
server_name ai.company.com;
ssl_certificate /etc/ssl/certs/ai.company.com.crt;
ssl_certificate_key /etc/ssl/private/ai.company.com.key;
location / {
proxy_pass http://localhost:8501;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
location /api {
proxy_pass http://localhost:8000;
proxy_set_header X-Real-IP $remote_addr;
}
}| Metric | Formula | Target |
|---|---|---|
| Response Time | P95 latency | < 5 seconds |
| Throughput | Requests/second | > 10 RPS |
| Error Rate | Errors/Total Requests | < 1% |
| GPU Utilization | Average GPU usage | 60-80% |
| Cache Hit Rate | Cache Hits/Total Requests | > 30% |
| Model Load Time | Time to load model | < 30 seconds |
- Documentation: https://docs.sovereign-ai.com
- GitHub: https://github.com/your-org/sovereign-ai-suite
- Issues: https://github.com/your-org/sovereign-ai-suite/issues
- Discord: https://discord.gg/sovereign-ai
- Forum: https://forum.sovereign-ai.com
- Stack Overflow: Tag
sovereign-ai
- Email: enterprise@sovereign-ai.com
- Phone: +1-xxx-xxx-xxxx
- SLA: 24/7 for enterprise customers
Copyright © 2024 Your Organization. All rights reserved.
This software is proprietary and confidential. Unauthorized copying, distribution, or use is strictly prohibited.
Special thanks to:
- The Hugging Face team for the transformers library
- The LangChain community
- Google for Gemma models
- Meta for Llama models
- All open-source contributors
Document Version: 1.0.0
Last Updated: January 2024
Next Review: April 2024
End of Documentation