A self-hosted data lakehouse with integrated AI capabilities: local LLM inference, RAG-based data querying, and metadata governance via OpenMetadata.
Created by: Mustapha Fonsau | GitHub
Open Data Platform · github.com/Monsau/ArcaP · LinkedIn
📖 Main documentation in English. Translations available in 17 additional languages below.
🇬🇧 English (You are here) | 🇫🇷 Français | 🇪🇸 Español | 🇵🇹 Português | 🇨🇳 中文 | 🇯🇵 日本語 | 🇷🇺 Русский | 🇸🇦 العربية | 🇩🇪 Deutsch | 🇰🇷 한국어 | 🇮🇳 हिन्दी | 🇮🇩 Indonesia | 🇹🇷 Türkçe | 🇻🇳 Tiếng Việt | 🇮🇹 Italiano | 🇳🇱 Nederlands | 🇵🇱 Polski | 🇸🇪 Svenska
Self-hosted data lakehouse combining ingestion, transformation, BI, and a local AI stack under one Docker Compose setup. All computation stays on-premise — no cloud APIs, no usage costs.
graph TB
A[Data Sources] --> B[Airbyte]
B --> C[Dremio Lakehouse]
C --> D[dbt Transformations]
D --> E[Superset BI]
E --> F[Business Insights]
C --> G[OpenMetadata]
G --> H[Qdrant Vector DB]
D --> H
H --> I[RAG System]
J[Ollama LLM] --> I
I --> K[AI Chat UI]
style B fill:#615EFF,color:#fff,stroke:#333,stroke-width:2px
style C fill:#f5f5f5,stroke:#333,stroke-width:2px
style D fill:#e8e8e8,stroke:#333,stroke-width:2px
style E fill:#d8d8d8,stroke:#333,stroke-width:2px
style G fill:#00C4CC,color:#fff,stroke:#333,stroke-width:2px
style H fill:#6C63FF,color:#fff,stroke:#333,stroke-width:2px
style J fill:#4ECDC4,color:#fff,stroke:#333,stroke-width:2px
style I fill:#95E1D3,stroke:#333,stroke-width:2px
style K fill:#AA96DA,color:#fff,stroke:#333,stroke-width:2px
Data Platform:
- Airbyte 2.0.0 for data ingestion (300+ connectors)
- Dremio 26.0 data lakehouse with Apache Polaris Iceberg catalog
- dbt 1.10+ for SQL transformations and lineage
- Apache Superset 4.1.2 for dashboards and BI
- 21 automated data quality tests
- Arrow Flight for real-time Dremio ↔ PostgreSQL sync
- Documentation in 18 languages
AI and Governance:
- Local LLM inference via Ollama (Llama 3.1, Mistral, Phi3)
- Vector search with Qdrant v1.17.1 (cosine similarity, 384-dim embeddings)
- OpenMetadata 1.12.4 as governance Source of Truth
- Governance-first RAG: metadata catalog queried before operational data
- Document ingestion: PDF, Word, Excel, CSV, JSON, TXT, Markdown
- Documents archived to MinIO before vector processing
- Scheduled ingestion from PostgreSQL and Dremio
- Fully on-premise — no cloud API calls, no usage fees
- Docker 20.10+ and Docker Compose 2.0+
- Python 3.11 or higher
- Minimum 8 GB RAM (16 GB recommended for AI services)
- 30 GB available disk space (includes LLM models)
- Optional: NVIDIA GPU for faster LLM inference
Use the orchestrate_platform.py script for automatic setup:
# Full deployment (Data Platform + AI Services)
python orchestrate_platform.py
# Windows PowerShell
$env:PYTHONIOENCODING="utf-8"
python -u orchestrate_platform.py
# Skip AI services if not needed
python orchestrate_platform.py --skip-ai
# Skip infrastructure (if already running)
python orchestrate_platform.py --skip-infrastructureWhat it does:
- Validates prerequisites (Docker, Docker Compose, Python 3.11+)
- Starts all Docker services
- Deploys the AI stack (Ollama, Qdrant, RAG API, Embedding Service, Chat UI)
- Configures Airbyte, Dremio, dbt
- Runs dbt transformations and quality tests
- Creates Superset dashboards
- Prints a deployment summary with all service URLs
# Clone repository
git clone https://github.com/Monsau/ArcaP.git
cd ArcaP
# Install dependencies
pip install -r requirements.txt
# Start infrastructure (Data Platform + AI Services)
docker-compose -f docker-compose.yml -f docker-compose-airbyte-stable.yml -f docker-compose-ai.yml up -d
# Or just data platform (no AI)
docker-compose -f docker-compose.yml -f docker-compose-airbyte-stable.yml up -d
# Or use make commands
make up
# Verify installation
make status
# Run quality tests
make dbt-testData Platform:
| Service | URL | Credentials |
|---|---|---|
| Airbyte | http://localhost:8000 | airbyte / password |
| Dremio | http://localhost:9047 | admin / admin123 |
| Superset | http://localhost:8088 | admin / admin |
| MinIO Console | http://localhost:9001 | minioadmin / minioadmin123 |
| PostgreSQL | localhost:5432 | postgres / postgres123 |
AI Services:
| Service | URL | Description |
|---|---|---|
| AI Chat UI | http://localhost:8501 | Natural language interface for data queries |
| RAG API | http://localhost:8002 | REST API for AI queries |
| RAG API Docs | http://localhost:8002/docs | Interactive API documentation |
| Qdrant UI | http://localhost:6333/dashboard | Vector database UI |
| Ollama LLM | http://localhost:11434 | Local LLM server (Llama 3.1) |
| Embedding Service | http://localhost:8001 | Text-to-vector conversion |
| OpenMetadata | http://localhost:8585 | Metadata governance catalog |
| Component | Version | Port | Description |
|---|---|---|---|
| Airbyte | 2.0.0 | 8000 | Data integration platform (300+ connectors) |
| Dremio | 26.0 | 9047, 32010 | Data lakehouse engine |
| dbt | 1.10+ | — | SQL transformations and data lineage |
| Superset | 4.1.2 | 8088 | Business intelligence and dashboards |
| PostgreSQL | 16 | 5432 | Transactional database |
| MinIO | latest | 9000, 9001 | S3-compatible object storage |
| Elasticsearch | 8.11.4 | 9200 | Search engine (used by OpenMetadata) |
| MySQL | 8.4 | 3307 | OpenMetadata backend database |
| Airflow | 3.0.0 | 8080 | Workflow orchestration |
| Component | Version | Port | Description |
|---|---|---|---|
| Ollama | latest | 11434 | Local LLM server (Llama 3.1, Mistral, Phi3) |
| Qdrant | 1.17.1 | 6333, 6334 | Vector database (REST + gRPC) |
| OpenMetadata | 1.12.4 | 8585 | Metadata governance catalog |
| RAG API | — | 8002 | FastAPI RAG orchestration service |
| Embedding Service | — | 8001 | all-MiniLM-L6-v2 text embeddings |
| AI Chat UI | — | 8501 | Streamlit natural language query interface |
OpenMetadata 1.12.4 is the governance backbone of ArcaP. It sits between the data lakehouse and the AI layer: it catalogues every asset produced in the platform, builds end-to-end lineage, and feeds the RAG system with structured knowledge.
Airbyte ──▶ Dremio ──▶ dbt ──▶ Superset
│ │
▼ ▼
OpenMetadata (catalog + lineage)
│
▼
Qdrant om_knowledge collection
│
▼
RAG System (governance-first)
The RAG pipeline queries om_knowledge before data_platform_knowledge. Every answer from the AI Chat UI is grounded in the governed metadata catalog — table descriptions, column definitions, data quality results, and ownership — not raw data alone.
| Connector | Ingestion | Capabilities |
|---|---|---|
| Dremio | Scheduled daily at 02:00 | Tables, views, VDS lineage, column profiling |
| dbt | After each dbt run |
Model lineage, test results, exposures |
| PostgreSQL | Scheduled daily | Tables, column stats, data quality |
| Airflow | Pipeline events | DAG lineage, task-level traceability |
- Tables & Views — every Dremio space, PostgreSQL schema, and dbt model
- Column-level Lineage — traces a BI metric back to its raw source across Airbyte → Dremio → dbt → Superset
- Data Quality Results — dbt test outcomes surfaced as quality badges in the catalog
- Ownership & Tags — data stewards assigned per domain; PII / sensitive columns tagged automatically
- Business Glossary — shared term definitions linked to physical columns
- Descriptions — auto-populated from dbt
descriptionfields and enriched collaboratively
| Interface | URL | Credentials |
|---|---|---|
| Catalog UI | http://localhost:8585 | admin / admin |
| REST API | http://localhost:8585/api/v1 | JWT Bearer token |
| Swagger | http://localhost:8585/swagger-ui | — |
| Health | http://localhost:8585/api/v1/health | — |
OpenMetadata requires two backing services that run alongside it:
| Service | Port | Purpose |
|---|---|---|
| MySQL 8.4 | 3307 | Metadata storage |
| Elasticsearch 8.11.4 | 9200 | Full-text search index |
The AI ingestion pipeline runs on a schedule and feeds OpenMetadata context into Qdrant:
# Trigger a manual metadata sync to Qdrant
python scripts/auto-sync-dremio-openmetadata.pyThe om_knowledge Qdrant collection stores:
- Table and column descriptions from the catalog
- Data lineage summaries
- Data quality test results
- Business glossary definitions
- Dataset ownership and stewardship metadata
Answers produced by the AI Chat UI always cite which catalog entries contributed context.
- OpenMetadata Setup Guide
- Integration Plan
- Deployment Summary
- Verification Checklist
- GenAI Integration
This project provides complete documentation in 18 languages, covering 5.2B+ people (70% of global population):
| Language | Documentation | Data Generation | Native Speakers |
|---|---|---|---|
| 🇬🇧 English | README.md | --language en |
1.5B |
| 🇫🇷 Français | docs/i18n/fr/ | --language fr |
280M |
| 🇪🇸 Español | docs/i18n/es/ | --language es |
559M |
| 🇵🇹 Português | docs/i18n/pt/ | --language pt |
264M |
| 🇸🇦 العربية | docs/i18n/ar/ | --language ar |
422M |
| 🇨🇳 中文 | docs/i18n/cn/ | --language cn |
1.3B |
| 🇯🇵 日本語 | docs/i18n/jp/ | --language jp |
125M |
| 🇷🇺 Русский | docs/i18n/ru/ | --language ru |
258M |
| 🇩🇪 Deutsch | docs/i18n/de/ | --language de |
134M |
| 🇰🇷 한국어 | docs/i18n/ko/ | --language ko |
81M |
| 🇮🇳 हिन्दी | docs/i18n/hi/ | --language hi |
602M |
| 🇮🇩 Indonesia | docs/i18n/id/ | --language id |
199M |
| 🇹🇷 Türkçe | docs/i18n/tr/ | --language tr |
88M |
| 🇻🇳 Tiếng Việt | docs/i18n/vi/ | --language vi |
85M |
| 🇮🇹 Italiano | docs/i18n/it/ | --language it |
85M |
| 🇳🇱 Nederlands | docs/i18n/nl/ | --language nl |
25M |
| 🇵🇱 Polski | docs/i18n/pl/ | --language pl |
45M |
| 🇸🇪 Svenska | docs/i18n/se/ | --language se |
13M |
# Generate French customer data (CSV format)
python config/i18n/data_generator.py --language fr --records 1000 --format csv
# Generate Spanish product data (JSON format)
python config/i18n/data_generator.py --language es --records 500 --format json
# Generate Chinese user data (Parquet format)
python config/i18n/data_generator.py --language cn --records 2000 --format parquetConfiguration: config/i18n/config.json
The platform includes a complete AI/LLM stack for natural language data querying and insights.
-
Deploy Platform (includes AI services):
python orchestrate_platform.py
-
Access AI Chat Interface:
- Open http://localhost:8501
- Use the sidebar to ingest data from your PostgreSQL or Dremio tables
-
Ingest Your Data (via sidebar):
Option 1: Upload Documents (NEW!) - Click "Choose files to upload" - Select PDF, Word, Excel, CSV, or other files - Add optional tags/source - Click "🚀 Upload & Ingest Documents" Option 2: From Database Table: customers Text column: description Metadata: customer_id,name,segment → Click "Ingest PostgreSQL" -
Ask Questions (examples):
- "What are the key trends in our sales data?"
- "Show me customer segments with highest revenue"
- "Are there any data quality issues in the orders table?"
- "Generate a SQL query to find recent high-value customers"
- "Explain the ETL pipeline for product data"
User Question → Chat UI → RAG API → Query Embedding
↓
Vector Search (Qdrant)
↓
Retrieve Context Documents
↓
Build Prompt with Context
↓
Local LLM (Ollama/Llama 3.1)
↓
AI-Generated Answer + Sources
| Service | URL | Purpose |
|---|---|---|
| AI Chat UI | http://localhost:8501 | Interactive Q&A interface |
| RAG API | http://localhost:8002 | REST API for AI queries |
| RAG API Docs | http://localhost:8002/docs | Interactive API documentation |
| Ollama LLM | http://localhost:11434 | Local LLM server (Llama 3.1) |
| Qdrant Vector DB | localhost:6333 | Semantic search database |
| Embedding Service | http://localhost:8001 | Text-to-vector conversion |
Python Example:
import httpx
# Ask a question
response = httpx.post(
"http://localhost:8002/query",
json={
"question": "What are our top products?",
"top_k": 5,
"model": "llama3.1"
}
)
result = response.json()
print(f"Answer: {result['answer']}")
print(f"Sources: {len(result['sources'])} documents")cURL Example:
curl -X POST http://localhost:8002/query \
-H "Content-Type: application/json" \
-d '{
"question": "What trends do you see in customer data?",
"top_k": 5,
"model": "llama3.1",
"temperature": 0.7
}'# Mistral (faster, good for coding)
docker exec ollama ollama pull mistral
# Phi3 (lightweight, quick responses)
docker exec ollama ollama pull phi3
# CodeLlama (code generation)
docker exec ollama ollama pull codellama
# List available models
docker exec ollama ollama list- All computation is local: no cloud API calls, no data leaves your infrastructure
- Qdrant vector DB with cosine similarity search across 384-dim embeddings
- Dual-collection RAG:
om_knowledge(OpenMetadata governance) anddata_platform_knowledge(operational) - Governance-first mode: metadata context prioritized over raw data context
- Supported models: Llama 3.1, Mistral, Phi3, CodeLlama
- Scheduled ingestion from PostgreSQL tables and Dremio spaces
- Source attribution in every answer: shows which documents contributed to the response
For detailed AI services documentation, see:
- AI Services Guide - Complete guide with architecture, configuration, troubleshooting
- Quick Start Guide - Fast AI setup with examples
- Platform Status - All services including AI
Data Engineers
Data Analysts
Developers
DevOps
# Infrastructure Management
make up # Start all services
make down # Stop all services
make restart # Restart services
make status # Check service status
make logs # View service logs
# Data Transformation (dbt)
make dbt-run # Run transformations
make dbt-test # Run quality tests
make dbt-docs # Generate documentation
make dbt-clean # Clean artifacts
# Data Synchronization
make sync # Manual sync Dremio to PostgreSQL
make sync-auto # Auto sync every 5 minutes
# Testing & Quality
make test # Run all tests
make lint # Code quality checks
make format # Format code
# Deployment
make deploy # Complete deployment
make deploy-quick # Quick deploymentServices: 9/9 operational (includes Airbyte)
dbt Tests: 21/21 passing
Dashboards: 3 active
Languages: 18 supported (5.2B+ people coverage)
Documentation: Complete in 18 languages
Status: Production Ready - v1.0
data-platform-iso-opensource/
├── README.md # This file
├── AUTHORS.md # Project creators and contributors
├── CHANGELOG.md # Version history
├── CONTRIBUTING.md # Contribution guidelines
├── CODE_OF_CONDUCT.md # Community guidelines
├── SECURITY.md # Security policies
├── LICENSE # MIT License
│
├── docs/ # Documentation
│ ├── i18n/ # Multilingual docs (18 languages)
│ │ ├── fr/, es/, pt/, cn/, jp/, ru/, ar/
│ │ ├── de/, ko/, hi/, id/, tr/, vi/
│ │ └── it/, nl/, pl/, se/
│ └── diagrams/ # Mermaid diagrams (248+)
│
├── config/ # Configuration
│ └── i18n/ # Internationalization
│ ├── config.json
│ └── data_generator.py
│
├── dbt/ # Data transformations
│ ├── models/ # SQL models
│ ├── tests/ # Quality tests
│ └── dbt_project.yml
│
├── reports/ # Documentation reports
│ ├── phase1/ # Integration reports
│ ├── phase2/ # Data cleaning reports
│ ├── phase3/ # Quality testing reports
│ ├── superset/ # Dashboard guides
│ └── integration/ # Integration guides
│
├── scripts/ # Automation scripts
│ ├── orchestrate_platform.py
│ ├── sync_dremio_realtime.py
│ └── populate_superset.py
│
└── docker-compose.yml # Infrastructure definition
We welcome contributions from the community. Please see:
- Add language configuration to
config/i18n/config.json - Create documentation directory:
docs/i18n/[language-code]/ - Translate README and guides
- Update main README language table
- Submit pull request
This project is licensed under the MIT License. See LICENSE file for details.
Supported by Talentys | LinkedIn - Data Engineering and Analytics Excellence
Built with enterprise-grade open-source technologies:
Data Platform:
- Airbyte - Data integration platform (300+ connectors)
- Dremio - Data lakehouse platform
- dbt - Data transformation tool
- Apache Superset - Business intelligence platform
- Apache Arrow - Columnar data format
- PostgreSQL - Relational database
- MinIO - Object storage
- Elasticsearch - Search and analytics
AI Services:
- Ollama - Local LLM server
- Llama 3.1 - Meta's open-source LLM (8B parameters)
- Qdrant - Vector database for semantic search
- sentence-transformers - Text embedding models
- FastAPI - Modern web framework for APIs
- Streamlit - App framework for ML/AI projects
Author: Mustapha Fonsau
- 🏢 Organization: Talentys | LinkedIn
- 💼 LinkedIn: linkedin.com/in/mustapha-fonsau
- 🐙 GitHub: github.com/Monsau
- 📧 Email: mfonsau@talentys.eu
For technical assistance:
- 📚 Documentation: docs/i18n/
- 🐛 Issue Tracker: GitHub Issues
- 💬 Discussions: GitHub Discussions
Made by Mustapha Fonsau | Supported by Talentys | LinkedIn