ArcaP — Open Data Platform

A self-hosted data lakehouse with integrated AI capabilities: local LLM inference, RAG-based data querying, and metadata governance via OpenMetadata.

Created by: Mustapha Fonsau | GitHub

Open Data Platform · github.com/Monsau/ArcaP · LinkedIn

📖 Main documentation in English. Translations available in 17 additional languages below.

🌍 Available Languages

Overview

Self-hosted data lakehouse combining ingestion, transformation, BI, and a local AI stack under one Docker Compose setup. All computation stays on-premise — no cloud APIs, no usage costs.

graph TB
    A[Data Sources] --> B[Airbyte]
    B --> C[Dremio Lakehouse]
    C --> D[dbt Transformations]
    D --> E[Superset BI]
    E --> F[Business Insights]

    C --> G[OpenMetadata]
    G --> H[Qdrant Vector DB]
    D --> H
    H --> I[RAG System]
    J[Ollama LLM] --> I
    I --> K[AI Chat UI]

    style B fill:#615EFF,color:#fff,stroke:#333,stroke-width:2px
    style C fill:#f5f5f5,stroke:#333,stroke-width:2px
    style D fill:#e8e8e8,stroke:#333,stroke-width:2px
    style E fill:#d8d8d8,stroke:#333,stroke-width:2px
    style G fill:#00C4CC,color:#fff,stroke:#333,stroke-width:2px
    style H fill:#6C63FF,color:#fff,stroke:#333,stroke-width:2px
    style J fill:#4ECDC4,color:#fff,stroke:#333,stroke-width:2px
    style I fill:#95E1D3,stroke:#333,stroke-width:2px
    style K fill:#AA96DA,color:#fff,stroke:#333,stroke-width:2px

Key Features

Data Platform:

Airbyte 2.0.0 for data ingestion (300+ connectors)
Dremio 26.0 data lakehouse with Apache Polaris Iceberg catalog
dbt 1.10+ for SQL transformations and lineage
Apache Superset 4.1.2 for dashboards and BI
21 automated data quality tests
Arrow Flight for real-time Dremio ↔ PostgreSQL sync
Documentation in 18 languages

AI and Governance:

Local LLM inference via Ollama (Llama 3.1, Mistral, Phi3)
Vector search with Qdrant v1.17.1 (cosine similarity, 384-dim embeddings)
OpenMetadata 1.12.4 as governance Source of Truth
Governance-first RAG: metadata catalog queried before operational data
Document ingestion: PDF, Word, Excel, CSV, JSON, TXT, Markdown
Documents archived to MinIO before vector processing
Scheduled ingestion from PostgreSQL and Dremio
Fully on-premise — no cloud API calls, no usage fees

Quick Start

Prerequisites

Docker 20.10+ and Docker Compose 2.0+
Python 3.11 or higher
Minimum 8 GB RAM (16 GB recommended for AI services)
30 GB available disk space (includes LLM models)
Optional: NVIDIA GPU for faster LLM inference

One-Command Deployment

Use the orchestrate_platform.py script for automatic setup:

# Full deployment (Data Platform + AI Services)
python orchestrate_platform.py

# Windows PowerShell
$env:PYTHONIOENCODING="utf-8"
python -u orchestrate_platform.py

# Skip AI services if not needed
python orchestrate_platform.py --skip-ai

# Skip infrastructure (if already running)
python orchestrate_platform.py --skip-infrastructure

What it does:

Validates prerequisites (Docker, Docker Compose, Python 3.11+)
Starts all Docker services
Deploys the AI stack (Ollama, Qdrant, RAG API, Embedding Service, Chat UI)
Configures Airbyte, Dremio, dbt
Runs dbt transformations and quality tests
Creates Superset dashboards
Prints a deployment summary with all service URLs

Manual Installation

# Clone repository
git clone https://github.com/Monsau/ArcaP.git
cd ArcaP

# Install dependencies
pip install -r requirements.txt

# Start infrastructure (Data Platform + AI Services)
docker-compose -f docker-compose.yml -f docker-compose-airbyte-stable.yml -f docker-compose-ai.yml up -d

# Or just data platform (no AI)
docker-compose -f docker-compose.yml -f docker-compose-airbyte-stable.yml up -d

# Or use make commands
make up

# Verify installation
make status

# Run quality tests
make dbt-test

Access Services

Data Platform:

Service	URL	Credentials
Airbyte	http://localhost:8000	airbyte / password
Dremio	http://localhost:9047	admin / admin123
Superset	http://localhost:8088	admin / admin
MinIO Console	http://localhost:9001	minioadmin / minioadmin123
PostgreSQL	localhost:5432	postgres / postgres123

AI Services:

Service	URL	Description
AI Chat UI	http://localhost:8501	Natural language interface for data queries
RAG API	http://localhost:8002	REST API for AI queries
RAG API Docs	http://localhost:8002/docs	Interactive API documentation
Qdrant UI	http://localhost:6333/dashboard	Vector database UI
Ollama LLM	http://localhost:11434	Local LLM server (Llama 3.1)
Embedding Service	http://localhost:8001	Text-to-vector conversion
OpenMetadata	http://localhost:8585	Metadata governance catalog

Architecture

System Components

Data Platform

Component	Version	Port	Description
Airbyte	2.0.0	8000	Data integration platform (300+ connectors)
Dremio	26.0	9047, 32010	Data lakehouse engine
dbt	1.10+	—	SQL transformations and data lineage
Superset	4.1.2	8088	Business intelligence and dashboards
PostgreSQL	16	5432	Transactional database
MinIO	latest	9000, 9001	S3-compatible object storage
Elasticsearch	8.11.4	9200	Search engine (used by OpenMetadata)
MySQL	8.4	3307	OpenMetadata backend database
Airflow	3.0.0	8080	Workflow orchestration

AI Services

Component	Version	Port	Description
Ollama	latest	11434	Local LLM server (Llama 3.1, Mistral, Phi3)
Qdrant	1.17.1	6333, 6334	Vector database (REST + gRPC)
OpenMetadata	1.12.4	8585	Metadata governance catalog
RAG API	—	8002	FastAPI RAG orchestration service
Embedding Service	—	8001	all-MiniLM-L6-v2 text embeddings
AI Chat UI	—	8501	Streamlit natural language query interface

Architecture Diagrams

Metadata Governance — OpenMetadata

OpenMetadata 1.12.4 is the governance backbone of ArcaP. It sits between the data lakehouse and the AI layer: it catalogues every asset produced in the platform, builds end-to-end lineage, and feeds the RAG system with structured knowledge.

Role in the Platform

Airbyte ──▶ Dremio ──▶ dbt ──▶ Superset
               │          │
               ▼          ▼
           OpenMetadata (catalog + lineage)
               │
               ▼
        Qdrant om_knowledge collection
               │
               ▼
         RAG System (governance-first)

The RAG pipeline queries om_knowledge before data_platform_knowledge. Every answer from the AI Chat UI is grounded in the governed metadata catalog — table descriptions, column definitions, data quality results, and ownership — not raw data alone.

Configured Connectors

Connector	Ingestion	Capabilities
Dremio	Scheduled daily at 02:00	Tables, views, VDS lineage, column profiling
dbt	After each `dbt run`	Model lineage, test results, exposures
PostgreSQL	Scheduled daily	Tables, column stats, data quality
Airflow	Pipeline events	DAG lineage, task-level traceability

What Gets Catalogued

Tables & Views — every Dremio space, PostgreSQL schema, and dbt model
Column-level Lineage — traces a BI metric back to its raw source across Airbyte → Dremio → dbt → Superset
Data Quality Results — dbt test outcomes surfaced as quality badges in the catalog
Ownership & Tags — data stewards assigned per domain; PII / sensitive columns tagged automatically
Business Glossary — shared term definitions linked to physical columns
Descriptions — auto-populated from dbt description fields and enriched collaboratively

Access

Interface	URL	Credentials
Catalog UI	http://localhost:8585	admin / admin
REST API	http://localhost:8585/api/v1	JWT Bearer token
Swagger	http://localhost:8585/swagger-ui	—
Health	http://localhost:8585/api/v1/health	—

Backend Services

OpenMetadata requires two backing services that run alongside it:

Service	Port	Purpose
MySQL 8.4	3307	Metadata storage
Elasticsearch 8.11.4	9200	Full-text search index

Governance-First RAG

The AI ingestion pipeline runs on a schedule and feeds OpenMetadata context into Qdrant:

# Trigger a manual metadata sync to Qdrant
python scripts/auto-sync-dremio-openmetadata.py

The om_knowledge Qdrant collection stores:

Table and column descriptions from the catalog
Data lineage summaries
Data quality test results
Business glossary definitions
Dataset ownership and stewardship metadata

Answers produced by the AI Chat UI always cite which catalog entries contributed context.

Full Documentation

Multilingual Support

This project provides complete documentation in 18 languages, covering 5.2B+ people (70% of global population):

Language	Documentation	Data Generation	Native Speakers
🇬🇧 English	README.md	`--language en`	1.5B
🇫🇷 Français	docs/i18n/fr/	`--language fr`	280M
🇪🇸 Español	docs/i18n/es/	`--language es`	559M
🇵🇹 Português	docs/i18n/pt/	`--language pt`	264M
🇸🇦 العربية	docs/i18n/ar/	`--language ar`	422M
🇨🇳 中文	docs/i18n/cn/	`--language cn`	1.3B
🇯🇵 日本語	docs/i18n/jp/	`--language jp`	125M
🇷🇺 Русский	docs/i18n/ru/	`--language ru`	258M
🇩🇪 Deutsch	docs/i18n/de/	`--language de`	134M
🇰🇷 한국어	docs/i18n/ko/	`--language ko`	81M
🇮🇳 हिन्दी	docs/i18n/hi/	`--language hi`	602M
🇮🇩 Indonesia	docs/i18n/id/	`--language id`	199M
🇹🇷 Türkçe	docs/i18n/tr/	`--language tr`	88M
🇻🇳 Tiếng Việt	docs/i18n/vi/	`--language vi`	85M
🇮🇹 Italiano	docs/i18n/it/	`--language it`	85M
🇳🇱 Nederlands	docs/i18n/nl/	`--language nl`	25M
🇵🇱 Polski	docs/i18n/pl/	`--language pl`	45M
🇸🇪 Svenska	docs/i18n/se/	`--language se`	13M

Generate Multilingual Test Data

# Generate French customer data (CSV format)
python config/i18n/data_generator.py --language fr --records 1000 --format csv

# Generate Spanish product data (JSON format)
python config/i18n/data_generator.py --language es --records 500 --format json

# Generate Chinese user data (Parquet format)
python config/i18n/data_generator.py --language cn --records 2000 --format parquet

Configuration: config/i18n/config.json

🤖 AI-Powered Data Insights

The platform includes a complete AI/LLM stack for natural language data querying and insights.

Quick Start with AI

Deploy Platform (includes AI services):
```
python orchestrate_platform.py
```
Access AI Chat Interface:
- Open http://localhost:8501
- Use the sidebar to ingest data from your PostgreSQL or Dremio tables

Ingest Your Data (via sidebar):

Option 1: Upload Documents (NEW!)
- Click "Choose files to upload"
- Select PDF, Word, Excel, CSV, or other files
- Add optional tags/source
- Click "🚀 Upload & Ingest Documents"

Option 2: From Database
Table: customers
Text column: description
Metadata: customer_id,name,segment
→ Click "Ingest PostgreSQL"

Ask Questions (examples):
- "What are the key trends in our sales data?"
- "Show me customer segments with highest revenue"
- "Are there any data quality issues in the orders table?"
- "Generate a SQL query to find recent high-value customers"
- "Explain the ETL pipeline for product data"

AI Architecture

User Question → Chat UI → RAG API → Query Embedding
                                  ↓
                          Vector Search (Qdrant)
                                  ↓
                          Retrieve Context Documents
                                  ↓
                          Build Prompt with Context
                                  ↓
                          Local LLM (Ollama/Llama 3.1)
                                  ↓
                          AI-Generated Answer + Sources

AI Services Available

Service	URL	Purpose
AI Chat UI	http://localhost:8501	Interactive Q&A interface
RAG API	http://localhost:8002	REST API for AI queries
RAG API Docs	http://localhost:8002/docs	Interactive API documentation
Ollama LLM	http://localhost:11434	Local LLM server (Llama 3.1)
Qdrant Vector DB	localhost:6333	Semantic search database
Embedding Service	http://localhost:8001	Text-to-vector conversion

Programmatic Access

Python Example:

import httpx

# Ask a question
response = httpx.post(
    "http://localhost:8002/query",
    json={
        "question": "What are our top products?",
        "top_k": 5,
        "model": "llama3.1"
    }
)

result = response.json()
print(f"Answer: {result['answer']}")
print(f"Sources: {len(result['sources'])} documents")

cURL Example:

curl -X POST http://localhost:8002/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What trends do you see in customer data?",
    "top_k": 5,
    "model": "llama3.1",
    "temperature": 0.7
  }'

Download Additional LLM Models

# Mistral (faster, good for coding)
docker exec ollama ollama pull mistral

# Phi3 (lightweight, quick responses)
docker exec ollama ollama pull phi3

# CodeLlama (code generation)
docker exec ollama ollama pull codellama

# List available models
docker exec ollama ollama list

AI Features

All computation is local: no cloud API calls, no data leaves your infrastructure
Qdrant vector DB with cosine similarity search across 384-dim embeddings
Dual-collection RAG: om_knowledge (OpenMetadata governance) and data_platform_knowledge (operational)
Governance-first mode: metadata context prioritized over raw data context
Supported models: Llama 3.1, Mistral, Phi3, CodeLlama
Scheduled ingestion from PostgreSQL tables and Dremio spaces
Source attribution in every answer: shows which documents contributed to the response

Comprehensive Guide

For detailed AI services documentation, see:

AI Services Guide - Complete guide with architecture, configuration, troubleshooting
Quick Start Guide - Fast AI setup with examples
Platform Status - All services including AI

Documentation

For Different Roles

Data Engineers

Data Analysts

Developers

DevOps

Common Commands

# Infrastructure Management
make up              # Start all services
make down            # Stop all services
make restart         # Restart services
make status          # Check service status
make logs            # View service logs

# Data Transformation (dbt)
make dbt-run         # Run transformations
make dbt-test        # Run quality tests
make dbt-docs        # Generate documentation
make dbt-clean       # Clean artifacts

# Data Synchronization
make sync            # Manual sync Dremio to PostgreSQL
make sync-auto       # Auto sync every 5 minutes

# Testing & Quality
make test            # Run all tests
make lint            # Code quality checks
make format          # Format code

# Deployment
make deploy          # Complete deployment
make deploy-quick    # Quick deployment

Project Status

Services: 9/9 operational (includes Airbyte)
dbt Tests: 21/21 passing
Dashboards: 3 active
Languages: 18 supported (5.2B+ people coverage)
Documentation: Complete in 18 languages
Status: Production Ready - v1.0

Project Structure

data-platform-iso-opensource/
├── README.md                       # This file
├── AUTHORS.md                      # Project creators and contributors
├── CHANGELOG.md                    # Version history
├── CONTRIBUTING.md                 # Contribution guidelines
├── CODE_OF_CONDUCT.md              # Community guidelines
├── SECURITY.md                     # Security policies
├── LICENSE                         # MIT License
│
├── docs/                           # Documentation
│   ├── i18n/                       # Multilingual docs (18 languages)
│   │   ├── fr/, es/, pt/, cn/, jp/, ru/, ar/
│   │   ├── de/, ko/, hi/, id/, tr/, vi/
│   │   └── it/, nl/, pl/, se/
│   └── diagrams/                   # Mermaid diagrams (248+)
│
├── config/                         # Configuration
│   └── i18n/                       # Internationalization
│       ├── config.json
│       └── data_generator.py
│
├── dbt/                            # Data transformations
│   ├── models/                     # SQL models
│   ├── tests/                      # Quality tests
│   └── dbt_project.yml
│
├── reports/                        # Documentation reports
│   ├── phase1/                     # Integration reports
│   ├── phase2/                     # Data cleaning reports
│   ├── phase3/                     # Quality testing reports
│   ├── superset/                   # Dashboard guides
│   └── integration/                # Integration guides
│
├── scripts/                        # Automation scripts
│   ├── orchestrate_platform.py
│   ├── sync_dremio_realtime.py
│   └── populate_superset.py
│
└── docker-compose.yml              # Infrastructure definition

Contributing

We welcome contributions from the community. Please see:

Adding a New Language

Add language configuration to config/i18n/config.json
Create documentation directory: docs/i18n/[language-code]/
Translate README and guides
Update main README language table
Submit pull request

License

This project is licensed under the MIT License. See LICENSE file for details.

Acknowledgments

Supported by Talentys | LinkedIn - Data Engineering and Analytics Excellence

Built with enterprise-grade open-source technologies:

Data Platform:

Airbyte - Data integration platform (300+ connectors)
Dremio - Data lakehouse platform
dbt - Data transformation tool
Apache Superset - Business intelligence platform
Apache Arrow - Columnar data format
PostgreSQL - Relational database
MinIO - Object storage
Elasticsearch - Search and analytics

AI Services:

Ollama - Local LLM server
Llama 3.1 - Meta's open-source LLM (8B parameters)
Qdrant - Vector database for semantic search
sentence-transformers - Text embedding models
FastAPI - Modern web framework for APIs
Streamlit - App framework for ML/AI projects

📧 Contact

Author: Mustapha Fonsau

🏢 Organization: Talentys | LinkedIn
💼 LinkedIn: linkedin.com/in/mustapha-fonsau
🐙 GitHub: github.com/Monsau
📧 Email: mfonsau@talentys.eu

Support

For technical assistance:

📚 Documentation: docs/i18n/
🐛 Issue Tracker: GitHub Issues
💬 Discussions: GitHub Discussions

Made by Mustapha Fonsau | Supported by Talentys | LinkedIn

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
ai-services		ai-services
assets		assets
config		config
data/test		data/test
dbt		dbt
docker		docker
docs		docs
dremio		dremio
k8s		k8s
minio/sample-data		minio/sample-data
opendata		opendata
openmetadata		openmetadata
postgres/init		postgres/init
reports		reports
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CREDENTIALS_STANDARDS.md		CREDENTIALS_STANDARDS.md
GITHUB_RELEASE_v1.1.0.md		GITHUB_RELEASE_v1.1.0.md
LICENSE		LICENSE
Makefile		Makefile
PORTS_MAPPING.md		PORTS_MAPPING.md
QUICK_START.md		QUICK_START.md
README.md		README.md
RELEASE_GUIDE_v1.1.md		RELEASE_GUIDE_v1.1.md
SECURITY_SSH_KEY_CHECK.md		SECURITY_SSH_KEY_CHECK.md
SECURITY_VERIFICATION_SUMMARY.md		SECURITY_VERIFICATION_SUMMARY.md
docker-compose-ai.yml		docker-compose-ai.yml
docker-compose-airbyte-stable.yml		docker-compose-airbyte-stable.yml
docker-compose-airbyte.yml		docker-compose-airbyte.yml
docker-compose-minimal.yml		docker-compose-minimal.yml
docker-compose-openmetadata-official.yml		docker-compose-openmetadata-official.yml
docker-compose-openmetadata-standalone.yml		docker-compose-openmetadata-standalone.yml
docker-compose-superset.yml		docker-compose-superset.yml
docker-compose.yml		docker-compose.yml
orchestrate_platform.py		orchestrate_platform.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

ArcaP — Open Data Platform

🌍 Available Languages

Overview

Key Features

Quick Start

Prerequisites

One-Command Deployment

Manual Installation

Access Services

Architecture

System Components

Data Platform

AI Services

Architecture Diagrams

Metadata Governance — OpenMetadata

Role in the Platform

Configured Connectors

What Gets Catalogued

Access

Backend Services

Governance-First RAG

Full Documentation

Multilingual Support

Generate Multilingual Test Data

🤖 AI-Powered Data Insights

Quick Start with AI

AI Architecture

AI Services Available

Programmatic Access

Download Additional LLM Models

AI Features

Comprehensive Guide

Documentation

For Different Roles

Common Commands

Project Status

Project Structure

Contributing

Adding a New Language

License

Acknowledgments

📧 Contact

Support

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 1

Languages

Packages