Skip to content

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB and MySQL

License

Notifications You must be signed in to change notification settings

datazip-inc/olake

Repository files navigation

olake
OLake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB. Visit olake.io/docs for the full documentation, and benchmarks

GitHub issuesDocumentation slack

undefined

Connector ecosystem for Olake, the key points Olake Connectors focuses on are these

  • Integrated Writers to avoid block of reading, and pushing directly into destinations
  • Connector Autonomy
  • Avoid operations that don't contribute to increasing record throughput

Getting Started with OLake

Source / Connectors

  1. Getting started Postgres -> Writers | Postgres Docs
  2. Getting started MongoDB -> Writers | MongoDB Docs
  3. Getting started MySQL -> Writers | MySQL Docs

Writers / Destination

  1. Apache Iceberg Docs
  2. AWS S3 Docs
  3. Local FileSystem Docs

Source/Connector Functionalities

Functionality MongoDB Postgres MySQL
Full Refresh Sync Mode
Incremental Sync Mode
CDC Sync Mode
Full Parallel Processing
CDC Parallel Processing
Resumable Full Load
CDC Heart Beat

We have additionally planned the following sources - AWS S3 | Kafka

Writer Functionalities

Functionality Local Filesystem AWS S3 Apache Iceberg
Flattening & Normalization (L1)
Partitioning
Schema Changes
Schema Evolution

Supported Catalogs For Iceberg Writer

Catalog Status
Glue Catalog WIP
Hive Meta Store Upcoming
JDBC Catalogue Upcoming
REST Catalogue - Nessie Upcoming
REST Catalogue - Polaris Upcoming
REST Catalogue - Unity Upcoming
REST Catalogue - Gravitino Upcoming
Azure Purview Not Planned, submit a request
BigLake Metastore Not Planned, submit a request

Core

Core or framework is the component/logic that has been abstracted out from Connectors to follow DRY. This includes base CLI commands, State logic, Validation logic, Type detection for unstructured data, handling Config, State, Catalog, and Writer config file, logging etc.

Core includes http server that directly exposes live stats about running sync such as:

  • Possible finish time
  • Concurrently running processes
  • Live record count

Core handles the commands to interact with a driver via these:

  • spec command: Returns render-able JSON Schema that can be consumed by rjsf libraries in frontend
  • check command: performs all necessary checks on the Config, Catalog, State and Writer config
  • discover command: Returns all streams and their schema
  • sync command: Extracts data out of Source and writes into destinations

Find more about how OLake works here.

Roadmap

Checkout GitHub Project Roadmap and Upcoming OLake Roadmap to track and influence the way we build it. If you have any ideas, questions, or any feedback, please share on our Github Discussions or raise an issue.

Contributing

We ❤️ contributions big or small check our Bounty Program. As always, thanks to our amazing contributors!.

About

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB and MySQL

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published