This repository is supporting the medium article I authored on the topic. The artifacts are for syncing Snowflake tables to Kafka using Kafka Connect
All steps mentioned [here] (https://pandeysudhendu.medium.com/streaming-data-from-snowflake-to-kafka-ed76ce0400c2)
In one of my previous blogs, I covered step-by-step how to move data from Kafka to Snowflake.
But how about a reverse pattern? Moving data from Snowflake to Kafka. There can be a number of use cases for this, but the overarching one could be propagating Snowflake table (change) data to the downstream systems.By the nature of Apache Kafka, it is purpose-built for a performant pub-sub architecture. This means if you have data from Snowflake that needs to be *streamed *to multiple sinks, you can have Kafka do this routing for you.
Surprisingly, there are not many ‘how to’ for this pattern. A few I stumbled upon (1, 2) either involved a cloud storage hop or an EL tool. In fact, I saw this StackOverflow question from 2020 with no answer yet!
We want to create a flow that streams changes in a Snowflake table to a Kafka Topic. This Kafka topic can then be consumed by downstream applications.
The below guide will be step-by-step on how you can achieve this. There is an assumption of knowledge of Snowflake and Kafka. We will concentrate on how to achieve the pattern. This means we will skip some concepts that are not core (docker, wsl2, etc) for this tutorial, but you can easily google them. The objective is to get this setup up and running!
-
Kafka Source: It is a system that can push (or Kafka can pull) data.
-
Kafka Sink: It is the target system where Kafka feeds the data.
-
Kafka Connect: Free, open-source component of Apache Kafka® that works as a centralized data hub for simple data integration between databases.
-
Kafka Topic: A log of events. It can be a multi-publisher and multi-subscriber.
-
Create and setup a Snowflake Instance.
-
Create a Kafka Environment (this will involve setting the Kafka, zookeeper, kafka-connect, etc.)
-
Setting up Confluent Kafka connect JBDC Source connector.
-
Installing Snowflake JDBC driver.
-
Run the Kafka Connector and do a sanity check.
-
Final step: to see if change data is getting captured!
-
The Kafka JDBC Connect is very well-built and the properties are worth fiddling with.
-
In our example, our mode was set to increment, but you can set it to incrementing and timestamp. That way, the watermark column (for example updated_at) will be considered and you will receive updates as well.
-
Since this is not log-based CDC, you suffer from all the drawback (missing deletes, missing subsequent updates, etc.) For that, we can check other options such as Kafka Connect Debezium.
-
You can write any SQL in the query tag. For example, if don’t want the connector to create its custom query of (id< and timestamp <), you can just write a complex SQL expression.