Transducers are Coming #200

creese · 2019-09-23T02:44:00Z

This has been a long time coming but I think we’re finally here. This proposal is composable with the existing Jackdaw Streams DSL. Just define your transducers and use transduce-kstream:

(defn transduce-kstream
  [kstream xf]
  "Takes a kstream and xf and transduces the stream."
  (-> kstream
      (j/transform (fn [] (transformer xf)) ["transducer"])
      (j/flat-map (fn [[_ v]] v))))

It turns out that KStream::transform followed by KStream::flatMap is equivalent to transduce with concat. We can use the latter to test our business logic with pure Clojure (no Kafka Streams). This approach was pioneered by Matthias awhile ago. The difference is now we're adding state.

Here is how to test your transducers:

(def coll
  [[nil {:debit-account "tech"
         :credit-account "cash"
         :amount 1000}]
   [nil {:debit-account "cash"
         :credit-account "sales"
         :amount 2000}]])

(->> coll
     (transduce (xf-split-entries nil nil) concat)
     (transduce (xf-running-balances (atom {}) swap!) concat))

The function xf-running-balances takes two arguments, a "store" and a function that "behaves like clojure.core/swap!" and returns a transducer. When developing your tranducers, you can use an atom and swap!.

When using your tranducers from Kafka Streams, no changes are needed. You supply different arguments. The examples show how to provide a state store and a helper function defined in jackdaw.streams.xform. However, if this doesn't work for you, you can write your own.

Here is the topology:

(require '[jackdaw.streams :as j])
(require '[jackdaw.streams.xform :as jxf])

(defn topology-builder
  [{:keys [entry-requested transaction-pending transaction-added] :as topics} xforms]
  (fn [builder]
    (jxf/add-state-store! builder)
    (-> (j/kstream builder entry-requested)
        (jxf/transduce-kstream (::xf-split-entries xforms))
        (j/through transaction-pending)
        (jxf/transduce-kstream (::xf-running-balances xforms))
        (j/to transaction-added))
    builder))

This PR contains examples for Word Count and the Simple Ledger.

codecov · 2019-09-23T02:47:31Z

Codecov Report

❗ No coverage uploaded for pull request base (master@cee3aba). Click here to learn what that means.
The diff coverage is 15.62%.

@@            Coverage Diff            @@
##             master     #200   +/-   ##
=========================================
  Coverage          ?   78.18%           
=========================================
  Files             ?       43           
  Lines             ?     2530           
  Branches          ?      151           
=========================================
  Hits              ?     1978           
  Misses            ?      401           
  Partials          ?      151

Impacted Files	Coverage Δ
src/jackdaw/streams/xform.clj	`11.53% <11.53%> (ø)`
src/jackdaw/streams/xform/fakes.clj	`33.33% <33.33%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cee3aba...9efe7b7. Read the comment docs.

kidpollo · 2019-09-25T07:52:40Z

Looking good! I would add examples of unit tests of the actual word count transducer. Also what happens to the simple ledger tests?

99-not-out · 2019-09-25T08:01:36Z

src/jackdaw/streams/xform.clj

+  "Takes a builder and adds a state store."
+  (doto ^StreamsBuilder (j/streams-builder* builder)
+    (.addStateStore (Stores/keyValueStoreBuilder
+                     (Stores/persistentKeyValueStore "transducer")


What if you want >1 stateful mapping? won't all the mappers get the same state store? We should perhaps add an airily to pass in the backing store name.

I agree with this. I was planning to add this in a separate PR.

I think that being able to configure the name of the state store should part of MVP.

Okay, let me think on that. Each transduce-kstream can have many stateful transducers. I need to generalize the interface.

yeah i think generalizing this to handle multiple transducers is good. i think that would handle the implicit coupling with add-state-store and transduce-kstream.

but i imagine there isn't a nice way around the user calling add-state-store for each transducer added. we could keep one store for all tranducer state and key them by "transducer-id" like [transducer-id k] to allow us to only need 1 store

99-not-out · 2019-09-25T08:02:25Z

src/jackdaw/streams/xform.clj

+      (init [_ context]
+        (reset! ctx context))
+      (transform [_ k v]
+        (let [^KeyValueStore store (.getStateStore @ctx "transducer")


I would add an airily to pass the backing store name

I kind of think this function is the only thing that should really be in this PR (plus a test in xform_test.clj that proves it works). Everything else feels like arbitrary policy decisions that shouldn't be made at the library level.

I disagree. fake-kv-store and kv-store-swap-fn are what make transducers reusable. We can test our business logic without the TopologyTestDriver and Kafka Streams. The implementation of kv-store-swap-fn allows you to treat your state stores like Clojure atoms. You don't have to use it, but if you want to, it's there. The examples show how to use all of this to solve non-trivial problems. I think we need them. I could see adding a few tests for this namespace though.

99-not-out · 2019-09-25T08:02:52Z

src/jackdaw/streams/xform.clj

+  [kstream xf]
+  "Takes a kstream and xf and transduces the stream."
+  (-> kstream
+      (j/transform (fn [] (transformer xf)) ["transducer"])


Me again with the backing store name :)

99-not-out · 2019-09-25T08:11:35Z

src/jackdaw/streams/xform.clj

+      (transform [_ k v]
+        (let [^KeyValueStore store (.getStateStore @ctx "transducer")
+              v (first (into [] (xf store) [[k v]]))]
+          (KeyValue/pair k v)))


This transform does not mess with the key - for clarity we should consider providing a transformer (which can affect the key) and a value-transformer which won't (i.e calls to the underlying transform and transformValues on the streams DSL).

Similarly for transduce-kstream vs transduce-kstream-values

This separation is useful when reading the high level code as it telegraphs to the reader whether a repartition may be occurring (a-la map vs mapValues)

Each KStream::transform is always followed by a KStream::flatMap. The final key or keys are obtained from the resulting value of KStream::transform. We don't care about the resulting key.

Suppose the input to KStream::transform is [input-key input-value]. The output might be [input-key [[k1 v1] [k2 v2]]. KStream::flatMap discards the key from the previous step and publishes two records with keys k1 and k2.

that makes sense. but the link between transformer and flatmap might not be clear to someone. if we decide not to replace it with .forward and .commit, i think this dependency should be documented. or transformer becomes a private function so it is clear transduce-kstream is the API to use

99-not-out · 2019-09-25T08:33:06Z

src/jackdaw/streams/xform.clj

+  "Takes an instance of KeyValueStore, a function f, and map m, and
+  updates the store in a manner similar to `clojure.core/swap!`."
+  [^KeyValueStore store f m]
+  (let [ks (keys (f {} m))


This (I think) means that the transform must create data when there is none, or else you will get no keys in this step for the subsequent reduction (i.e. it cannot be update-only). Something which should be documented so its explicit - as the result of (f {} m) is important. (An update-only transform doesn't really make sense anyway, but worth a note on the doc string IMO)

Can you provide a working example?

could this just be (keys m)?

…le out testing bug

cddr · 2019-10-02T11:26:44Z

src/jackdaw/streams/xform.clj

+  "Takes a builder and adds a state store."
+  (doto ^StreamsBuilder (j/streams-builder* builder)
+    (.addStateStore (Stores/keyValueStoreBuilder
+                     (Stores/persistentKeyValueStore "transducer")


I think that being able to configure the name of the state store should part of MVP.

cddr · 2019-10-02T12:10:16Z

src/jackdaw/streams/xform.clj

+  "Takes a kstream and xf and transduces the stream."
+  (-> kstream
+      (j/transform (fn [] (transformer xf)) ["transducer"])
+      (j/flat-map (fn [[_ v]] v))))


Why is the flat-map necessary here? What if someone wants to keep the keys?

Flat map corresponds to the reducing function in transduce. In this function, we always transduce with concat. This allows you to ingest a record and transform it into zero or more records. Without this step, you wouldn't be able to implement either of the examples.

I think you could achieve the same functionality using the forward and commit methods on ctx (which is a ProcessorContext). There's an example here. Not saying that way's better, just that there is an alternative way.

I was going to make the same point! I used ProcessorContext in the example @DaveWM linked because it felt closer to a transducible context than anything in the high-level KStreams API. ProcessorContext also gives a bit more fine-grained control over state stores.

cddr · 2019-10-02T12:14:22Z

src/jackdaw/streams/xform.clj

+      (init [_ context]
+        (reset! ctx context))
+      (transform [_ k v]
+        (let [^KeyValueStore store (.getStateStore @ctx "transducer")


I kind of think this function is the only thing that should really be in this PR (plus a test in xform_test.clj that proves it works). Everything else feels like arbitrary policy decisions that shouldn't be made at the library level.

…rent ns

DaveWM · 2019-10-07T16:55:46Z

src/jackdaw/streams/xform.clj

+
+(defn transduce-kstream
+  [kstream xf]
+  "Takes a kstream and xf and transduces the stream."


I'd add a bit more information around stateful transducers here. It needs to be clear to the user that this won't work with the stateful transducers from the Clojure core lib, and that if they need a stateful transducer they need to write their own.

also docstring and args are in the wrong order

AndreaCrotti · 2019-10-10T07:57:19Z

src/jackdaw/streams/xform.clj

+           [org.apache.kafka.streams.state KeyValueStore Stores]
+           org.apache.kafka.streams.StreamsBuilder))
+
+(defn kv-store-swap-fn


from the -fn suffix I would have thought it would return a function, but it's actually doing the swapping from what I understand.
Maybe something like default-kv-store-swap would be more clear?
Do we need the -swap! as well if it's a write?

Maybe something like default-kv-store-swap would be more clear?

I like this!

Do we need the -swap! as well if it's a write?

According to the style guide:
"The names of functions/macros that are not safe in STM transactions should end with an exclamation mark (e.g. reset!)."

So yeah... I can get behind that.

DaveWM · 2019-11-12T18:09:53Z

I spoke to @blak3mill3r (the author of Noah), yesterday about how he's implemented stateful transducers in Noah. He came up with a broadly similar solution to what we have here, it seems like the use of volatile! within the transducer code is a real sticking point. The main difference between our solutions is that in Noah, all the transducers use a single state store rather than each transducer having its own. We weren't sure what the performance implications of that would be, but it's worth bearing in mind in case we run into perf issues in future.

We also discussed starting a shared library for all the core transducers re-written to support persisting their state, so that they can be used with Jackdaw and Noah. Blake's going to set this up, then I thought we could potentially pull this in in Jackdaw. There are some open questions around this though, such as what we do about other popular transducer libraries like xforms.

blak3mill3r · 2019-11-15T00:55:37Z

Here is that shared library which reimplements (the transducer arity of) all of the functions in clojure.core that return a stateful transducer:

https://github.com/blak3mill3r/coddled-super-centaurs

That function is then bound twice by noah to instrument the transducer state and tie it into a StateStore:

https://github.com/blak3mill3r/noah/blob/5803dd5/src/noah/transduce.clj#L34-L35
https://github.com/blak3mill3r/noah/blob/5803dd5/src/noah/transduce.clj#L85-L86

Also, @DaveWM ... I checked, and as far as I can tell, there aren't any stateful transducers in xforms or kixi.stats. They have interesting higher-order transducers and reducing fns, and (I think...) these should work fine composed with these instrumented stateful transducers.

blak3mill3r · 2019-11-15T01:03:22Z

Also I want to clarify regarding: "all the transducers use a single state store rather than each transducer having its own"

Each time you transduce a KStream, if that transduction needs state, you must provide a store. That transducer can of course be a composition of several transducers, any of which can be stateful, and all of the states for these composed transducers will be stored together in a clojure vector as the record in the state store. To transduce multiple KStreams, you would use multiple state stores.

vijumathew · 2019-09-25T23:10:00Z

examples/simple-ledger/dev/user.clj

+  "Use this namespace for interactive development.
+
+  This namespace requires libs needed to reset the app and helpers
+  from `jackdaw.repl`. WARNING: Do no use `clj-refactor` (or


do not* typo

vijumathew · 2019-12-02T22:47:41Z

src/jackdaw/streams/xform.clj

+
+(defn transduce-kstream
+  [kstream xf]
+  "Takes a kstream and xf and transduces the stream."


also docstring and args are in the wrong order

vijumathew · 2019-12-02T23:02:02Z

src/jackdaw/streams/xform.clj

+        (reset! ctx context))
+      (transform [_ k v]
+        (let [^KeyValueStore store (.getStateStore @ctx "transducer")
+              v (first (into [] (xf store) [[k v]]))]


if the goal of this line is to get an updated value for v, could instead do (first (sequence xf [[k v]]).

i could be missing what (xf store) is needed for

vijumathew · 2019-12-02T23:14:55Z

src/jackdaw/streams/xform.clj

+  "Takes a builder and adds a state store."
+  (doto ^StreamsBuilder (j/streams-builder* builder)
+    (.addStateStore (Stores/keyValueStoreBuilder
+                     (Stores/persistentKeyValueStore "transducer")


yeah i think generalizing this to handle multiple transducers is good. i think that would handle the implicit coupling with add-state-store and transduce-kstream.

but i imagine there isn't a nice way around the user calling add-state-store for each transducer added. we could keep one store for all tranducer state and key them by "transducer-id" like [transducer-id k] to allow us to only need 1 store

vijumathew · 2019-12-02T23:19:54Z

src/jackdaw/streams/xform.clj

+      (transform [_ k v]
+        (let [^KeyValueStore store (.getStateStore @ctx "transducer")
+              v (first (into [] (xf store) [[k v]]))]
+          (KeyValue/pair k v)))


that makes sense. but the link between transformer and flatmap might not be clear to someone. if we decide not to replace it with .forward and .commit, i think this dependency should be documented. or transformer becomes a private function so it is clear transduce-kstream is the API to use

vijumathew · 2019-12-02T23:27:29Z

src/jackdaw/streams/xform.clj

+  "Takes an instance of KeyValueStore, a function f, and map m, and
+  updates the store in a manner similar to `clojure.core/swap!`."
+  [^KeyValueStore store f m]
+  (let [ks (keys (f {} m))


could this just be (keys m)?

vijumathew · 2019-12-02T23:28:10Z

src/jackdaw/streams/xform.clj

+
+(defn kv-store-swap-fn
+  "Takes an instance of KeyValueStore, a function f, and map m, and
+  updates the store in a manner similar to `clojure.core/swap!`."


could we add something that explains the shape of f? maybe like:

f is a function of 2 args that takes the old map (store) and the provided map and combines them together.

kidpollo · 2020-01-15T00:04:25Z

Plz merge this already 😛 !!

kidpollo · 2024-05-09T23:48:51Z

Bump!

creese requested a review from a team as a code owner September 23, 2019 02:44

creese requested review from matthias-margush, 99-not-out, bren-do, cddr, DaveWM, gphilipp and kidpollo September 23, 2019 02:50

creese force-pushed the transducers branch from 354823e to 3430581 Compare September 23, 2019 03:01

Adds helper functions for working with transducers and two examples.

4dffa12

creese force-pushed the transducers branch from de98739 to 4dffa12 Compare September 25, 2019 05:22

creese changed the title ~~Transducers~~ Transducers are Coming Sep 25, 2019

99-not-out reviewed Sep 25, 2019

View reviewed changes

Charles Reese added 3 commits September 26, 2019 09:29

Replace topology xf-word-count topology with classic Word Count to ru…

08ae80f

…le out testing bug

Fix test-xf-word-count

05cfd61

Remove nils from keys in examples and change simple ledger map keys

2136e1c

cddr reviewed Oct 2, 2019

View reviewed changes

Remove xf- prefix from transducer naming, move fake-kv-store to diffe…

4ee5d7f

…rent ns

DaveWM reviewed Oct 7, 2019

View reviewed changes

When publishing test data, use "entry-pending"

fe1a74b

AndreaCrotti reviewed Oct 10, 2019

View reviewed changes

vijumathew reviewed Dec 2, 2019

View reviewed changes

Charles Reese and others added 2 commits December 12, 2019 20:40

Use forward method on the processor context

01317c2

Rename var from entry to txn

9efe7b7

Transducers are Coming #200

Are you sure you want to change the base?

Transducers are Coming #200

Conversation

creese commented Sep 23, 2019 • edited

codecov bot commented Sep 23, 2019 • edited

Codecov Report

kidpollo commented Sep 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cddr Oct 2, 2019 • edited

Choose a reason for hiding this comment

creese Oct 2, 2019 • edited

Choose a reason for hiding this comment

vijumathew Dec 2, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

creese Oct 2, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vijumathew Dec 2, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cddr Oct 2, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

creese Oct 2, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveWM commented Nov 12, 2019

blak3mill3r commented Nov 15, 2019

blak3mill3r commented Nov 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vijumathew Dec 2, 2019 • edited

Choose a reason for hiding this comment

vijumathew Dec 2, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kidpollo commented Jan 15, 2020

kidpollo commented May 9, 2024

creese commented Sep 23, 2019 •

edited

codecov bot commented Sep 23, 2019 •

edited

cddr Oct 2, 2019 •

edited

creese Oct 2, 2019 •

edited

vijumathew Dec 2, 2019 •

edited

creese Oct 2, 2019 •

edited

vijumathew Dec 2, 2019 •

edited

cddr Oct 2, 2019 •

edited

creese Oct 2, 2019 •

edited

vijumathew Dec 2, 2019 •

edited

vijumathew Dec 2, 2019 •

edited