Releases: bytewax/bytewax
v0.19.1
Overview
- Fixes a bug where using a system clock on certain architectures causes items to be dropped from windows.
What's Changed
- Little docs cleanups by @davidselassie in #424
- Add builds for Python 3.12 by @whoahbot in #425
- Adds async connector guide by @davidselassie in #428
- refactor orderbook example to remove ordereddict by @lfunderburk in #429
- Fix an issue with session windows by @whoahbot in #430
- Prepare 0.19.1 by @whoahbot in #431
New Contributors
- @lfunderburk made their first contribution in #429
Full Changelog: v0.19.0...v0.19.1
v0.19.0
Overview
-
Multiple operators have been reworked to avoid taking and releasing
Python's global interpreter lock while iterating over multiple items.
Windowing operators, stateful operators and operators likebranch
will see significant performance improvements.Thanks to @damiondoesthings for helping us track this down!
-
Breaking change
FixedPartitionedSource.build_part
,
DynamicSource.build
,FixedPartitionedSink.build_part
andDynamicSink.build
now take an additionalstep_id
argument. This argument can be used when
labeling custom Python metrics. -
Custom Python metrics can now be collected using the
prometheus-client
library. -
Breaking change The schema registry interface has been removed.
You can still use schema registries, but you need to instantiate
the (de)serializers on your own. This allows for more flexibility.
See theconfluent_serde
andredpanda_serde
examples for how
to use the new interface. -
Fixes bug where items would be incorrectly marked as late in sliding
and tumbling windows in cases where the timestamps are very far from
thealign_to
parameter of the windower. -
Adds
stateful_flat_map
operator. -
Breaking change Removes
builder
argument fromstateful_map
.
Instead, the initial state value is alwaysNone
and you can call
your previous builder by hand in themapper
. -
Breaking change Improves performance by removing the
now: datetime
argument fromFixedPartitionedSource.build_part
,
DynamicSource.build
, andUnaryLogic.on_item
. If you need the
current time, use:
from datetime import datetime, timezone
now = datetime.now(timezone.utc)
- Breaking change Improves performance by removing the
sched: datetime
argument fromStatefulSourcePartition.next_batch
,
StatelessSourcePartition.next_batch
,UnaryLogic.on_notify
. You
should already have the scheduled next awake time in whatever
instance variable you returned in
{Stateful,Stateless}SourcePartition.next_awake
or
UnaryLogic.notify_at
.
What's Changed
- Add Kafka concept section to metadata.json by @whoahbot in #373
- Fixed split_demo example by @Psykopear in #371
- Prevent dataflow hang if next_awake is far in future by @davidselassie in #374
- Adds a basic stub file generator by @davidselassie in #369
- Update metrics and observability guides by @Psykopear in #372
- Fixes pyright errors by @davidselassie in #378
- Shuffles around Kafka objects and updates docstrings by @davidselassie in #382
- Fixes
collect
andbranch
operator test file names by @davidselassie in #385 - Using MyST + Sphinx for API docs by @davidselassie in #383
- Update README.md by @jonasbest in #388
- All docs to Sphinx and RTD by @davidselassie in #386
- Update logo in README.md by @konradsienkowski in #389
- Removes
stateful_map
builder
function and addsstateful_flat_map
by @davidselassie in #387 - Re-enables doctests via Sybil by @davidselassie in #390
- Removes
now
andsched
arguments in input partitions and unary logic by @davidselassie in #391 - fix backup interval of zero raising an exception by @damiondoesthings in #393
- Remove GLIBC 2.27 builder by @whoahbot in #395
- Fix recovery store garbage collection by @davidselassie in #394
- Customize Sphinx docs theme by @konradsienkowski in #400
- Cleanup docs by @whoahbot in #401
- [Docs]: Fix path to Slack icon by @konradsienkowski in #402
- Updates release instructions with Read the Docs stuff by @davidselassie in #403
- Additional recovery tests by @davidselassie in #407
- Adds warnings about using session windows by @davidselassie in #408
- Update max window documentation by @awmatheson in #410
- Windowing concept doc by @davidselassie in #412
- Don't call
time_for
twice by @whoahbot in #414 - Refactors window boundary calculation to avoid overflow by @davidselassie in #415
- Unified Redpanda and Confluent schema registries by @Psykopear in #399
- Fix inconsistent window boundaries panic due to microseconds in timestamps by @davidselassie in #416
- Makes stubgen deterministic by @davidselassie in #417
- Don't take the GIL when iterating over items by @whoahbot in #418
- Start working on custom metrics for Kafka by @whoahbot in #404
- Take 2 on inconsistent boundaries by @davidselassie in #419
- Add benchmarks for core operators by @whoahbot in #420
- Refactor operators by @whoahbot in #422
- Prepare for v0.19.0 release by @whoahbot in #423
New Contributors
- @jonasbest made their first contribution in #388
- @damiondoesthings made their first contribution in #393
Full Changelog: v0.18.2...v0.19.0
v0.18.2
Overview
- Fixes a bug that prevented the deletion of old state in recovery stores.
- Better error messages on invalid epoch and backup interval parameters.
- Fixes bug where dataflow will hang if a source's
next_awake
is set far in the future.
What's Changed
Full Changelog: v0.18.1...v0.18.2
What's Changed
Full Changelog: v0.18.1...v0.18.2
v0.18.1
Overview
- Changes the default batch size for
KafkaSource
from 1 to 1000 to match the Kafka input operator. - Fixes an issue with the
count_window
operator: #364.
What's Changed
- Fix some links in docs by @Psykopear in #358
- Updates dataflow programming concept docs by @davidselassie in #357
- Update the readme with changes from 0.18 by @awmatheson in #356
- Make KafkaSinkMessage types covariant by @Psykopear in #359
- Add a pre-reduce step to
reduce_final
by @davidselassie in #361 - Adds 1 Billion Row Challenge example by @davidselassie in #362
- Update DynamicSource docstring by @awmatheson in #360
- Fixed count_window so it passes the whole item to the clock by @Psykopear in #365
- Fix metrics demo setup by @jhanninen in #367
- Add Kafka concepts section by @whoahbot in #366
- Prepare for 0.18.1 by @whoahbot in #368
New Contributors
- @jhanninen made their first contribution in #367
Full Changelog: v0.18.0...v0.18.1
v0.18.0
Overview
-
Support for schema registries, through
bytewax.connectors.kafka.registry.RedpandaSchemaRegistry
andbytewax.connectors.kafka.registry.ConfluentSchemaRegistry
. -
Custom Kafka operators in
bytewax.connectors.kafka.operators
:
input
,output
,deserialize_key
,deserialize_value
,deserialize
,
serialize_key
,serialize_value
andserialize
. -
Breaking change
KafkaSource
now emits a specialKafkaSourceMessage
to allow access to all data on consumed messages.KafkaSink
now consumesKafkaSinkMessage
to allow setting additional fields on produced messages. -
Non-linear dataflows are now possible. Each operator method returns
a handle to theStream
s it produces; add further steps via calling
operator functions on those returned handles, not the root
Dataflow
. See the migration guide for more info. -
Auto-complete and type hinting on operators, inputs, outputs,
streams, and logic functions now works. -
A ton of new operators:
collect_final
,count_final
,
count_window
,flatten
,inspect_debug
,join
,join_named
,
max_final
,max_window
,merge
,min_final
,min_window
,
key_on
,key_assert
,key_split
,merge
,unary
. Documentation
for all operators are inbytewax.operators
now. -
New operators can be added in Python, made by grouping existing
operators. Seebytewax.dataflow
module docstring for more info. -
Breaking change Operators are now stand-alone functions;
import bytewax.operators as op
and use e.g.op.map("step_id", upstream, lambda x: x + 1)
. -
Breaking change All operators must take a
step_id
argument now. -
Breaking change
fold
andreduce
operators have been renamed to
fold_final
andreduce_final
. They now only emit on EOF and are
only for use in batch contexts. -
Breaking change
batch
operator renamed tocollect
, so as to
not be confused with runtime batching. Behavior is unchanged. -
Breaking change
output
operator does not forward downstream its
items. Add operators on the upstream handle instead. -
next_batch
on input partitions can now return anyIterable
, not
just aList
. -
inspect
operator now has a default inspector that prints out items
with the step ID. -
collect_window
operator now can collect intoset
s anddict
s. -
Adds a
get_fs_id
argument to{Dir,File}Source
to allow handling
non-identical files per worker. -
Adds a
TestingSource.EOF
andTestingSource.ABORT
sentinel values
you can use to test recovery. -
Breaking change Adds a
datetime
argument to
FixedPartitionSource.build_part
,DynamicSource.build_part
,
StatefulSourcePartition.next_batch
, and
StatelessSourcePartition.next_batch
. You can now use this to
update yournext_awake
time easily. -
Breaking change Window operators now emit
WindowMetadata
objects
downstream. These objects can be used to introspect the open_time
and close_time of windows. This changes the output type of windowing
operators from:(key, values)
to(key, (metadata, values))
. -
Breaking change IO classes and connectors have been renamed to
better reflect their semantics and match up with documentation. -
Moves the ability to start multiple Python processes with the
-p
or--processes
to thebytewax.testing
module. -
Breaking change
SimplePollingSource
moved from
bytewax.connectors.periodic
tobytewax.inputs
since it is an
input helper. -
SimplePollingSource
'salign_to
argument now works.
What's Changed
- Error cleanups by @davidselassie in #302
- More error fixing by @davidselassie in #303
- Add initial metrics implementation. by @whoahbot in #296
- Adds batching getter input helpers by @davidselassie in #304
- Move Python multiprocessing execution mode to the testing namespace by @whoahbot in #305
- We don't actually depend on bincode ourselves anymore by @davidselassie in #308
- Rename IO classes by @davidselassie in #307
- Move SimplePollingSource and fix align_to argument by @davidselassie in #309
- Add window metadata object to the output of windowing operators by @whoahbot in #311
- Flushes StdOutSink by @davidselassie in #314
- Deterministically awaken keys by @davidselassie in #315
- Passes a
now
argument tobuild_part
andnext_batch
by @davidselassie in #316 - Adds
TestingSource.{ABORT, EOF}
by @davidselassie in #317 - Adds
get_fs_id
argument to{Dir,File}Source
by @davidselassie in #320 - Adds vermin pre-commit by @davidselassie in #323
- Adds a simple retry functionality to
SimplePollingSource
by @davidselassie in #324 - Fix all examples by @Psykopear in #322
- Non-linear dataflows and Python operators by @davidselassie in #321
- New docs structure by @cra in #325
- Move module docstring sections into markdown docs by @davidselassie in #330
- Removes
key_split
operator; fixes type annotations by @davidselassie in #331 - Runtime typecheck
Stream
arguments to operators by @davidselassie in #334 - Changes snapshot and backup interval to saner defaults by @davidselassie in #335
- Performance work by @davidselassie in #333
- Handle
Optional
and more complex types in operator signatures by @davidselassie in #338 - Tuples are faster than lists by @davidselassie in #339
- Update getting started by @whoahbot in #336
flat_map_batch
operator by @davidselassie in #341- Add an overload to
branch
operator so if it gets aTypeGuard
, the output streams are typed correctly by @davidselassie in #342 - Add documentation for joins and wordcount to getting started by @whoahbot in #340
- Bumps
rusqlite
deps by @davidselassie in #345 - Uses
ruff format
instead ofblack
by @davidselassie in #346 - Kafka connector revamp, schema registry support, py-operators by @Psykopear in #332
- Renames
batch
operator tocollect
by @davidselassie in #351 - Joins concepts documentation and
join_window
fixes by @davidselassie in #344 - Properly handle resume from EOF by @davidselassie in #352
- Add getting started guides for execution and snapshot by @whoahbot in #350
- Start working on 0.18 migration guide by @whoahbot in #337
- Changelog updated for kafka connector by @Psykopear in #348
- Update container guide by @Psykopear in #349
- Long Format Documentation Update by @awmatheson in #347
- Prepare 0.18 by @whoahbot in #354
New Contributors
Full Changelog: v0.17.1...v0.18.0
v0.17.2
Overview
- Fixes error message creation, and updates error messages when creating recovery partitions.
Full Changelog: v0.17.1...v0.17.2
v0.17.1
Overview
-
Adds the
batch
operator to Dataflows. CallingDataflow.batch
will batch incoming items until either a batch size has been reached
or a timeout has passed. -
Adds the
SimplePollingInput
source. Subclass this input source to
periodically source new input for a dataflow. -
Re-adds GLIBC 2.27 builds to support older linux distributions.
What's Changed
- Batch operator by @Psykopear in #287
- Re-add glibc 2.27 builder by @whoahbot in #297
- PeriodicInput source by @Psykopear in #295
- Prepare for v0.17.1 release by @whoahbot in #299
Full Changelog: v0.17.0...v0.17.1
v0.17.0
v0.17.0
Changed
-
Breaking change Recovery system re-worked. Kafka-based recovery
removed. SQLite recovery file format changed; existing recovery DB
files can not be used. See the module docstring for
bytewax.recovery
for how to use the new recovery system. -
Dataflow execution supports rescaling over resumes. You can now
change the number of workers and still get proper execution and
recovery. -
epoch-interval
has been renamed tosnapshot-interval
-
The
list-parts
method ofPartitionedInput
has been changed to
return aList[str]
and should only reflect the available
inputs that a given worker has access to. You no longer need
to return the complete set of partitions for all workers. -
The
next
method ofStatefulSource
andStatelessSource
has
been changed tonext_batch
and should return aList
of elements,
or the empty list if there are no elements to return.
Added
-
Added new cli parameter
backup-interval
, to configure the length of
time to wait before "garbage collecting" older recovery snapshots. -
Added
next_awake
to input classes, which can be used to schedule
when the next call tonext_batch
should occur. Usenext_awake
instead oftime.sleep
. -
Added
bytewax.inputs.batcher_async
to bridge async Python libraries
in Bytewax input sources. -
Added support for linux/aarch64 and linux/armv7 platforms.
Removed
KafkaRecoveryConfig
has been removed as a recovery store.
What's Changed
- Spread operator by @Psykopear in #246
- Start updating CI by @whoahbot in #253
- Correct number of available inputs by @RCdeWit in #255
- Build exception messages only on error by @davidselassie in #254
- Fix links in docs leading to 404 error by @konradsienkowski in #256
- Mark KafkaRecoveryConfig as deprecated by @davidselassie in #258
- Fixes
ymd
deprecation warnings. by @whoahbot in #262 - Batch input take 2 by @Psykopear in #261
- Properly handle when resume_state is falsy by @davidselassie in #267
- AsyncBatcher by @davidselassie in #269
- Next awake by @Psykopear in #266
- Run clippy on pre-commit by @davidselassie in #272
- For some reason test_execution has Windows line endings by @davidselassie in #273
- Various text pre-commit hooks by @davidselassie in #274
- SQLite-only recovery with dynamic partitions to support rescaling by @davidselassie in #271
- Bumps PyO3 to 0.19.2 by @davidselassie in #276
- Calculate primaries only on worker 0 by @davidselassie in #281
- Uses ruff to lint by @davidselassie in #278
- Update CI by @whoahbot in #277
- Update orderbook.py to use level2_batch by @awmatheson in #279
- Removes now-unused pickling infra by @davidselassie in #280
- Fix tiny typo in comment by @rabernat in #283
- File input fixups by @davidselassie in #282
- Fix kafka input error message by @Psykopear in #289
- Error when inconsistent recovery partitions by @davidselassie in #290
- Adding recovery entrypoint script by @miccioest in #291
- Add v0.17 upgrading guide. by @whoahbot in #288
- Prepare for v0.17.0 release. by @whoahbot in #292
New Contributors
Full Changelog: v0.16.2...v0.17.0
v0.16.2
Overview
- Add support for Windows builds - thanks @zzl221000!
- Adds a CSVInput subclass of FileInput
What's Changed
- Update PyO3 by @whoahbot in #244
- Add a _CSVSource and CSVInput subclass of FileInput by @awmatheson in #245
- Fix encoder if a class is passed by @Psykopear in #247
- Add windows build support by @zzl221000 in #249
- Prepare for 0.16.2 release by @whoahbot in #251
New Contributors
- @zzl221000 made their first contribution in #249
Full Changelog: v0.16.1...v0.16.2
v0.16.1
Overview
- Add a cooldown for activating workers to reduce CPU consumption.
- Add support for Python 3.11.
What's Changed
- Docs fixes by @Psykopear in #235
- Fixes some issues with API docs by @davidselassie in #236
- Bump timely to most recent version. by @whoahbot in #238
- Fixed examples by @Psykopear in #237
- Python 3.11 package by @Psykopear in #191
- Write out Dataflow JSON to a file by @whoahbot in #240
- 1ms cooldown before activating workers by @Psykopear in #241
- Fix readme by @Psykopear in #239
- Prepare for v0.16.1 release by @whoahbot in #243
Full Changelog: v0.16.0...v0.16.1