Skip to content

Releases: bytewax/bytewax

v0.19.1

01 Apr 18:30
Compare
Choose a tag to compare

Overview

  • Fixes a bug where using a system clock on certain architectures causes items to be dropped from windows.

What's Changed

New Contributors

Full Changelog: v0.19.0...v0.19.1

v0.19.0

20 Mar 17:22
73d8e2f
Compare
Choose a tag to compare

Overview

  • Multiple operators have been reworked to avoid taking and releasing
    Python's global interpreter lock while iterating over multiple items.
    Windowing operators, stateful operators and operators like branch
    will see significant performance improvements.

    Thanks to @damiondoesthings for helping us track this down!

  • Breaking change FixedPartitionedSource.build_part,
    DynamicSource.build, FixedPartitionedSink.build_part and DynamicSink.build
    now take an additional step_id argument. This argument can be used when
    labeling custom Python metrics.

  • Custom Python metrics can now be collected using the prometheus-client
    library.

  • Breaking change The schema registry interface has been removed.
    You can still use schema registries, but you need to instantiate
    the (de)serializers on your own. This allows for more flexibility.
    See the confluent_serde and redpanda_serde examples for how
    to use the new interface.

  • Fixes bug where items would be incorrectly marked as late in sliding
    and tumbling windows in cases where the timestamps are very far from
    the align_to parameter of the windower.

  • Adds stateful_flat_map operator.

  • Breaking change Removes builder argument from stateful_map.
    Instead, the initial state value is always None and you can call
    your previous builder by hand in the mapper.

  • Breaking change Improves performance by removing the now: datetime argument from FixedPartitionedSource.build_part,
    DynamicSource.build, and UnaryLogic.on_item. If you need the
    current time, use:

from datetime import datetime, timezone

now = datetime.now(timezone.utc)
  • Breaking change Improves performance by removing the sched: datetime argument from StatefulSourcePartition.next_batch,
    StatelessSourcePartition.next_batch, UnaryLogic.on_notify. You
    should already have the scheduled next awake time in whatever
    instance variable you returned in
    {Stateful,Stateless}SourcePartition.next_awake or
    UnaryLogic.notify_at.

What's Changed

New Contributors

Full Changelog: v0.18.2...v0.19.0

v0.18.2

08 Feb 18:00
Compare
Choose a tag to compare

Overview

  • Fixes a bug that prevented the deletion of old state in recovery stores.
  • Better error messages on invalid epoch and backup interval parameters.
  • Fixes bug where dataflow will hang if a source's next_awake is set far in the future.

What's Changed

Full Changelog: v0.18.1...v0.18.2

What's Changed

Full Changelog: v0.18.1...v0.18.2

v0.18.1

12 Jan 22:09
a8243f0
Compare
Choose a tag to compare

Overview

  • Changes the default batch size for KafkaSource from 1 to 1000 to match the Kafka input operator.
  • Fixes an issue with the count_window operator: #364.

What's Changed

New Contributors

Full Changelog: v0.18.0...v0.18.1

v0.18.0

20 Dec 22:32
6478893
Compare
Choose a tag to compare

Overview

  • Support for schema registries, through bytewax.connectors.kafka.registry.RedpandaSchemaRegistry and bytewax.connectors.kafka.registry.ConfluentSchemaRegistry.

  • Custom Kafka operators in bytewax.connectors.kafka.operators:
    input, output, deserialize_key, deserialize_value, deserialize,
    serialize_key, serialize_value and serialize.

  • Breaking change KafkaSource now emits a special KafkaSourceMessage to allow access to all data on consumed messages. KafkaSink now consumes KafkaSinkMessage to allow setting additional fields on produced messages.

  • Non-linear dataflows are now possible. Each operator method returns
    a handle to the Streams it produces; add further steps via calling
    operator functions on those returned handles, not the root
    Dataflow. See the migration guide for more info.

  • Auto-complete and type hinting on operators, inputs, outputs,
    streams, and logic functions now works.

  • A ton of new operators: collect_final, count_final,
    count_window, flatten, inspect_debug, join, join_named,
    max_final, max_window, merge, min_final, min_window,
    key_on, key_assert, key_split, merge, unary. Documentation
    for all operators are in bytewax.operators now.

  • New operators can be added in Python, made by grouping existing
    operators. See bytewax.dataflow module docstring for more info.

  • Breaking change Operators are now stand-alone functions; import bytewax.operators as op and use e.g. op.map("step_id", upstream, lambda x: x + 1).

  • Breaking change All operators must take a step_id argument now.

  • Breaking change fold and reduce operators have been renamed to
    fold_final and reduce_final. They now only emit on EOF and are
    only for use in batch contexts.

  • Breaking change batch operator renamed to collect, so as to
    not be confused with runtime batching. Behavior is unchanged.

  • Breaking change output operator does not forward downstream its
    items. Add operators on the upstream handle instead.

  • next_batch on input partitions can now return any Iterable, not
    just a List.

  • inspect operator now has a default inspector that prints out items
    with the step ID.

  • collect_window operator now can collect into sets and dicts.

  • Adds a get_fs_id argument to {Dir,File}Source to allow handling
    non-identical files per worker.

  • Adds a TestingSource.EOF and TestingSource.ABORT sentinel values
    you can use to test recovery.

  • Breaking change Adds a datetime argument to
    FixedPartitionSource.build_part, DynamicSource.build_part,
    StatefulSourcePartition.next_batch, and
    StatelessSourcePartition.next_batch. You can now use this to
    update your next_awake time easily.

  • Breaking change Window operators now emit WindowMetadata objects
    downstream. These objects can be used to introspect the open_time
    and close_time of windows. This changes the output type of windowing
    operators from: (key, values) to (key, (metadata, values)).

  • Breaking change IO classes and connectors have been renamed to
    better reflect their semantics and match up with documentation.

  • Moves the ability to start multiple Python processes with the
    -p or --processes to the bytewax.testing module.

  • Breaking change SimplePollingSource moved from
    bytewax.connectors.periodic to bytewax.inputs since it is an
    input helper.

  • SimplePollingSource's align_to argument now works.

What's Changed

New Contributors

  • @cra made their first contribution in #325

Full Changelog: v0.17.1...v0.18.0

v0.17.2

05 Oct 18:55
10486e6
Compare
Choose a tag to compare

Overview

  • Fixes error message creation, and updates error messages when creating recovery partitions.

Full Changelog: v0.17.1...v0.17.2

v0.17.1

12 Sep 18:29
795b306
Compare
Choose a tag to compare

Overview

  • Adds the batch operator to Dataflows. Calling Dataflow.batch
    will batch incoming items until either a batch size has been reached
    or a timeout has passed.

  • Adds the SimplePollingInput source. Subclass this input source to
    periodically source new input for a dataflow.

  • Re-adds GLIBC 2.27 builds to support older linux distributions.

What's Changed

Full Changelog: v0.17.0...v0.17.1

v0.17.0

28 Aug 15:53
e2df62c
Compare
Choose a tag to compare

v0.17.0

Changed

  • Breaking change Recovery system re-worked. Kafka-based recovery
    removed. SQLite recovery file format changed; existing recovery DB
    files can not be used. See the module docstring for
    bytewax.recovery for how to use the new recovery system.

  • Dataflow execution supports rescaling over resumes. You can now
    change the number of workers and still get proper execution and
    recovery.

  • epoch-interval has been renamed to snapshot-interval

  • The list-parts method of PartitionedInput has been changed to
    return a List[str] and should only reflect the available
    inputs that a given worker has access to. You no longer need
    to return the complete set of partitions for all workers.

  • The next method of StatefulSource and StatelessSource has
    been changed to next_batch and should return a List of elements,
    or the empty list if there are no elements to return.

Added

  • Added new cli parameter backup-interval, to configure the length of
    time to wait before "garbage collecting" older recovery snapshots.

  • Added next_awake to input classes, which can be used to schedule
    when the next call to next_batch should occur. Use next_awake
    instead of time.sleep.

  • Added bytewax.inputs.batcher_async to bridge async Python libraries
    in Bytewax input sources.

  • Added support for linux/aarch64 and linux/armv7 platforms.

Removed

  • KafkaRecoveryConfig has been removed as a recovery store.

What's Changed

New Contributors

Full Changelog: v0.16.2...v0.17.0

v0.16.2

05 Jun 19:48
1a3338b
Compare
Choose a tag to compare

Overview

  • Add support for Windows builds - thanks @zzl221000!
  • Adds a CSVInput subclass of FileInput

What's Changed

New Contributors

Full Changelog: v0.16.1...v0.16.2

v0.16.1

12 May 18:04
96aa989
Compare
Choose a tag to compare

Overview

  • Add a cooldown for activating workers to reduce CPU consumption.
  • Add support for Python 3.11.

What's Changed

Full Changelog: v0.16.0...v0.16.1