Skip to content

Add Parquet codec to v0.51-exaforce#35

Merged
sundaresanr merged 1 commit intov0.51-exaforcefrom
v0.51-exaforce-parquet
Apr 18, 2026
Merged

Add Parquet codec to v0.51-exaforce#35
sundaresanr merged 1 commit intov0.51-exaforcefrom
v0.51-exaforce-parquet

Conversation

@sundaresanr
Copy link
Copy Markdown

Summary

Ports the parquet codec from v0.45-exaforce onto upstream v0.51, cleanly (no azure_blob queue source, no GCP impersonation — both confirmed unused in production Vector configs, and the azure_blob patches don't compile against v0.51's azure-core 0.25 anyway). Replaces the now-reverted PR #31.

Single commit, 12 files, +3844/-2159 (Cargo.lock is the bulk).

What's in the patch

  • `lib/codecs/Cargo.toml`: parquet 39.0.0 dep
  • `lib/codecs/src/encoding/format/parquet.rs`: new `ParquetSerializer` (1317 lines). `BoolColumnWriter` match uses bare binding (not `ref mut`) — Rust 2024 ergonomics.
  • `lib/codecs/src/encoding/format/mod.rs`: expose `ParquetSerializer*`
  • `lib/codecs/src/encoding/mod.rs`: `pub use serializer::BatchSerializer`
  • `lib/codecs/src/encoding/serializer.rs`: `SerializerConfig::Parquet` variant, `BatchSerializer` enum, `build_batched()` with cfg-guarded `Otlp` arm. Imports `Parquet*` from `super::format`; `Gelf` match arm uses tuple syntax to match the existing enum.
  • `src/codecs/encoding/config.rs`: `EncodingConfigWithFraming::build_batched()`; preserves `vector_lib::configurable::configurable_component` import that the previous rebase attempt accidentally dropped.
  • `src/components/validation/resources/mod.rs`: `Parquet` arm in serializer→deserializer match
  • `src/sinks/util/encoding.rs`: `Encoder<Vec>` trait for batched codec usage
  • `src/sinks/aws_s3/{config,sink}.rs`: use the batched encoder so `aws_s3` sink emits parquet files
  • `Cargo.{toml,lock}`: transitive deps for parquet (arrow, etc.). `pulsar` stays at 6.3.1 (avoiding the 6.5.0 drift the previous rebase introduced which broke the pulsar sink).

Local verification

  • `cargo check -p vector` (default features — what `make package-aarch64-unknown-linux-gnu-all` uses) — green
  • `cargo check --tests -p vector --no-default-features --features azure-integration-tests` (what `make test-integration-azure` uses) — green

Downstream

Once merged, `exaforce-publish.yaml` push-trigger will publish `vector-base:v0.51-exaforce-` to ECR. The `operations` repo then bumps `goservices/vector_exec_sources/Dockerfile:3` in a separate follow-up PR.

@sundaresanr sundaresanr force-pushed the v0.51-exaforce-parquet branch 2 times, most recently from cf47090 to 86fa5b0 Compare April 17, 2026 23:16
Ports the parquet codec from v0.45-exaforce onto upstream v0.51 so
the aws_s3 sink can emit parquet files. The azure_blob queue source
and GCP impersonation pieces that also lived on v0.45-exaforce were
dropped — neither is used in any production Vector config.
@sundaresanr sundaresanr force-pushed the v0.51-exaforce-parquet branch from 86fa5b0 to 3e5fc44 Compare April 18, 2026 00:17
@sundaresanr sundaresanr merged commit 46d4a56 into v0.51-exaforce Apr 18, 2026
1 check passed
sundaresanr added a commit that referenced this pull request Apr 20, 2026
Brings the parquet codec from v0.51-exaforce-parquet (PR #35) onto
upstream v0.54.0. Enables the aws_s3 sink to emit parquet files via
`encoding.codec: parquet`, preserving the existing user-facing config
that production vector configs already use.

Design differences vs PR #35, to fit v0.54's already-reshaped
batched encoding infrastructure:

- Upstream v0.54 gained a `BatchSerializer` enum (in
  lib/codecs/src/encoding/encoder.rs) with an `Arrow` variant behind
  the `arrow` feature, plus `BatchEncoder` and `EncoderKind::Batch`.
  We add `BatchSerializer::Parquet(ParquetSerializer)` alongside the
  `Arrow` variant rather than defining a new `BatchSerializer` enum
  in serializer.rs. `EncoderKind::Batch` is now ungated.
- aws_s3 sink's `S3RequestOptions::encoder` switches from
  `(Transformer, Encoder<Framer>)` to `(Transformer, EncoderKind)`.
  The existing `Encoder<Vec<Event>> for (Transformer, EncoderKind)`
  impl in src/sinks/util/encoding.rs dispatches between framed and
  batched paths, so there is no need for the `Arc<dyn Encoder>`
  polymorphism PR #35 used on v0.51.
- `(Transformer, BatchEncoder)` impl is no longer gated behind
  `codecs-arrow`, since parquet is now always-on.

Parquet support itself is unchanged: `lib/codecs/src/encoding/format/parquet.rs`
is copied verbatim from PR #35 (parquet 39.0.0, 1317 lines), and
`SerializerConfig::Parquet { parquet: ParquetSerializerOptions }` plus
`SerializerConfig::build_batched()` and
`EncodingConfigWithFraming::build_batched()` are preserved.

Verified with `cargo check -p vector` (default features).
sundaresanr added a commit that referenced this pull request Apr 20, 2026
Brings the parquet codec from v0.51-exaforce-parquet (PR #35) onto
upstream v0.54.0. Enables the aws_s3 sink to emit parquet files via
`encoding.codec: parquet`, preserving the existing user-facing config
that production vector configs already use.

Design differences vs PR #35, to fit v0.54's already-reshaped
batched encoding infrastructure:

- Upstream v0.54 gained a `BatchSerializer` enum (in
  lib/codecs/src/encoding/encoder.rs) with an `Arrow` variant behind
  the `arrow` feature, plus `BatchEncoder` and `EncoderKind::Batch`.
  We add `BatchSerializer::Parquet(ParquetSerializer)` alongside the
  `Arrow` variant rather than defining a new `BatchSerializer` enum
  in serializer.rs. `EncoderKind::Batch` is now ungated.
- aws_s3 sink's `S3RequestOptions::encoder` switches from
  `(Transformer, Encoder<Framer>)` to `(Transformer, EncoderKind)`.
  The existing `Encoder<Vec<Event>> for (Transformer, EncoderKind)`
  impl in src/sinks/util/encoding.rs dispatches between framed and
  batched paths, so there is no need for the `Arc<dyn Encoder>`
  polymorphism PR #35 used on v0.51.
- `(Transformer, BatchEncoder)` impl is no longer gated behind
  `codecs-arrow`, since parquet is now always-on.

Parquet support itself is unchanged: `lib/codecs/src/encoding/format/parquet.rs`
is copied verbatim from PR #35 (parquet 39.0.0, 1317 lines), and
`SerializerConfig::Parquet { parquet: ParquetSerializerOptions }` plus
`SerializerConfig::build_batched()` and
`EncodingConfigWithFraming::build_batched()` are preserved.

Verified with `cargo check -p vector` (default features).
sundaresanr added a commit that referenced this pull request Apr 20, 2026
Brings the parquet codec from v0.51-exaforce-parquet (PR #35) onto
upstream v0.54.0. Enables the aws_s3 sink to emit parquet files via
`encoding.codec: parquet`, preserving the existing user-facing config
that production vector configs already use.

Design differences vs PR #35, to fit v0.54's already-reshaped
batched encoding infrastructure:

- Upstream v0.54 gained a `BatchSerializer` enum (in
  lib/codecs/src/encoding/encoder.rs) with an `Arrow` variant behind
  the `arrow` feature, plus `BatchEncoder` and `EncoderKind::Batch`.
  We add `BatchSerializer::Parquet(ParquetSerializer)` alongside the
  `Arrow` variant rather than defining a new `BatchSerializer` enum
  in serializer.rs. `EncoderKind::Batch` is now ungated.
- aws_s3 sink's `S3RequestOptions::encoder` switches from
  `(Transformer, Encoder<Framer>)` to `(Transformer, EncoderKind)`.
  The existing `Encoder<Vec<Event>> for (Transformer, EncoderKind)`
  impl in src/sinks/util/encoding.rs dispatches between framed and
  batched paths, so there is no need for the `Arc<dyn Encoder>`
  polymorphism PR #35 used on v0.51.
- `(Transformer, BatchEncoder)` impl is no longer gated behind
  `codecs-arrow`, since parquet is now always-on.

Parquet support itself is unchanged: `lib/codecs/src/encoding/format/parquet.rs`
is copied verbatim from PR #35 (parquet 39.0.0, 1317 lines), and
`SerializerConfig::Parquet { parquet: ParquetSerializerOptions }` plus
`SerializerConfig::build_batched()` and
`EncodingConfigWithFraming::build_batched()` are preserved.

Verified with `cargo check -p vector` (default features).
sundaresanr added a commit that referenced this pull request Apr 20, 2026
Brings the parquet codec from v0.51-exaforce-parquet (PR #35) onto
upstream v0.54.0. Enables the aws_s3 sink to emit parquet files via
`encoding.codec: parquet`, preserving the existing user-facing config
that production vector configs already use.

Design differences vs PR #35, to fit v0.54's already-reshaped
batched encoding infrastructure:

- Upstream v0.54 gained a `BatchSerializer` enum (in
  lib/codecs/src/encoding/encoder.rs) with an `Arrow` variant behind
  the `arrow` feature, plus `BatchEncoder` and `EncoderKind::Batch`.
  We add `BatchSerializer::Parquet(ParquetSerializer)` alongside the
  `Arrow` variant rather than defining a new `BatchSerializer` enum
  in serializer.rs. `EncoderKind::Batch` is now ungated.
- aws_s3 sink's `S3RequestOptions::encoder` switches from
  `(Transformer, Encoder<Framer>)` to `(Transformer, EncoderKind)`.
  The existing `Encoder<Vec<Event>> for (Transformer, EncoderKind)`
  impl in src/sinks/util/encoding.rs dispatches between framed and
  batched paths, so there is no need for the `Arc<dyn Encoder>`
  polymorphism PR #35 used on v0.51.
- `(Transformer, BatchEncoder)` impl is no longer gated behind
  `codecs-arrow`, since parquet is now always-on.

Parquet support itself is unchanged: `lib/codecs/src/encoding/format/parquet.rs`
is copied verbatim from PR #35 (parquet 39.0.0, 1317 lines), and
`SerializerConfig::Parquet { parquet: ParquetSerializerOptions }` plus
`SerializerConfig::build_batched()` and
`EncodingConfigWithFraming::build_batched()` are preserved.

Verified with `cargo check -p vector` (default features).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant