Skip to content

enhancement(aws_s3 sink): Add ability to configure request errors to retry #23206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Jul 14, 2025

Conversation

jchap-pnnl
Copy link
Contributor

@jchap-pnnl jchap-pnnl commented Jun 13, 2025

Summary

Adds a new field, a RetryStrategy enum called retry_strategy, to S3SinkConfig that allows users to specify types of response errors to retry. Users can specify specific status codes of error responses of failed requests they want to be automatically retried, or they can specify all failed requests to be automatically retried.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

  1. Ran existing integration tests and verified they passed.
  2. Manually tested to verify the new functionality. Generated 403 errors by changing the access_key_id to an invalid one. Verified the failed authentication service was retried when the configuration required it to and didn't when the configuration didn't require it to. Ran tests with different values for retry_strategy, type, and status_codes:
    • retry_strategy: omitted |
      • type: all | custom
        • status_codes: [] | [403] | [404] | [403, 404]

Used this configuration:

# vector.yaml
sources:
  generate_syslog:
    type: "demo_logs"
    format: "syslog"
    count: 10

transforms:
  remap_syslog:
    inputs:
      - "generate_syslog"
    type: "remap"
    source: |
      structured = parse_syslog!(.message)
      . = merge(., structured)

sinks:
  s3:
    type: aws_s3
    encoding:
      codec: "json"
    region: <actual region>
    inputs:
      - remap_syslog
    bucket: <bucket name>
    auth:
      access_key_id: <key id value>
      secret_access_key: <key value>

    retry_strategy: 
      type: all
      # type: custom
      #   status_codes: [403]
  1. Considered adding a new test to src/sinks/aws_s3/integration_tests.rs, but didn't see a good option for detecting service retries. Considered a test using a timer that could measure how long a successful request takes and determine if retries are happening if the failed request takes longer. We rejected this option because the results could vary depending on the load of system resources or other factors. But please let us know if you recommend a testing approach that would work.

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • The CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • cargo fmt --all
      • cargo clippy --workspace --all-targets -- -D warnings
      • cargo nextest run --workspace (alternatively, you can run cargo test --all)
      • ./scripts/check_changelog_fragments.sh
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run cargo vdev build licenses to regenerate the license inventory and commit the changes (if any). More details here.

jchap-pnnl added 18 commits May 15, 2025 15:13
…l errors. Added retry_all_errors to S3SinkConfig and S3RetryLogic structs. Setting retry_all_errors to the default value in the generate_config function. Added self.retry_all_errors to the condition in the is_retriable_error function. (vectordotdev#10870)
… Added configured_to_retry and check_response functions to s3_common/config.rs. Added configured_to_retry call to is_retriable_error result in RetryLogic. (vectordotdev#10870)
@jchap-pnnl jchap-pnnl requested review from a team as code owners June 13, 2025 22:07
@bits-bot
Copy link

bits-bot commented Jun 13, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: external docs Anything related to Vector's external, public documentation labels Jun 13, 2025
@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Jun 17, 2025
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Jul 10, 2025
@pront pront added the meta: awaiting author Pull requests that are awaiting their author. label Jul 10, 2025
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Jul 10, 2025
@jchap-pnnl
Copy link
Contributor Author

I'm trying to figure out the two failing checks -- Test Suite / Checks and Test Suite / Test Suite.

In Test Suite / Checks, the failure is happening when cargo vdev check docs is run. Looks like the auto-generated documentation for retry_strategy in website/cue/reference/components/sinks/generated/aws_s3.cue isn't matching the required schema defined in website/cue/reference.cue. Seems to be complaining that a description isn't getting documented for status_codes. That's curious because there is a description on the StatusCodes variant of the RetryStrategy enum defined in src/sinks/s3_common/config.rs, but it's not getting pulled into the auto-generated documentation. Furthermore, status_codes isn't put in with the other enum variants of retry_strategy in the auto-generated documentation, I guess because they're strings but status_codes is a list. I wonder if the variants being different types is contributing to the problem.

Looking at .github/workflows/test.yml, maybe the Test Suite / Test Suite failure is simply because Test Suite / Checks failed.

Appreciate any help.

@pront
Copy link
Member

pront commented Jul 11, 2025

I will take a look @jchap-pnnl

@jchap-pnnl
Copy link
Contributor Author

Great, thanks!

@pront
Copy link
Member

pront commented Jul 11, 2025

image

@pront
Copy link
Member

pront commented Jul 11, 2025

Please take another look at the current state of this PR and let us know if this is ready for the final review.

jchap-pnnl and others added 2 commits July 11, 2025 12:39
…Changed documentation comments explain that retry_strategy settings extend, not override, default retry behavior for the sink. (vectordotdev#10870)
@jchap-pnnl
Copy link
Contributor Author

I reran the tests, verified the functionality still works, and updated the PR description. I've addressed everything I'm aware of. Please resume reviewing. Thanks!

@pront pront enabled auto-merge July 14, 2025 18:54
@pront pront added this pull request to the merge queue Jul 14, 2025
Merged via the queue into vectordotdev:master with commit 8c199ac Jul 14, 2025
42 checks passed
@jchap-pnnl jchap-pnnl deleted the feature/failed-response-retry branch July 17, 2025 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider expanding the cases where Vector retries requests
5 participants