Proxy release 2024-05-16 #7776

vipvap · 2024-05-16T06:01:29Z

Proxy release 2024-05-16

Please merge this Pull Request using 'Create a merge commit' button

Adds ordering asserts to the output of the delta key iterator `MergeDeltaKeys` that implements a k-merge. Part of #7296 : the asserts added by this PR get hit in the reproducers of #7296 as well, but they are earlier in the pipeline.

Improves the tiered compaction tests: * Adds a new test that is a simpler version of the ignored `test_many_updates_for_single_key` test. * Reduces the amount of data that `test_many_updates_for_single_key` processes to make it execute more quickly. * Adds logging support.

## Problem The main point of this PR is to get rid of `python-jose` and `ecdsa` packages as transitive dependencies through `moto`. They have a bunch of open vulnerabilities[1][2][3] (which don't affect us directly), but it's nice not to have them at all. - [1] GHSA-wj6h-64fc-37mp - [2] GHSA-6c5p-j8vq-pqhj - [3] GHSA-cjwg-qfpm-7377 ## Summary of changes - Update `moto` from 4.1.2 to 5.0.6 - Update code to accommodate breaking changes in `moto_server`

…ServerConf` (#7642) Before this PR, `neon_local` would store a copy of a subset of the initial `pageserver.toml` in its `.neon/config`, e.g, `listen_pg_addr`. That copy is represented as `struct PageServerConf`. This copy was used to inform e.g., `neon_local endpoint` and other commands that depend on Pageserver about which port to connect to. The problem with that scheme is that the duplicated information in `.neon/config` can get stale if `pageserver.toml` is changed. This PR fixes that by eliminating populating `struct PageServerConf` from the `pageserver.toml`s. The `[[pageservers]]` TOML table in the `.neon/config` is obsolete. As of this PR, `neon_local` will fail to start and print an error informing about this change. Code-level changes: - Remove the `--pg-version` flag, it was only used for some checks during `neon_local init` - Remove the warn-but-continue behavior for when auth key creation fails but auth keys are not required. It's just complexity that is unjustified for a tool like `neon_local`. - Introduce a type-system-level distinction between the runtime state and the two (!) toml formats that are almost the same but not quite. - runtime state: `struct PageServerConf`, now without `serde` derives - toml format 1: the state in `.neon/config` => `struct OnDiskState` - toml format 2: the `neon_local init --config TMPFILE` that, unlike `struct OnDiskState`, allows specifying `pageservers` - Remove `[[pageservers]]` from the `struct OnDiskState` and load the data from the individual `pageserver.toml`s instead.

Fixes flaky test `test_gc_of_remote_layers`, which was failing because of the `Nothing to GC` pageserver log. I looked into the fails, it seems that backround `gc_loop` sometimes started GC for initial tenant, which wasn't configured to disable GC. The fix is to not create initial tenant with enabled gc at all. Fixes #7538

## Problem We currently have no way to see what the current LSN of a compute its, and in case of read replicas, we don't know what the difference in LSNs is. ## Summary of changes Adds these metrics

The test utils should only be used during tests. Users should not be able to create this extension on their own. Signed-off-by: Alex Chi Z <chi@neon.tech>

- Rename "filename" types which no longer map directly to a filename (LayerFileName -> LayerName) - Add a -v1- part to local layer paths to smooth the path to future updates (we anticipate a -v2- that uses checksums later) - Rename methods that refer to the string-ized version of a LayerName to no longer be called "filename" - Refactor reconcile() function to use a LocalLayerFileMetadata type that includes the local path, rather than carrying local path separately in a tuple and unwrap()'ing it later.

## Problem See #6714, #6967 ## Summary of changes Completely ignore page header when comparing VM pages. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

For aux file keys (v1 or v2) the vectored read path does not return an error when they're missing. Instead they are omitted from the resulting btree (this is a requirement, not a bug). Skip updating the metric in these cases to avoid infinite results.

…(updates AWS SDKs) (#7664) Before this PR, using the AWS SDK profile feature for running against minio didn't work because * our SDK versions were too old and didn't include awslabs/aws-sdk-rust#1060 and * we didn't massage the s3 client config builder correctly. This PR * udpates all the AWS SDKs we use to, respectively, the latest version I could find on crates.io (Is there a better process?) * changes the way remote_storage constructs the S3 client, and * documents how to run the test suite against real S3 & local minio. Regarding the changes to `remote_storage`: if one reads the SDK docs, it is clear that the recommended way is to use `aws_config::from_env`, then customize. What we were doing instead is to use the `aws_sdk_s3` builder directly. To get the `local-minio` in the added docs working, I needed to update both the SDKs and make the changes to the `remote_storage`. See the commit history in this PR for details. Refs: * byproduct: smithy-lang/smithy-rs#3633 * follow-up on deprecation: #7665 * follow-up for scrubber S3 setup: #7667

## Problem Various performance test cases were destabilized by the recent upgrade of `reqwest`, because it changes an error string. Examples: - https://neon-github-public-dev.s3.amazonaws.com/reports/main/9005532594/index.html#testresult/3f984e471a9029a5/ - https://neon-github-public-dev.s3.amazonaws.com/reports/main/9005532594/index.html#testresult/8bd0f095fe0402b7/ The performance tests suffer from this more than most tests, because they churn enough data that the pageserver is still trying to contact the storage controller while it is shut down at the end of tests. ## Summary of changes s/Connection refused/error sending request/

…cheduling optimization (#7673) ## Problem Storage controller was using a zero layer count in SecondaryProgress as a proxy for "not initialized". However, in tenants with zero timelines (a legitimate state), the layer count remains zero forever. This caused #7583 to destabilize the storage controller scale test, which creates lots of tenants, some of which don't get any timelines. ## Summary of changes - Use a None mtime instead of zero layer count to determine if a SecondaryProgress should be ignored. - Adjust the test to use a shorter heatmap upload period to let it proceed faster while waiting for scheduling optimizations to complete.

This PR does two things: First, it fixes a bug with tiered compaction's k-merge implementation. It ignored the lsn of a key during ordering, so multiple updates of the same key could be read in arbitrary order, say from different layers. For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in the heap, they might return `(a,2)` and `(a,1)`. Ultimately, this change wasn't enough to fix the ordering issues in #7296, in other words there is likely still bugs in the k-merge. So as the second thing, we switch away from the k-merge to an in-memory based one, similar to #4839, but leave the code around to be improved and maybe switched to later on. Part of #7296

…7679) This reverts commit 1173ee6. ## Problem It breaks autoscaling tests

…ts/csharp/npgsql (#7680)

## Problem #7637 breaks forward compat test. On commit ea531d4. https://neon-github-public-dev.s3.amazonaws.com/reports/main/8988324349/index.html ``` test_create_snapshot 2024-05-07T16:03:11.331883Z INFO version: git-env:ea531d448eb65c4f58abb9ef7d8cd461952f7c5f failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 16:03:11.316131763 UTC build_tag: build_tag-env:5159 test_forward_compatibility 2024-05-07T16:07:02.310769Z INFO version: git-env:ea531d448eb65c4f58abb9ef7d8cd461952f7c5f failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 16:07:02.294676183 UTC build_tag: build_tag-env:5159 ``` The forward compatibility test is actually using the same tag as the current build. The commit before that, https://neon-github-public-dev.s3.amazonaws.com/reports/main/8988126011/index.html ``` test_create_snapshot 2024-05-07T15:47:21.900796Z INFO version: git-env:2dbd1c1ed5cd0458933e8ffd40a9c0a5f4d610b8 failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 15:47:21.882784185 UTC build_tag: build_tag-env:5158 test_forward_compatibility 2024-05-07T15:50:48.828733Z INFO version: git-env:c4d7d5982553d2cf66634d1fbf85d95ef44a6524 failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 15:50:48.816635176 UTC build_tag: build_tag-env:release-5434 ``` This pull request patches the bin path so that the new neon_local will use the old binary. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem There is no global per-ep rate limiter in proxy. ## Summary of changes * Return global per-ep rate limiter back. * Rename weak compute rate limiter (the cli flags were not used anywhere, so it's safe to rename).

## Problem This caused a variation of the stats bug fixed by #7662. That PR also fixed this case, but we still shouldn't make redundant get calls. ## Summary of changes - Only call get in the create image layers loop at the end of a range if some keys have been accumulated

## Problem Move from aws based arm64 runners to bare-metal based ## Summary of changes Changes in GitHub action workflows where `runs-on: arm64` used. More parallelism added, build time for `neon with extra platform builds` workflow reduced from 45m to 25m

We didn't check permission in `"/v1/failpoints"` endpoint, it means that everyone with per-tenant token could modify the failpoints. This commit fixes that.

We had accidentally left two endpoints for `tenant`: `/synthetic_size` and `/size`. Size had the more extensive description but has returned 404 since renaming. Remove the `/size` in favor of the working one and describe the `text/html` output.

in addition to layer names, expand the input vocabulary to recognize lines in the form of: ${kind}:${lsn} where: - kind in `gc_cutoff` or `branch` - lsn is accepted in Lsn display format (x/y) or hex (as used in layer names) gc_cutoff and branch have different colors.

While switching to use nextest with the repository in f28bdb6, we had not noticed that it doesn't yet support running doctests. Run the doc tests before other tests.

@hlinnaka

The old test based on the immutable `target_file_size` that was a parameter to the function. It makes no sense to go further once `current_level_target_height` has reached `u64::MAX`, as lsn's are u64 typed. In practice, we should only run into this if there is a bug, as the practical lsn range usually ends much earlier. Testing on `target_file_size` makes less sense, it basically implements an invocation mode that turns off the looping and only runs one iteration of it. @hlinnaka agrees that `current_level_target_height` is better here. Part of #7554

Split checkpoint_stats into two separate metrics: checkpoints_req and checkpoints_timed Fixes commit 21e1a49 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>

The first implementation #7456 did not include `index_part.json` changes in an attempt to keep amount of changes down. Tracks the historic reparentings and earlier detach in `index_part.json`. - `index_part.json` receives a new field `lineage: Lineage` - `Lineage` is queried through RemoteTimelineClient during basebackup, creating `PREV LSN: none` for the invalid prev record lsn just as it would had been created for a newly created timeline - as `struct IndexPart` grew, it is now boxed in places Cc: #6994

In timeline detach ancestor tests there is no way to really be sure that there were no subtle off-by one bugs. One such bug is demoed and reverted. Add verifying fullbackup is equal before and after detaching ancestor. Fullbackup is expected to be equal apart from `zenith.signal`, which is known to be good because endpoint can be started without the detached branch receiving writes.

The test has been flaky since 2024-04-11 for unknown reason, and the logging was off. Fix the logging and raise the limit a bit. The problematic ratio reproduces with pg14 and added sleep (not included) but not on pg15. The new ratio abs diff limit works for all inspected examples. Cc: #7536

Per [evidence] the timeline ancestor detach tests can panic while shutting down on vectored get validation. Allow the error because tenant is restarted twice in the test. [evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7708/9058185709/index.html#suites/a1c2be32556270764423c495fad75d47/d444f7e5c0a18ce9

## Problem `report-benchmarks-failures` job is triggered for any failure in the CI pipeline, but we need it to be triggered only for failed `benchmarks` job ## Summary of changes - replace `failure()` with `needs.benchmarks.result == 'failure'` in the condition

A test for #7684. This pull request checks if the pageserver version we specified is the one actually running by comparing the git hash in forward compatibility tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem Shards with number >0 could hang waiting for `await_initial_logical_size`, as we don't calculate logical size on these shards. This causes them to hold onto semaphore units and starve other tenants out from proceeding with warmup activation. That doesn't hurt availability (we still have on-demand activation), but it does mean that some background tasks like consumption metrics would omit some tenants. ## Summary of changes - Skip waiting for logical size calculation on shards >0 - Upgrade unexpected code paths to use debug_assert!(), which acts as an implicit regression test for this issue, and make the info() one into a warn()

The [2.31.0 release](https://github.com/rui314/mold/releases/tag/v2.31.0) of mold includes a 10% speed improvement for binaries with a lot of debug info. As we have such, it might be useful to update mold to the latest release. The jump is from 2.4.0 to 2.31.0, but it's not been many releases in between as the version number was raised by the mold maintainers to 2.30.0 after 2.4.1 [to avoid confusion for some tools](https://github.com/rui314/mold/releases/tag/v2.30.0).

…#7747) This PR allows setting the `PAGESERVER_DEFAULT_TENANT_CONFIG_COMPACTION_ALGORITHM` env var to override the `tenant_config.compaction_algorithm` field in the initial `pageserver.toml` for all tests. I tested manually that this works by halting a test using pdb and inspecting the `effective_config` in the tenant status managment API. If the env var is set, the tests are parametrized by the `kind` tag field, allowing to do a matrix build in CI and let Allure summarize everything in a nice report. If the env var is not set, the tests are not parametrized. So, merging this PR doesn't cause problems for flaky test detection. In fact, it doesn't cause any runtime change if the env var is not set. There are some tests in the test suite that set used to override the entire tenant_config using `NeonEnvBuilder.pageserver_config_override`. Since config overrides are merged non-recursively, such overrides that don't specify `kind = ` cause a fallback to pageserver's built-in `DEFAULT_COMPACTION_ALGORITHM`. Such cases can be found using ``` ["']tenant_config\s*[='"] ``` We'll deal with these tests in a future PR. closes #7555

…#7740) ## Problem There are not enough arm runners and jobs in `neon-extra-builds` workflow take about the same amount of time on a small-arm runner as on large-arm. ## Summary of changes - Switch `neon-extra-builds` workflow from `large-arm64` to `small-arm64` runners

…file refs (#7752) ## Problem This is historical baggage from when the pageserver could be run with local disk only: we had a bunch of places where we had to treat remote storage as optional. Closes: #6890 ## Changes - Remove Option<> around remote storage (in #7722 we made remote storage clearly mandatory) - Remove code for deleting old metadata files: they're all gone now. - Remove other references to metadata files when loading directories, as none exist. I checked last 14 days of logs for "found legacy metadata", there are no instances.

They have merged our PR nical/rust_debug#4 but they haven't released a new crate version yet. refs #7763

refs #7753 This PR is step (1) of removing sync walredo from Pageserver. Changes: * Remove the sync impl * If sync is configured, warn! and use async instead * Remove the metric that exposes `kind` * Remove the tenant status API that exposes `kind` Future Work ----------- After we've released this change to prod and are sure we won't roll back, we will 1. update the prod Ansible to remove the config flag from the prod pageserver.toml. 2. remove the remaining `kind` code in pageserver These two changes need no release inbetween. See #7753 for details.

Adds a test that is a reproducer for many tiered compaction bugs, both ones that have since been fixed as well as still unfxied ones: * (now fixed) #7296 * #7707 * #7759 * Likely also #7244 but I haven't tried that. The key ordering bug can be reproduced by switching to `merge_delta_keys` instead of `merge_delta_keys_buffered`, so reverting a big part of #7661, although it only sometimes reproduces (30-50% of cases). part of #7554

The `= None` makes it not use the fixture. This slipped due to last-minute changes.

Add performance regress test for on-demand download throughput. Closes #7146 Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech>

FNV hash is simple, portable, and stable. This pull request vendors the FNV hash implementation from servo and modified it to use the u128 variant. replaces #7644 ref #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

…_policy (#7769) Signed-off-by: Alex Chi Z <chi@neon.tech>

github-actions · 2024-05-16T06:49:19Z

3060 tests run: 2933 passed, 0 failed, 127 skipped (full report)

Flaky tests (1)

Postgres 15

test_delete_timeline_client_hangup: debug

Code coverage* (full report)

functions: 31.5% (6350 of 20178 functions)
lines: 47.5% (47944 of 101008 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
c6d5ff9 at 2024-05-16T06:49:18.333Z :recycle:}

skyzh

LGTM as I don't think proxy release should be blocked on / affected by storage changes.

conradludgate · 2024-05-16T18:14:32Z

LGTM as I don't think proxy release should be blocked on / affected by storage changes.

There's no codeowner checks for release PRs ;)
We're not releasing proxy yet this week as we are blocked on a bug fix

danieltprice · 2024-05-23T22:25:47Z

Reviewed for changelog.

arpad-m and others added 30 commits May 8, 2024 13:22

Expose LSN and replication delay as metrics (#7610)

21e1a49

## Problem We currently have no way to see what the current LSN of a compute its, and in case of read replicas, we don't know what the difference in LSNs is. ## Summary of changes Adds these metrics

chore(neon_test_utils): restrict installation to superuser (#7624)

1173ee6

The test utils should only be used during tests. Users should not be able to create this extension on their own. Signed-off-by: Alex Chi Z <chi@neon.tech>

Revert "chore(neon_test_utils): restrict installation to superuser" (#…

2682e02

…7679) This reverts commit 1173ee6. ## Problem It breaks autoscaling tests

build(deps): bump Npgsql from 8.0.2 to 8.0.3 in /test_runner/pg_clien…

5ea117c

…ts/csharp/npgsql (#7680)

Proxy added per ep rate limiter (#7636)

be1a88e

## Problem There is no global per-ep rate limiter in proxy. ## Summary of changes * Return global per-ep rate limiter back. * Rename weak compute rate limiter (the cli flags were not used anywhere, so it's safe to rename).

Fix permissions for safekeeper failpoints (#7669)

0b02043

We didn't check permission in `"/v1/failpoints"` endpoint, it means that everyone with per-tenant token could modify the failpoints. This commit fixes that.

build: run doctests (#7697)

6206f76

While switching to use nextest with the repository in f28bdb6, we had not noticed that it doesn't yet support running doctests. Run the doc tests before other tests.

Fix checkpoint metric (#7701)

95098c3

Split checkpoint_stats into two separate metrics: checkpoints_req and checkpoints_timed Fixes commit 21e1a49 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>

bayandin and others added 15 commits May 14, 2024 13:39

chore(test): add version check for forward compat test (#7685)

30d15ad

A test for #7684. This pull request checks if the pageserver version we specified is the one actually running by comparing the git hash in forward compatibility tests. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

Add restart_lsn metric for logical slots.

1a2a3cb

chore(deps): use upstream svg_fmt after they merged our PR (#7764)

bc78b0e

They have merged our PR nical/rust_debug#4 but they haven't released a new crate version yet. refs #7763

fixup #7747: actually use the fixture for neon_env_builder (#7767)

3ef6e21

The `= None` makes it not use the fixture. This slipped due to last-minute changes.

Add performance regress test_ondemand_download_churn.py (#7242)

affc18f

Add performance regress test for on-demand download throughput. Closes #7146 Co-authored-by: Christian Schwarz <christian@neon.tech> Co-authored-by: Alexander Bayandin <alexander@neon.tech>

feat(pageserver): use fnv hash for aux file encoding (#7742)

4b97683

FNV hash is simple, portable, and stable. This pull request vendors the FNV hash implementation from servo and modified it to use the u128 variant. replaces #7644 ref #7462 --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

fix(test): ensure fixtures are correctly used for pageserver_aux_file…

c6d5ff9

…_policy (#7769) Signed-off-by: Alex Chi Z <chi@neon.tech>

vipvap requested review from a team as code owners May 16, 2024 06:01

vipvap requested review from jcsp, conradludgate and skyzh and removed request for a team May 16, 2024 06:01

skyzh approved these changes May 16, 2024

View reviewed changes

conradludgate closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy release 2024-05-16 #7776

Proxy release 2024-05-16 #7776

vipvap commented May 16, 2024

github-actions bot commented May 16, 2024

Postgres 15

skyzh left a comment

conradludgate commented May 16, 2024

danieltprice commented May 23, 2024

Proxy release 2024-05-16 #7776

Proxy release 2024-05-16 #7776

Conversation

vipvap commented May 16, 2024