Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid slowdown when using multiple masks #14617

Merged
merged 20 commits into from
Dec 7, 2023

Conversation

georgeee
Copy link
Member

Problem: testing with test-apply tool revealed that using multiple masks leads to 7x-8x slowdown of block application (for 128 txs). This happens because instead of a single logarithmic lookup 290 of such lookups are performed instead.

Solution: merge maps state from masks and keep it in the topmost mask. Use them for lookups, reducing 290 map lookups to a single one.

Explain how you tested your changes:

  • Measured performance with test-apply tool
  • Deployed the change on whale-1-1, confirmed performance improvement and normal operation and block production

Checklist:

  • Dependency versions are unchanged
    • Notify Velocity team if dependencies must change in CI
  • Modified the current draft of release notes with details on what is completed or incomplete within this project
  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? List them

@georgeee georgeee requested a review from a team as a code owner November 27, 2023 21:22
@georgeee georgeee force-pushed the georgeee/avoid-slowdown-of-multiple-masks branch from 95430a9 to d39a07c Compare November 28, 2023 18:38
@georgeee georgeee force-pushed the georgeee/avoid-slowdown-of-multiple-masks branch from d39a07c to 12a8428 Compare November 28, 2023 20:37
Copy link
Member

@nholland94 nholland94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I get the high level concepts, but I'm having a hard time following the logic here. Rather than explaining it here on the PR, could you add comments to the mask code to make it clearer? For example, a comment at the top which describes the overall accumulator concept, and some comments on the accumulator fields to explain the current and next maps records? The logic around the detachment signals would be useful to have documented as well.

On another note, this entire changeset is unsafe in the event that we update a mask that has children. While I'm pretty sure we don't do this under normal operating conditions within the frontier, we do use masks throughout the codebase, so we would need to carefully audit all usages to make sure we never do this. I'd be more comfortable taking this sort of change if we were to actually refactor the mask interface to be immutable instead of mutable. This gives me some pause in this direction, but I understand the performance gains of this are really good. So I have a few questions:

  1. Are the performance gains this brings necessary to unblock the issue we are seeing in Berkeley, or are the other performance gains enough?
  2. Is there another approach we can take here which will optimize the mask chaining without having to introducing a new invariant into the masking interface?

let self_find_hash t address =
assert_is_attached t ;
Map.find !(t.hashes) address
let update_maps ~f t =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: since update_maps is always taking a function that is mutating fields on the record, it might make more sense to just make the fields storing a maps_t to be mutable instead, and then each function can just be implemented as an immutable update on that record of maps, allowing update_maps to hide the mutability away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be implemented this way, I agree

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nholland94 could you implement this change?

@georgeee georgeee force-pushed the georgeee/avoid-slowdown-of-multiple-masks branch from 12a8428 to 8914406 Compare November 29, 2023 17:52
}
[@@deriving sexp]

let maps_copy { accounts; token_owners; hashes; locations } =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just the identity function? Should we remove this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not. Underlying type contains mutable variables, hence we're recreating variables here

Problem: 290 layers of masking trees makes processing of transactions to
be O(290*logn) where n is the size of a mask. This 290x factor is a
significant slowdown

Solution: have a mechanism to replace O(290*logn) to something around
O(30*logn).

This PR is prototype, without logic integrated into frontier handling
@georgeee georgeee force-pushed the georgeee/avoid-slowdown-of-multiple-masks branch from ac44c5c to d05242c Compare December 1, 2023 21:09
@georgeee
Copy link
Member Author

georgeee commented Dec 2, 2023

!ci-nightly-me

@georgeee
Copy link
Member Author

georgeee commented Dec 2, 2023

!ci-nightly-me

@ghost-not-in-the-shell
Copy link
Contributor

ghost-not-in-the-shell commented Dec 4, 2023

I can basically understand the ideas behind this optimization. You are caching the updates to accounts in accumulated. But I am not very clear with the changes you did in this commit: 7fbea17 Why we now have a next and current and a detached_next_signal?

Also, before this optimization, we are using remove_account_and_update_hashes to update the current map. So basically when the mask gets commited, the base ledger would notify that mask to remove the accounts in the mask. But it seems like you are using a different ways to manage the accumulated maps (namely the detached_next_signal).

It seems like the accumulated maps only gets updated here:

    let maps_and_ancestor t =
      Option.iter t.accumulated
        ~f:(fun { detached_next_signal; next; base; current = _ } ->
          if Async.Ivar.is_full detached_next_signal then
            t.accumulated <-
              Some
                { next = t.maps
                ; current = next
                ; detached_next_signal = t.detached_parent_signal
                ; base
                } ) ;
      match t.accumulated with
      | Some { current; base; _ } ->
          (current, base)
      | None ->
          (t.maps, get_parent t)

But this would only gets triggered when there's a read hits this mask. So it seems to me that the accumulated maps would be updated every k masks. My understanding is that there would be k masks with accumulated maps and then the root shifted and at this point the top mask would receive this detached_next_signal and change its detched_next_signal to be detached_parent_signal, and then we have another round k masks.

I guess without the detached_next_signal and the current/next rotation, the accumulated map would keep growing as the chain moves on, because when we unset_parent, we just reset the accumulated maps that being unparented and it won't affect its children.

This is just my understanding of what's happening here (could be wrong). Can you explain a little bit on how you intend to make it work? @georgeee

@ghost-not-in-the-shell ghost-not-in-the-shell requested a review from a team as a code owner December 4, 2023 19:16
@ghost-not-in-the-shell
Copy link
Contributor

!ci-build-me

; accumulated =
Option.map t.accumulated ~f:(fun acc ->
{ base = acc.base
; detached_next_signal = detached_parent_signal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be detached_next_signal instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems that the logic here is meant to be that we carry the detached_next_signal through until we hit a layer where we trigger a rotation. Depending on how we use copy in the code, this could lead to the accumulator never being rotated, causing a memory leak.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug indeed
Initial implementation got it right, then in 3dba63a I changed it to what it is now, but this was a wrong change

Problem: accumulated solution is heavily relied upon assumption of
immutability of a parent ledger/mask. If this assumption is accidentally
broken, this might be problematic.

Solution: when a conflicting update to parent is detected, log it and
reset the accumulated state all down the ancestry.
Reverts a faulty change (original version was correct).

This reverts commit 3dba63a.
Base automatically changed from feature/ledger-mask-maps to feature/sparse-ledger-wide-merkle-paths December 5, 2023 06:32
Base automatically changed from feature/sparse-ledger-wide-merkle-paths to feature/sparse-ledger-of-subset-no-mutability December 5, 2023 06:32
Base automatically changed from feature/sparse-ledger-of-subset-no-mutability to feature/merkle-mask-empty-account-preloading December 5, 2023 06:32
Base automatically changed from feature/merkle-mask-empty-account-preloading to feature/merkle-mask-preloading December 5, 2023 06:32
Base automatically changed from feature/merkle-mask-preloading to feature/improved-merkle-masks-for-staged-ledger December 5, 2023 06:33
Base automatically changed from feature/improved-merkle-masks-for-staged-ledger to feature/batch-account-lookups December 5, 2023 06:33
Base automatically changed from feature/batch-account-lookups to rampup December 5, 2023 06:34
@deepthiskumar
Copy link
Member

!ci-nightly-me

1 similar comment
@ghost-not-in-the-shell
Copy link
Contributor

!ci-nightly-me

@@ -493,7 +495,6 @@ module Make (Inputs : Inputs_intf.S) = struct
Base.merkle_root ancestor

let remove_account_and_update_hashes t location =
t.accumulated <- None ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original thought was that you could just remove this line here. No other changes should be required.

Among all the children of the root breadcrumb. If it's not on the canonical chain, then they would be treated as garbage and those would be collected, so I think there's no need to manually set accumulated to be None. For the mask on the canonical chain, we don't need to do anything. So remove this would be enough.

What you did here adds an extra bit of complexity to the code and also do the cleanup manually. I am not sure whether this is the best choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After george explained to me, now I understand that this actually does safeguard against possible bug. This does safeguard against possible errors. The newly added logs and cleanup are exactly some "safeguard" for this PR.

@@ -25,6 +25,8 @@ module Make (Inputs : Inputs_intf) = struct
open Inputs
include Base

let logger = Logger.create ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we creating a logger here instead of passing one into the functions that need it? It's best to pass the logger around in the code and not create ad-hoc loggers, as we propagate metadata through loggers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it now and it seems like I'd need to change 30-50 lines of code (I didn't finish). Do you think this is completely necessary?

This change is a safeguard and we don't expect this log to ever occur in real life. If it does, we probably will have means to deduce context of it occurring even with truncated logging context

@@ -144,8 +146,8 @@ module Make (Inputs : Inputs_intf) = struct
let unsafe_preload_accounts_from_parent =
Mask.Attached.unsafe_preload_accounts_from_parent

let register_mask t mask =
let attached_mask = Mask.set_parent mask t in
let register_mask ?accumulated t mask =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use an explicit named argument here instead of an optional argument. It helps to avoid bugs where we fail to thread a value through, and I'm not sure if we call this function frequently enough directly to make it worth avoid the explicit ~accumulated:None argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 32 usages of Maskable.register_mask (most in tests) and only one usage where a value is actually passed (one in Ledger module, the only place we actually want this change).

So I'd say avoiding explicit call is worth it.

if not (Mask.Attached.is_committing mask) then (
Mask.Attached.parent_set_notify mask account ;
let child_uuid = Mask.Attached.get_uuid mask in
Mask.Attached.drop_accumulated mask;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need a reformat.

( match accumulated_opt with
| Some { current; next; base; detached_next_signal }
when Option.is_none t.accumulated ->
maps_merge current t.maps ;
Copy link
Member

@nholland94 nholland94 Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutably merges the contents of t.maps into the values passed in from accumulated_opt. I don't think that's what we want, and will lead to bugs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm addressing this one in my update_maps refactor. I left the other issue I pointed out here till later, as it requires more changes to implement.

Copy link
Member Author

@georgeee georgeee Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is actually by design. When we construct a value that we pass in accumulated_opt, we make a copy, so that I saw no need to make another copy here and merging seemed natural.

I'm not against rewriting it, that is: now it's not buggy, but could be easily misused in future.

maps_merge current t.maps ;
maps_merge next t.maps ;
t.accumulated <- Some { current; next; base; detached_next_signal }
| _ ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the None when Option.is_some t.accumulated case, shouldn't we pull the accumulator from our new parent, and merge our local maps into it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how I designed this feature.
Now you have to explicitly use the accumulator in set_parent, otherwise old semantics of doing a local lookup in t.maps and then making a call to parent would be used.

I was cautious not to use this accumulator semantics in contexts where I probably don't need it (e.g. in tx processing). I'd prefer to do a proper refactoring before making this mechanism fully universal.

Now it functions only for 290 masks of frontier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. when re-parenting happens, there is no need to merge maps, we can continue using our own maps without invariant being altered.

@nholland94
Copy link
Member

!ci-build-me

@nholland94
Copy link
Member

!ci-nightly-me

{ t.maps with
accounts = Location_binable.Map.empty
; hashes = Addr.Map.empty
; token_owners = Token_id.Map.empty
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locations is missing

maps_merge current t.maps ;
maps_merge next t.maps ;
t.accumulated <- Some { current; next; base; detached_next_signal }
| _ ->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how I designed this feature.
Now you have to explicitly use the accumulator in set_parent, otherwise old semantics of doing a local lookup in t.maps and then making a call to parent would be used.

I was cautious not to use this accumulator semantics in contexts where I probably don't need it (e.g. in tx processing). I'd prefer to do a proper refactoring before making this mechanism fully universal.

Now it functions only for 290 masks of frontier.

maps_merge current t.maps ;
maps_merge next t.maps ;
t.accumulated <- Some { current; next; base; detached_next_signal }
| _ ->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. when re-parenting happens, there is no need to merge maps, we can continue using our own maps without invariant being altered.

@georgeee
Copy link
Member Author

georgeee commented Dec 7, 2023

!ci-build-me

@deepthiskumar
Copy link
Member

!approved-for-mainnet

@deepthiskumar
Copy link
Member

!ci-nightly-me

@deepthiskumar deepthiskumar merged commit 5c0eb5b into rampup Dec 7, 2023
32 of 35 checks passed
@deepthiskumar deepthiskumar deleted the georgeee/avoid-slowdown-of-multiple-masks branch December 7, 2023 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants