Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Updating validator committee every two rounds will cause leader manipulation and permanent chain state split #2986

Closed
feezybabee opened this issue Jan 10, 2024 · 21 comments
Labels
bug Incorrect or unexpected behavior

Comments

@feezybabee
Copy link

feezybabee commented Jan 10, 2024

https://hackerone.com/reports/2289066

Summary:

AleoBFT updates the committee (stake, status, etc) with every block. At each even round, elect the leader from the committee with latest committed block. This will introduce two issues:

** Issue a. Leader Manipulation **:

The leader is selected from the committee by the seed which is composed of self.starting_round, current_round, total_stake. However, the current leader can manipulate total_stake by adding bond_public transactions in its batch. The leader can easily find the target total_stake which make itself the leader for the next round. Thus, a leader can keep itself as a leader forever.

** Issue b. Permanent Chain State Split **

In current AleoBFT, the leader election directly depends on the latest committed block. However, for asynchronous network, some validators may have committed a block but some others not, making validators elect different leaders. This will cause permanent chain state split.

I will provide Steps To Reproduce for Issue a and Proof-of-Concept for Issue b.

Steps To Reproduce:

For Issue a.

  1. Clone https://github.com/Gooong/snarkOS/tree/manipulate_consensus
  2. Run ./devnet.sh with default setting
  3. Wait for 100 rounds
  4. Check validator-0.log and find that the validator 0 has been leader for most even rounds. (Just count the line by searching elected a leader - aleo1rhgdu77hgyqd3xjj8ucu3jj9r2krwz6mnzyd80gncr5fxcwlh5rsvzp9px and compare that by count the line elected a leader)

Also see the example validator-0.log at the attachment.

Note that, in our code, Validator 0 only send the transaction with best effort. In practice, the the leader can directly simulate and add the transaction to the batch. This will 100% make itself the leader of future rounds and other validator can not do anything to prevent this.

For issue b.

Take the picture in the bullshark blog as example.

image

Suppose we are in Round 3, while deciding the leader for current round, Validator 1 will find that Round 1 hasn't been committed so it will use committee in the round before Round 1. However, in Validator 4's view, Round 1 has been committed so it just decide the leader from Round 1's committee. This will cause permanent chain state split.

Additional considerations

The ambiguous of latest committee will also introduce other tricky problems:

If last block haven't been committed, we can't tell if the current round BatchCertificate (either from other peers or from myself) has reached quorum. Because the committe will change once the last block is committed. This function in LedgerService makes the committee ambiguous.

    /// Returns the committee for the given round.
    /// If the given round is in the future, then the current committee is returned.
    fn get_committee_for_round(&self, round: u64) -> Result<Committee<N>> {
        match self.ledger.get_committee_for_round(round)? {
            // Return the committee if it exists.
            Some(committee) => Ok(committee),
            // Return the current committee if the round is in the future.
            None => {
                // Retrieve the current committee.
                let current_committee = self.current_committee()?;
                // Return the current committee if the round is in the future.
                match current_committee.starting_round() <= round {
                    true => Ok(current_committee),
                    false => bail!("No committee found for round {round} in the ledger"),
                }
            }
        }
    }

I think we can fix this by:

  1. In current_round, use the committe from current_round - COMMITTEE_LAG_ROUND round. (COMMITTEE_LAG_ROUND could be large enough like 100 or 1000) The function will be like:
    /// Returns the committee for the given round.
    /// If the given round is in the future, then the current committee is returned.
    fn get_committee_for_round(&self, round: u64) -> Result<Committee<N>> {
        match self.ledger.get_committee_for_round(round.saturating_sub(COMMITTEE_LAG_ROUND) )? {
            // Return the committee if it exists.
            Some(committee) => Ok(committee),
            // Return the current committee if the round is in the future.
            None => {
                bail!("No committee found for round {round} in the ledger"),
            }
        }
    }

In this way, the concensue will stall if the latest COMMITTEE_LAG_ROUND round hasn't been committed.

  1. Or keep the committee the same for the whole Epoch. But we need to be cautious at the Epoch boundary.

Supporting Material/References:

See the attachment

Fix Suggestions:

The problem here is that AleoBFT's committee directly depends on the latest committed state. However, the latest state is easy to manipulate and can temporarily mismatch.

It's suggested to fix the committee for some fixed rounds. Also, while choosing leader, only use random seed that is hard to manipulate.

In SUI, the validator committee is fixed for the whole epoh (~24h). In the codebase of bullshark paper, it directly use round-robin: https://github.com/asonnino/narwhal/blob/667595d9dc82a472bade5afd59993d8479047a48/consensus/src/lib.rs#L205

Impact

Current AleoBFT implementation will cause: (a) leader manipulation and (b) permanent chain state split

@feezybabee feezybabee added the bug Incorrect or unexpected behavior label Jan 10, 2024
@randomsleep
Copy link
Contributor

It will also make it hard to validate BatchPropose and Certificate. See this thread: #2956 (comment)

This issue has an important impact on the Committee update and Leader election. So I would suggest fixing this issue first.

@raychu86
Copy link
Contributor

raychu86 commented Jan 21, 2024

This is a valid/known problem, however is of lower impact.

In theory an attack like this could occur, however this doesn't take into account possible changes to committee stakes from external bond_public calls. The assumption here is that nobody else on the network is calling bond_public or unbond_public when the attack is being performed.

In addition, the proposed fix using an older committee (COMMITTEE_LAG_ROUND) is not sound. Using old committee sets may include validators that are not in the committee, thereby electing leaders that do not exist.

@randomsleep
Copy link
Contributor

randomsleep commented Jan 21, 2024

@raychu86 The code provided is just a proof of concept. The vulnerability does exist in practice:

Suppose in the current round, I am the leader of the committee. While proposing the BatchPropose, I can get and simulate all the transactions in the causal certificate. Then I can find the target BatchPropose that makes me a leader in future rounds. The problem is that the leader is the block producer. It can choose specific parent certificates, add transactions in its batch as it likes, and simulate the block in advance. Therefore, no matter what transaction other validators have included, the malicious leader can always make itself a leader in future rounds.

An example of the harm of leader manipulation:

The malicious validators can easily enumerate blocks to make it a leader for many consecutive even rounds. Assuming 1/4 of validators are malicious, they only need to enumerate O(4**k) times to make them leaders in k consecutive even rounds. Then in the first k-1 consecutive even rounds, the malicious can halt the chain by not broadcasting certificates. This means they can easily delay block production as they like. This may cause more problems if you consider the PoW part.

@randomsleep
Copy link
Contributor

About using an older committee (COMMITTEE_LAG_ROUND):

Yes, we should not use old committee sets. I mean we can define the current round committee as generated from current_round - COMMITTEE_LAG_ROUND.

Nonetheless, to resolve the ambiguity of the latest committee, we may need to decide on the committee in the earlier round (COMMITTEE_LAG_ROUND). This means the validator has to wait for specific rounds after they request to quit the committee. To ease leader manipulation, we may need to avoid using latest_state as a random seed.

@randomsleep
Copy link
Contributor

randomsleep commented Jan 23, 2024

More details about using COMMITTEE_LAG_ROUND:

Update Committee at Every Block

Let's define COMMITTEE_LAG_ROUND=1000. The idea is that at round(r), try to use the committee generated at round(r-1000). The committee is deterministic if the chain has committed a block with block.round >= round(r-1000). This resolves the ambiguity of the latest committee.

It's possible that round(r-1000) didn't produce a block. In this case, we use the committee generated at the first block whose round >= round(r-1000).

Refer to this graph for more operations:

image

@randomsleep
Copy link
Contributor

As AleoNet/snarkVM#2213 shows atomic_post_ratify is slow with thousands of delegators, we may consider keeping the same committee during the whole epoch. Only update committee and stake reward at epoch boundary.

Keep the Same Committee during Epoch

Let's define PoS Epoch = 1000 rounds. Divide round into Epoch. The Epoch of Round(r) is Epoch(r/1000). At the first block of Epoch(i+1), generate the committee for Epoch(i+2). Panic if there is no block in the whole Epoch. In this way, the committee is deterministic while entering a new epoch.

image

Once commit a new block at around r, the validator will try to update the committee for the next epoch:

let current_epoch = r / N
let last_epoch = last_block.epoch
if current_epoch == last_epoch:
    return
// every epoch must have committed at least one block
assert(current_epoch == last_epoch+1)
finalize the reward for Epoch(last_epoch)
compute the committee for Epoch(current_epoch + 1)

@howardwu
Copy link
Contributor

Suppose in the current round, I am the leader of the committee. While proposing the BatchPropose, I can get and simulate all the transactions in the causal certificate. Then I can find the target BatchPropose that makes me a leader in future rounds. The problem is that the leader is the block producer. It can choose specific parent certificates, add transactions in its batch as it likes, and simulate the block in advance. Therefore, no matter what transaction other validators have included, the malicious leader can always make itself a leader in future rounds.

The intuition here is not correct.

BlockDAGs behave differently from traditional PBFT leaders. The leader only has batch_timeout time to propose a valid batch and have it be certified by at least 2f+1 validators. In addition, leaders do NOT know in advance that they will be the leader. Bullshark only elects leaders in the following round, meaning they look backwards in time to determine if a leader was elected, based on f+1 votes from validators in the future round.

@randomsleep
Copy link
Contributor

In addition, leaders do NOT know in advance that they will be the leader. Bullshark only elects leaders in the following round, meaning they look backwards in time to determine if a leader was elected, based on f+1 votes from validators in the future round.

In the current implementation, the leader is deterministic, based on the previous committee:

        // Retrieve the previous committee of the current round.
        let previous_committee = match self.ledger().get_previous_committee_for_round(current_round) {
            Ok(committee) => committee,
            Err(e) => {
                error!("BFT failed to retrieve the previous committee for the even round {current_round} - {e}");
                return false;
            }
        };
        // Determine the leader of the current round.
        let leader = match previous_committee.get_leader(current_round) {
            Ok(leader) => leader,
            Err(e) => {
                error!("BFT failed to compute the leader for the even round {current_round} - {e}");
                return false;
            }
        };

I have created another example to demonstrate perfect leader manipulation: randomsleep@2e8cbe4

In the example, Validator 0 is malicious. Validator 1, 2, and 3 are honest and they randomly send bond_public transactions. While proposing batch, Validator 0 will speculate all the transactions and then insert bond_public transactions with specific amount. This makes Validator 0 always leader in future rounds.

Step To Reporduce:

  1. Clone the repo: git clone https://github.com/randomsleep/snarkOS.git
  2. Checkout branch: git checkout perfect_leader_manipulation
  3. Run ./devnet with the default setting.
  4. Wait 100 rounds and check log. Find that Validator 0 is always the leader after some rounds (just search elected a leader - aleo1rhgdu77hgyqd3xjj8ucu3jj9r2krwz6mnzyd80gncr5fxcwlh5rsvzp9px)

image

Here is the log sample:
logs-20240129114403.zip

@ghostant-1017
Copy link
Contributor

@randomsleep Great job!
I found the leader is deterministic as well, but I thought it is safe at that time.

@raychu86
Copy link
Contributor

Closing this issue as it is resolved as a byproduct of - #3065.

The reason we needed the change is because we need consistent and predictable leader selection in the block-per-anchor model, but just so happens to address the current concerns above.

@randomsleep
Copy link
Contributor

randomsleep commented Feb 12, 2024

@raychu86 Great work! It resolves the ambiguity of the committee.

This issue hasn't been fully fixed though. Not very even round will commit a block. The round leader may fail to create a Certificate in time. Therefore, it's possible that some rounds don't have a corresponding committee in the map.

https://github.com/AleoHQ/snarkOS/blob/26ae0a98f0b94adca2dfd1c863e730343a1fbe5b/node/bft/ledger-service/src/ledger.rs#L129

In this case, the committee should refer to the next committed block (as described here: #2986 (comment))

@randomsleep
Copy link
Contributor

It appears that the issue of the leader manipulation issue is still unresolved. Please correct me if I am missing something.

@raychu86
Copy link
Contributor

raychu86 commented Feb 14, 2024

@randomsleep If you look at the code for get_committee_for_round in snarkVM, you'll see that even if there is no committee/block being created at that round, there is still a committee selected, which is the committee of the previous committed block.

If round 10 has a block and round 16 has a block. The committee at round 11-15 will be the committee created at round 10.

I believe this should be sufficient for addressing the issue. Please let me know if you still have concerns.

@randomsleep
Copy link
Contributor

@raychu86 Got it, thanks!

If round 10 has a block and round 16 has a block. The committee at round 11-15 will be the committee created at round 10.

The insert(&self, next_height: u32, committee: Committee<N>) function will set the height of round 11-15 to the same height at round 10. I am curious why not set the height of round 11-15 to the same height at round 16? Logically round 11-15 is committed by the block at round 16.

@randomsleep
Copy link
Contributor

I don't think this resolves issue of the leader manipulation. The leader is elected with a random seed total_stake and it manipulatable in the same way: #2986 (comment)

@howardwu
Copy link
Contributor

With the introduction of the committee lookback design (which now makes the leader election predictable up to GC depth), the manipulation concern is indeed more prevalent than before. To clarify:

  • the leader cannot choose what transmissions to include in what order, HOWEVER
  • the leader can choose which certificates to include and the order of the certificates for the subDAG they commit

To mitigate this concern according to the Bullshark paper, the solution isn't to change the get_leader deterministic algorithm, instead it is to introduce a new topological sort over the transmissions that are committed into the block.

For instance, we can sort all of the transmissions which go into the block based on:

  1. Bucket transmissions into the rounds that they are committed in
  2. For each bucket, sort them based on the solution IDs & transaction IDs (i.e. order the field element representing the ID)

@randomsleep
Copy link
Contributor

For instance, we can sort all of the transmissions which go into the block based on:

  1. Bucket transmissions into the rounds that they are committed in
  2. For each bucket, sort them based on the solution IDs & transaction IDs (i.e. order the field element representing the ID)

@howardwu This is still manipulatable because the leader can always include specific transactions in its own BatchPropose and simulate the block in advance.

The root cause is that get_leader includes total_stake as a random seed. I guess the initial purpose is to introduce more randomness in leader election, which, turns out to be manipulatable. Full randomness is not a must in the leader election with Bullshark. Instead, fairness should be kept. I would suggest:

  1. Remove total_stake and self.starting_round in get_leader. Only use current_round as seed.
  2. sorted_members sort validators with the index they join the committee.

This will make get_leader hard to manipulate the chance is very low that a validator be the leader in many consecutive rounds.

@raychu86
Copy link
Contributor

raychu86 commented Mar 3, 2024

@randomsleep Would sorting the member based on the x-coordinate of the validator address be sufficient to replace the sorted_members? We currently do not have an easy way to track the order that validators joined/left the committee.

We have a proposed change here - AleoNet/snarkVM#2378

@randomsleep
Copy link
Contributor

@raychu86 I think this is sufficient to address the issue.

In theory, the malicious validators can quit and rejoin the committee to manipulate x-coordinate of the address. However, due to the unbonding period, it is impractical to do that frequently.

@randomsleep
Copy link
Contributor

Hi @howardwu @raychu86, this issue has been resolved but the report was closed with a downgraded severity level referencing the outdated comment: #2986 (comment). This report shows two issues that are both critical. This was submitted on December 17, 2023, before the public announcement of audits. Please rejudge the severity level based on the new progress.

@howardwu
Copy link
Contributor

I'll share a message with the bug bounty team at the Foundation regarding the severity level. Just a heads up that the bug bounty process is unaffiliated to Provable (where Ray and I are).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect or unexpected behavior
Projects
None yet
Development

No branches or pull requests

5 participants