Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: sanction/reward peer for connect #105

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dan-da
Copy link
Collaborator

@dan-da dan-da commented Feb 11, 2024

The big picture

At a high-level this PR enables the following scenario:

  1. We successfully connect to new peer A and A's standing score improves
  2. Later A goes offline and our connect attempts fail. Each failed attempt worsens A standing score until it reaches the minimum score, 0 - cli.peer_tolerance and is banned.
  3. Every 5 minutes (TBD?) we attempt to connect again. Eventually a successful connect occurs, and A's standing improves, resulting in A becoming unbanned, but still with a low score. A single failed connect at this point would get A banned again.
  4. Each time we connect to A again in the future, A's standing score improves. Eventually A's score reaches the maximum 0 + cli.peer_tolerance and is capped.

To achieve this, some changes were made to the existing Peer Standing/Sanction system:

  1. cli.peer_tolerance is now used to define both a min and max for peer standing score, which was previously unlimited. Note that these limits are dynamic and at the control of the node operator. (though static/immutable for each execution of neptune-core).
  2. A peer is considered banned when the standing score is exactly the minimum score. Any improvement in the score results in being unbanned. But still the peer can easily be banned if sanctioned again, whereas if they achieve higher scores, they can then tolerate more sanctions.
  3. There is now a mechanism to unsanction (reward) a peer. For now, it is only used for ConnectSuccess, but we can extend it for other things as well.

Uniquely Identifying Peers

An issue arose early in dev/testing with multiple peers on the same machine, invoked with the run-multiple-instances.sh script. Sanctions for all 3 peers were being stored in the same DB record, identified only by IP, which in this case is 127.0.0.1. The Database type was defined as:

pub struct PeerDatabases {
    pub peer_standings: NeptuneLevelDb<IpAddr, PeerStanding>,
}

This equates an IP with a peer, however that doesn't reflect reality, where it is clearly possible to run nodes on different ports, and those nodes could be running different code, different network configs, etc.

As such, I modified to:

    pub peer_standings: NeptuneLevelDb<SocketAddr, PeerStanding>,

and adapted APIs to match. So PeerStanding are identified by SocketAddr, ie IP+port.

I added a test case can_track_peer_standing_by_port that verifies PeerStanding can be independently tracked for Peers on the same IP. This test would fail on the older code, if ever adapted for it.

note: the node operator can still use the cli --ban arg to ban entire IP(s) without specifying ports.

I realize this change might be controversial. It can be reverted if necessary, but I feel it is "more correct" and has great utility in enabling PeerStanding to be correctly tracked for multiple nodes running on localhost with different ports.


Open Qs:

  1. How long should we wait after a peer is banned before we can attempt to connect again? Presently it is set to 5 minutes, which may be too short, since our check-for-peers interval is 2 minutes. Perhaps 1 hour? But the other consideration, is that when a node goes offline, it will shortly become banned for ConnectFailed, and when it comes back online, we'd like to connect to it without too much delay. There's kind of a tension. So maybe 10 mins, with that in mind?

  2. If a peer is banned and the LatestSanction is NOT a ConnectFailed, then we never try to connect again. This node is effectively perma-banned with no chance to improve. I did this because it seems to mirror existing behavior. However, maybe we should give every banned node an opportunity to redeem itself once in a while.

We don't have to resolve these q's now. Both could easily be addressed in a future PR.


Commit Msg

closes #35

High Level Changes:

  • add doc-comments describing how sanctioning system works
  • Identify PeerStanding by SocketAddr instead of IpAddr because peer's can share IP (not a unique ID)
  • Add Unsanction (reward) capability
  • Sanction peer for connect failure
  • Unsanction peer for connect success
  • Establish min peer_standing score equal to 0 - cli.peer_tolerance A peer is banned when the min score is reached.
  • Establish max peer_standing score equal to cli.peer_tolerance
  • A peer can become banned for ConnectFail but after a 5 minute period is eligible to attempt connect again and becomes unbanned if successful. see: networking_state::CONNECT_FAILED_TIMEOUT_SECS

Cleanups/Tweaks:

  • rename punish_peer to sanction_peer
  • rename PeerStanding::standing to score
  • impl strum::Display for PeerSanctionReason
  • move some existing methods into GlobalState, NetworkingState
  • wrap db iterator with block_in_place() to make it async-friendly

Dependencies:

  • adds direct dep on thiserror; it already existed in our dep-tree

New Tests:

  • can_track_peer_standing_by_port
  • ban_peer_connect_fail_and_unban_connect_success

Testing Performed

  1. Implemented test ban_peer_connect_fail_and_unban_connect_success which simulates a full sequence of sanctioning, banning, unsanctioning, unbanning and verifies details at each point.

  2. ran 3 localhost nodes. procedure:

    • temporarily adjusted CONNECT_FAILED_TIMEOUT_SECS from 5 mins to 40 secs.
    • temporarily adjusted PEER_DISCOVERY_INTERVAL_IN_SECONDS from 2 mins to 20 secs.
    • ran 3 regtest nodes on 3 localhost ports using the run-multiple-instances.sh script.
    • Waited until each node has 3 peers, then shutdown the 3rd node.
    • Verified in node 1 and 2's logs that node 3 gets repeatedly sanctioned, until banned.
    • Verified that nodes 1 and 2 attempt to connect to node 3 after CONNECT_FAILED_TIMEOUT_SECS and peer remains banned.
    • restarted node 3
    • Verified that nodes 1 and 2 attempt to connect to node 3 after CONNECT_FAILED_TIMEOUT_SECS and peer is unsanctioned and unbanned after successful connect. Each node again shows 2 peers in dashboard.

Here is an edited-for-readability log from node 0, that demonstrates the detailed logging:

2024-02-09T23:05:19 DEBUG standing_permits_connect_to_peer: [127.0.0.1]:29792.  peer standing not found. allow: true
2024-02-09T23:05:19  INFO Connecting to peer [127.0.0.1]:29792 with distance 2
2024-02-09T23:05:19  INFO rewarding peer 127.0.0.1:29792 for ConnectSuccess
2024-02-09T23:05:19 DEBUG Old Standing for Peer 127.0.0.1:29792 was PeerStanding { score: 0, latest_sanction: None, timestamp_of_latest_sanction: None }
2024-02-09T23:05:19 DEBUG New Standing for Peer 127.0.0.1:29792 is PeerStanding { score: 25, latest_sanction: None, timestamp_of_latest_sanction: None }

2024-02-09T23:05:31 DEBUG Removing max block height from sync data structure for peer [127.0.0.1]:29792

2024-02-09T23:05:39 DEBUG standing_permits_connect_to_peer: [127.0.0.1]:29792.  peer NOT banned. score: 25. allow: true
2024-02-09T23:05:39  INFO Connecting to peer [127.0.0.1]:29792 with distance 2
2024-02-09T23:05:39  WARN Sanctioning peer 127.0.0.1:29792 for ConnectFailed
2024-02-09T23:05:39 DEBUG Old Standing for Peer 127.0.0.1:29792 was PeerStanding { score: 25, latest_sanction: None, timestamp_of_latest_sanction: None }
2024-02-09T23:05:39 DEBUG New Standing for Peer 127.0.0.1:29792 is PeerStanding { score: 0, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707519939, tv_nsec: 883171674 }) }

2024-02-09T23:05:59 DEBUG standing_permits_connect_to_peer: [127.0.0.1]:29792.  peer NOT banned. score: 0. allow: true
2024-02-09T23:05:59  INFO Connecting to peer [127.0.0.1]:29792 with distance 2
2024-02-09T23:05:59  WARN Sanctioning peer 127.0.0.1:29792 for ConnectFailed
2024-02-09T23:05:59 DEBUG Old Standing for Peer 127.0.0.1:29792 was PeerStanding { score: 0, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707519939, tv_nsec: 883171674 }) }
2024-02-09T23:05:59 DEBUG New Standing for Peer 127.0.0.1:29792 is PeerStanding { score: -25, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707519959, tv_nsec: 887529758 }) }

2024-02-09T23:06:19 DEBUG standing_permits_connect_to_peer: [127.0.0.1]:29792.  peer NOT banned. score: -25. allow: true
2024-02-09T23:06:19  INFO Connecting to peer [127.0.0.1]:29792 with distance 2
2024-02-09T23:06:19  WARN Sanctioning peer 127.0.0.1:29792 for ConnectFailed
2024-02-09T23:06:19 DEBUG Old Standing for Peer 127.0.0.1:29792 was PeerStanding { score: -25, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707519959, tv_nsec: 887529758 }) }
2024-02-09T23:06:19 DEBUG New Standing for Peer 127.0.0.1:29792 is PeerStanding { score: -50, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707519979, tv_nsec: 889638190 }) }

2024-02-09T23:06:39 DEBUG standing_permits_connect_to_peer: [127.0.0.1]:29792.  peer NOT banned. score: -50. allow: true
2024-02-09T23:06:39  INFO Connecting to peer [127.0.0.1]:29792 with distance 2
2024-02-09T23:06:39  WARN Sanctioning peer 127.0.0.1:29792 for ConnectFailed
2024-02-09T23:06:39 DEBUG Old Standing for Peer 127.0.0.1:29792 was PeerStanding { score: -50, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707519979, tv_nsec: 889638190 }) }
2024-02-09T23:06:39 DEBUG New Standing for Peer 127.0.0.1:29792 is PeerStanding { score: -75, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707519999, tv_nsec: 891635518 }) }

2024-02-09T23:06:59 DEBUG standing_permits_connect_to_peer: [127.0.0.1]:29792.  peer NOT banned. score: -75. allow: true
2024-02-09T23:06:59  INFO Connecting to peer [127.0.0.1]:29792 with distance 2
2024-02-09T23:06:59  WARN Sanctioning peer 127.0.0.1:29792 for ConnectFailed
2024-02-09T23:06:59 DEBUG Old Standing for Peer 127.0.0.1:29792 was PeerStanding { score: -75, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707519999, tv_nsec: 891635518 }) }
2024-02-09T23:06:59 DEBUG New Standing for Peer 127.0.0.1:29792 is PeerStanding { score: -100, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707520019, tv_nsec: 895418539 }) }
2024-02-09T23:06:59  WARN Banning peer 127.0.0.1:29792

2024-02-09T23:07:19 DEBUG standing_permits_connect_to_peer: [127.0.0.1]:29792. peer remains BANNED since last ConnectFailed.  Can try again in 20 seconds.  score: -100. allow: false

2024-02-09T23:07:39 DEBUG standing_permits_connect_to_peer: [127.0.0.1]:29792. peer is BANNED but timeout expired since ConnectFailed sanction. score: -100. allow: true
2024-02-09T23:07:39  INFO Connecting to peer [127.0.0.1]:29792 with distance 2
2024-02-09T23:07:39  WARN Sanctioning peer 127.0.0.1:29792 for ConnectFailed
2024-02-09T23:07:39 DEBUG Old Standing for Peer 127.0.0.1:29792 was PeerStanding { score: -100, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707520019, tv_nsec: 895418539 }) }
2024-02-09T23:07:39 DEBUG New Standing for Peer 127.0.0.1:29792 is PeerStanding { score: -100, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707520059, tv_nsec: 902585255 }) }

2024-02-09T23:07:59 DEBUG standing_permits_connect_to_peer: [127.0.0.1]:29792. peer remains BANNED since last ConnectFailed.  Can try again in 20 seconds.  score: -100. allow: false

2024-02-09T23:08:19 DEBUG standing_permits_connect_to_peer: [127.0.0.1]:29792. peer is BANNED but timeout expired since ConnectFailed sanction. score: -100. allow: true
2024-02-09T23:08:19  INFO Connecting to peer [127.0.0.1]:29792 with distance 2
2024-02-09T23:08:19  INFO rewarding peer 127.0.0.1:29792 for ConnectSuccess
2024-02-09T23:08:19 DEBUG Old Standing for Peer 127.0.0.1:29792 was PeerStanding { score: -100, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707520059, tv_nsec: 902585255 }) }
2024-02-09T23:08:19 DEBUG New Standing for Peer 127.0.0.1:29792 is PeerStanding { score: -75, latest_sanction: Some(ConnectFailed), timestamp_of_latest_sanction: Some(SystemTime { tv_sec: 1707520059, tv_nsec: 902585255 }) }
2024-02-09T23:08:19  INFO Unbanning peer 127.0.0.1:29792

peer2-readable.log


Problems after rebase to master

The above testing was performed on my branch created from master at 0d414d9 on Jan 25. Today I rebased it to master @ 9fb265c Feb 5. I again attempted to run 3 nodes with run-multiple-instances.sh but now the peers are unable to pass eachother blocks and disconnect from eachother. I get a log error "In order to be transferred, a Block must have a non-None proof field". Here's an example:

2024-02-11T02:03:11.68229879Z  INFO ThreadId(02) neptune_core::connect_to_peers: Connection accepted from [::ffff:127.0.0.1]:47096
2024-02-11T02:03:11.682693047Z DEBUG ThreadId(01) neptune_core::main_loop: Received message sent to main thread.
2024-02-11T02:03:11.682784953Z DEBUG ThreadId(01) neptune_core::main_loop: Received add peer max block height from a peer thread
2024-02-11T02:03:11.683502005Z DEBUG ThreadId(05) neptune_core::peer_loop: Received block notification request from peer [::ffff:127.0.0.1]:47096
2024-02-11T02:03:11.683632119Z DEBUG ThreadId(05) neptune_core::peer_loop: Got BlockNotificationRequest
2024-02-11T02:03:11.684171193Z DEBUG ThreadId(05) neptune_core::peer_loop: Received peer list req from peer [::ffff:127.0.0.1]:47096
2024-02-11T02:03:11.684250284Z DEBUG ThreadId(05) neptune_core::peer_loop: Responding with: [([::ffff:127.0.0.1]:29791, 125719324242919459118982459613409448902)]
2024-02-11T02:03:11.68462132Z DEBUG ThreadId(05) neptune_core::peer_loop: Received block req by height from peer [::ffff:127.0.0.1]:47096
2024-02-11T02:03:11.684657636Z DEBUG ThreadId(05) neptune_core::peer_loop: Got BlockRequestByHeight of height 14
2024-02-11T02:03:11.685041049Z DEBUG ThreadId(05) neptune_core::peer_loop: Found 1 blocks
2024-02-11T02:03:11.685521208Z ERROR ThreadId(05) neptune_core::models::blockchain::block: In order to be transferred, a Block must have a non-None proof field.
thread 'tokio-runtime-worker' panicked at src/models/blockchain/block/mod.rs:80:17:
explicit panic

This appears to be caused by the intervening changes to master, unrelated to this PR.

closes Neptune-Crypto#35

High Level Changes:
* add doc-comments describing how sanctioning system works
* Identify PeerStanding by SocketAddr instead of IP
   because peer's can share IP (not a unique ID)
* Add Unsanction (reward) capability
* Sanction peer for connect failure
* Unsanction peer for connect success
* Establish min peer_standing score equal to 0 - cli.peer_tolerance
  A peer is banned when the min score is reached.
* Establish max peer_standing score equal to cli.peer_tolerance
* A peer can become banned for ConnectFail but after a 5
  minute period is eligible to attempt connect again and
  becomes unbanned if successful.
  see: networking_state::CONNECT_FAILED_TIMEOUT_SECS

Cleanups/Tweaks:
* rename punish_peer to sanction_peer
* rename PeerStanding::standing to score
* impl strum::Display for PeerSanctionReason
* move some existing methods into GlobalState, NetworkingState
* wrap db iterator with block_in_place() to make it async-friendly

Dependencies:
* adds direct dep on thiserror; it already existed in our dep-tree

New Tests:
* can_track_peer_standing_by_port
* ban_peer_connect_fail_and_unban_connect_success
@Sword-Smith
Copy link
Member

Sword-Smith commented Feb 12, 2024

About the banning of IPs/ports. It was actually intentional that I put the bans on IP. I thought that it was better to ban a whole IP address in case of shenanigans. I would rather use a big hammer than open up for repeated, costly attacks, like sharing invalid blocks and transactions.

In the absence of further info, I lean towards banning whole IPs as I think protecting the network against malicious actors is more important than convenience in integration test environments.

What do other blockchains do?

The main thing I want to protect against is the sharing of invalid blocks and transactions, consensus-related shenanigans.

@dan-da
Copy link
Collaborator Author

dan-da commented Feb 12, 2024

yeah, I figured it was intentional, which is why I said it might be a controversial change. ;-) I wrestled with it quite a bit myself.

You did not comment on how it impacts running/testing multiple nodes on localhost.

If we go back to storing PeerStanding by IP, then about the only way to test standing behavior on localhost would be to use 127.0.0.2, 127.0.0.3, etc. Which people are not used to. So it would need very clear documentation.

It's an interesting question what other cryptocurrencies do. I will check into bitcoin and monero's behavior.

@aszepieniec
Copy link
Contributor

Here is something probably tangential but also probably worth thinking about: using proof-of-work and/or verifiable delay functions as an alternative to or as a soft version of banning.

A node could have a policy of accepting connection requests from peers but only if they can prove to have devoted more than x amount of energy or time to that connection request. Suspicious IP range? High requirement. Benign connection failure with an otherwise impeccable standing? Low requirement.

@dan-da
Copy link
Collaborator Author

dan-da commented Feb 12, 2024

I looked into bitcoin's behavior a bit. Here's a summary of my understanding:

  1. Bitcoin keeps a peer score, similar to ours.
  2. When a peer is banned, the entire IP is banned.
  3. Banning entire IP has been noted to be problematic for eg Tor/onion nodes connected via localhost ports.
  4. When a peer is banned, an unban timestamp is stored. This is displayed on Peers screen in bitcoin-qt. So there is always a known time for peer to become unbanned.
  5. There is a bantime setting to configure ban duration. Default value is 86400 secs (1 day).
  6. There is also a setting for the ban threshold/score. Similar to our peer_tolerance.

I like that they store a timestamp for unbanning and have a bantime setting. I think we should do the same. But that can be in a future PR.

I haven't checked into monero's behavior yet. It might be interesting to check Eth's behavior also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mark potential peer as invalid after a few failed connection attempts
3 participants