Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
reduce synchronize timeouts and increase block batch size #2922
This PR tries to find a happy middle between slower peers and having a performant IBD. I believe every peer should be able to respond within the timeout limits specified, and we won't spend so much time waiting for the peer to timeout if it's not responding.
This PR is meant to improve the sync time while we wait for an overhaul of the consensus module.
The speedup is impressive, and agreed that it's worth pursuing, but as discussed in Discord, we need to make sure that we aren't going to be causing problems for people using e.g. Tor or who are behind restrictive internets (North Korea, Iran, China). We need to make sure that Sia works out of the box for everyone.
If we had some sort of parallel downloads, or perhaps a smart timeout that would double in length each time that the timeout was hit, then we could get the speedup without breaking Sia for certain disadvantaged users.
Doing 25 blocks at a time is also potentially a problem, because on the Sia network that could be up to 50 MB of data. From what I understand, users need to be able to fetch that much data before the timeout expires.
I like the direction of this pull request but we need to make sure we're supporting disadvantaged users at the same time that we introduce speedups for our typical users.
I think we can probably get most of the speedup by starting with a low timeout and steadily increasing it if we are unable to get any nodes to work at all.
For the consecutive blocks, 25 just seems like a really high number to me. It works well earlier in the chain when blocks are small, but in the long term that is a full 50MB per pass, which makes me really uncomfortable.
The true solution here is to implement headers-first block retrieval, and parallel block downloads from nodes. I know that's a lot more work than some simple tweaks to these constants though.
Parallel downloads, smart timeouts, and header-first block retrieval are all excellent ideas that should be considered and implemented in the consensus overhaul. This PR is meant to improve our existing code, not rewrite any of it. Those features are way outside the scope of this PR.
I believe the timeouts I propose in this PR are a good balance of performance improvements without causing problems for the disadvanteaged users you mentioned. According to https://metrics.torproject.org the average Tor user can download 5MB in about 12 seconds achieving 3.33Mb/s throughput. Sia's worst case scenario of a 50MB batch size would take exactly 120 seconds to download at 3.33Mb/s. Tor users will still be able to download batches of 25 blocks even if every block is completely full.
According to http://www.speedtest.net/global-index/iraq Iraq has an average download speed of 7.22 Mb/s, they should have no problem downloading a worst case 50MB batch of blocks in 120 seconds. Iran, China, and every other country I've looked at is faster than Iraq.
I stand by my numbers because I believe they allow us to continue to serve the disadvantaged users while offering a noticeable reduction in the time it takes to download the blocks.
For some reason, I was thinking you changed the timeout to 10 seconds, not 2 minutes. I'm not sure why I was thinking that, but my original comment was based on a change to 10 seconds.
We care about medians more than averages, but more specifically we care about the 95th percentile, not the 50th percentile. This is the best resource I could find for that for now: https://www.fastmetrics.com/internet-connection-speed-by-country.php
According to those metrics, even the top 10 countries all have >5% of users at under 4mbps speeds. Some countries have their average speeds (on this particular graph, anyway) under 2mbps (including countries outside of Africa).
For the early days of Sia, a lot of chinese users were complaining continuously that it was very difficult to sync a node in China. These complains subsided substantially when we starting bumping the timeouts to 2 minutes+. That's the biggest reason I'm being stubborn about this - it'd be really bad from my point of view to change some constants and then suddenly a huge user segment starts having trouble syncing. We largely stopped receiving these complaints after we rolled out network changes to substantially boost the timeout and keepalive constants that we were using.
I am definitely not comfortable doing 25 blocks at a time. Especially because we will have trouble testing this until it's fully rolled out anyway.
For the timeouts, they aren't so bad. RelayHeader and SendBlk are both single-round trip RPCs. Relay header has a tiny payload, and SendBlk only ever goes up to about 2 MB for payload. sendBlocks is a little heavier, the caller both writes a small payload, and reads a large payload. Assuming 1mbps and a high-latency handshake, which I think is fair given the statistics I linked above, you'd need at least 3 minutes to complete a download of 10 full size blocks.
In this matter, I am strongly inclined to be highly conservative. I want Sia to be a very robust platform, and I want it to be known for being robust. That's a reputation we don't currently have, and I'm cautious to adjust constants like this especially when it's been a user-reported problem in the past.
I've updated the constants to numbers you are comfortable with.
I did want to make one last petition for you to reconsider a lower timeout. The fastmetrics.com study you refer to is using data from 2015 (except where "Update" is specified) Internet speeds have increased in the last 3 years.
Also, it's important to consider the Sia target market. I would argue that users looking for cloud storage are aware of their bandwidth limitations, and businesses are likely to have more reliable and higher bandwidth connections than residential users. In other words, I think going after the 95 percentile is the wrong target audience for your product. Furthermore, those Internet speed metrics are dragged down by slow mobile connections (which is not Sia's current target market).
In summary, I think you're being overly conservative in these numbers and causing unnecessarily slow sync times to be able to serve a very small population of users who are not your target market.
edit: changed "audience" to "market" as it's a better term for what I was trying to describe.
I appreciate your comments and understand where you are coming from, you are correct that Sia is not as useful to people with slow internet speeds, and also if your internet is bottom 5% of the country, you are probably not the Sia target market.
But I think the vision for decentralized infrastructure goes quite a bit deeper than this, and I also thing that we can implement some coding solutions incrementally which will speed up Sia substantially for the high-end users without barricading low end users.