Syncing from scratch randomly slows down #352

ghost · 2016-12-11T18:50:04Z

This is issue i've been already reporting in early versions of Lisk. Syncing from scratch works better than before but it's still far from perfect.

It's understundable that on beginning syncing could be slow due many transactions made and cpu time needed to verify. But after time syncing speeds up to reasonable value. Then suddenly without additional logging process becomes very slow again. (i have marked areas on disk usage chart which shows good speed with green and slow with red). Restarting lisk process fix issue temporarily, but anyway with current version of Lisk it took 24h to sync from genesis block to 3209245310885481431. I've described in #351 why only to this block. #351 is different issue than this, with this one there wasn't any additional errors/logging as i've mentioned before.

Cpu usage seems to be the same while it's syncing with reasonable speed and when syncing very slow. By slow i mean abnormally slow, sometimes getting new block takes longer than network interval which technically makes it impossible to sync.

Additionally CPU usage vary at around 25% roughly, the same information can be read from load average, which simply indicates that syncing can be possibly 4x faster than currently with the same implementation of cryptographic functions and logic to verify block & transactions. There is 3/4 cpu power left in idle state.

Possible solutions:

Improve current logic to fix randomly slowed down syncing
Improve current logic to use additional 75% CPU power which is idle
Implement better than official commonly used JS cryptographic library, possibly written in C++/C or any other low level fast language. Possibly rewrite transaction/block verification code as a C++/C library for JS - this is very necessary step to achieve reasonable scaleability.
Possibly syncing speed can be also improved by moving communication between nodes to Web Sockets as proposed in Network connectivity channels #347 - this can be big step forward as it will positively affect block propagation times over network.

Another problem is that starting Lisk to sync without snapshot form lisk.io is tricky and confusing enough so most users will ignore this and go with centralised snapshot. As i've reported here LiskArchive/lisk-build#57 but it have been ignored, moreover option in installLisk.sh to sync from scratch does not work. I've managed to do it tricky way with creating fake file.db.gz and bash lisk.sh rebuild -f file.db.gz I think this should be as easy as deciding with install location or choosing between main network or test network. So more people will get encouraged to sync from other people, locate issues with syncing etc. It's good approach in decentralised project generally.

Screenshot 1

Screenshot 2

Additional information about hardware which i've run tests

Hardware Class: cpu
Arch: X86-64
Vendor: "GenuineIntel"
Model: 6.63.2 "Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz"
8gb Ram
4 cores CPU

The text was updated successfully, but these errors were encountered:

mrv777 · 2016-12-11T19:29:51Z

It's only using 25% because you have 4 cores. Node is only a single threaded process so it only can use one core. Multithreading would be great, but isn't easy. You could start tackling it if you wanted though :)

ghost · 2016-12-11T19:33:57Z

Im not familiar with Node, as im not big fan on JS at all, doesn't matter how many cores i've.
It needs to get fully multithreaded, didn't worked with Node.js, but it must be possible, even reasonable multithreading can be implemented to php, did this and it was working flawlessly. Even if its so trouble to multithread Node.js, some of syncing logic can be rewritten to low level language, as additional module which can take care of things which does not work well in one thread per request model.

mrv777 · 2016-12-11T19:43:34Z

Sorry if there is some confusion. Node.js can be used in multi threading. Just lisk's node code is not written that way currently. I'm sure it will be rewritten at some point, but currently that is why you see it at 25% on 4 core system

ghost · 2016-12-11T19:45:40Z

I have checked, it seems some versions of Node.js supports reasonable multithreading some not. Let's wait for @karmacoma to take position.

Isabello · 2016-12-12T00:54:13Z

There are plans to clusterize the process at some point.

maxkordek · 2016-12-12T09:17:18Z

There are also plans to re-write time/performance critical functionalities into a low level language. This will probably be done later in the second part of the Ascent phase.

karmacoma · 2016-12-12T11:21:05Z

@karek314 Regarding your possible solutions:

Improve current logic to fix randomly slowed down syncing

I don't see a possible solution here.

Improve current logic to use additional 75% CPU power which is idle

I assume you mean allow for other CPU cores to be utilized. At the persistence layer, PostgreSQL is already utilizing multiple cores, which is where much of the heavy lifting is being conducted. As already mentioned by @Isabello, we plan to clusterize the node.js application itself into several distinct processes. There is also ongoing work: #302 that will improve the efficiency by which "work" is actually delegated to the persistence layer.

Implement better than official commonly used JS cryptographic library, possibly written in C++/C or any other low level fast language. Possibly rewrite transaction/block verification code as a C++/C library for JS - this is very necessary step to achieve reasonable scaleability.

@4miners recently introduced a change from js-nacl to libsodium which has improved the speed of cryptographic operations by approx. 3 times.

At this point, imo the bottleneck is not the language or level at which it is written. The inefficiencies are largely related to the way db connections / queries are being conducted. Once again #302 should address this, especially in the area of block / transaction processing. We are also looking at ways we can improve the apply and undo transaction operations, which are the most costly.

Possibly syncing speed can be also improved by moving communication between nodes to Web Sockets as proposed in #347 - this can be big step forward as it will positively affect block propagation times over network.

Yes, we are already in agreement on this. We can further discuss your proposal in #347.

Another problem is that starting Lisk to sync without snapshot form lisk.io is tricky and confusing enough so most users will ignore this and go with centralised snapshot. As i've reported here LiskArchive/lisk-build#57 but it have been ignored, moreover option in installLisk.sh to sync from scratch does not work.

@Isabello has reopened the issue on lisk-build where installLisk.sh is maintained. We are not ignoring the issue.

ghost · 2016-12-12T12:03:50Z

I don't see a possible solution here.
There is solution, something can be improved when simple restart of Lisk fix this issue, and syncing remains at reasonable speed for some random length. If it's related to connectivity, rewriting and moving node-2-node communication to web sockets should improve that. I haven't taken look at code, don't have time do this for free. I believe there is always solution to every problem.

At this point, imo the bottleneck is not the language or level at which it is written. The inefficiencies are largely related to the way db connections / queries are being conducted. Once again #302 should address this, especially in the area of block / transaction processing. We are also looking at ways we can improve the apply and undo transaction operations, which are the most costly.

Good to know about db inefficiencies caused by how queries are made, but i think that language will be next bottleneck sooner or later.

@Isabello has reopened the issue on lisk-build where installLisk.sh is maintained. We are not ignoring the issue.

Yes and no. I've been discussing with her, but couldn't make her to agree with me. Lisk is decentralised project. We should encourage every user to build their database on the top of data collected over peers found in network, instead allowing to go with snapshot. Snapshot is a great way to sync node very fast way, same good as syncing ethereum not in archive mode which is simply fast but less secure.
There should be clear question on installing / rebuilding node in installLisk.sh and as well in lisk.sh
Clear question if user would like to do full sync from network or choose to sync from centralised snapshots. Currently only option to do that is tweaky and buried down in help, hardly noticeable, while it's buggy anyway. This should be loud and clear.

Let me bring some possible attacks when every user is encouraged to chose sync from snapshot. Im saying encouraged since there is no question, it's default option to go with snapshot.

When Lisk has been running over 101 delegates solely owned by LiskHQ this was actually LiskHQ centralised network, not surprise price was falling from ICO but anyway - at any time Lisk blockchain could have been easily hijacked by forcing all users to upgrade at one time. Im not saying you ever did you ever wanted to do. Just saying it's vulnerability. But owning all delegates by 1 entity is big threat to decentralised project anyway, believe no need to elaborate on this.
What if someone will take over lisk.io dns records and distribute fake blockchain copy with some amounts of money stolen from others while LiskHQ publish new version of network without backwards support - forcing all to hardfork ? This is possible, and there are few possible scenarios to perform such attacks.

In summary, I believe in every distributed ledger - blockchain based decentralised projects, syncing blockchain from genesis block should be primary option. With fast sync as opt in, in case of Lisk (snapshots) - as it's obviously less secure. Moreover forging delegates should be even clearly encouraged to sync from genesis block as forging ones are the ones writing data to blockchain.

I know snapshots were mandatory at first stages in Lisk when node couldn't possibly sync from beginning, but now ? It works stable enough.

4miners · 2017-05-12T00:10:55Z

That issue is still valid, slow down during sync can be noticed on 0.9, will investigate.

4miners · 2017-05-14T05:05:12Z

After some investigation I found that slow down of sync is probably caused by transactions received by node during sync. Each transaction received need to be processed before and after processing a block (undo/redo to unconfirmed balance). With time they are stacking and block processing became slower and slower.

Solution:
Don't allow to receive transactions when node is in syncing state.

Isabello · 2017-05-14T10:04:09Z

This may also be a side effect from moving receive blocks rejection inside the sequence. Previously we always rejected blocks during sync. Now we add the receive block to a sequence and do the check when that sequence comes up.

…

On May 14, 2017 07:05, "Mariusz Serek" ***@***.***> wrote: After some investigation I found that slow down of sync is probably caused by transactions received by node during sync. Each transaction received need to by processed before and after processing a block (undo/redo to unconfirmed balance). With time they are stacking and block processing became slower and slower. Solution: Don't allow to receive transactions when node is in syncing state. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#352 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/APzFsJmgsQS2_3TtN-9OGzLMTd2vkvM_ks5r5ouJgaJpZM4LJ_6y> .

ghost · 2017-05-14T10:46:53Z

What may be important to add is that this issue has been occurring from very early versions of Lisk (first testnet release). Up to now including last release.

diego-G · 2018-09-10T08:47:59Z

Superseded by #2384

karmacoma added the performance label Dec 12, 2016

4miners mentioned this issue Feb 27, 2017

Improve blocks processing efficiency #449

Closed

4 tasks

MaciejBaj added chain labels Jun 19, 2018

diego-G closed this as completed Sep 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Syncing from scratch randomly slows down #352

Syncing from scratch randomly slows down #352

ghost commented Dec 11, 2016 •

edited by ghost

mrv777 commented Dec 11, 2016

ghost commented Dec 11, 2016 •

edited by ghost

mrv777 commented Dec 11, 2016

ghost commented Dec 11, 2016

Isabello commented Dec 12, 2016

maxkordek commented Dec 12, 2016 •

edited

karmacoma commented Dec 12, 2016

ghost commented Dec 12, 2016 •

edited by ghost

4miners commented May 12, 2017

4miners commented May 14, 2017 •

edited by karmacoma

Isabello commented May 14, 2017 via email

ghost commented May 14, 2017

diego-G commented Sep 10, 2018 •

edited

Syncing from scratch randomly slows down #352

Syncing from scratch randomly slows down #352

Comments

ghost commented Dec 11, 2016 • edited by ghost

mrv777 commented Dec 11, 2016

ghost commented Dec 11, 2016 • edited by ghost

mrv777 commented Dec 11, 2016

ghost commented Dec 11, 2016

Isabello commented Dec 12, 2016

maxkordek commented Dec 12, 2016 • edited

karmacoma commented Dec 12, 2016

ghost commented Dec 12, 2016 • edited by ghost

4miners commented May 12, 2017

4miners commented May 14, 2017 • edited by karmacoma

Isabello commented May 14, 2017 via email

ghost commented May 14, 2017

diego-G commented Sep 10, 2018 • edited

ghost commented Dec 11, 2016 •

edited by ghost

ghost commented Dec 11, 2016 •

edited by ghost

maxkordek commented Dec 12, 2016 •

edited

ghost commented Dec 12, 2016 •

edited by ghost

4miners commented May 14, 2017 •

edited by karmacoma

diego-G commented Sep 10, 2018 •

edited