Parallel Block Validation to prevent DDOS big block attack #93

Closed
wants to merge 22 commits into
from

Projects

None yet

6 participants

@ptschip
ptschip commented Sep 13, 2016

This is almost finished. I works well enough but a little more polishing needed.

Also need a good python regression for this and some documentation on the design.

@ptschip ptschip changed the title from [WIP] Parallel Block Validation to prevent DDOS big block attack to Parallel Block Validation to prevent DDOS big block attack Oct 8, 2016
@ptschip
ptschip commented Oct 8, 2016

I'm taking this out of WIP. It's ready for code review. All python script are working (including pruning.py) with the exception of maxuploadtarget.py which has been broken for some time. Also the unit tests all pass. There is just one feature update I need to make in regards to how the scriptthreads get interrupted and I also need to add documentation and maybe expand on the "parallel.py" python script. I should have the final adds done by early next week.

@ptschip
ptschip commented Oct 11, 2016

Added documentation found in /doc/bu-parallel-validation.md

Just need to add one more feature for quitting the script queue threads and then extend the parallel.py test script.

@ptschip
ptschip commented Oct 12, 2016

Added quitting the script threads but still need to be able to do a control.Wait() after forcing the QUIT on the script threads and hopefully will take care of the conditional wait error i'm seeing at the end of the python script.

Then it's just a matter of getting a more extended python script to work in the case where we have multiple attack blocks to contend with.

@ptschip
ptschip commented Oct 13, 2016

OK, I believe this is fully functional and debugged now.

Just need to append an extra python test or two.

@deadalnix

While this is certainly something very good to have, I would argue against merging it in 0.12bu . It is too large of a change to get in there without increasing risks for a key release of BU.

@ptschip ptschip changed the title from Parallel Block Validation to prevent DDOS big block attack to [WIP] Parallel Block Validation to prevent DDOS big block attack Oct 20, 2016
qa/rpc-tests/excessive.py
self.nodes[random.randint(0,3)].generate(1)
- time.sleep(1)
+ time.sleep(10)
@ftrader
ftrader Oct 23, 2016

Why is this timeout increased?

@ptschip
ptschip Oct 25, 2016

not sure why i changed that, i think i was just testing something out. i'll change it back

qa/rpc-tests/p2p-acceptblock.py
+ while self.nodes[0].getblockcount() != 290:
+ time.sleep(1)
+ j = j + 1
+ if (j > 60):
@ftrader
ftrader Oct 23, 2016

Can you please explain (comments in code) where the 60s timeout comes from?
e.g. if it it related to some constant or specific timeouts in the application code, or if it is based on some calculation like number of blocks to be transmitted under some bandwidth constraint, ...

Is this kind of construct (waiting for a block count on a node to reach a certain count) something that will be needed more often in tests due to some changes made in BU?
If so, it seems the test code would be clearer if this were factored out into a method.

@ptschip
ptschip Oct 25, 2016

I added a comment there that we have to wait for the reorg to compete but the 60 seconds is just a number i picked...it has to break at some point to throw an assert if the reorg isn't happening. The reorg should only take a second or two to complete but on my system, it's a bit slow and always fails if i don't wait a second or two...but rather than put in a hard limit i thought the loop would be better.

@@ -91,6 +91,7 @@ def send_blocks_with_version(self, peer, numblocks, nVersionToUse):
block_time += 1
height += 1
tip = block.sha256
+ #tip = block.hash
@ftrader
ftrader Oct 23, 2016

dead code that can be removed?

qa/rpc-tests/p2p-versionbits-warning.py
# Fill rest of period with regular version blocks
self.nodes[0].generate(VB_PERIOD - VB_THRESHOLD + 1)
# Check that we're not getting any versionbit-related errors in
# getinfo()
+ time.sleep(2)
@ftrader
ftrader Oct 23, 2016

why is an additional sleep needed here?

@ptschip
ptschip Oct 25, 2016

took it out, can't remember what i was doing there..it work fine without it.

qa/rpc-tests/p2p-versionbits-warning.py
self.nodes[0].generate(VB_PERIOD - VB_THRESHOLD)
# Might not get a versionbits-related alert yet, as we should
# have gotten a different alert due to more than 51/100 blocks
# being of unexpected version.
# Check that getinfo() shows some kind of error.
+ time.sleep(2)
@ftrader
ftrader Oct 23, 2016

why is an additional sleep needed here?

@ptschip
ptschip Oct 25, 2016

same thing, took it out.

qa/rpc-tests/p2p-versionbits-warning.py
@@ -148,7 +163,9 @@ def run_test(self):
# Connecting one block should be enough to generate an error.
self.nodes[0].generate(1)
+ time.sleep(2)
@ftrader
ftrader Oct 23, 2016

why is an additional sleep needed here?

@ptschip
ptschip Oct 25, 2016

same, took it out

+ self.nodes[0].keypoolrefill(100)
+ for i in xrange(200):
+ send_to[self.nodes[0].getnewaddress()] = Decimal("0.01")
+ self.nodes[0].sendmany("", send_to)
@ftrader
ftrader Oct 23, 2016

To me it looks like the 5 lines from this one upward are repeated identically a few times in the following sections.

Am I missing a reason why you're not using a loop over them?

qa/rpc-tests/parallel.py
+ connect_nodes(self.nodes[1],3)
+ sync_blocks(self.nodes)
+
+ time.sleep(15) # wait here to make sure a re-org does not happen on node0 so we want to give it some time.
@ftrader
ftrader Oct 23, 2016

Is the 15 explicable in terms of other timeouts that happen?

Will it be enough for other test systems?

And if possible, is there a better way of determining whether a re-org has been triggered on node0 that does not involve waiting for some time, as this just slows the test down?

@ptschip
ptschip Oct 25, 2016

I think 15 seconds is more than enough time.

I don't know about how to determine with the python scripts if a re-org is underway...it's a good question. I would like to know the answer :) ....I'll put a TODO comment in ther and also add it to my own personal task list.

src/init.cpp
@@ -1124,6 +1124,22 @@ bool AppInit2(boost::thread_group& threadGroup, CScheduler& scheduler)
for (int i=0; i<nScriptCheckThreads-1; i++)
threadGroup.create_thread(&ThreadScriptCheck);
}
+ // BU: parallel block validation - begin
@ftrader
ftrader Oct 23, 2016

re: the thread setup starting above and continuing below, it would be nice if the number of threads was configurable, not implemented by code duplication in various places.

12 to 16 core systems are routine for even lower-end server hardware, it would be helpful for people who have powerful nodes to adjust the number of threads to suit their systems, without recompiling.

@ptschip
ptschip Oct 25, 2016

this is already being done, if you look in init.cpp it checks the number of cores.

// -par=0 means autodetect, but nScriptCheckThreads==0 means no concurrency
nScriptCheckThreads = GetArg("-par", DEFAULT_SCRIPTCHECK_THREADS);
if (nScriptCheckThreads <= 0)
    nScriptCheckThreads += GetNumCores();
if (nScriptCheckThreads <= 1)
    nScriptCheckThreads = 0;
else if (nScriptCheckThreads > MAX_SCRIPTCHECK_THREADS)
    nScriptCheckThreads = MAX_SCRIPTCHECK_THREADS;
@deadalnix

On a more general tone, we should really not have any sleep in there...

@deadalnix

I looks like like what you need is a thread pool and a finish flag.

@ptschip
ptschip commented Oct 25, 2016

@ftrader I commented and/or fixed the issues you highlighted.

@ptschip
ptschip commented Oct 25, 2016

@deadalnix I think what you getting at is why not put everything in a threadgroup. However, these threads are detached threads. We can't join them, we want them to run until they complete or are otherwise interrupted, without affecting the main processing thread, (and then we use a semaphore to make sure we don't launch any more until at least one has returned). And since we can't .join them then there is no use in have a threadgroup at shutdown to make sure the threads have finished (and we can't add detached threads to a group anyway). So instead, I use mapBlockValidationThreads to track all current running thread information, which is essentially, a custom threadgroup for our detached threads.

But you do bring up an issue , there needs to be code in there on shutdown to make sure that all the PV threads, if any, have finished their work. I'll work on that next an update here when ready.

@deadalnix

You have a set of thread that have a set of task to do. This is textbook use case for a Threadpool .

These threads each have their block to process, plus a shared object so they can coordinate. When one the thread successfully validate a block, it can set a flag int hat object. On a regular basis (for instance, in between each transaction, the actual policy do not matter much here) a worker checks the flag and abandon if it notices the flag is set. (setting the flag needs to be done with a release atomic, reading with a acquire atomic)

Interleaving the synchronization logic with the code doing the work not only is hard to understand and fragile, but it is also almost impossible to get right.

You can set the maximum level of parallelism in the Threadpool (currently 4, but I see no point setting this in stone) and provides blocks to the threadpool to process.

@ptschip
ptschip commented Nov 4, 2016 edited

@deadalnix Sorry for the long post, but with all the chatter going on in slack I thought I should respond to your concerns here.

I understand what you're asking as far as wanting something more classical in the design of the threadgoup and that way things are now doesn't mean we can't move in that direction. But this is how I had to build up the code. It was extremely and I mean extremely difficult to figure out how the parallelism should work with the locking, the UTXO and scriptcheckqueues, and also just the learning curve of a new part of the codebase. So rather than start from a perfect design I started by spinning up a few threads and working with IBD. Not having any kind of way to generate parallel blocks in large amounts I needed to work that way so that I could actually see whether this PV was even going to work and whether there would be any serious performance issues that would preclude us from going down this path and wasting a lot of time. So it took many weeks of incremental coding and testing to make it work, and it does work. I'm a very good and thorough tester, of course no one should trust the developer of his code in that, but just to say I'm very confident in the working of the code as it is.

But, yes you are right, we can move to a better design, better encapsulation of the code to make it more readable and maintainable, but the functionality of the code will not change. It is complicated, as you say, particularly in the area locking and locking orders, but that fact will not change by changing the thread model IMO.

As for the hard coded 4 thread scriptcheckqueue groups. I agree with you, it would be good to have that configurable but it requires some work there but it's not critical that we make it configurable right now IMO, and it is not such a straightforward task as it sounds. I think it is work that can be done at a later date.

As for the concerns about consensus. Certainly we are touching on the part of the code that determines which chain we follow. Tom Harding has had a look there and found a problem which I've corrected; so we've had some experienced eyes on that. Really the issue there is around pindexMostWork and how we handle that during parallelism. You can easily find that part in connectblock() in main.cpp. In the end while there are many commits to the code on github, there really isn't that much code that has changed. Most of the significant changes and changes related to consensus are found in connectblock(). Really the difficult part there was ensuring that we have the correct pindexMostWork. You can read more about that in the bu-parallel-validation.md found in the docs folder.

As for Bugs, well yes, I would like to know if there are any bugs that need fixing. We need more people testing and reviewing the code. But our group is still small and everybody is working on their own projects. So far though, Tom Harding and @ftrader have done some good review which has been very helpful so far. I would suggest, if you have time, at least that you compile and run the code, and there is a good python test now parallel.py which you can run and which can be extended on (still I'm working on it in the case of the 4 block scenarios at the end of the script) Feedback on any specific bugs would be much appreciated.

@dgenr8

This is not a complete review.

src/main.cpp
+ // short while after it has been interrupted and meanwhile the view of the UTXO could have changed after another thread has updated the tip.
+ // ** Additionally and more importantly in the event of a big block DDOS attack, by checking the chain height here, we ensure that
+ // the thread that is validating the big block will immediately exit and finish up (while still keeping the block on disk in the
+ // case of a reorg) as soon as the first block makes it through and wins the validation race.
@dgenr8
dgenr8 Nov 6, 2016

Why is this general exit check done inside the conditionals !tx.IsCoinBase() and !viewTempCache.HaveInputs(tx)?

@ptschip
ptschip Nov 7, 2016

This is needed here for the reason that the competing thread, if it has lost the race, may find that it now has invalid inputs because the other winning thread has already updated the UTXO. If that happens we enter this section, however we don't want to return with a DOS 100 error message at the bottom of this section, instead we just want to return false and print out the corresponding and correct log message for why the thread had exited.

@dgenr8
dgenr8 Nov 7, 2016

If this thread has lost the race, shouldn't it exit regardless of whether an invalid input is detected?

@ptschip
ptschip Nov 9, 2016

@dgenr8 Yes you are correct in that it needs to exit, but still the code has to be here anyway. The reason is again, we can not allow the "return DOS 100" to happen if this is a thread that has merely lost a validation race or a reorg is underway, and if it has lost the race, or reorg is happening, then the next call to viewTempCache.haveInputs(tx) will fail and we will execute this section of code anyway.

And of course if somehow the thread is done checking inputs then it will be terminated further down stream.

@dgenr8
dgenr8 Nov 9, 2016

Why would the next call to haveInputs necessarily fail?

@ptschip
ptschip Nov 10, 2016

@dgenr8 it will fail because the outputs will no longer be available because the competing block will have updated the UTXO and spent the coins. That's assuming the same tx's or many of them are in the same block. Of course it may not happen 100% of the time so in that event the thread still gets a message from the competing thread to terminate at the bottom of the loop.

Hmmm, just thinking about it more though, an attack block would likely be using txns that were not propagated. And so any competing blocks wouldn't have those in their block. So in that case, we'd be relying on the message from the block that finished validation to terminate. I know it works fine but maybe we should have a backup way of terminating. It's overkill in my view, but in this part of the code I think a little overkill maybe is warranted. What do you think?

@ptschip
ptschip Nov 10, 2016

@dgenr8 I must say though I don't like the idea of having to put some duplicate code in that loop that is not necessary, I think it's fine the way it is, it's not vital that the thread self terminates at that point, if it does then fine, but it will get terminated either from the thread that wins the race or further down stream; at the last check just before it would update the UTXO, if it makes it that far it will finally self terminate.

src/main.cpp
+ // case of a reorg) as soon as the first block makes it through and wins the validation race.
+ if (fParallel)
+ {
+ if (chainActive.Height() > nStartingHeight) {
@dgenr8
dgenr8 Nov 6, 2016

This should be a check against chainwork. Not every reorg results in a higher tip.

@ptschip
ptschip Nov 7, 2016

This is not for a re-org. Imagine we have two competing threads. Each is validating against the current chainactive tip to get it's block through. Now one of those has won the race and advanced the tip...the chain height will now be greater by 1 than when this thread began its validation. We could use chain work here also but it's the same thing at this point.

@ptschip
ptschip Nov 7, 2016

@dgenr8 Perhaps your right in that I should at least change this for the sake of consistency and understandability to POW...also I think we should be checking here for not just for (pow > startingPOW) but also for (pow < startingPOW) as would happen if a re-org were underway while validating this block (even though we will terminate the thread anyway when tip is disconnected)

+ // Once the first block has been disconnected and a re-org has begun then we need to terminate any
+ // currently running PV threads that are validating. They will likley have self terminated
+ // at this point anyway because the chain tip and UTXO base view will have changed but just
+ // to be sure we are not waiting on script threads to finish we can issue the termination here.
@dgenr8
dgenr8 Nov 6, 2016

This new code could be moved before fBlocksDisconnected = true and conditioned on !fBlocksDisconnected instead of a new boolean.

@ptschip
ptschip Nov 7, 2016

I put it after the first DisconnectTip() because if the disconnect fails it will return false and therefore we don't want to kill the validation threads if there are any running . (although it seems rather unlikely that a disconnect would fail and if it did i think there would be a more serious problem there, not sure it's even necessary to put in the code that it should return, perhaps an assert would be more appropriate).

Also I put it there in that section because I thought it would be best to kill the PV threads as soon as the first block was disconnected ...there may be other blocks to disconnect in the the event of a larger re-org so rather than waiting until all the blocks are disconnected , the PV threads are terminated immediately after the chaintip they are working on has been undone.

@dgenr8
dgenr8 Nov 7, 2016

Sorry, my only point here was that a new boolean isn't needed.

@ptschip
ptschip Nov 9, 2016 edited

@dgenr8 Oh, ok...thanks, that makes sense. I'll update that.

src/main.cpp
@@ -2945,8 +2945,7 @@ bool ConnectBlock(const CBlock& block, CValidationState& state, CBlockIndex* pin
LogPrint("parallel", "Sending Quit() to scriptcheckqueue\n");
(*mi).second.pScriptQueue->Quit();
}
- //(*mi).second.tRef->interrupt(); // interrupt the thread
- (*mi).second.fQuit = true; // interrupt the thread
+ (*mi).second.fQuit = true; // quit the thread
@deadalnix
deadalnix Nov 10, 2016

Can you explain why you use both this mechanism and the interrupt from boost ?

Also, please use atomic to read/write this flag. Writing require a release and reading an acquire.

@ptschip
ptschip Dec 10, 2016

Updated and changed to Atomic.

Two mechanisms are used to exit the PV thread...one is to interrupt the scriptcheck thread pool (used for checking sigs) the other is to send a quit to the PV thread itself.

src/main.cpp
static CCheckQueue<CScriptCheck> scriptcheckqueue(128);
+static CCheckQueue<CScriptCheck> scriptcheckqueue2(128);
+static CCheckQueue<CScriptCheck> scriptcheckqueue3(128);
+static CCheckQueue<CScriptCheck> scriptcheckqueue4(128);
@deadalnix
deadalnix Nov 10, 2016

Making this an array looks like a more appropriate choice.

src/main.cpp
+void ThreadScriptCheck4() {
+ RenameThread("bitcoin-scriptch4");
+ scriptcheckqueue4.Thread();
+}
@@ -2408,8 +2447,12 @@ static int64_t nTimeIndex = 0;
static int64_t nTimeCallbacks = 0;
static int64_t nTimeTotal = 0;
-bool ConnectBlock(const CBlock& block, CValidationState& state, CBlockIndex* pindex, CCoinsViewCache& view, bool fJustCheck)
+bool ConnectBlock(const CBlock& block, CValidationState& state, CBlockIndex* pindex, CCoinsViewCache& view, bool fJustCheck, bool fParallel)
@deadalnix
deadalnix Nov 10, 2016

Boolean parameter is usually a code smell. I know they are all over the codebase, but I don't think that's a good reason to pile up.

@ptschip
ptschip Dec 10, 2016

it is needed here, we have to support both parallel mode and non-parallel mode of operation.

+ if (coins && !coins->IsPruned())
+ return state.DoS(100, error("ConnectBlock(): tried to overwrite transaction"),
+ REJECT_INVALID, "bad-txns-BIP30");
+ }
@deadalnix
deadalnix Nov 10, 2016

The whole thing should go in its own diff.

src/main.cpp
+ // scriptcheckqueue threads have all returned.
+ boost::shared_ptr<boost::mutex> scriptcheck_mutex = allScriptCheckQueues.GetScriptCheckMutex();
+ if (fParallel) cs_main.unlock(); // unlock cs_main, we may be waiting here for a while before aquiring the scoped lock below
+ boost::mutex::scoped_lock lock(*scriptcheck_mutex);
@deadalnix
deadalnix Nov 10, 2016

This violate almost all known good practices when it comes to multithreading. I mean, even if you got it right this time, this is the definition of brittle and WILL break at some point becuase someone changed something and it all broke.

@ptschip
ptschip Dec 10, 2016

It works fine...take a look a the newer code. Much has changed and been streamlined. Mostly everything is now in parallel.h and parallel.cpp. Any code is "brittle" and requires the coder to understand the code he's dealing with. The problem we have here is we can either support two sets of code for ConnectBlock(), one that is for parallel validation and one without, but that will end up being much worse IMO for potential code breakage, if/when developers make changes to one without making them to the other. Take a look at the newer code and then comment again.

return false;
- }
@deadalnix
deadalnix Nov 10, 2016

Also,

if (!foo) return false;
return true;
@deadalnix

Alright. There are all kind of things that are wrong here. I think you have a good prototype, but I don't think this meets the standards of quality we should expect from code that run a $10B network. I know that the code is fubared to begin with, but that's not a good reason to add more.

I think it is fair to say that this code grown organically rather than grew from principled software engineering. that's ok to get something up and running quick, but that's it.

I'd like to paraphrase Romero here - it is about guideline they put in place with Carmack and other when growing id software - "When you find a problem with existing code, fix it. Do not continue to do what you are doing, do not work around it and do not open a task for it. If you don't, then your code will be build on faulty foundations."

That being said, here we go.

  • The code mix synchronization with busniess logic. You need to come up with an architecture that allow a separation of concern here. See reason below in a).
  • The code uses 3 different synchronization techniques (mutexes, flagging, and boost thread interruption). At least 2 of them are redundant with each others (flags and boost thread facilities).
  • The flags are not used properly. First, they need to be set using release atomic and read using acquire , and second, it is unclear what prevents the UTXO to be changed between the flag read and the UTXO use. It looks like the right thing to do here is to check the flag while holding a read lock on the UTXO.
  • Second there are 2 different fQuit flags. For maximum confusion.
  • There seems to be no use of RWLock . Things like the UTXO would greatly benefit from using this.
  • There is a lot of manual lock and unlocking. This isn't exception safe. This is also an indicator of poor code structure. Note that boost thread termination uses exceptions. Scope guards are needed, and code using several mutexes needs to be physically separated.
  • There is a TON of fParallel checks. this is a code smell. once again, it shows poor code structure - same as for IsWitnessProgram checks scattered all over the codebase. If you need to specialize a bunch of behavior based on some check, there are various superior techniques. One would be to go OOP and use virtual methods. The checks needs to be done once and virtual dispatch takes care of the rest. Another alternative is to use compile-time policies ( http://www.intopalo.com/blog/2014-03-28-policy-based-design/ ).

a) Synchronization logic cannot be tested the way regular code is. It is also notoriously error prone. It is important that it needs to be reviewable independently from the application logic.

Last but not least large changes that do contains both refactoring and logic change are the most risky. They also tend to not be a very efficient way to work for the contributor. In effect, nothing can be merged as long as EVERYTHING is in a good enough state to be merged. This contains code that is perfectly fine, which, maybe become not fine in the future because of other's work. I would suggest you proceed as follow:
1/ Do small restructuring of the existing code to move toward getting the hooks you need at the right place. These changes needs to be purely structural and do not change any application logic. This is often a good opportunity to add extra unittests as the structure of the code improves.
2/ In parallel, start introducing the code for parallel validation. This do not need to be wired in existing logic outside of tests at first.

When 1/ and 2/ are sufficiently advanced, you should find yourself in a position where you can add the fParallel flag, a check to it early on in the validation logic and chose the right policy accordingly. This should looks like (pseudocode) :

ValidateBlock<SingleThreadValidationPolicy>(chain, block);

changed into

// Various code to checks if we want parallel validation
fParallel = ...;

if (fParallel) {
    ValidateBlock(chain, block, MultipleThreadValidationPolicy(4));
} else {
    ValidateBlock(chain, block, SingleThreadValidationPolicy());
}

And Voila !

I hopes that helps.

src/checkqueue.h
+ {
+ boost::mutex::scoped_lock lock(mutex_fQuit);
+ fQuit = true;
+ LogPrint("parallel_2", "setting fQuit to: %d\n", fQuit);
@deadalnix
deadalnix Nov 11, 2016

It doesn't makes sens to log while holding that mutex. Critical sections needs to be reduced to the minimum.

@ptschip
ptschip Dec 10, 2016

these were just intended to be temporary and I've taken them out but keep in mind when you used logprint, they don't actually print out anything unless debug=parallel_2 is set in your bitcoin.conf. But as i said these are now taken out since the code is fully stable for a while now.

src/main.cpp
@@ -122,6 +128,7 @@ CScript COINBASE_FLAGS;
const string strMessageMagic = "Bitcoin Signed Message:\n";
+
src/main.cpp
+ boost::mutex::scoped_lock lock(*scriptcheck_mutex);
+
+ // Get the next available scriptcheckqueue.
+ CCheckQueue<CScriptCheck>* pScriptQueue = allScriptCheckQueues.GetScriptCheckQueue(scriptcheck_mutex);
@deadalnix
deadalnix Nov 14, 2016

Alright it seems that this whole thing ends up being the major design problem. I believe this needs to be fixed before PV moves forward. You are juggling with locks here and mixing that with block validation. This is brittle.

This should be refactored. Here is the API I would aim for :

auto token = signatureChecker.check(transactions_to_check);

// Do various stuff while signature are being validated...

auto result = token.get(); // Wait for transactions to be validated and return the result of the validation.
token.cancel(); // Cancel processing.

The token is a value type. The destructor does cancel the work if it hasn't finished already.

Behind this API you can hide pretty much whatever is wanted. The important thing is that it is abstracted away.

@ptschip
ptschip Dec 10, 2016

Again, take a look at the newer code and keep in mind we are not just validation signatures here...but that said the newer code completely gets rid of cs_main during the time we're looping through the inputs and checking sigs and txns.

src/main.cpp
+ if (fParallel) {
+ // We need to place a Quit here because we do not want to assign a script queue to a thread of activity
+ // if another thread has just won the race and has sent an fQuit.
+ if (mapBlockValidationThreads[this_id].fQuit) {
@deadalnix
deadalnix Nov 14, 2016

Once more, no repeated lookups.

src/main.cpp
+ // if another thread has just won the race and has sent an fQuit.
+ if (mapBlockValidationThreads[this_id].fQuit) {
+ LogPrint("parallel", "fQuit 0 called - Stopping validation of this block and returning\n");
+ cs_main.lock(); // must lock before returning.
@deadalnix
deadalnix Nov 14, 2016

Why do caller needs that lock ? Can we refactor ?

@ptschip
ptschip Dec 10, 2016

cs_main must be locked when we return as it is a recursive mutex and must be unlocked by the caller.

src/main.cpp
+ mapBlockValidationThreads[this_id].pScriptQueue = pScriptQueue;
+
+ // Assigne the nSequenceId for the block being validated in this thread.
+ mapBlockValidationThreads[this_id].nSequenceId = pindex->nSequenceId;
@deadalnix
deadalnix Nov 14, 2016

You shouldn't be fetching these again and again. Just pass a pointer to the object around.

src/main.cpp
+ }
+
+ // Re-aquire cs_main if necessary
+ if (fParallel) cs_main.lock();
@deadalnix
deadalnix Nov 14, 2016

There are various places where you lock and unlock this guy. The way you do it means that the lock will be in any state when this code returns. Does the caller expect a specific lock state ?

Also, this is obviously error prone, and even if you get it right, someone is bound to edit that code and get it wrong.

@ptschip
ptschip Dec 10, 2016

cs_main must be locked when it returns...as mentioned before.

@@ -2541,16 +2632,35 @@ bool ConnectBlock(const CBlock& block, CValidationState& state, CBlockIndex* pin
if (!tx.IsCoinBase())
{
- if (!view.HaveInputs(tx))
+ if (!viewTempCache.HaveInputs(tx)) {
@deadalnix
deadalnix Nov 14, 2016

Can you explain why view cannot be used ?

src/main.cpp
+ return false;
+ }
+ LOCK(cs_blockvalidationthread);
+ if (mapBlockValidationThreads[this_id].fQuit) {
@deadalnix
deadalnix Nov 14, 2016

You should avoid doing a map lookup every time you want this object. Just keep a pointer to it.

Also, I mentioned it, you need to use atomic to check that flag.

src/main.cpp
return state.DoS(100, error("ConnectBlock(): inputs missing/spent"),
- REJECT_INVALID, "bad-txns-inputs-missingorspent");
+ REJECT_INVALID, "bad-txns-inputs-missingorspent");
@deadalnix
deadalnix Nov 14, 2016

Please avoid pure formatting change, or do them in their own PR. It makes the log confusing for other devs.

@ptschip
ptschip Dec 10, 2016

just a mistake, fixed.

src/main.cpp
- setPreVerifiedTxHash.erase(hash);
- setUnVerifiedOrphanTxHash.erase(hash);
+ {
+ cs_xval.lock();
@deadalnix
deadalnix Nov 14, 2016

Use scope guard. The way you are doing these lock and unlock is absolutely not safe.

@ptschip
ptschip Dec 10, 2016

switched to recursive lock

src/main.cpp
+ }
+ control.Add(vChecks);
+ nChecked++;
+ if (fParallel) cs_main.lock();
@deadalnix
deadalnix Nov 14, 2016

I don't think checking signature would have anything to do with cs_main. Please keep separation of concerns.

Also I'm curious as to why you only lock/unlock cs_main when the flag for PV is there. It doesn't looks like to me that it is any safer do do/not do depending on PV. Could you make a PR to tweak the cs_main lock independently ?

src/main.cpp
+ // Return if Quit thread received.
+ if (fParallel)
+ {
+ LOCK(cs_blockvalidationthread);
@deadalnix
deadalnix Nov 14, 2016

If you were passing a pointer, you wouldn't have to set/unset this.

src/checkqueue.h
@@ -60,6 +60,9 @@ class CCheckQueue
*/
unsigned int nTodo;
+ //! Mutex to protect fQuit
+ boost::mutex mutex_fQuit;
@deadalnix
deadalnix Nov 14, 2016

I think you can remove this mutex and use amtomics instead.

release to store, acquire to read.

Peter Tschipper and others added some commits Sep 6, 2016
@ptschip Peter Tschipper Parallel Block Validation to mitigate DDOS big block attack
- add 3 more script check queues
- use a wrapper function to call HandleBlockMessageThread
- Create semaphores to control and handle queue selection and throughput
- Update locking - we do not need to hold cs_main when checking inputs
- Flush the temporary view to the base view rather than re-running every txn with UpdateCoins
- unlock cs_main before waiting for script check threads to complete
- Python test for Parallel Validation

- Disable SENDHEADERS requests and always revert to INV
Seems that SENDHEADERS functionality really causes problems
even more so with parallel validation.  Turning it off here
completely, but we will still service nodes that want it.

- Only allow parallel block validation during IBD or receiving new blocks

- When mining new blocks or re-indexing or any other activity the blocks
are prevented from entering parallel validation: the cs_main locks are
not relinquished in those cases.

- Turn off Parallel Validation when doing IBD

- Ability to Quit scriptcheck threads

When  4 blocks are running in parallel and a
new, smaller block shows up, we need to be able to interrupt the
script threads that are currently validating for a one of the  blocks
so that we can free up a script check queue for the new smaller  block.

Documentation for Parallel Validation

Change some logprint messages from parallel to parallel_2

- Update the nBlockSequenceId after the block has advanced the tip

This is important for Parallel Validation.  By decreasing the sequence id
we are indicating that this block represents the current pindexMostWork. This
prevents the losing block from having the pindexMostWork point to it rather
than to the winning block.

- Continously check for new blocks to connect

With Parallel Validation new blocks can be accepted while we are connecting at
tip therefore we need to check after we connect each block whether a potentially
longer more work chain now exists. If we don't do this then we can at times end
up with a temporarily unconnected block until the next block gets mined and
another attempt is made to connect the most work chain.

- Terminate any PV threads during a re-org

PV can only operate with blocks that will advance the current chaintip (fork1). Therefore,
if we are needing to re-org to another chain (fork 2)  then we have to kill any currently
running PV threads assoicated with the current chain tip on fork1. This is the
solution to the problem of having two forks being mined while there is an attack
block processing on fork1. If fork2 continues to be mined and eventually pulls
ahead of fork1 in proof of work, then a re-org to fork2 will be initiated causing
the PV threads on fork1 to be terminated and fork2 blocks then connected and fork2 then
becoming the chain active tip.

- Use Chain Work instead of nHeight to self terminate a PV thread

If the Chain Work has changed either positve or negative to where it
was when we started the PV thread then we will exit the thread.  Previously
we were using Chain Height, which worked fine, but this is more understantable
from a coding perspective and also we added the feature to check for when
Chain Work has decreased from the starting point which would indicate that
a re-org was underway.
f2387d1
@ptschip ptschip Move ZMQ notifications to ActivateBestChainStep
We must notify ZMQ after each tip is connected rather
than after ActivateBestChain otherwise we can miss a block
when there are several to connect together at once.
ca480ce
@ptschip ptschip Don't relay block inventory unless the chain is nearly sync'd
This particularly seems to help sync issues on regest where
we're mining many blocks all at the same time.
134ba1c
@ptschip ptschip Create parallel.cpp and parallel.h and move PV code into it 5e03ea2
@ptschip ptschip Simplify the selection of the scriptcheck_mutex and pScriptQueue
Abstract the code and make it more readable by returning both
values at the same time.

Small Pointer Optimization

Reorder where the control is aquired
deeec3c
@ptschip ptschip Remove parallel_2 logprints from checkqueue.h 067b2cf
@ptschip ptschip Small pointer optimaztion for HandleBlockMessage 4c60cfe
@ptschip ptschip move log messages to parallel.cpp 920b065
@ptschip ptschip streamline ConnectBlock a little more da5cecd
@ptschip ptschip Cleanup CAllScriptCheckQueues in parallel.h 1244c35
@ptschip ptschip remove Class CscriptCheck from unlimited.h fa88cd3
@ptschip ptschip Small edit 1ecaf65
@ptschip ptschip Use Atomic for fQuit flag in checkqueue.h 453e379
@ptschip ptschip Completely remove cs_main locks from checkinputs and sigs during PV
During the loop whre we check inputs and signatures we can completely
remove the cs_main locks by making sure to add an internal lock
cs_main lock to ChainWorkHasChanged().  This function rarely will
get invoked and only if the thread is about to exit so there is no
problem here of adding any additional overhead.

Removing the cs_main locks here is the most important step in
completely removing any significant inter-dependancy between
parallel validation threads.
0b86ad4
@ptschip ptschip Must maintain locking order between scoped lock and cs_main
If we are in PV mode then we must make sure the scoped lock is
unlocked before re-locking cs_main otherwise there is a potential
for a deadlock.  The locking order should be cs_main -> scoped lock.
However during PV we unlock cs_main before aquiring the scoped lock
and therefore we much unlock the scoped lock before re-aquiring
cs_main near the end of ConnectBlock().
ab5cb06
@ptschip ptschip Make sure locks are correct before returning after SequenceLocks check.
The scriptlock must be unlocked and cs_main must be locked before
returning after the call to SequenceLocks() fails.
5586b83
@ptschip ptschip Create SetLocks() used to set the locking order when returning from PV
If there is an error during PV then we have to make sure the locks
are set in the right order before returning.  cs_main must always
be locked when we return from ConnectBlock() as cs_main is recursive
but we must also ensure that the scriptlock is unlocked before
re-locking cs_main to avoid a potential deadlock.

By creating SetLocks() we can abstract away the setting of the locking
order and prevent any developer confusion or possible introduction of  errors
into the code if future changes are made in the ConnectBlock() function.
ee2b08f
@ptschip ptschip Consolidate and simplify the sript check thread pool creation
In the past the code for creaating the script check threads and pools
was all over the place in several files but is now consolidated in
parallel.cpp and parallel.h.  Also is is much easier to make any
changes to the number of scriptcheckqueue's by just editing two
lines in  parallel.cpp.
0393332
@ptschip ptschip Move cs_blocksemaphore into globals.
Using it defined as a static within HandleBlockMessageThread is not
threadsafe until we get to c++11.
7055ee9
@ptschip ptschip Move the aquistion of the script control until after the scoped lock 37ab4b6
@ptschip ptschip Change back to using a mutex for fQuit flag
It seems overkill to use atomic here for code that will
very rarely get executed.
af5e49c
@ptschip ptschip Use only one semaphore for PV instead of two
This simplifies the code and makes it easier to understand. Also, there
is very little improvement if any to IBD by using a separate semaphore.
6b97060
@gmaxwell

This would significantly incentivize the creation of empty blocks, as they would always win races even if they arrived significantly later. Is this your intention?

@gandrewstone gandrewstone changed the title from [WIP] Parallel Block Validation to prevent DDOS big block attack to Parallel Block Validation to prevent DDOS big block attack Jan 8, 2017
@gandrewstone
Collaborator

closing, ptschip has a new branch and will open a new PR soon...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment