feat: concurrent recheckTx #163

jinsan-line · 2021-01-14T09:13:17Z

Related with: https://github.com/line/link/issues/1152

Description

To optimize performance, we need to increase concurrency. After implementing concurrent checkTx(#160), I'd like to implement concurrent recheckTx.

The key change is decomposing application.CheckTx() into CheckTxSync() and CheckTxAsycn(). abci.CheckTxSync() and abci.CheckTxASync() actually use the same application interface, application.CheckTx(). So all of abci.CheckTxSync() and abci.CheckTxAsync() are blocking function. W/ application.CheckTxSync() and application.CheckAsync(), I intend to implement these as sync and blocking and async and non-blocking in cosmos-sdk.

Please note the reason why application.CheckTxSync() is still needed. It's needed because rpc.broadcaseTxSync() (and Commit()) has it's own goroutine from http server. In this case, it's better in terms of not creating unnecessary goroutine more.

For contributor use:

Wrote tests
Updated CHANGELOG_PENDING.md
Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
Updated relevant documentation (docs/) and code comments
Re-reviewed Files changed in the Github PR explorer

…xAsync()`

…lback

jinsan-line · 2021-01-19T03:31:21Z

@torao

In general, there are many important points that need to be considered when asynchronizing processes, for instance, such as the scope of threads (creation and termination) and who manages them, whether to use a thread pool, and what the impact will be if the processing consumes a lot of time, etc., so I would like to ask:

First of all, I'd like to mention we're dealing with Go but not Java. W/in Go, there is no concept for thread pool but there is only goroutine that's actually backed by GOMAXPROCS threadpool. But, as a Go programmer, we don't need to consider threadpool when using goroutine and we could conceptually consider goroutine like a free resource.

In "Concurrency in Go: Tools and Techniques for Developers, Katherine Cox-Buday, O'Reilly Media," the author is saying as follows:

| Go’s philosophy on concurrency can be summed up like this: aim for simplicity, use channels when possible, and treat goroutines like a free resource.

In this PR, what thread does the asynchronous processing take place in? Will a new thread be started each time the callback is executed? Or does it use a dedicated thread pool for CheckTx asynchronous processing?

Tendermint doesn't and shouldn't care of what goroutine executes callback but, as owner of ABCI, it should provide an asynchronous interface to give ABCI server chance to execute ABCI in concurrent manner.

If a new thread is started for each asynchronous process, won't a large number of threads and system resources be consumed and the OS becomes unresponsive if the `CheckTx' process is delayed for some reason?

If it uses a thread pool that has a fixed number of threads, or if the thread pools are shared by processes other than CheckTx, what will be the impact if the CheckTx process is delayed and there are no more threads available?

I hope it could be answered for these that we could treat goroutine like a free resource.

torao · 2021-01-19T05:12:51Z

I had thought goroutine to be similar to (OS native) threads, but now I understood that it more correctly resembles coroutine (and the name seems to be the same). I think this is the same as a parallel processing model which uses GOMAXPROCS OS-native threads to process queued jobs, and it's a more programmer-friendly abstraction.

As computer resources are finite, however, I think "free resource" is a little ideal thing. Assuming a parallel processing model of such a jobqueue-workers type, if more than GOMAXPROCS CheckTx processes are (not blocked but) delayed at the same time, will all other goroutines in the Tendermint be affected?

The term "thread pool" in the previous comment is a Java expression, but in this case, GOMAXPROCS OS-native threads are equivalent meaning to "the thread pool provided by system default". What I was considering about in my previous comment was that in any language or framework, to avoid exhausting such the system default thread pool, it should be designed carefully and is sometimes considered having a separate thread pool.

jinsan-line · 2021-01-19T06:26:40Z

@torao

I think the same way for thread pool & resource management in terms of dealing parallel execution but it doesn't affect this PR at all.

I had thought goroutine to be similar to (OS native) threads, but now I understood that it more correctly resembles coroutine (and the name seems to be the same). I think this is the same as a parallel processing model which uses GOMAXPROCS OS-native threads to process queued jobs, and it's a more programmer-friendly abstraction.

I think goroutine is a kind of coroutine as well.

As computer resources are finite, however, I think "free resource" is a little ideal thing. Assuming a parallel processing model of such a jobqueue-workers type, if more than GOMAXPROCS CheckTx processes are (not blocked but) delayed at the same time, will all other goroutines in the Tendermint be affected?

The term "thread pool" in the previous comment is a Java expression, but in this case, GOMAXPROCS OS-native threads are equivalent meaning to "the thread pool provided by system default". What I was considering about in my previous comment was that in any language or framework, to avoid exhausting such the system default thread pool, it should be designed carefully and is sometimes considered having a separate thread pool.

To my understanding, All function is running on goroutine in Go. We already has the issue what'll happen if checkTx takes long time. If checkTx takes long time, It already affects the block interval so it's not introduced newly. The only new thing, that we should focus for this PR, is scheduling cost.

(Another thing, I'd like to say, is that asynchronous doesn't compel non-blocking to use another goroutine. For example, xxxAsync() in the example apps of tendermint are still blocking. )

Basically, I think we should accept the goroutine scheduling is fairly cheap because it's a well known property and one of killing feature of Go. If we couldn't trust this assumption, we couldn't use Go.
Deadlock could be a new introduced concern but it's not a business of tendermint. (Of course, it's app team's responsibility for cosmos-sdk. HaHa~) tendermint could only give a chance to make it as non-blocking to application. tendermint is exposed to Deadlock risk if application implements checkTx using goroutine but, only with this risk, we could get a chance to optimize the performance with parallel execution.

egonspace · 2021-01-19T06:59:05Z

abci/client/local_client.go

@@ -266,5 +273,6 @@ func newLocalReqRes(req *types.Request, res *types.Response) *ReqRes {
 	reqRes := NewReqRes(req)
 	reqRes.Response = res
 	reqRes.SetDone()
+	reqRes.Done()


I'm just asking out of curiosity, did you put this code in because there a problem with the absence of this code before? Or is it just a formality?

There was no bug behavior before but, as you can see as, it's not good in terms of consistency.

egonspace · 2021-01-19T07:45:35Z

mempool/clist_mempool.go

 			Tx:   memTx.tx,
 			Type: abci.CheckTxType_Recheck,
 		})
+		reqRes.SetCallback(func(res *abci.Response) {


Where will this reqRes.cb be called at?

I'd like to refactor the code because it's not intuitive and has many assumptions. If I refactor it w/in this PR, I think it makes this PR more complicated so I didn't do it. I'd like to refactor it If I have a chance.

https://github.com/line/tendermint/blob/8316c03342d55179c70f0e32df8b861e58af5044/abci/client/client.go#L99-L110

Oh, I've missed to call callback when abci is async and non-blocking when I revert refactored code. I'll revise it, thanks,

Thanks, fixed at 38df3b3

As you can catch it w/ client.go, it has a kind of bug to call cb many times if call SetCallback() after ReqRes is done. I'd still like to refactor it if I have a chance.

egonspace

I fully agree with the improvement direction.

abci/client/local_client.go

wetcod

LGTM

kukugi

LGTM 👍

egonspace · 2021-01-19T10:24:29Z

reactor_test of mempool failed. I think this failure should be fixed.

    reactor_test.go:99: 
        	Error Trace:	reactor_test.go:99
        	            				reactor_test.go:73
        	            				asm_amd64.s:1374
        	Error:      	Not equal: 
        	            	expected: types.Tx{0x23, 0xac, 0x5c, 0xbc, 0xf7, 0xca, 0x73, 0xbd, 0xe2, 0xbf, 0x15, 0x58, 0x72, 0xcb, 0xfd, 0x25, 0x55, 0xea, 0x33, 0xc7}
        	            	actual  : types.Tx{0x7, 0xc, 0x90, 0x43, 0x27, 0x47, 0xe5, 0xe6, 0xed, 0xbf, 0x5b, 0x3d, 0x58, 0xc6, 0x8c, 0x93, 0xe8, 0x47, 0xb2, 0x8e}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(types.Tx) (len=20) Tx{23AC5CBCF7CA73BDE2BF155872CBFD2555EA33C7}
        	            	+(types.Tx) (len=20) Tx{070C90432747E5E6EDBF5B3D58C68C93E847B28E}
        	            	 
        	Test:       	TestReactorBroadcastTxMessage
        	Messages:   	txs at index 98 on reactor 3 don't match: Tx{23AC5CBCF7CA73BDE2BF155872CBFD2555EA33C7} vs Tx{070C90432747E5E6EDBF5B3D58C68C93E847B28E}
    reactor_test.go:99: 
        	Error Trace:	reactor_test.go:99
        	            				reactor_test.go:73
        	            				asm_amd64.s:1374
        	Error:      	Not equal: 
        	            	expected: types.Tx{0x7, 0xc, 0x90, 0x43, 0x27, 0x47, 0xe5, 0xe6, 0xed, 0xbf, 0x5b, 0x3d, 0x58, 0xc6, 0x8c, 0x93, 0xe8, 0x47, 0xb2, 0x8e}
        	            	actual  : types.Tx{0xe4, 0x83, 0xec, 0x83, 0x3, 0x7f, 0xa6, 0xad, 0x8c, 0x20, 0x92, 0x31, 0xe9, 0x74, 0x72, 0xd3, 0x59, 0x79, 0x67, 0x65}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(types.Tx) (len=20) Tx{070C90432747E5E6EDBF5B3D58C68C93E847B28E}
        	            	+(types.Tx) (len=20) Tx{E483EC83037FA6AD8C209231E97472D359796765}
        	            	 
        	Test:       	TestReactorBroadcastTxMessage
        	Messages:   	txs at index 99 on reactor 3 don't match: Tx{070C90432747E5E6EDBF5B3D58C68C93E847B28E} vs Tx{E483EC83037FA6AD8C209231E97472D359796765}
    reactor_test.go:99: 
        	Error Trace:	reactor_test.go:99
        	            				reactor_test.go:73
        	            				asm_amd64.s:1374
        	Error:      	Not equal: 
        	            	expected: types.Tx{0xe4, 0x83, 0xec, 0x83, 0x3, 0x7f, 0xa6, 0xad, 0x8c, 0x20, 0x92, 0x31, 0xe9, 0x74, 0x72, 0xd3, 0x59, 0x79, 0x67, 0x65}
        	            	actual  : types.Tx{0x23, 0xac, 0x5c, 0xbc, 0xf7, 0xca, 0x73, 0xbd, 0xe2, 0xbf, 0x15, 0x58, 0x72, 0xcb, 0xfd, 0x25, 0x55, 0xea, 0x33, 0xc7}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(types.Tx) (len=20) Tx{E483EC83037FA6AD8C209231E97472D359796765}
        	            	+(types.Tx) (len=20) Tx{23AC5CBCF7CA73BDE2BF155872CBFD2555EA33C7}
        	            	 
        	Test:       	TestReactorBroadcastTxMessage
        	Messages:   	txs at index 100 on reactor 3 don't match: Tx{E483EC83037FA6AD8C209231E97472D359796765} vs Tx{23AC5CBCF7CA73BDE2BF155872CBFD2555EA33C7}

egonspace · 2021-01-19T10:39:32Z

mempool/reactor.go

@@ -174,7 +174,7 @@ func (memR *Reactor) Receive(chID byte, src p2p.Peer, msgBytes []byte) {
 		if src != nil {
 			txInfo.SenderP2PID = src.ID()
 		}
-		err := memR.mempool.CheckTx(msg.Tx, nil, txInfo)
+		err := memR.mempool.CheckTxAsync(msg.Tx, txInfo, nil)


I think the reactor_test failed because of this code. If the reactor does CheckTxAsync, the order of txs entering into mempool may change. If this is intended, there should be some kind of defense in the test case.

I'm already investigating on it and I think we need CheckTxAsync at here. Thanks,

After investigating, I think it's a legacy issue that could occurs probably. Because removed abci.localClient.mtx from abci.localClient.CheckTxXXX(), the occur possibility just was increased.

I'll deal it with #167 because it's not related directly with this PR and also it could be little bit complicated.

* feat: concurrent recheckTx (#163) * fix: lint errors * fix: lint errors * fix: lint errors

* feat: more prometheus metrics for monitoring performance (#146) (#175) * chore: config timeout and connection pool (#150) (#171) * fix: use line/tm-db instead of tendermint/tm-db * bump up tm-db, iavl; re-apply #201 * chore: use default db backend among the available ones (#212) * chore: use default db backend among the available ones * chore: bump up iavl, tm-db * feat: concurrent checkTx #213; fix tm-db call * fix: rename TM to OC * fix: modify key name; tendermint -> ostracon * chore: rename tendermint to ostracon * chore: remove mempool.postCheck (#158) (#217) * fix: error handling after check tx * fix: typo * chore: (mempool) remove postCheck and impl reserve * chore: fix tests * chore: revise log (remove checkTx.Code) * chore: add `CONTRACT` for `mem.proxyAppConn.CheckTxAsync()` * chore: revise numTxs, txsBytes for `ErrMempoolIsFull` in reserve() * chore: revise to remove redundant `isFull()` * fix: remove tx from cache when `app errors` or `failed to reserve` * Revert "chore: revise to remove redundant `isFull()`" This reverts commit 55990ec. * fix: revise to call Begin/EndRecheck even though mem.Size() is 0 (#219) * fix: revise to call Begin/EndRecheck even though `mem.Size()` is 0 * chore: revise local_client.go * fix: lint error * chore: recheckTxs() just return if mem.Size() == 0 * feat: concurrent recheckTx (#163) (#221) * chore: increase the value of maxPerPage (#223) * chore: fix the type of consensus_block_interval_seconds from histogram to gauge (#224) * feat: impl checkTxAsyncReactor() (#168) (#225) * feat: impl checkTxAsyncReactor() (#168) * fix: tests * fix: lint errors * chore: revise abci.Client, Async() interfaces (#169) (#226) * chore: revise abci.Client, Async() interfaces * chore: regen mock w/ mockery 2.7.4 * fix: lint error * fix: test_race * mempool.Flush() flushes all txs from mempool so it should get `Lock()` instead of `RLock()` * chore: remove iavl dependency (#228) * chore: remove iavl dependency * chore: fix lint error * fix: add more fixing for abci.Client, Async() * feat: revise metric for measuring performance * build: remove needless build tag `!libsecp256k1` (#246) The build tag makes disable go implementation of secp256k1. Cause there is no C implementation, a build error will occur when using tag `libsecp256k1`. * feat: add duration metrics of gauge type (#256) * perf: optimize checking the txs size (#264) * perf: optimize checking the txs size * ci: add GOPRIVATE to workflows * test: add a unit test * fix: fix lint errors * perf: do not flush wal when receive consensus msgs (#273) * perf: do not flush wal when receive consensus msgs * fix: ci failure * fix: lint failure * fix: ci-e2e build failure * fix: bump up tm-db * fix: missing abci api * fix: bump up tm-db; use memdb * test: add test case to raise test coverage * fix: race error * fix: race error * fix: race error * fix: increase e2e test timeout * fix: add test case for coverage * fix: e2e docker file * fix: apply comments * fix: a Ostracon to an Ostracon

jinsan-line added 7 commits January 14, 2021 14:29

chore: remove recheckCursor/End

8f85feb

chore: decompose mempool.CheckTx() into CheckTxSync() and `CheckT…

bb336a7

…xAsync()`

chore: localClient.CheckTxAsync() calls app.CheckTxAsync()

fc988bf

fix: tests failed to build

c54d730

fix: call mem.reqResCb() correctly

7230279

chore: rename abci.Application.CheckTx() to CheckTxSync()

da835c3

chore: CListMempool.recheckTxs() waits all txs are rechecked

7cb06c6

jinsan-line self-assigned this Jan 14, 2021

jinsan-line marked this pull request as draft January 14, 2021 09:13

jinsan-line requested review from kukugi, tnasu, wetcod, torao, egonspace, kfangw and kokeshiM0chi January 15, 2021 08:07

jinsan-line added 2 commits January 15, 2021 18:21

chore: revise mempool.CheckTxAsync() to call RUnlock() in the cal…

3cab47e

…lback

fix: how to call mem.resResCb()

c876263

jinsan-line removed request for torao, kfangw, tnasu, kukugi, egonspace, wetcod and kokeshiM0chi January 15, 2021 11:10

jinsan-line added 2 commits January 15, 2021 21:17

chore: bench with CheckTxAsync()

5298b68

chore: revise mem.reqResCb()

6735a14

jinsan-line marked this pull request as ready for review January 18, 2021 08:20

jinsan-line requested review from kukugi and wetcod January 18, 2021 08:22

egonspace reviewed Jan 19, 2021

View reviewed changes

fix: call reqRes.Callback() if set

38df3b3

egonspace approved these changes Jan 19, 2021

View reviewed changes

wetcod reviewed Jan 19, 2021

View reviewed changes

abci/client/local_client.go Outdated Show resolved Hide resolved

abci/client/local_client.go Outdated Show resolved Hide resolved

chore: revise calling order reqRes.Done() and reqRes.SetDone()

e9e241c

wetcod approved these changes Jan 19, 2021

View reviewed changes

kukugi approved these changes Jan 19, 2021

View reviewed changes

egonspace reviewed Jan 19, 2021

View reviewed changes

jinsan-line mentioned this pull request Jan 21, 2021

feat: impl checkTxAsyncReactor() #168

Merged

5 tasks

jinsan-line merged commit 65d890b into Finschia:feat/perf Jan 21, 2021

jinsan-line deleted the concurrent-rechecktx branch January 21, 2021 11:05

This was referenced Jan 25, 2021

chore: revise abci.Client, Async() interfaces #169

Merged

feat: concurrent recheckTx Finschia/finschia-sdk#52

Merged

jinsan-line added a commit that referenced this pull request Apr 26, 2021

feat: concurrent recheckTx (#163)

4353b02

jinsan-line mentioned this pull request Apr 26, 2021

feat: concurrent recheckTx (#163) #221

Merged

5 tasks

jinsan-line added a commit that referenced this pull request Apr 27, 2021

feat: concurrent recheckTx (#163) (#221)

25d9ece

* feat: concurrent recheckTx (#163) * fix: lint errors * fix: lint errors * fix: lint errors

This was referenced Apr 27, 2021

feat: impl checkTxAsyncReactor() (#168) #225

Merged

chore: revise abci.Client, Async() interfaces (#169) #226

Merged

egonspace unassigned jinsan-line Jun 28, 2021

egonspace pushed a commit to egonspace/ostracon that referenced this pull request Jul 8, 2021

feat: concurrent recheckTx (Finschia#163) (Finschia#221)

44e2989

egonspace pushed a commit to egonspace/ostracon that referenced this pull request Jul 8, 2021

feat: concurrent recheckTx (Finschia#163) (Finschia#221)

1221fe7

tnasu mentioned this pull request Jul 13, 2023

Backport tendermint-v0.34.20 into main #642

Merged

0Tech mentioned this pull request Nov 1, 2023

EPIC: triage the PRs from Finschia/ostracon Finschia/cometbft#1

Open

50 tasks

0Tech mentioned this pull request Nov 14, 2023

Triage on ostracon#163 Finschia/cometbft#22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: concurrent recheckTx #163

feat: concurrent recheckTx #163

jinsan-line commented Jan 14, 2021 •

edited

jinsan-line commented Jan 19, 2021 •

edited

torao commented Jan 19, 2021 •

edited

jinsan-line commented Jan 19, 2021 •

edited

egonspace Jan 19, 2021

jinsan-line Jan 19, 2021

egonspace Jan 19, 2021

jinsan-line Jan 19, 2021

jinsan-line Jan 19, 2021

jinsan-line Jan 19, 2021

jinsan-line Jan 19, 2021

egonspace left a comment

wetcod left a comment

kukugi left a comment

egonspace commented Jan 19, 2021 •

edited

egonspace Jan 19, 2021

jinsan-line Jan 19, 2021

jinsan-line Jan 20, 2021

feat: concurrent recheckTx #163

feat: concurrent recheckTx #163

Conversation

jinsan-line commented Jan 14, 2021 • edited

Description

jinsan-line commented Jan 19, 2021 • edited

torao commented Jan 19, 2021 • edited

jinsan-line commented Jan 19, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

egonspace left a comment

Choose a reason for hiding this comment

wetcod left a comment

Choose a reason for hiding this comment

kukugi left a comment

Choose a reason for hiding this comment

egonspace commented Jan 19, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jinsan-line commented Jan 14, 2021 •

edited

jinsan-line commented Jan 19, 2021 •

edited

torao commented Jan 19, 2021 •

edited

jinsan-line commented Jan 19, 2021 •

edited

egonspace commented Jan 19, 2021 •

edited