Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: concurrent recheckTx #163

Merged
merged 14 commits into from
Jan 21, 2021
Merged

feat: concurrent recheckTx #163

merged 14 commits into from
Jan 21, 2021

Conversation

jinsan-line
Copy link

@jinsan-line jinsan-line commented Jan 14, 2021

Related with: https://github.com/line/link/issues/1152

Description

To optimize performance, we need to increase concurrency. After implementing concurrent checkTx(#160), I'd like to implement concurrent recheckTx.

The key change is decomposing application.CheckTx() into CheckTxSync() and CheckTxAsycn(). abci.CheckTxSync() and abci.CheckTxASync() actually use the same application interface, application.CheckTx(). So all of abci.CheckTxSync() and abci.CheckTxAsync() are blocking function. W/ application.CheckTxSync() and application.CheckAsync(), I intend to implement these as sync and blocking and async and non-blocking in cosmos-sdk.

Please note the reason why application.CheckTxSync() is still needed. It's needed because rpc.broadcaseTxSync() (and Commit()) has it's own goroutine from http server. In this case, it's better in terms of not creating unnecessary goroutine more.


For contributor use:

  • Wrote tests
  • Updated CHANGELOG_PENDING.md
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments
  • Re-reviewed Files changed in the Github PR explorer

@jinsan-line jinsan-line marked this pull request as ready for review January 18, 2021 08:20
@jinsan-line
Copy link
Author

jinsan-line commented Jan 19, 2021

@torao

In general, there are many important points that need to be considered when asynchronizing processes, for instance, such as the scope of threads (creation and termination) and who manages them, whether to use a thread pool, and what the impact will be if the processing consumes a lot of time, etc., so I would like to ask:

First of all, I'd like to mention we're dealing with Go but not Java. W/in Go, there is no concept for thread pool but there is only goroutine that's actually backed by GOMAXPROCS threadpool. But, as a Go programmer, we don't need to consider threadpool when using goroutine and we could conceptually consider goroutine like a free resource.

In "Concurrency in Go: Tools and Techniques for Developers, Katherine Cox-Buday, O'Reilly Media," the author is saying as follows:

| Go’s philosophy on concurrency can be summed up like this: aim for simplicity, use channels when possible, and treat goroutines like a free resource.

  • In this PR, what thread does the asynchronous processing take place in? Will a new thread be started each time the callback is executed? Or does it use a dedicated thread pool for CheckTx asynchronous processing?

Tendermint doesn't and shouldn't care of what goroutine executes callback but, as owner of ABCI, it should provide an asynchronous interface to give ABCI server chance to execute ABCI in concurrent manner.

  • If a new thread is started for each asynchronous process, won't a large number of threads and system resources be consumed and the OS becomes unresponsive if the `CheckTx' process is delayed for some reason?
  • If it uses a thread pool that has a fixed number of threads, or if the thread pools are shared by processes other than CheckTx, what will be the impact if the CheckTx process is delayed and there are no more threads available?

I hope it could be answered for these that we could treat goroutine like a free resource.

@torao
Copy link
Member

torao commented Jan 19, 2021

I had thought goroutine to be similar to (OS native) threads, but now I understood that it more correctly resembles coroutine (and the name seems to be the same). I think this is the same as a parallel processing model which uses GOMAXPROCS OS-native threads to process queued jobs, and it's a more programmer-friendly abstraction.

As computer resources are finite, however, I think "free resource" is a little ideal thing. Assuming a parallel processing model of such a jobqueue-workers type, if more than GOMAXPROCS CheckTx processes are (not blocked but) delayed at the same time, will all other goroutines in the Tendermint be affected?

The term "thread pool" in the previous comment is a Java expression, but in this case, GOMAXPROCS OS-native threads are equivalent meaning to "the thread pool provided by system default". What I was considering about in my previous comment was that in any language or framework, to avoid exhausting such the system default thread pool, it should be designed carefully and is sometimes considered having a separate thread pool.

@jinsan-line
Copy link
Author

jinsan-line commented Jan 19, 2021

@torao

I think the same way for thread pool & resource management in terms of dealing parallel execution but it doesn't affect this PR at all.

I had thought goroutine to be similar to (OS native) threads, but now I understood that it more correctly resembles coroutine (and the name seems to be the same). I think this is the same as a parallel processing model which uses GOMAXPROCS OS-native threads to process queued jobs, and it's a more programmer-friendly abstraction.

I think goroutine is a kind of coroutine as well.

As computer resources are finite, however, I think "free resource" is a little ideal thing. Assuming a parallel processing model of such a jobqueue-workers type, if more than GOMAXPROCS CheckTx processes are (not blocked but) delayed at the same time, will all other goroutines in the Tendermint be affected?

The term "thread pool" in the previous comment is a Java expression, but in this case, GOMAXPROCS OS-native threads are equivalent meaning to "the thread pool provided by system default". What I was considering about in my previous comment was that in any language or framework, to avoid exhausting such the system default thread pool, it should be designed carefully and is sometimes considered having a separate thread pool.

To my understanding, All function is running on goroutine in Go. We already has the issue what'll happen if checkTx takes long time. If checkTx takes long time, It already affects the block interval so it's not introduced newly. The only new thing, that we should focus for this PR, is scheduling cost.

(Another thing, I'd like to say, is that asynchronous doesn't compel non-blocking to use another goroutine. For example, xxxAsync() in the example apps of tendermint are still blocking. )

  • Basically, I think we should accept the goroutine scheduling is fairly cheap because it's a well known property and one of killing feature of Go. If we couldn't trust this assumption, we couldn't use Go.
  • Deadlock could be a new introduced concern but it's not a business of tendermint. (Of course, it's app team's responsibility for cosmos-sdk. HaHa~) tendermint could only give a chance to make it as non-blocking to application. tendermint is exposed to Deadlock risk if application implements checkTx using goroutine but, only with this risk, we could get a chance to optimize the performance with parallel execution.

@@ -266,5 +273,6 @@ func newLocalReqRes(req *types.Request, res *types.Response) *ReqRes {
reqRes := NewReqRes(req)
reqRes.Response = res
reqRes.SetDone()
reqRes.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just asking out of curiosity, did you put this code in because there a problem with the absence of this code before? Or is it just a formality?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was no bug behavior before but, as you can see as, it's not good in terms of consistency.

Tx: memTx.tx,
Type: abci.CheckTxType_Recheck,
})
reqRes.SetCallback(func(res *abci.Response) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will this reqRes.cb be called at?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to refactor the code because it's not intuitive and has many assumptions. If I refactor it w/in this PR, I think it makes this PR more complicated so I didn't do it. I'd like to refactor it If I have a chance.

https://github.com/line/tendermint/blob/8316c03342d55179c70f0e32df8b861e58af5044/abci/client/client.go#L99-L110

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I've missed to call callback when abci is async and non-blocking when I revert refactored code. I'll revise it, thanks,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed at 38df3b3

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can catch it w/ client.go, it has a kind of bug to call cb many times if call SetCallback() after ReqRes is done. I'd still like to refactor it if I have a chance.

Copy link
Contributor

@egonspace egonspace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree with the improvement direction.

abci/client/local_client.go Outdated Show resolved Hide resolved
abci/client/local_client.go Outdated Show resolved Hide resolved
Copy link
Contributor

@wetcod wetcod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

@kukugi kukugi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@egonspace
Copy link
Contributor

egonspace commented Jan 19, 2021

reactor_test of mempool failed. I think this failure should be fixed.

    reactor_test.go:99: 
        	Error Trace:	reactor_test.go:99
        	            				reactor_test.go:73
        	            				asm_amd64.s:1374
        	Error:      	Not equal: 
        	            	expected: types.Tx{0x23, 0xac, 0x5c, 0xbc, 0xf7, 0xca, 0x73, 0xbd, 0xe2, 0xbf, 0x15, 0x58, 0x72, 0xcb, 0xfd, 0x25, 0x55, 0xea, 0x33, 0xc7}
        	            	actual  : types.Tx{0x7, 0xc, 0x90, 0x43, 0x27, 0x47, 0xe5, 0xe6, 0xed, 0xbf, 0x5b, 0x3d, 0x58, 0xc6, 0x8c, 0x93, 0xe8, 0x47, 0xb2, 0x8e}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(types.Tx) (len=20) Tx{23AC5CBCF7CA73BDE2BF155872CBFD2555EA33C7}
        	            	+(types.Tx) (len=20) Tx{070C90432747E5E6EDBF5B3D58C68C93E847B28E}
        	            	 
        	Test:       	TestReactorBroadcastTxMessage
        	Messages:   	txs at index 98 on reactor 3 don't match: Tx{23AC5CBCF7CA73BDE2BF155872CBFD2555EA33C7} vs Tx{070C90432747E5E6EDBF5B3D58C68C93E847B28E}
    reactor_test.go:99: 
        	Error Trace:	reactor_test.go:99
        	            				reactor_test.go:73
        	            				asm_amd64.s:1374
        	Error:      	Not equal: 
        	            	expected: types.Tx{0x7, 0xc, 0x90, 0x43, 0x27, 0x47, 0xe5, 0xe6, 0xed, 0xbf, 0x5b, 0x3d, 0x58, 0xc6, 0x8c, 0x93, 0xe8, 0x47, 0xb2, 0x8e}
        	            	actual  : types.Tx{0xe4, 0x83, 0xec, 0x83, 0x3, 0x7f, 0xa6, 0xad, 0x8c, 0x20, 0x92, 0x31, 0xe9, 0x74, 0x72, 0xd3, 0x59, 0x79, 0x67, 0x65}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(types.Tx) (len=20) Tx{070C90432747E5E6EDBF5B3D58C68C93E847B28E}
        	            	+(types.Tx) (len=20) Tx{E483EC83037FA6AD8C209231E97472D359796765}
        	            	 
        	Test:       	TestReactorBroadcastTxMessage
        	Messages:   	txs at index 99 on reactor 3 don't match: Tx{070C90432747E5E6EDBF5B3D58C68C93E847B28E} vs Tx{E483EC83037FA6AD8C209231E97472D359796765}
    reactor_test.go:99: 
        	Error Trace:	reactor_test.go:99
        	            				reactor_test.go:73
        	            				asm_amd64.s:1374
        	Error:      	Not equal: 
        	            	expected: types.Tx{0xe4, 0x83, 0xec, 0x83, 0x3, 0x7f, 0xa6, 0xad, 0x8c, 0x20, 0x92, 0x31, 0xe9, 0x74, 0x72, 0xd3, 0x59, 0x79, 0x67, 0x65}
        	            	actual  : types.Tx{0x23, 0xac, 0x5c, 0xbc, 0xf7, 0xca, 0x73, 0xbd, 0xe2, 0xbf, 0x15, 0x58, 0x72, 0xcb, 0xfd, 0x25, 0x55, 0xea, 0x33, 0xc7}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,2 +1,2 @@
        	            	-(types.Tx) (len=20) Tx{E483EC83037FA6AD8C209231E97472D359796765}
        	            	+(types.Tx) (len=20) Tx{23AC5CBCF7CA73BDE2BF155872CBFD2555EA33C7}
        	            	 
        	Test:       	TestReactorBroadcastTxMessage
        	Messages:   	txs at index 100 on reactor 3 don't match: Tx{E483EC83037FA6AD8C209231E97472D359796765} vs Tx{23AC5CBCF7CA73BDE2BF155872CBFD2555EA33C7}

@@ -174,7 +174,7 @@ func (memR *Reactor) Receive(chID byte, src p2p.Peer, msgBytes []byte) {
if src != nil {
txInfo.SenderP2PID = src.ID()
}
err := memR.mempool.CheckTx(msg.Tx, nil, txInfo)
err := memR.mempool.CheckTxAsync(msg.Tx, txInfo, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reactor_test failed because of this code. If the reactor does CheckTxAsync, the order of txs entering into mempool may change. If this is intended, there should be some kind of defense in the test case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm already investigating on it and I think we need CheckTxAsync at here. Thanks,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After investigating, I think it's a legacy issue that could occurs probably. Because removed abci.localClient.mtx from abci.localClient.CheckTxXXX(), the occur possibility just was increased.

I'll deal it with #167 because it's not related directly with this PR and also it could be little bit complicated.

@jinsan-line jinsan-line mentioned this pull request Jan 21, 2021
5 tasks
@jinsan-line jinsan-line merged commit 65d890b into Finschia:feat/perf Jan 21, 2021
@jinsan-line jinsan-line deleted the concurrent-rechecktx branch January 21, 2021 11:05
jinsan-line added a commit that referenced this pull request Apr 26, 2021
jinsan-line added a commit that referenced this pull request Apr 27, 2021
* feat: concurrent recheckTx (#163)

* fix: lint errors

* fix: lint errors

* fix: lint errors
egonspace pushed a commit to egonspace/ostracon that referenced this pull request Jul 8, 2021
egonspace pushed a commit to egonspace/ostracon that referenced this pull request Jul 8, 2021
egonspace pushed a commit that referenced this pull request Aug 26, 2021
* feat: more prometheus metrics for monitoring performance (#146) (#175)

* chore: config timeout and connection pool (#150) (#171)

* fix: use line/tm-db instead of tendermint/tm-db

* bump up tm-db, iavl; re-apply #201

* chore: use default db backend among the available ones (#212)

* chore: use default db backend among the available ones

* chore: bump up iavl, tm-db

* feat: concurrent checkTx #213; fix tm-db call

* fix: rename TM to OC

* fix: modify key name; tendermint -> ostracon

* chore: rename tendermint to ostracon

* chore: remove mempool.postCheck (#158) (#217)

* fix: error handling after check tx

* fix: typo

* chore: (mempool) remove postCheck and impl reserve

* chore: fix tests

* chore: revise log (remove checkTx.Code)

* chore: add `CONTRACT` for `mem.proxyAppConn.CheckTxAsync()`

* chore: revise numTxs, txsBytes for `ErrMempoolIsFull` in reserve()

* chore: revise to remove redundant `isFull()`

* fix: remove tx from cache when `app errors` or `failed to reserve`

* Revert "chore: revise to remove redundant `isFull()`"

This reverts commit 55990ec.

* fix: revise to call Begin/EndRecheck even though mem.Size() is 0 (#219)

* fix: revise to call Begin/EndRecheck even though `mem.Size()` is 0

* chore: revise local_client.go

* fix: lint error

* chore: recheckTxs() just return if mem.Size() == 0

* feat: concurrent recheckTx (#163) (#221)

* chore: increase the value of maxPerPage (#223)

* chore: fix the type of consensus_block_interval_seconds from histogram to gauge (#224)

* feat: impl checkTxAsyncReactor() (#168) (#225)

* feat: impl checkTxAsyncReactor() (#168)

* fix: tests

* fix: lint errors

* chore: revise abci.Client, Async() interfaces (#169) (#226)

* chore: revise abci.Client, Async() interfaces

* chore: regen mock w/ mockery 2.7.4

* fix: lint error

* fix: test_race

* mempool.Flush() flushes all txs from mempool so it should get `Lock()` instead of `RLock()`

* chore: remove iavl dependency (#228)

* chore: remove iavl dependency

* chore: fix lint error

* fix: add more fixing for abci.Client, Async()

* feat: revise metric for measuring performance

* build: remove needless build tag `!libsecp256k1` (#246)

The build tag makes disable go implementation of secp256k1.
Cause there is no C implementation, a build error will occur when using tag `libsecp256k1`.

* feat: add duration metrics of gauge type (#256)

* perf: optimize checking the txs size (#264)

* perf: optimize checking the txs size

* ci: add GOPRIVATE to workflows

* test: add a unit test

* fix: fix lint errors

* perf: do not flush wal when receive consensus msgs (#273)

* perf: do not flush wal when receive consensus msgs

* fix: ci failure

* fix: lint failure

* fix: ci-e2e build failure

* fix: bump up tm-db

* fix: missing abci api

* fix: bump up tm-db; use memdb

* test: add test case to raise test coverage

* fix: race error

* fix: race error

* fix: race error

* fix: increase e2e test timeout

* fix: add test case for coverage

* fix: e2e docker file

* fix: apply comments

* fix: a Ostracon to an Ostracon
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants