WIP, DO NOT MERGE: *: make Range() fine grained #9199

mitake · 2018-01-23T07:11:08Z

Current Range() tries to read entire requested range of keys in a
single read transaction. It can introduce long waiting of writer
transactions which can be observed as high latency spikes. For solving
this problem, this commit lets serializable Range() split its read
transaction in a fine grained manner. In the interval of the read
transactions, concurrent write RPCs (e.g. Put(), DeleteRange()) can
have a chance of starting their transactions. Serializable read only
Txn() is also turned into the smaller transactions.

This commit also adds a new option --range-max-keys-once to
etcd. With the option, users can specify the maximum number of keys
read in a single transaction during Range().

mitake · 2018-01-23T07:17:25Z

related issue: #7719

I'm still working on benchmarking this feature with realistic data set. I'm glad if I can hear opinions about the design. @xiang90 @gyuho @hexfusion

Currently, this PR only cares about serializable read. But making linearizable read fine grain would be straightforward because recent read requests are processed in the RPC layer, not in the apply loop.

Also, having a similar mechanism in the client side is valuable. It can solve the blocking issue and also reduce the peak memory usage of the server. Server side mechanism is useful for enforcing the policy of small txns, though.

gyuho · 2018-01-23T19:22:07Z

etcdserver/apply.go

+	resp.Kvs = make([]*mvccpb.KeyValue, len(res.KVs))
+	for i := range res.KVs {
+		if r.KeysOnly {
+			res.KVs[i].Value = nil


These lines seem redundant with after-lines when !r.Serializable. Could it be shared between two code paths?

I'm considering to unify the range implementation of the changed serializable and the existing linearizable because it doesn't break semantics. How do you think about this idea?

Yeah sounds good, as long as we don't copy same code around.

xiang90 · 2018-01-24T22:58:36Z

@mitake

There are two things we want to solve:

split large read into smaller ones and assemble the result without locking

In this way, the large read request will not block other smaller requests. However, this will not increase the overall throughput since the locked critical section remains unchanged, and there still will be contentions among cores.

reduce critical sections to improve parallelism

If we can reduce the critical sections or make them smaller, we can achieve better throughput and utilize multi-cores better.

@heyitsanthony already did some work on 2) in previous release by caching, but it can be further improved. I suggest you read though related issues before get started on 2).

mitake · 2018-01-25T03:22:25Z

@xiang90 yes, increasing parallelism in large range will be effective so I'll work on it. But the main problem in #7719 is reducing the pause time of write transaction. I'll share the benchamrking result of this PR probably until tomorrow which shows the change is effective for reducing the pause time (although the throughput of read txns will be degraded).

xiang90 · 2018-01-25T03:29:45Z

@mitake agreed. thanks!

mitake · 2018-01-25T08:44:10Z

@xiang90 I did rough benchmark on the latest version of this PR.

environment

All nodes are on GCP and include 1 client node and 3 server nodes. The instance type is n1-standard-4.

how to initialize etcd

For making Range() heavy, I putted 1M keys with benchmark command:

./benchmark put --clients=100 --conns=100  --endpoints=10.140.0.4:2379  --total=1000000 --val-size=1000 --key-space-size=1000000 --sequential-keys --target-leader

benchmarking

I ran benchmark put (benchmark put --endpoints=10.140.0.4:2379 --conns=100 --clients=100 --total=100000 --val-size= 1000 --target-leader) for write workload and etcdctl get for heavy range workload concurrently. Simply executed them from different terminals.

default (equal to current etcd)

benchmark put

Summary:
  Total:        16.2448 secs.
  Slowest:      4.6134 secs.
  Fastest:      0.0026 secs.
  Average:      0.0157 secs.
  Stddev:       0.1452 secs.
  Requests/sec: 6155.8099

Response time histogram:
  0.0026 [1]    |
  0.4637 [99899]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.9248 [0]    |
  1.3858 [0]    |
  1.8469 [0]    |
  2.3080 [0]    |
  2.7691 [0]    |
  3.2302 [0]    |
  3.6912 [0]    |
  4.1523 [0]    |
  4.6134 [100]  |

Latency distribution:
  10% in 0.0077 secs.
  25% in 0.0087 secs.
  50% in 0.0099 secs.
  75% in 0.0117 secs.
  90% in 0.0143 secs.
  95% in 0.0165 secs.
  99% in 0.0258 secs.
  99.9% in 4.5800 secs.

etcdctl get

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.5:2379 get --prefix "" > /dev/null
real    0m9.978s
user    0m3.368s
sys     0m2.036s

--max-range-keys-once 100000

benchmark put

Summary:
  Total:        20.1981 secs.
  Slowest:      2.1657 secs.
  Fastest:      0.0033 secs.
  Average:      0.0197 secs.
  Stddev:       0.0952 secs.
  Requests/sec: 4950.9569

Response time histogram:
  0.0033 [1]    |
  0.2195 [99194]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.4358 [3]    |
  0.6520 [302]  |
  0.8682 [102]  |
  1.0845 [199]  |
  1.3007 [101]  |
  1.5169 [0]    |
  1.7332 [0]    |
  1.9494 [0]    |
  2.1657 [98]   |

Latency distribution:
  10% in 0.0077 secs.
  25% in 0.0087 secs.
  50% in 0.0101 secs.
  75% in 0.0126 secs.
  90% in 0.0197 secs.
  95% in 0.0269 secs.
  99% in 0.0633 secs.
  99.9% in 1.2026 secs.

etcdctl get

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.4:2379 get --prefix "" > /dev/null

real    0m13.232s
user    0m3.572s
sys     0m2.360s

Although it isn't ideal yet, it seems that this PR is successfully breaking Range()s into smaller txns according to the latency results of benchmark put (almost 4.5sec -> 1.2sec 99.9 percentile).

Note that we need to be careful a little bit for reproducing the results. We must specify the leader node as the endpoint ofetcdctl get. If the get can be processed on the follower nodes, the high latency of benchmark put won't be observed.

gyuho · 2018-01-25T08:52:24Z

etcdserver/apply.go

+			lim = r.Limit
+		}
+		startKey := r.Key
+		noEnd := bytes.Compare(rangeEnd, []byte{0}) != 0


Can we add simple bytes.Compare documentation?

e.g. // rangeEnd == byte-0

Also for bytes.Compare(startKey, rangeEnd) == -1

What does byte-0 mean?

I was saying we should document what this bytes.Compare does. bytes.Compare(rangeEnd, []byte{0}) != 0 is clear since we define as noEnd. Can we godoc bytes.Compare(startKey, rangeEnd) == -1?

mitake · 2018-01-25T08:58:19Z

I also executed the benchmark with --max-range-keys-once 10000:

benchmark put

Summary:
  Total:        46.6339 secs.
  Slowest:      3.9975 secs.
  Fastest:      0.0030 secs.
  Average:      0.0462 secs.
  Stddev:       0.2815 secs.
  Requests/sec: 2144.3638

Response time histogram:
  0.0030 [1]    |
  0.4024 [98251]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.8019 [231]  |
  1.2013 [223]  |
  1.6008 [206]  |
  2.0002 [308]  |
  2.3997 [299]  |
  2.7991 [190]  |
  3.1986 [98]   |
  3.5980 [102]  |
  3.9975 [91]   |

Latency distribution:
  10% in 0.0078 secs.
  25% in 0.0089 secs.
  50% in 0.0105 secs.
  75% in 0.0132 secs.
  90% in 0.0189 secs.
  95% in 0.0283 secs.
  99% in 1.6357 secs.
  99.9% in 3.5371 secs.

etcdctl get

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.3:2379 get --prefix "" > /dev/null

real    0m40.143s
user    0m3.120s
sys     0m2.144s

The latencies of benchmark put is slightly better than the default configuration, but it isn't as good as --max-range-keys-once 100000 and throughput is terrible. It doesn't mean the smaller txns are always good for improve latency scores. Not surprisingly, they are not good for throughput.

xiang90 · 2018-01-26T03:02:37Z

Have you tried to limit it to lower number than 100000? Would p99 get improved? And how long does get take without put and the patch?

…

On Thu, Jan 25, 2018 at 7:01 PM Hitoshi Mitake ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In etcdserver/apply.go <#9199 (comment)>: > + rr, err = txn.Range(r.Key, rangeEnd, ro) + if err != nil { + return nil, err + } + } else { + rr = &mvcc.RangeResult{ + KVs: make([]mvccpb.KeyValue, 0), + Rev: r.Revision, + } + + lim := int64(a.s.Cfg.RangeMaxKeysOnce) + if lim == 0 || r.Limit != 0 && r.Limit < lim { + lim = r.Limit + } + startKey := r.Key + noEnd := bytes.Compare(rangeEnd, []byte{0}) != 0 What does byte-0 mean? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9199 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AERby_Mvpq2NEY-iD58OH4Hx8mGeu65Aks5tOT_kgaJpZM4RpL6D> .

mitake · 2018-01-26T05:15:46Z

@xiang90 yes, #9199 (comment) shows the result of 10000. Both of the latency and throughput scores are worse than the scores of 100000. It means shorter read txns don't guarantee low latencies of write txns.

I'll share the baseline range performance later. But it seems that Range() performance degrades with the limited read txn size.

xiang90 · 2018-01-26T05:55:45Z

@mitake

what i do not understand is that why p99 of 10,000 is worser than p99 of 100,000. Can you try to explicitly yield the go routine after each txn call? I expect the latency of 10,000 is lower than 100,000.

mitake · 2018-01-26T06:23:51Z

@xiang90

Can you try to explicitly yield the go routine after each txn call?

You mean goroutine for gathering the results?
Currently I'm thinking about the possibility that contention in memory allocator or GC would introducing the high latency in the case of smaller read txns. Probably inserting small sleep time between each read txns would help. I'll share the result next week.

xiang90 · 2018-01-26T06:29:53Z

You mean goroutine for gathering the results?

i mean call yield here: https://github.com/coreos/etcd/pull/9199/files#diff-f141df18bc5c4d2e821b7d9dfb484f95R318.

I am not sure how the go mutex works internally. i might not be a fair lock, which means that a go routine might re-entry the critical section right after its leave if it does not yield to other routines. but i am not sure.

mitake · 2018-01-26T06:33:13Z

@xiang90 I understood what you mean. Probably I should call runtime.Gosched()? I'll try it.

xiang90 · 2018-01-26T06:34:11Z

@mitake oh. yes. it is called Gosched in go :P.

mitake · 2018-01-26T06:34:55Z

@xiang90 it seems so :) https://golang.org/pkg/runtime/#Gosched

mitake · 2018-01-26T07:15:24Z

@xiang90 I updated based on your explicit yielding idea. The results (including baseline range performance) is below. The latency score of --max-range-keys-once 10000 is much better than the older version. But it is still worse than the case of 100,000. Probably I need much more analysis with pprof.

baseline range performance

default

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.5:2379 get --pr
efix "" > /dev/null 

real    0m9.225s
user    0m3.312s
sys     0m2.616s

10,000

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.3:2379 get --prefix "" > /dev/null        

real    0m29.761s
user    0m3.376s
sys     0m2.816s

100,000

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.4:2379 get --pr
efix "" > /dev/null 

real    0m11.826s
user    0m3.460s
sys     0m2.468s

concurrent range and put

default

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.3:2379 get --pr
efix "" > /dev/null 

real    0m11.896s
user    0m3.692s
sys     0m2.696s


Summary:
  Total:        16.5136 secs.
  Slowest:      4.5160 secs.
  Fastest:      0.0020 secs.
  Average:      0.0158 secs.
  Stddev:       0.1424 secs.
  Requests/sec: 6055.5975

Response time histogram:
  0.0020 [1]    |
  0.4534 [99899]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.9048 [0]    |
  1.3562 [0]    |
  1.8076 [0]    |
  2.2590 [0]    |
  2.7104 [0]    |
  3.1618 [0]    |
  3.6132 [0]    |
  4.0646 [0]    |
  4.5160 [100]  |

Latency distribution:
  10% in 0.0067 secs.
  25% in 0.0076 secs.
  50% in 0.0089 secs.
  75% in 0.0115 secs.
  90% in 0.0204 secs.
  95% in 0.0268 secs.
  99% in 0.0412 secs.
  99.9% in 4.5047 secs.

10,000

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.3:2379 get --pr
efix "" > /dev/null 

real    0m29.739s
user    0m3.620s
sys     0m2.736s


Summary:
  Total:        34.6274 secs.
  Slowest:      3.3571 secs.
  Fastest:      0.0020 secs.
  Average:      0.0339 secs.
  Stddev:       0.2094 secs.
  Requests/sec: 2887.8850

Response time histogram:
  0.0020 [1]    |
  0.3375 [98533]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.6731 [226]  |
  1.0086 [133]  |
  1.3441 [170]  |
  1.6796 [214]  |
  2.0151 [304]  |
  2.3506 [184]  |
  2.6861 [207]  |
  3.0216 [4]    |
  3.3571 [24]   |

Latency distribution:
  10% in 0.0065 secs.
  25% in 0.0074 secs.
  50% in 0.0085 secs.
  75% in 0.0107 secs.
  90% in 0.0175 secs.
  95% in 0.0251 secs.
  99% in 1.2282 secs.
  99.9% in 2.6472 secs.

100,000

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.5:2379 get --pr
efix "" > /dev/null 

real    0m13.050s
user    0m3.564s
sys     0m2.720s


Summary:
  Total:        18.9839 secs.
  Slowest:      1.6877 secs.
  Fastest:      0.0028 secs.
  Average:      0.0184 secs.
  Stddev:       0.0801 secs.
  Requests/sec: 5267.6333

Response time histogram:
  0.0028 [1]    |
  0.1713 [99191]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.3398 [6]    |
  0.5083 [103]  |
  0.6768 [300]  |
  0.8452 [201]  |
  1.0137 [1]    |
  1.1822 [98]   |
  1.3507 [0]    |
  1.5192 [0]    |
  1.6877 [99]   |

Latency distribution:
  10% in 0.0076 secs.
  25% in 0.0085 secs.
  50% in 0.0099 secs.
  75% in 0.0124 secs.
  90% in 0.0189 secs.
  95% in 0.0258 secs.
  99% in 0.0606 secs.
  99.9% in 1.1644 secs.

xiang90 · 2018-01-26T19:16:09Z

@mitake

Probably I need much more analysis with pprof.

Yes. I still do not fully understand where the improvements come from, and why exactly the Puts are blocked. Let us dig into this a little bit more.

xiang90 · 2018-01-30T19:52:11Z

/cc @jpbetz

mitake · 2018-01-31T04:36:01Z

I did some profiling for understanding the behaviour. For making the experiment simple, I made a single node cluster on my machine. Clients (etcdctl and benchmark) were executed on the same machine. It seems that smaller read txns consume lots of time for the commit process of the backend.
(I'm still not 100% sure because pprof's output format seems to be changed since the last time I used it).

default

Showing nodes accounting for 70ms, 100% of 70ms total
Showing top 20 nodes out of 23
      flat  flat%   sum%        cum   cum%
      20ms 28.57% 28.57%       20ms 28.57%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
      10ms 14.29% 42.86%       30ms 42.86%  runtime.findrunnable /usr/local/go/src/runtime/proc.go
      10ms 14.29% 57.14%       20ms 28.57%  runtime.notesleep /usr/local/go/src/runtime/lock_futex.go
      10ms 14.29% 71.43%       20ms 28.57%  runtime.notetsleep_internal /usr/local/go/src/runtime/lock_futex.go
      10ms 14.29% 85.71%       10ms 14.29%  runtime.siftupTimer /usr/local/go/src/runtime/time.go
      10ms 14.29%   100%       10ms 14.29%  runtime.usleep /usr/local/go/src/runtime/sys_linux_amd64.s
         0     0%   100%       10ms 14.29%  github.com/coreos/etcd/internal/lease.(*lessor).runLoop /home/mitake/gopath/src/github.com/coreos/etcd/internal/lease/lessor.go
         0     0%   100%       10ms 14.29%  runtime.addtimer /usr/local/go/src/runtime/time.go
         0     0%   100%       10ms 14.29%  runtime.addtimerLocked /usr/local/go/src/runtime/time.go
         0     0%   100%       20ms 28.57%  runtime.futexsleep /usr/local/go/src/runtime/os_linux.go
         0     0%   100%       30ms 42.86%  runtime.mcall /usr/local/go/src/runtime/asm_amd64.s
         0     0%   100%       20ms 28.57%  runtime.mstart /usr/local/go/src/runtime/proc.go
         0     0%   100%       20ms 28.57%  runtime.mstart1 /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 14.29%  runtime.notetsleep /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       10ms 14.29%  runtime.notetsleepg /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       30ms 42.86%  runtime.park_m /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 42.86%  runtime.schedule /usr/local/go/src/runtime/proc.go
         0     0%   100%       20ms 28.57%  runtime.stopm /usr/local/go/src/runtime/proc.go
         0     0%   100%       20ms 28.57%  runtime.sysmon /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 14.29%  runtime.timerproc /usr/local/go/src/runtime/time.go

--max-range-keys-once 10000

Showing nodes accounting for 50ms, 100% of 50ms total
Showing top 20 nodes out of 24
      flat  flat%   sum%        cum   cum%
      40ms 80.00% 80.00%       40ms 80.00%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
      10ms 20.00%   100%       10ms 20.00%  runtime.makemap /usr/local/go/src/runtime/hashmap.go
         0     0%   100%       10ms 20.00%  github.com/coreos/etcd/internal/mvcc/backend.(*backend).run /home/mitake/gopath/src/github.com/coreos/etcd/internal/mvcc/backend/backend.go
         0     0%   100%       10ms 20.00%  github.com/coreos/etcd/internal/mvcc/backend.(*batchTxBuffered).Commit /home/mitake/gopath/src/github.com/coreos/etcd/internal/mvcc/backend/batch_tx.go
         0     0%   100%       10ms 20.00%  github.com/coreos/etcd/internal/mvcc/backend.(*batchTxBuffered).commit /home/mitake/gopath/src/github.com/coreos/etcd/internal/mvcc/backend/batch_tx.go
         0     0%   100%       10ms 20.00%  github.com/coreos/etcd/internal/mvcc/backend.(*batchTxBuffered).unsafeCommit /home/mitake/gopath/src/github.com/coreos/etcd/internal/mvcc/backend/batch_tx.go
         0     0%   100%       10ms 20.00%  github.com/coreos/etcd/internal/mvcc/backend.(*readTx).reset /home/mitake/gopath/src/github.com/coreos/etcd/internal/mvcc/backend/read_tx.go
         0     0%   100%       10ms 20.00%  runtime.entersyscallblock /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 20.00%  runtime.entersyscallblock_handoff /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 20.00%  runtime.findrunnable /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 60.00%  runtime.futexsleep /usr/local/go/src/runtime/os_linux.go
         0     0%   100%       10ms 20.00%  runtime.futexwakeup /usr/local/go/src/runtime/os_linux.go
         0     0%   100%       10ms 20.00%  runtime.handoffp /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 20.00%  runtime.mcall /usr/local/go/src/runtime/asm_amd64.s
         0     0%   100%       10ms 20.00%  runtime.notesleep /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       20ms 40.00%  runtime.notetsleep_internal /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       30ms 60.00%  runtime.notetsleepg /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       10ms 20.00%  runtime.notewakeup /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       10ms 20.00%  runtime.park_m /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 20.00%  runtime.schedule /usr/local/go/src/runtime/proc.go

--max-range-keys-once 100000

Showing nodes accounting for 80ms, 100% of 80ms total
Showing top 20 nodes out of 23
      flat  flat%   sum%        cum   cum%
      40ms 50.00% 50.00%       40ms 50.00%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
      10ms 12.50% 62.50%       10ms 12.50%  runtime.acquirep /usr/local/go/src/runtime/proc.go
      10ms 12.50% 75.00%       30ms 37.50%  runtime.exitsyscall /usr/local/go/src/runtime/proc.go
      10ms 12.50% 87.50%       10ms 12.50%  runtime.lock /usr/local/go/src/runtime/lock_futex.go
      10ms 12.50%   100%       40ms 50.00%  runtime.notetsleep_internal /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       20ms 25.00%  runtime.exitsyscallfast /usr/local/go/src/runtime/proc.go
         0     0%   100%       20ms 25.00%  runtime.exitsyscallfast.func1 /usr/local/go/src/runtime/proc.go
         0     0%   100%       20ms 25.00%  runtime.exitsyscallfast_pidle /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 12.50%  runtime.findrunnable /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 37.50%  runtime.futexsleep /usr/local/go/src/runtime/os_linux.go
         0     0%   100%       10ms 12.50%  runtime.futexwakeup /usr/local/go/src/runtime/os_linux.go
         0     0%   100%       10ms 12.50%  runtime.mcall /usr/local/go/src/runtime/asm_amd64.s
         0     0%   100%       30ms 37.50%  runtime.mstart /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 37.50%  runtime.mstart1 /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 37.50%  runtime.notetsleep /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       40ms 50.00%  runtime.notetsleepg /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       10ms 12.50%  runtime.park_m /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 12.50%  runtime.schedule /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 12.50%  runtime.stopm /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 37.50%  runtime.sysmon /usr/local/go/src/runtime/proc.go

mitake · 2018-01-31T07:35:43Z

I also implemented a client side mechanism for the same purpose in the second commit.

The major todo of the client side is implementing the sorting feature. The Range() calls can be separated now so the sorting should be performed in the client side too.

Current Range() tries to read entire requested range of keys in a single read transaction. It can introduce long waiting of writer transactions which can be observed as high latency spikes. For solving this problem, this commit lets Range() split its read transaction in a fine grained manner. In the interval of the read transactions, concurrent write RPCs (e.g. Put(), DeleteRange()) can have a chance of starting their transactions. This commit also adds a new option `--range-max-keys-once` to etcd. With the option, users can specify the maximum number of keys read in a single transaction during Range().

…ingle range txn This commit adds a similar mechanism of making a read txn smaller in the client side. This is good for reducing peak memory usage of the server. This commit adds a new flag --max-range-keys-once to `etcdctl get`. With specifying a value with the option, users can use this feature during the execution of the command.

xiang90 · 2018-02-07T00:08:17Z

@mitake it is strange that making the range max size to 10000 will increase the number of commit of batched tx. Read only requests should have nothing to do with backend commit.

mitake · 2019-10-20T15:28:01Z

I'm closing this because the problem of large range performance is solved in #10523

mitake force-pushed the fine-grained-range branch from 2c0b061 to 750435f Compare January 23, 2018 08:11

mitake changed the title ~~WIP, DO NOT MERGE: etcdserver: make serializable Range() fine grained~~ WIP, DO NOT MERGE: *: make serializable Range() fine grained Jan 23, 2018

gyuho reviewed Jan 23, 2018

View reviewed changes

mitake force-pushed the fine-grained-range branch from 750435f to 26175e8 Compare January 25, 2018 08:31

gyuho reviewed Jan 25, 2018

View reviewed changes

mitake force-pushed the fine-grained-range branch from 26175e8 to 6c02cda Compare January 26, 2018 07:13

mitake force-pushed the fine-grained-range branch from 6c02cda to 203ffe8 Compare January 31, 2018 07:34

mitake added 2 commits January 31, 2018 17:01

mitake force-pushed the fine-grained-range branch from 203ffe8 to a1c7731 Compare January 31, 2018 08:02

mitake changed the title ~~WIP, DO NOT MERGE: *: make serializable Range() fine grained~~ WIP, DO NOT MERGE: *: make Range() fine grained Jan 31, 2018

xiang90 mentioned this pull request Feb 7, 2018

mvcc: allow large concurrent reads under light write workload #9296

Merged

mitake closed this Oct 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP, DO NOT MERGE: *: make Range() fine grained #9199

WIP, DO NOT MERGE: *: make Range() fine grained #9199

mitake commented Jan 23, 2018 •

edited

mitake commented Jan 23, 2018

gyuho Jan 23, 2018

mitake Jan 24, 2018

gyuho Jan 24, 2018

xiang90 commented Jan 24, 2018

mitake commented Jan 25, 2018

xiang90 commented Jan 25, 2018

mitake commented Jan 25, 2018

gyuho Jan 25, 2018 •

edited

mitake Jan 26, 2018

gyuho Jan 26, 2018

mitake commented Jan 25, 2018

xiang90 commented Jan 26, 2018 via email

mitake commented Jan 26, 2018

xiang90 commented Jan 26, 2018

mitake commented Jan 26, 2018 •

edited

xiang90 commented Jan 26, 2018

mitake commented Jan 26, 2018

xiang90 commented Jan 26, 2018

mitake commented Jan 26, 2018

mitake commented Jan 26, 2018

xiang90 commented Jan 26, 2018

xiang90 commented Jan 30, 2018

mitake commented Jan 31, 2018

mitake commented Jan 31, 2018

xiang90 commented Feb 7, 2018

mitake commented Oct 20, 2019

WIP, DO NOT MERGE: *: make Range() fine grained #9199

WIP, DO NOT MERGE: *: make Range() fine grained #9199

Conversation

mitake commented Jan 23, 2018 • edited

mitake commented Jan 23, 2018

gyuho Jan 23, 2018

Choose a reason for hiding this comment

mitake Jan 24, 2018

Choose a reason for hiding this comment

gyuho Jan 24, 2018

Choose a reason for hiding this comment

xiang90 commented Jan 24, 2018

mitake commented Jan 25, 2018

xiang90 commented Jan 25, 2018

mitake commented Jan 25, 2018

environment

how to initialize etcd

benchmarking

default (equal to current etcd)

--max-range-keys-once 100000

gyuho Jan 25, 2018 • edited

Choose a reason for hiding this comment

mitake Jan 26, 2018

Choose a reason for hiding this comment

gyuho Jan 26, 2018

Choose a reason for hiding this comment

mitake commented Jan 25, 2018

xiang90 commented Jan 26, 2018 via email

mitake commented Jan 26, 2018

xiang90 commented Jan 26, 2018

mitake commented Jan 26, 2018 • edited

xiang90 commented Jan 26, 2018

mitake commented Jan 26, 2018

xiang90 commented Jan 26, 2018

mitake commented Jan 26, 2018

mitake commented Jan 26, 2018

baseline range performance

default

10,000

100,000

concurrent range and put

default

10,000

100,000

xiang90 commented Jan 26, 2018

xiang90 commented Jan 30, 2018

mitake commented Jan 31, 2018

default

--max-range-keys-once 10000

--max-range-keys-once 100000

mitake commented Jan 31, 2018

xiang90 commented Feb 7, 2018

mitake commented Oct 20, 2019

mitake commented Jan 23, 2018 •

edited

gyuho Jan 25, 2018 •

edited

mitake commented Jan 26, 2018 •

edited