Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP, DO NOT MERGE: *: make Range() fine grained #9199

Closed
wants to merge 2 commits into from

Conversation

mitake
Copy link
Contributor

@mitake mitake commented Jan 23, 2018

Current Range() tries to read entire requested range of keys in a
single read transaction. It can introduce long waiting of writer
transactions which can be observed as high latency spikes. For solving
this problem, this commit lets serializable Range() split its read
transaction in a fine grained manner. In the interval of the read
transactions, concurrent write RPCs (e.g. Put(), DeleteRange()) can
have a chance of starting their transactions. Serializable read only
Txn() is also turned into the smaller transactions.

This commit also adds a new option --range-max-keys-once to
etcd. With the option, users can specify the maximum number of keys
read in a single transaction during Range().

@mitake
Copy link
Contributor Author

mitake commented Jan 23, 2018

related issue: #7719

I'm still working on benchmarking this feature with realistic data set. I'm glad if I can hear opinions about the design. @xiang90 @gyuho @hexfusion

Currently, this PR only cares about serializable read. But making linearizable read fine grain would be straightforward because recent read requests are processed in the RPC layer, not in the apply loop.

Also, having a similar mechanism in the client side is valuable. It can solve the blocking issue and also reduce the peak memory usage of the server. Server side mechanism is useful for enforcing the policy of small txns, though.

@mitake mitake changed the title WIP, DO NOT MERGE: etcdserver: make serializable Range() fine grained WIP, DO NOT MERGE: *: make serializable Range() fine grained Jan 23, 2018
resp.Kvs = make([]*mvccpb.KeyValue, len(res.KVs))
for i := range res.KVs {
if r.KeysOnly {
res.KVs[i].Value = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines seem redundant with after-lines when !r.Serializable. Could it be shared between two code paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering to unify the range implementation of the changed serializable and the existing linearizable because it doesn't break semantics. How do you think about this idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds good, as long as we don't copy same code around.

@xiang90
Copy link
Contributor

xiang90 commented Jan 24, 2018

@mitake

There are two things we want to solve:

  1. split large read into smaller ones and assemble the result without locking

In this way, the large read request will not block other smaller requests. However, this will not increase the overall throughput since the locked critical section remains unchanged, and there still will be contentions among cores.

  1. reduce critical sections to improve parallelism

If we can reduce the critical sections or make them smaller, we can achieve better throughput and utilize multi-cores better.

@heyitsanthony already did some work on 2) in previous release by caching, but it can be further improved. I suggest you read though related issues before get started on 2).

@mitake
Copy link
Contributor Author

mitake commented Jan 25, 2018

@xiang90 yes, increasing parallelism in large range will be effective so I'll work on it. But the main problem in #7719 is reducing the pause time of write transaction. I'll share the benchamrking result of this PR probably until tomorrow which shows the change is effective for reducing the pause time (although the throughput of read txns will be degraded).

@xiang90
Copy link
Contributor

xiang90 commented Jan 25, 2018

@mitake agreed. thanks!

@mitake
Copy link
Contributor Author

mitake commented Jan 25, 2018

@xiang90 I did rough benchmark on the latest version of this PR.

environment

All nodes are on GCP and include 1 client node and 3 server nodes. The instance type is n1-standard-4.

how to initialize etcd

For making Range() heavy, I putted 1M keys with benchmark command:

./benchmark put --clients=100 --conns=100  --endpoints=10.140.0.4:2379  --total=1000000 --val-size=1000 --key-space-size=1000000 --sequential-keys --target-leader

benchmarking

I ran benchmark put (benchmark put --endpoints=10.140.0.4:2379 --conns=100 --clients=100 --total=100000 --val-size= 1000 --target-leader) for write workload and etcdctl get for heavy range workload concurrently. Simply executed them from different terminals.

default (equal to current etcd)

  • benchmark put
Summary:
  Total:        16.2448 secs.
  Slowest:      4.6134 secs.
  Fastest:      0.0026 secs.
  Average:      0.0157 secs.
  Stddev:       0.1452 secs.
  Requests/sec: 6155.8099

Response time histogram:
  0.0026 [1]    |
  0.4637 [99899]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.9248 [0]    |
  1.3858 [0]    |
  1.8469 [0]    |
  2.3080 [0]    |
  2.7691 [0]    |
  3.2302 [0]    |
  3.6912 [0]    |
  4.1523 [0]    |
  4.6134 [100]  |

Latency distribution:
  10% in 0.0077 secs.
  25% in 0.0087 secs.
  50% in 0.0099 secs.
  75% in 0.0117 secs.
  90% in 0.0143 secs.
  95% in 0.0165 secs.
  99% in 0.0258 secs.
  99.9% in 4.5800 secs.

etcdctl get

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.5:2379 get --prefix "" > /dev/null
real    0m9.978s
user    0m3.368s
sys     0m2.036s

--max-range-keys-once 100000

  • benchmark put
Summary:
  Total:        20.1981 secs.
  Slowest:      2.1657 secs.
  Fastest:      0.0033 secs.
  Average:      0.0197 secs.
  Stddev:       0.0952 secs.
  Requests/sec: 4950.9569

Response time histogram:
  0.0033 [1]    |
  0.2195 [99194]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.4358 [3]    |
  0.6520 [302]  |
  0.8682 [102]  |
  1.0845 [199]  |
  1.3007 [101]  |
  1.5169 [0]    |
  1.7332 [0]    |
  1.9494 [0]    |
  2.1657 [98]   |

Latency distribution:
  10% in 0.0077 secs.
  25% in 0.0087 secs.
  50% in 0.0101 secs.
  75% in 0.0126 secs.
  90% in 0.0197 secs.
  95% in 0.0269 secs.
  99% in 0.0633 secs.
  99.9% in 1.2026 secs.
  • etcdctl get
$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.4:2379 get --prefix "" > /dev/null

real    0m13.232s
user    0m3.572s
sys     0m2.360s

Although it isn't ideal yet, it seems that this PR is successfully breaking Range()s into smaller txns according to the latency results of benchmark put (almost 4.5sec -> 1.2sec 99.9 percentile).

Note that we need to be careful a little bit for reproducing the results. We must specify the leader node as the endpoint ofetcdctl get. If the get can be processed on the follower nodes, the high latency of benchmark put won't be observed.

lim = r.Limit
}
startKey := r.Key
noEnd := bytes.Compare(rangeEnd, []byte{0}) != 0
Copy link
Contributor

@gyuho gyuho Jan 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add simple bytes.Compare documentation?

e.g. // rangeEnd == byte-0

Also for bytes.Compare(startKey, rangeEnd) == -1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does byte-0 mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saying we should document what this bytes.Compare does. bytes.Compare(rangeEnd, []byte{0}) != 0 is clear since we define as noEnd. Can we godoc bytes.Compare(startKey, rangeEnd) == -1?

@mitake
Copy link
Contributor Author

mitake commented Jan 25, 2018

I also executed the benchmark with --max-range-keys-once 10000:

  • benchmark put
Summary:
  Total:        46.6339 secs.
  Slowest:      3.9975 secs.
  Fastest:      0.0030 secs.
  Average:      0.0462 secs.
  Stddev:       0.2815 secs.
  Requests/sec: 2144.3638

Response time histogram:
  0.0030 [1]    |
  0.4024 [98251]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.8019 [231]  |
  1.2013 [223]  |
  1.6008 [206]  |
  2.0002 [308]  |
  2.3997 [299]  |
  2.7991 [190]  |
  3.1986 [98]   |
  3.5980 [102]  |
  3.9975 [91]   |

Latency distribution:
  10% in 0.0078 secs.
  25% in 0.0089 secs.
  50% in 0.0105 secs.
  75% in 0.0132 secs.
  90% in 0.0189 secs.
  95% in 0.0283 secs.
  99% in 1.6357 secs.
  99.9% in 3.5371 secs.
  • etcdctl get
$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.3:2379 get --prefix "" > /dev/null

real    0m40.143s
user    0m3.120s
sys     0m2.144s

The latencies of benchmark put is slightly better than the default configuration, but it isn't as good as --max-range-keys-once 100000 and throughput is terrible. It doesn't mean the smaller txns are always good for improve latency scores. Not surprisingly, they are not good for throughput.

@xiang90
Copy link
Contributor

xiang90 commented Jan 26, 2018 via email

@mitake
Copy link
Contributor Author

mitake commented Jan 26, 2018

@xiang90 yes, #9199 (comment) shows the result of 10000. Both of the latency and throughput scores are worse than the scores of 100000. It means shorter read txns don't guarantee low latencies of write txns.

I'll share the baseline range performance later. But it seems that Range() performance degrades with the limited read txn size.

@xiang90
Copy link
Contributor

xiang90 commented Jan 26, 2018

@mitake

what i do not understand is that why p99 of 10,000 is worser than p99 of 100,000. Can you try to explicitly yield the go routine after each txn call? I expect the latency of 10,000 is lower than 100,000.

@mitake
Copy link
Contributor Author

mitake commented Jan 26, 2018

@xiang90

Can you try to explicitly yield the go routine after each txn call?

You mean goroutine for gathering the results?
Currently I'm thinking about the possibility that contention in memory allocator or GC would introducing the high latency in the case of smaller read txns. Probably inserting small sleep time between each read txns would help. I'll share the result next week.

@xiang90
Copy link
Contributor

xiang90 commented Jan 26, 2018

You mean goroutine for gathering the results?

i mean call yield here: https://github.com/coreos/etcd/pull/9199/files#diff-f141df18bc5c4d2e821b7d9dfb484f95R318.

I am not sure how the go mutex works internally. i might not be a fair lock, which means that a go routine might re-entry the critical section right after its leave if it does not yield to other routines. but i am not sure.

@mitake
Copy link
Contributor Author

mitake commented Jan 26, 2018

@xiang90 I understood what you mean. Probably I should call runtime.Gosched()? I'll try it.

@xiang90
Copy link
Contributor

xiang90 commented Jan 26, 2018

@mitake oh. yes. it is called Gosched in go :P.

@mitake
Copy link
Contributor Author

mitake commented Jan 26, 2018

@xiang90 it seems so :) https://golang.org/pkg/runtime/#Gosched

@mitake
Copy link
Contributor Author

mitake commented Jan 26, 2018

@xiang90 I updated based on your explicit yielding idea. The results (including baseline range performance) is below. The latency score of --max-range-keys-once 10000 is much better than the older version. But it is still worse than the case of 100,000. Probably I need much more analysis with pprof.

baseline range performance

default

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.5:2379 get --pr
efix "" > /dev/null 

real    0m9.225s
user    0m3.312s
sys     0m2.616s

10,000

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.3:2379 get --prefix "" > /dev/null        

real    0m29.761s
user    0m3.376s
sys     0m2.816s

100,000

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.4:2379 get --pr
efix "" > /dev/null 

real    0m11.826s
user    0m3.460s
sys     0m2.468s

concurrent range and put

default

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.3:2379 get --pr
efix "" > /dev/null 

real    0m11.896s
user    0m3.692s
sys     0m2.696s

Summary:
  Total:        16.5136 secs.
  Slowest:      4.5160 secs.
  Fastest:      0.0020 secs.
  Average:      0.0158 secs.
  Stddev:       0.1424 secs.
  Requests/sec: 6055.5975

Response time histogram:
  0.0020 [1]    |
  0.4534 [99899]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.9048 [0]    |
  1.3562 [0]    |
  1.8076 [0]    |
  2.2590 [0]    |
  2.7104 [0]    |
  3.1618 [0]    |
  3.6132 [0]    |
  4.0646 [0]    |
  4.5160 [100]  |

Latency distribution:
  10% in 0.0067 secs.
  25% in 0.0076 secs.
  50% in 0.0089 secs.
  75% in 0.0115 secs.
  90% in 0.0204 secs.
  95% in 0.0268 secs.
  99% in 0.0412 secs.
  99.9% in 4.5047 secs.

10,000

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.3:2379 get --pr
efix "" > /dev/null 

real    0m29.739s
user    0m3.620s
sys     0m2.736s

Summary:
  Total:        34.6274 secs.
  Slowest:      3.3571 secs.
  Fastest:      0.0020 secs.
  Average:      0.0339 secs.
  Stddev:       0.2094 secs.
  Requests/sec: 2887.8850

Response time histogram:
  0.0020 [1]    |
  0.3375 [98533]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.6731 [226]  |
  1.0086 [133]  |
  1.3441 [170]  |
  1.6796 [214]  |
  2.0151 [304]  |
  2.3506 [184]  |
  2.6861 [207]  |
  3.0216 [4]    |
  3.3571 [24]   |

Latency distribution:
  10% in 0.0065 secs.
  25% in 0.0074 secs.
  50% in 0.0085 secs.
  75% in 0.0107 secs.
  90% in 0.0175 secs.
  95% in 0.0251 secs.
  99% in 1.2282 secs.
  99.9% in 2.6472 secs.

100,000

$ time ETCDCTL_API=3 bin/etcdctl --command-timeout=60s --endpoints=10.140.0.5:2379 get --pr
efix "" > /dev/null 

real    0m13.050s
user    0m3.564s
sys     0m2.720s

Summary:
  Total:        18.9839 secs.
  Slowest:      1.6877 secs.
  Fastest:      0.0028 secs.
  Average:      0.0184 secs.
  Stddev:       0.0801 secs.
  Requests/sec: 5267.6333

Response time histogram:
  0.0028 [1]    |
  0.1713 [99191]        |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.3398 [6]    |
  0.5083 [103]  |
  0.6768 [300]  |
  0.8452 [201]  |
  1.0137 [1]    |
  1.1822 [98]   |
  1.3507 [0]    |
  1.5192 [0]    |
  1.6877 [99]   |

Latency distribution:
  10% in 0.0076 secs.
  25% in 0.0085 secs.
  50% in 0.0099 secs.
  75% in 0.0124 secs.
  90% in 0.0189 secs.
  95% in 0.0258 secs.
  99% in 0.0606 secs.
  99.9% in 1.1644 secs.

@xiang90
Copy link
Contributor

xiang90 commented Jan 26, 2018

@mitake

Probably I need much more analysis with pprof.

Yes. I still do not fully understand where the improvements come from, and why exactly the Puts are blocked. Let us dig into this a little bit more.

@xiang90
Copy link
Contributor

xiang90 commented Jan 30, 2018

/cc @jpbetz

@mitake
Copy link
Contributor Author

mitake commented Jan 31, 2018

I did some profiling for understanding the behaviour. For making the experiment simple, I made a single node cluster on my machine. Clients (etcdctl and benchmark) were executed on the same machine. It seems that smaller read txns consume lots of time for the commit process of the backend.
(I'm still not 100% sure because pprof's output format seems to be changed since the last time I used it).

default

Showing nodes accounting for 70ms, 100% of 70ms total
Showing top 20 nodes out of 23
      flat  flat%   sum%        cum   cum%
      20ms 28.57% 28.57%       20ms 28.57%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
      10ms 14.29% 42.86%       30ms 42.86%  runtime.findrunnable /usr/local/go/src/runtime/proc.go
      10ms 14.29% 57.14%       20ms 28.57%  runtime.notesleep /usr/local/go/src/runtime/lock_futex.go
      10ms 14.29% 71.43%       20ms 28.57%  runtime.notetsleep_internal /usr/local/go/src/runtime/lock_futex.go
      10ms 14.29% 85.71%       10ms 14.29%  runtime.siftupTimer /usr/local/go/src/runtime/time.go
      10ms 14.29%   100%       10ms 14.29%  runtime.usleep /usr/local/go/src/runtime/sys_linux_amd64.s
         0     0%   100%       10ms 14.29%  github.com/coreos/etcd/internal/lease.(*lessor).runLoop /home/mitake/gopath/src/github.com/coreos/etcd/internal/lease/lessor.go
         0     0%   100%       10ms 14.29%  runtime.addtimer /usr/local/go/src/runtime/time.go
         0     0%   100%       10ms 14.29%  runtime.addtimerLocked /usr/local/go/src/runtime/time.go
         0     0%   100%       20ms 28.57%  runtime.futexsleep /usr/local/go/src/runtime/os_linux.go
         0     0%   100%       30ms 42.86%  runtime.mcall /usr/local/go/src/runtime/asm_amd64.s
         0     0%   100%       20ms 28.57%  runtime.mstart /usr/local/go/src/runtime/proc.go
         0     0%   100%       20ms 28.57%  runtime.mstart1 /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 14.29%  runtime.notetsleep /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       10ms 14.29%  runtime.notetsleepg /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       30ms 42.86%  runtime.park_m /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 42.86%  runtime.schedule /usr/local/go/src/runtime/proc.go
         0     0%   100%       20ms 28.57%  runtime.stopm /usr/local/go/src/runtime/proc.go
         0     0%   100%       20ms 28.57%  runtime.sysmon /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 14.29%  runtime.timerproc /usr/local/go/src/runtime/time.go

--max-range-keys-once 10000

Showing nodes accounting for 50ms, 100% of 50ms total
Showing top 20 nodes out of 24
      flat  flat%   sum%        cum   cum%
      40ms 80.00% 80.00%       40ms 80.00%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
      10ms 20.00%   100%       10ms 20.00%  runtime.makemap /usr/local/go/src/runtime/hashmap.go
         0     0%   100%       10ms 20.00%  github.com/coreos/etcd/internal/mvcc/backend.(*backend).run /home/mitake/gopath/src/github.com/coreos/etcd/internal/mvcc/backend/backend.go
         0     0%   100%       10ms 20.00%  github.com/coreos/etcd/internal/mvcc/backend.(*batchTxBuffered).Commit /home/mitake/gopath/src/github.com/coreos/etcd/internal/mvcc/backend/batch_tx.go
         0     0%   100%       10ms 20.00%  github.com/coreos/etcd/internal/mvcc/backend.(*batchTxBuffered).commit /home/mitake/gopath/src/github.com/coreos/etcd/internal/mvcc/backend/batch_tx.go
         0     0%   100%       10ms 20.00%  github.com/coreos/etcd/internal/mvcc/backend.(*batchTxBuffered).unsafeCommit /home/mitake/gopath/src/github.com/coreos/etcd/internal/mvcc/backend/batch_tx.go
         0     0%   100%       10ms 20.00%  github.com/coreos/etcd/internal/mvcc/backend.(*readTx).reset /home/mitake/gopath/src/github.com/coreos/etcd/internal/mvcc/backend/read_tx.go
         0     0%   100%       10ms 20.00%  runtime.entersyscallblock /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 20.00%  runtime.entersyscallblock_handoff /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 20.00%  runtime.findrunnable /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 60.00%  runtime.futexsleep /usr/local/go/src/runtime/os_linux.go
         0     0%   100%       10ms 20.00%  runtime.futexwakeup /usr/local/go/src/runtime/os_linux.go
         0     0%   100%       10ms 20.00%  runtime.handoffp /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 20.00%  runtime.mcall /usr/local/go/src/runtime/asm_amd64.s
         0     0%   100%       10ms 20.00%  runtime.notesleep /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       20ms 40.00%  runtime.notetsleep_internal /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       30ms 60.00%  runtime.notetsleepg /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       10ms 20.00%  runtime.notewakeup /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       10ms 20.00%  runtime.park_m /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 20.00%  runtime.schedule /usr/local/go/src/runtime/proc.go

--max-range-keys-once 100000

Showing nodes accounting for 80ms, 100% of 80ms total
Showing top 20 nodes out of 23
      flat  flat%   sum%        cum   cum%
      40ms 50.00% 50.00%       40ms 50.00%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s
      10ms 12.50% 62.50%       10ms 12.50%  runtime.acquirep /usr/local/go/src/runtime/proc.go
      10ms 12.50% 75.00%       30ms 37.50%  runtime.exitsyscall /usr/local/go/src/runtime/proc.go
      10ms 12.50% 87.50%       10ms 12.50%  runtime.lock /usr/local/go/src/runtime/lock_futex.go
      10ms 12.50%   100%       40ms 50.00%  runtime.notetsleep_internal /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       20ms 25.00%  runtime.exitsyscallfast /usr/local/go/src/runtime/proc.go
         0     0%   100%       20ms 25.00%  runtime.exitsyscallfast.func1 /usr/local/go/src/runtime/proc.go
         0     0%   100%       20ms 25.00%  runtime.exitsyscallfast_pidle /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 12.50%  runtime.findrunnable /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 37.50%  runtime.futexsleep /usr/local/go/src/runtime/os_linux.go
         0     0%   100%       10ms 12.50%  runtime.futexwakeup /usr/local/go/src/runtime/os_linux.go
         0     0%   100%       10ms 12.50%  runtime.mcall /usr/local/go/src/runtime/asm_amd64.s
         0     0%   100%       30ms 37.50%  runtime.mstart /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 37.50%  runtime.mstart1 /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 37.50%  runtime.notetsleep /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       40ms 50.00%  runtime.notetsleepg /usr/local/go/src/runtime/lock_futex.go
         0     0%   100%       10ms 12.50%  runtime.park_m /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 12.50%  runtime.schedule /usr/local/go/src/runtime/proc.go
         0     0%   100%       10ms 12.50%  runtime.stopm /usr/local/go/src/runtime/proc.go
         0     0%   100%       30ms 37.50%  runtime.sysmon /usr/local/go/src/runtime/proc.go

@mitake
Copy link
Contributor Author

mitake commented Jan 31, 2018

I also implemented a client side mechanism for the same purpose in the second commit.

The major todo of the client side is implementing the sorting feature. The Range() calls can be separated now so the sorting should be performed in the client side too.

Current Range() tries to read entire requested range of keys in a
single read transaction. It can introduce long waiting of writer
transactions which can be observed as high latency spikes. For solving
this problem, this commit lets Range() split its read transaction in a
fine grained manner. In the interval of the read transactions,
concurrent write RPCs (e.g. Put(), DeleteRange()) can have a chance of
starting their transactions.

This commit also adds a new option `--range-max-keys-once` to
etcd. With the option, users can specify the maximum number of keys
read in a single transaction during Range().
…ingle range txn

This commit adds a similar mechanism of making a read txn smaller in
the client side. This is good for reducing peak memory usage of the
server.

This commit adds a new flag --max-range-keys-once to `etcdctl
get`. With specifying a value with the option, users can use this
feature during the execution of the command.
@mitake mitake changed the title WIP, DO NOT MERGE: *: make serializable Range() fine grained WIP, DO NOT MERGE: *: make Range() fine grained Jan 31, 2018
@xiang90
Copy link
Contributor

xiang90 commented Feb 7, 2018

@mitake it is strange that making the range max size to 10000 will increase the number of commit of batched tx. Read only requests should have nothing to do with backend commit.

@mitake
Copy link
Contributor Author

mitake commented Oct 20, 2019

I'm closing this because the problem of large range performance is solved in #10523

@mitake mitake closed this Oct 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants