New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNM : Race condition in rados bench #5812
Conversation
@Abhishekvrshny the link in the description must be an issue in the "Backport" tracker |
@dachary Yes my bad :( . It should have been 12949. Also the name of the branch should have been wip-12949-hammer. Can I change that somehow? |
@Abhishekvrshny I see you figured it out ;-) |
5538727
to
be2aab5
Compare
You can include fa72ecb and dba8b5b from #5810 and see if it helps. I would do:
and see if the bot compiles ok with that. It's perfectly ok to include commits from other pull requests if needed. They will be silently trimmed by git when merging. |
be2aab5
to
5fb896a
Compare
@dachary very nice introduction not for bug-fix ... Sorry it's not quite about this pr. |
Tab replaced with spaces Signed-off-by: Dmitry Yatsushkevich <dyatsushkevich@mirantis.com> (cherry picked from commit 069d95e)
Generalize vec_stddev by making them templates, so that we could use different kind of a vector to calculate STDDEV Signed-off-by: Dmitry Yatsushkevich <dyatsushkevich@mirantis.com> (cherry picked from commit e360bfd)
fix indents in output change `snprintf' to `setprecision' Signed-off-by: Dmitry Yatsushkevich <dyatsushkevich@mirantis.com> (cherry picked from commit ddb422f)
Add IOPS metric in result output for write_bench, rand_read_bench and seq_read_bench routines Signed-off-by: Dmitry Yatsushkevich <dyatsushkevich@mirantis.com> (cherry picked from commit 3944264)
Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com> (cherry picked from commit e3d62a9)
'trans_size' description from header: "size of the write/read to perform" But really 'object_size' is used in write/read operations. 'trans_size' is used only in ObjBencher::status_printer for calc current and average bandwidth. As result - bad statistics in case 'trans_size' and and 'object_size' are different. Signed-off-by: Dmitry Yatsushkevich <dyatsushkevich@mirantis.com> (cherry picked from commit 31d16e9)
Remove excess branch 'else {object_size = op_size;}' from 'if(operation != OP_WRITE){...}else{object_size = op_size;}' because 'object_size' variable already initialized with 'op_size' Signed-off-by: Dmitry Yatsushkevich <dyatsushkevich@mirantis.com> (cherry picked from commit 1d1d0aa)
Rename aio_bench argument 'op_size' to 'object_size' to reflect the reality of how it is used. Signed-off-by: Dmitry Yatsushkevich <dyatsushkevich@mirantis.com> (cherry picked from commit a87ac4d)
We never want to accumulate a bandwidth that is zero. A bandwidth of zero means that the object could not be written in one second (or that it's the first time in the loop). Even if writing an object takes a very long time, the bandwidth cannot be zero, it will just be very small. Fix suggested by @dachary Fixes: ceph#7401 Signed-off-by: Dmitry Yatsushkevich <dmitry.yatsushkevich@gmail.com> (cherry picked from commit cd0f2b7)
Signed-off-by: Loic Dachary <ldachary@redhat.com> (cherry picked from commit 1ac7279)
ObjBencher::aio_bench sets data.finished to num_objs but it should be set to zero since it reflects the number of objects read / written. That's what the read bench do and the write bench should do the same. It does not cause a problem, it is just confusing. Signed-off-by: Loic Dachary <ldachary@redhat.com> (cherry picked from commit 5c6190d)
Fixes the high CPU usage and corrects rados bench scores on fast SSDs and ramdisks/memstore. For bench run on SSD, on Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz before this patch, times are: write: real 5m0.169s, user 2m33.565s, sys 4m39.791s seq: real 4m28.642s, user 1m35.250s, sys 6m42.948s rand: real 5m0.258s, user 1m19.656s, sys 6m47.145s After this patch: write: real 5m1.162s, user 0m27.788s, sys 3m11.707s seq: real 5m1.149s, user 2m23.278s, sys 4m14.427s rand: real 5m1.021s, user 2m30.514s, sys 4m20.347s Bench run: rados -p ssd bench 300 write|seq|read --no-cleanup Note the increase in user time cpu on seq/read tests, along with decreased sys cpu time; this is because there's additional memcmp() that compares read objects with expected contents. With less time spent memory juggling, more time is spent performing more reads per second, increasing memcmp call count and increasing amount of user cpu time used. Signed-off-by: Piotr Dałek <piotr.dalek@ts.fujitsu.com> (cherry picked from commit b894fc7)
When doing seq and rand read benchmarks using rados bench, a quite large portion of cpu time is consumed by doing object verification. This patch adds an option to disable this verification when it's not needed, in turn giving better cluster utilization. rados -p storage bench 600 rand scores without --no-verification: Total time run: 600.228901 Total reads made: 144982 Read size: 4194304 Bandwidth (MB/sec): 966 Average IOPS: 241 Stddev IOPS: 38 Max IOPS: 909522486 Min IOPS: 0 Average Latency: 0.0662 Max latency: 1.51 Min latency: 0.004 real 10m1.173s user 5m41.162s sys 11m42.961s Same command, but with --no-verify: Total time run: 600.161379 Total reads made: 174142 Read size: 4194304 Bandwidth (MB/sec): 1.16e+03 Average IOPS: 290 Stddev IOPS: 20 Max IOPS: 909522486 Min IOPS: 0 Average Latency: 0.0551 Max latency: 1.12 Min latency: 0.00343 real 10m1.172s user 4m13.792s sys 13m38.556s Note the decreased latencies, increased bandwidth and more reads performed. Signed-off-by: Piotr Dałek <piotr.dalek@ts.fujitsu.com> (cherry picked from commit ca6abca)
This function iterates over all bufferlist internal buffers and calls their invalidate_crc() method. Required for rados bench to work correctly, because it modifies buffers outside buffer api, invalidating internal CRC cache in the process - this method clears that cache, so another call for buffer::get_crc() to return correct checksum. Signed-off-by: Piotr Dałek <piotr.dalek@ts.fujitsu.com> (cherry picked from commit 55a6f9e)
Under certain conditions (like bench seq/rand -b 1024 -t 128) it is possible that aio_read reads data into destination buffers before or during memcmp execution, resulting in "[..] is not correct!" errors even if actual objects are perfectly fine. Also, moved latencty calculation around, so it is no longer affeted by memcmp. Signed-off-by: Piotr Dałek <piotr.dalek@ts.fujitsu.com> (cherry picked from commit 9bcf5f0)
5fb896a
to
e298f84
Compare
e298f84
to
8f8d488
Compare
@@ -153,8 +153,12 @@ void *ObjBencher::status_printer(void *_bencher) { | |||
|
|||
int ObjBencher::aio_bench( | |||
int operation, int secondsToRun, | |||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines shouldn't be present in the commits..
@theanalyst my bad. The side-effect of manually merging conflicts |
@Abhishekvrshny no problems... this is a non trivial merge .. and a good learning experience |
8f8d488
to
a71071a
Compare
@Abhishekvrshny I suggest you try to get all modifications made to obj_bencher since hammer with git cherry-pick -x 069d95eaf49cadaa9a8fa1fa186455944a50ec7d e360bfd7058af01fd03aa9260eacbdc5d631e34b ddb422f40394ec2f4fc4a6a6cff49e43538121a9 39442641fc61c6ba7bde2cde5ceafd9276647f85 e3d62a9d4c6da7bd48f7aefce91db050bdc9a637 31d16e90108a741e92f897e550a8bf1e26392d2d 1d1d0aa01b944e8f9dad96b7f13bce2a581ba758 a87ac4d36b86aa826afb990769831bfcf08ed3fa cd0f2b754dc3cdf1e4f1647dc1cd91d891ce896c 1ac727982213bf676beefe9340d6822193dbb700 5c6190d012785222ba1eebee734aa34f79c5bf27 fe79babc9867a4cdf81a5b83019b9e1bbc86f9ae b894fc7acf7dee7f7ec8c5c280e7a6be41133328 ca6abca 55a6f9e b894fc7 9bcf5f0 It applies cleanly and it's legitimate IMHO because:
|
a71071a
to
d37c990
Compare
@dachary This would put a bulk of commits in one backport PR. Is that fine? or shall there be separate backport PRs for each of them? Also if we take all changes since hammer, how do we know if there isn't something which is NOT meant to be in hammer (and is probably a feature only for the next release)? |
Very good question :-) The only way is to review each commit individually, which I did. I guess we're lucky that this is a simple situation. It may have been a lot more hairy if there were new features. |
This is generally not ok and we do that very rarely. IMHO it's ok in this specific case because it's about a tool and not the core and because the series of change is quite simple.
Let's first see how it goes with tests. If tests are happy and noone objects this rather extensive backport, then we can worry about cross referencing this commit in the right places. You're lucky : you got to deal with one of the most complicated cases of backport :-) |
@dachary :) . Now that the tests have passed, shall I make a new branch with an appropriate name and send a separate PR, or will this be fine? |
I didn't expected that it'll be so complicated. Sorry, guys! |
@branch-predictor it is a good experience (not just for @Abhishekvrshny ;-). |
@Abhishekvrshny I think it's worth asking if #5428 should be backported (on the ceph-devel mailing list). |
For the record, the mail thread is at : http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/27302 |
abhishekvrshny: why we need #5428? #5812 relies on it? ah, right there's a change to rest_bench as well, but it got lost once rest_bench was eradicated abhishekvrshny: this: http://ceph.predictor.org.pl/diff-wip-12947.txt should help BranchPredictor: Can you raise a hammer specific PR for 5812 and I would close 5812 itself. I can then schedule an integration test with that. abhishekvrshny: well, I can... company will kill me for all those git clones. :P BranchPredictor: it looks like a non-trivial backport to me. I dont want to mess it up :P abhishekvrshny: no worries, I'll do it. BranchPredictor: Thanks :) |
|
replaced by #6791 |
http://tracker.ceph.com/issues/12949