-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate faf results on Plaintext #7402
Comments
I think the date is empty as the variable is never initialized: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Rust/faf/src/main.rs#L43 The initialization seems to have been removed in this PR: https://github.com/TechEmpower/FrameworkBenchmarks/pull/6523/files#diff-e828d6a53c980b3107862cfa2530ba397b76f56d2a0bb3d555465a6f6217f4feL52-L53 That would explain the network bandwidth difference. |
/cc @errantmind |
Please look at the source for the framework here, the date value is passed along from the main event loop, also the date header validation does exist in the tfb toolset. |
Thanks for confirming. Do you know how it's getting over the "physical" limits then? If not then I will try to check the full payload it's sending so I can understand. |
could process priority affect throughput? |
@fafhrd91 historically all plaintext results have been capped at 7M rps because when you look at the minimum request size and the network packets, over a 10Gb/s network, it can't be faster than that. This was confirmed by:
I am not excluding any magic trick though, even at the network level, which could be a totally fine explanation. |
@sebastienros @fafhrd91 @sumeetchhetri @errantmind i've done a little investigation here and it seems each faf request from wrk is 87 bytes and each response back is 126 bytes. so, if we assume for argument sake an average ethernet packet size of 8192 bytes, that would allow 64 responses per packet plus 66 bytes of protocol overhead for TCP/IP/Ethernet, giving us an overhead of ~1 byte per response for an overall average including protocol overhead of 127 bytes per response. if we divide 10,000,000,000 (10Gb) bits by (127 * 8 = 1016) bits we get a theoretical max throughput of 9,842,519 responses per second on a 10Gb ethernet interface. so, i think it is possible that the faf solution, due to it's extreme low level and focus on shaving every possible microsecond off the server side processing overhead could be significantly faster than all the other frameworks no? it does seem to skew the results on the TE charts though so maybe it should be tagged as stripped instead of realistic? i also can confirm that the responses are adhering to the rules and the date field does change at least every second. |
@sebastienros are my calculations above correct? as far as i understand, they would assume full duplex communication between web server and benchmark server so that would mean theoretical throughput of 10Gb up and 10Gb down simultaneously yes? if not, then i can't understand how we could even reach 7m RPS. 🤷♂️ |
I didn't see this right away, so sorry for the delay. There is no single secret that accounts for the difference in performance between faf and other projects. Each component in faf has been hand optimized to some degree. If anyone is extra curious they can test each component individually against what they have. The source code is a little dense but has decent separation-of-concerns. |
@errantmind i will take a look. i find rust syntax so horrific to look at though. i'll see if i can put together something in plain old C that would achieve the same throughput. @sebastienros @nbrady-techempower can anyone confirm re. my questions/calculations above? i had thought, as sebastien asserted above, that 7m RPS was some kind of hard limit based on the hardware but it seems not. 🤷♂️ |
Sorry, haven't had time to look into it, but I had done the same calculation 3 years ago (we recorded it I think) and we confirmed the 7M. I had looked at wireshark traces too because we didn't understand why all results capped at 7M. Then we ordered new NICs (40gb/s) and magically the benchmarks became much faster. All these benchmarks that are currently at 7M reach 10M+ when using a faster card, with the same hardware (we own the same physical machines as TE), so I don't think faf is just faster because it's "more" optimized than other fast frameworks. But until proven otherwise this is what it is, and I totally understand that there are some tests to validate the results. Hope to find some time to investigate more soon. |
MTU is 1500 on these NICs, including the ones we use. |
@sebastienros ok, so if my quick calculation is correct that would make it a theoretical max RPS of ~9.4m. each packet = (11 * 126) + 66 = 1452 bytes = 11616 bits assuming full duplex and that downstream link can use the full 10Gb. does that sound correct? |
I took some tcp dumps and all looks fine. Compared to ULib (c++) also which used to be the fastest on our faster network. The important things I noted are
But something that you should re-calculate, is that the requests are 170B, so these should be the network bottleneck (if it is one): From the client each packet is 2720B (16 requests of 170B) plus the preamble (66B) which makes 2786B to send 16 requests. And this is assuming there is no lost/empty packets (and there are) |
I am talking about the request payload, not the response. |
@sebastienros the numbers above are a little off. it works out as max ~7m request per second as we had initially assumed. so, we no closer to having an answer then? 🤔 it must be 🪄 😄 |
if you wanted to, could improve the baseline significantly by removing the redundant 'Connection: keep-alive' header in the request - in HTTP/1.1 keep-alive is assumed in the absence of this header. you could also drop the 'Accept' header altogether. removing these two headers would knock 120 bytes off each request - they are most likely ignored anyway by most if not all the frameworks in the test. 🤷♂️ |
Did we ever get an up-to-date hard-confirmation that each (pipelined) request is received is 2720 bytes? It has been a while since I've done any testing but I have a recollection the requests were a lot smaller than that. Either way, can you provide the exact bytes you are seeing in each pipelined request? |
The same initial conclusion we got to a few years ago. (sorry I forgot about bits/bytes) But from what I have seen so far there is nothing abnormal with FAF, this is getting very interesting. |
@billywhizz Interesting, thank you for posting the info. I'm going to have to think about this for a while. I've gone through the code multiple times today and tested it in various ways, and I don't see any issues there, so I'm at a bit of a loss. Although not totally relevant, I'm also using FaF in a couple production environments that serve up a decent chunk of traffic and have also not noticed any issues over the months it has been in use. |
yes - i think both myself and @sebastienros are in agreement that nothing seems to be wrong with the code or what we are seeing on the wire. am wondering could it be some kind of issue/bug with wrk and how it's counting pipelined responses? i'm kinda stumped tbh. 😕 |
There is one thing that is different for faf. I assume most frameworks use port 8080, but faf is using 8089. @errantmind would you like to submit a PR that makes it use 8080 instead, I believe it's only in 3 locations (main.rs, docker file expose, and benchamrksconfig.json). There is no reason it would change anything, right? But wrk ... who knows |
one thing i am thinking on here.
so, i think it might be possible that faf, due to a bug, is writing more responses than requests it receives and that wrk terminates early when it receives expected number of responses to requests it has sent (some of which could actually still be in flight, or possibly not even sent yet depending on how wrk "works" internally). this seems to me the only thing that could explain the numbers we are seeing. there's no system level "trick" afaict that would allow sending more than ~7.2m requests per second if the numbers we discussed above are correct. does that make sense? i'll see if can reproduce this and/or have a dig into the source code of wrk to see if it's possible/likely. |
@billywhizz let me know what you find. If there is a bug, then I'll submit a PR to fix it. If not, I'll change the port. |
i did some testing here with a proxy between wrk and faf and always see same amount of responses sent out of faf as requests that were received so doesn't look like my theory is correct. 🤷♂️ |
I'm happy there doesn't appear to be a bug from your tests, but I'd still like an explanation why it exceeds the calculated limits described earlier. It seems like it is either an even harder to replicate bug, or something is off with the calculation, e.g., the official tfb benchmark isn't sending as many bytes in each request as appears in the repo.. or something along those lines. Of course, it could be something else entirely as well. I don't have time at the moment to mount a full-scale investigation of my own, but if anyone eventually provides a full explanation, it will be appreciated, whatever the reason is. |
I decide to implement this after this fascinating read. TechEmpower#7402
I did look at wrk and seems to be counting of request per thread can be a problem. For example From what I understand if that would be the case, then RPS would increase (which is strange). I can have fix for |
As I posted on Twitter in reply to Egor Bogatov, my first thought is that there may be an error either in the composition of responses or in the way wrk counts responses, as @kant2002 discusses above. I'd recommend reviewing the total received bytes reported by wrk versus the expected total received size of N responses multiplied by M bytes per response. My hunch is that these two values won't align. |
I'm looking at numbers which wrk displays for local testing on my very old laptop. Admittedly not best target for testing.
I cannot connect 3 numbers in any meaningful way,
I may attribute number dfferences to https://github.com/wg/wrk/blob/a211dd5a7050b1f9e8a9870b95513060e72ac4a0/src/wrk.c#L277-L278 but I would like that somebody find a flaw in my reasoning, or point on the error in interpretation wrk results. |
@kant2002 the |
In reality no fw reaches 7M req/s, only faf with 8.6M. If we check these numbers with such precision, we must know where they come from. TechEmpower toolset numbersHistorically the numbers in the benchmark graphs are a bit higher, than in reality. But it affects all fws similarly. Why?In wrk req/s are included also the non-2xx or 3xx responses or socket errors (in plain text with pipeline). So they take the total number of requests and subtract the errors, and divide it by 15 seconds to show the charts numbers. It seems correct, but wrk never run exactly 15 seconds (~ from 15.00 - 15.14s).
DiffWhen there are no errors, it is easy to see the difference.
Wrk@kant2002 About the numbers returned by wrk 10375504 / 687757.44 = 15.08599310826794, and rounded is 15.09s. |
@joanhey I did find yestedday that bottom numbers are really slightly different because of precision display, but was too tired to get back. Still difference between total RPS and Per Thread RPS is still not clear for me. The total test duration is time between starting first thread, and time when last thread process last request. Also if follow links above Okay, Let's take a look at reply of @billywhizz #7402 (comment)
|
Seems to be I should not do anything too early in the morning. Ignore last message, except part with Per Thread RPS. I do not understand why number are bigger then Total PRS / Threads count. |
@kant2002 |
General FYI: I am actively looking into this today as it is annoying not to know what is going on here. I have a few suspicions and will post an update here today/tomorrow with whatever I find. In the meantime I ask that FaF is excluded from this round of the official results until we have an answer here. @nbrady-techempower |
I decide to implement this after this fascinating read. #7402
Sorry, I had some other work come up I had to prioritize. It appears there was a fairly niche bug after all. I patched the code and will write an explanation after the next continuous run completes. |
I broke my build in the process of upgrading between Rust versions so I'll have to wait before seeing the updated results, but I'll go ahead and explain the bug. The bug affected HTTP 1.1 pipelined requests outside of local networks, in situations where the pipelined requests exceeded the network interface's MTU. In short, FaF was sending more responses than there were requests. On Linux, by default, loopback interfaces often use very high MTU. Because of this, the bug did not manifest over loopback which made it harder to pinpoint initially, only after I set my loopback MTU to a more standard 1500 could I reproduce it. As the benchmark was sending nearly double the MTU bytes as pipelined requests, the socket reads usually contained a partial request at the end of the segment (totalling 1500 bytes). FaF was not updating a pointer to a read buffer where it needed and was responding to the 9 complete pipelined requests, then responding again to the full 16 requests after reading the remaining data from the socket in a subsequent loop. Wrk is counting all these responses as a part of the results. I wrote some simple tests to ensure FaF is now behaving as expected so I'm pretty confident the issue is resolved. |
We need to add more tests. |
All the frameworks can only check for the first chart, and not the exact route. |
I am checking the full route, and have been for a very long time now. You appear to be looking at an older commit. It is pretty strange to me that you are fixated on setting process priority as this keeps popping up. It is neither unusual nor unrealistic as scheduling has an effect on performance, and this is a benchmark. Even if it wasn't a benchmark it wouldn't be unusual for any latency-sensitive application (e.g. audio, games, etc). Please refer to the following for the current commit and diff: https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/Rust/faf https://github.com/TechEmpower/FrameworkBenchmarks/pull/7701/files |
@sebastienros We now have 3 runs in which |
Absolutely, we should close the issue and put faf back on the board. Another great TechEmpower story. |
Server: HP Z6 G4 Dual Xeon Gold 5120 (TurboBoost off), Net link 10Gbps, Debian 11, Python 3.9 Client: Intel 12700K (20 threads), Net link 10Gbps, Debian 11
On server 28 cores loaded ~ 25%...30%
122 + 13 = 135 bytes Bonus: Net 25Gbps
|
And why do the TFB scripts themselves calculate the speed as much as dividing by a hardcoded constant For example, what's the difference if we take the number of requests and time from my test: |
It's currently over 8M RPS which is above the theorical limit of the network.
The text was updated successfully, but these errors were encountered: