Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Requirements] Add that Response cache is not permitted in Fortunes test #6529

Open
joanhey opened this issue Apr 10, 2021 · 19 comments
Open

Comments

@joanhey
Copy link
Contributor

joanhey commented Apr 10, 2021

Fortunes Requirements
It's obvious that want to test server-side templates, XSS countermeasures,... and that it need to be executed in each request and not test the template cache.

And
xxi. Use of an in-memory cache of Fortune objects or rows by the application is not permitted.

But It does not say anything about the template cache.
Almost all template systems have a cache, and should not be allowed in tests.

One example is Ubiquity framework in async mode: https://github.com/phpMv/ubiquity/blob/9090fe990f49b475d68f416d9e64f6d1ae6f2580/src/Ubiquity/controllers/SimpleViewAsyncController.php#L18-L27
But perhaps more frameworks are using it.

@NateBrady23
Copy link
Member

I'll bring this up with everybody on Monday, but I agree with you.

Unfortunately, I can't think of a way to test for this during verification. (Always makes me think how cool it would be if we had some static code analysis.) So, it will require persistence from the community and might be hard to find.

@joanhey
Copy link
Contributor Author

joanhey commented Apr 12, 2021

It isn't easy to automate the verification, but also it isn't with the cache of Fortune objects or rows.

Perhaps will be good to add a random number or time to the "Additional fortune added at request time." and so will be more difficult to cache the template.

But visually checking the results, I can't understand some numbers.
Can someone explain to me how it is possible that some fw are similar or faster in the fortune tests that in the single query? 😖

Compare a single row query with 2 simple numbers converted to json vs a 12 rows query, sort, escape, template, ....
Only the db and server response are large enough to be slower.

Examples from last run:

Fw Single query Fortunes
actix [platform] 637,185 650,925
drogon-core 646,199 664,945
h2o 394,903 406,306
asp.net core [platform, pg] 394,844 396,589
fasthttp 397,250 394,079
atreugo 397,492 393,905
ben [kestrel] 406,426 378,093

vs

Fw Single query Fortunes
lithium-postgres-batch 835,290 650,083
just-js 675,528 541,889
vertx-postgres 573,312 342,819
greenlightning 434,648 320,221
gemini 322,997 242,373

@NateBrady23
Copy link
Member

Perhaps will be good to add a random number or time to the "Additional fortune added at request time." and so will be more difficult to cache the template.

I like this idea.

@joanhey
Copy link
Contributor Author

joanhey commented Apr 13, 2021

If we add a random char plus a random number at the beginning, It can also break any sort cache.
Ex: "B54321 Additional fortune added at request time."

@joanhey
Copy link
Contributor Author

joanhey commented Apr 19, 2021

All the 20 round is bad for that.

@jcheron
Copy link
Contributor

jcheron commented Apr 20, 2021

I didn't feel I was violating the xxi rule in this implementation.

xxi. Use of an in-memory cache of Fortune objects or rows by the application is not permitted.

  • The list of Fortune instances is not cached.
  • These instances are loaded from the database on every request and used.
  • Once the data is loaded, only the template engine caches the result, for a specific dataset to display.

No current rule prohibits this practice.

And I don't know if it should be forbidden. This usage is an integral part of optimising implementations.

I'm afraid that by adding new rules, we'll forbid any kind of optimization, which would make it impossible to differentiate between frameworks using the same technologies.

Adding a random data on Fortune, why not, but this would change the context of the test, for not respecting a rule that is not currently explained.

Obviously, if this new rule was added, I would make sure that the Ubiquity implementation respected it.

PS: It would be a good practice to ping the authors of the implementations, when mentioning them in a new issue.
@nbrady-techempower @joanhey

@joanhey
Copy link
Contributor Author

joanhey commented Apr 27, 2021

I didn't say that any framework is violating the requirements.
I show that some frameworks are using template cache, and show ubiquity as one example of many. (It's easier to see it in PHP than in C). But with the numbers is clear which fw are using the cache.

So we need to clarify the requirements: template cache is permitted or not ?
And all developers will know it to have a fairplay benchmark.

If it's permitted:

  • the numbers will be similar to the single query test
  • we are not testing the template system, only the template cache
  • we need to regulate what is permitted in the cache: htmlscape, sort, ...

In the Fortunes test it's easy to cache, the template always receive the same data and send the same html.
But in a realistic app (backend, admin, control panel, ...) the data changes continuously.

@joanhey
Copy link
Contributor Author

joanhey commented May 13, 2021

@nbrady-techempower any news ??
At a minimum we need to start a discussion

@NateBrady23
Copy link
Member

@joanhey I've shared again with the team to see if I can get more thoughts on it, thanks.

@bhauer
Copy link
Contributor

bhauer commented May 13, 2021

This is an interesting one. It reminds me in some ways of the conversations we had about allowing pipelined queries in database connectivity providers. After some back and forth, we ultimately decided to allow query pipelining because, as @jcheron points out above, some optimization strategies benefit real-world applications and not just results in this benchmark.

In other words, some optimizations that appear at first to violate the spirit of our tests by optimizing "to the benchmark," are in fact something of the opposite: providing a new benefit to real-world applications. Precluding such optimizations would be curiously antithetical to the project's goals, which include motivating frameworks and platforms to improve performance for the benefit of application developers.

That said, I am not personally convinced that caching the result of a template composition is providing unique and innovative value to real-world applications.

I think "template caching" means two different things:

  1. Caching the template that is used to ultimately compose an HTML response by merging with a context object. For example, a Mustache template library might cache the fortune.mustache template which contains template code such as <tr><td>{{id}}</td><td>{{message}}</td></tr>. This is permitted and in fact expected to be employed by most template libraries, as templates are typically immutable within applications.
  2. Caching the fully composed output of a composition, the result of merging a template and a context object. Presumably such a cache would be keyed by some attribute of the context object, though I am not sure what attribute that is.

It is my understanding that the latter is what we're discussing in this thread. Perhaps this would more accurately be called "Template composition caching" or even "Response caching."

I's the similarity to actual "response caching" that leads me to doubt the unique and innovative value provided here. I see caching the resulting composition as nearly equivalent to caching the bulk of the response. If your application is serving static content, as implied by a scenario where "template composition caching" provides value, it would likely have a reverse proxy cache or CDN between the user and your server anyway. In our Fortunes test, the intent is for the context object, though always the same, to be a stand-in for an actually dynamic context object (hence the dynamic addition of an item and sorting).

There are subtle differences, of course. The template composition caching approach, if I understand it correctly, would require no cache invalidation if the context object ever changed. The cache lookup would be a miss, and a new composition would be generated and cached seamlessly. By comparison, caching a response at a reverse proxy will typically have some timeout after which a request will be sent to the back-end server for refreshing the cache.

When we discussed query pipelining, we also discussed the possibility of adding a results filter on this attribute since the approach was sufficiently different that it seemed reasonable people might want to be able to slice the data accordingly (e.g., show me only results from implementations that use pipelining or vice-versa). Although that didn't get implemented, it might still be useful. If such finer-grained attributes were added, template composition caching could be added as another attribute.

While I am currently leaning toward asking that we do not use template composition caching, I'd like to hear more thoughts and opinions.

@jcheron
Copy link
Contributor

jcheron commented May 14, 2021

I fully agree with this point @bhauer:

some optimizations that appear at first to violate the spirit of our tests by optimizing "to the benchmark," are in fact something of the opposite: providing a new benefit to real-world applications. Precluding such optimizations would be curiously antithetical to the project's goals, which include motivating frameworks and platforms to improve performance for the benefit of application developers.

Now, let's look at the value of this type of cache (template caching) in a real context:

  • If the data are always unchanged and the application serves static, albeit dynamically generated, content, caching at the template level does not make sense, and it is better to cache the full response, thus avoiding unnecessary querying of the database.

  • If the data changes systematically, or there is little chance of the same query being made, to get the same response dataset, then again, there is no point in caching, and it would be an unnecessary extra burden.

  • Now, if we are in an intermediate context: the data "may" be changed, note that the difference here is that it is a simple possibility, not a systematic change. There may be several decades, hundreds or thousands of requests between each change. This type of cache then becomes an interesting alternative, especially in the case where I cannot use another cache system at another level, or I do not have control over the possible invalidation of this other cache system in place.

This last case, which is possible in production, is quite close to the context described in the Fortunes specification:
Fortunes data can change, which justifies systematically querying the data, without using a response or DB level cache, but if this is not the case, the use of a template level cache can take over.

It could be argued that in this case, we are no longer testing the efficiency of the template engine rendering, but of its caching system.
To which I would reply:

  • This caching system is an integral part of the template engine
  • Some frameworks or platforms don't use a template engine and render directly => banning any caching system at the template level would give the false impression that using a template engine is always a handicap in terms of performance.

@dantti
Copy link
Contributor

dantti commented Aug 9, 2021

  • This caching system is an integral part of the template engine

It doesn't have to be, it can be some at application level, and as you said we are not testing the templating engine anymore which to me is what is important in this test.

  • Some frameworks or platforms don't use a template engine and render directly => banning any caching system at the template level would give the false impression that using a template engine is always a handicap in terms of performance.

Well the cache could be done still at application level where it is being rendered, and yes most if not all cases are going to be faster than templated ones for obvious reasons, that doesn't mean templates are bad but might be an interesting thing to compare.

As opposed to the pipelined tests which still tests the DB performance, this stops testing the templating engine or HTML rendering to test a caching system, so IMO it should be forbidden.

@joanhey
Copy link
Contributor Author

joanhey commented Mar 16, 2022

When we fix this issue.
They are caching the response.
And they have a big advantage vs others.

It's only necessary to change the requirements.

And the change is easy.
One cookie start with a number and the "Additional fortune added at request time." with another.
So we break any cache from sort and escape strings.

Any fw can cache the response. But we are not testing the cache. We are testing the template, sort and escape strings.

@joanhey joanhey changed the title [Requirements] Add that template cache is not permitted in Fortunes test [Requirements] Add that Response cache is not permitted in Fortunes test Apr 17, 2022
@volyrique
Copy link
Contributor

I am quite late to the party, but as the author of the h2o implementation (one of the problematic cases called out above) I would like to state categorically that while the code does perform an equivalent to point 1 from @bhauer's message (which is apparently both permitted and expected), it does not cache any responses (point 2). However, I agree that having higher throughput in the fortunes test than in the single query one is counterintuitive, and I am as surprised by the data as any of you are (the results are pretty much the same in the last run that has completed as of this writing, c768c2cb-b67a-41a5-987b-7f35ebf64368).

So, I did a quick investigation and discovered the following: The last run that contained "reasonable" results was 92383925-3ba7-40fd-88cf-19f55751f01c (single query - 470930 requests per second, fortunes - 421247). The next run, 51274292-fa20-4316-bc29-4138d6ac607e, was problematic (the values being 403863 and 415689 respectively; note that both of them regressed, though the single query test was affected significantly worse by 14.24%, while the effect on the fortunes value was moderate at 1.32%). However, if you check the history of the frameworks/C/h2o directory for each of the commit IDs associated with the runs, then you will notice that there have been no changes between the runs - the code is absolutely the same in both cases, and yes, it keeps its dependencies locked down as much as possible; refer to the Dockerfile.

While I am unable to pinpoint the cause, I can offer a theory (only for h2o, of course - I am not qualified to discuss any other framework) - there was a change in the environment (hardware and/or software update, etc.) that resulted in the regression; note that there was a gap of more than a week between the runs. I think at the time I just guessed that it had something to do with the speculative execution vulnerability mitigations that had started making their way through the software ecosystem, which was why I was pushing for an OS upgrade in #5709 (and I am glad that is still on the agenda in #7321). Now apparently both tests have the same bottleneck, but the single query one runs before any other database test, thus warming up the environment and providing a small (0.93% in the latest run) improvement to the test that executes subsequently.

@joanhey
Copy link
Contributor Author

joanhey commented May 17, 2022

Thanks @volyrique
Very good investigation and explanation.

After checking round 18 and 19 (with the vulnerability mitigations). That's when this strange behaviour started.
Also some frameworks, in round 19 and later, the fortunes are faster with ORM than with RAW db connection. Very odd.

So it is becoming more and more important to upgrade the kernel. #7321

Still some frameworks use response cache

So we need to clarify the rules.

@billywhizz
Copy link
Contributor

@joanhey @volyrique @nbrady-techempower @dantti @jcheron
i provided a likely answer here as to why fortunes RPS can sometimes be higher than single query. let me know what you think. it makes sense to me, especially when you look at the cpu load on web server and the fact the database is the likely bottleneck.

on the general point here, i totally agree template output should absolutely not be cached. the whole point of the fortunes test is to compare the performance of executing the template with html escaping against the database results for every request. @joanhey's suggestion of randomizing the additional row seems like a good solution to ensure this isn't allowed.

@volyrique
Copy link
Contributor

@billywhizz Sorry, I don't see it, at least for h2o - prepared statements are used for both the fortunes and the single query tests, and in the latter case the only parameter is sent in binary form. Honestly, I view the results as essentially the same because the 0.93% difference I have mentioned before is not that far from a run-to-run variation in the fortunes test, for instance - 0.41% between c768c2cb-b67a-41a5-987b-7f35ebf64368 and 8366f6fe-9203-4f5a-8636-03d6aacd994c as an example. If I really wanted to dig into this, I would start by running the fortunes test first, followed by the single query one, and see what would happen, but I could not because I do not have the necessary access to the test environment (that is, run experiments as I please). Anyway, I am not bothered by such a small difference, especially since I know that with respect to database performance I have way bigger fish to fry, i.e. implement pipelining.

@billywhizz
Copy link
Contributor

hi @volyrique

sorry, my explanation was poor and actually incorrect, but the issue is normal and can be explained.

if you look here: https://ajdust.github.io/tfbvis/?testrun=Citrine_started2022-01-12_d9300976-46e5-4dcd-a72b-4f14c01ef5d8&testtype=db&round=false

you can see that h2o scores 393 krps for single query and 401 krps for fortunes. but if you look at the cpu usage on the web server for the single query test it is 63% utilized - indicating that the bottleneck in this case is the database which is likely at 100% cpu load while the web server sits idle 37% of the time.
Screenshot from 2022-06-01 02-50-29

for fortunes, cpu usage is 88% on the web server. so, the database is still the bottleneck and is also 100% loaded for this test but web server is idle only 12% of the time.
Screenshot from 2022-06-01 02-52-34

so if you calculate the requests per second if cpu on web server could be fully utilized and db was not a bottleneck then you would get:

db = 393 / 63 * 100 = 623 krps if web server cpu could be fully utilized
fortunes = 401 / 88 * 100 = 455 krps if web server cpu could be fully utilized

so, the web server can handle more /db rps than /fortunes but isn't able to because of the load on the database.

does that make sense now?

it's actually a shortcoming in the techempower tests - it would make more sense, if we are comparing web server performance, to rank the frameworks on the number of requests per second per cpu rather than just the raw rps number.

@volyrique
Copy link
Contributor

@billywhizz I agree with your analysis - it presents a quite useful framework-independent technique to demonstrate that the results are consistent with the expectation that the single query test has higher throughput than the fortunes one, i.e. there is a bottleneck that limits them both. However, it doesn't answer the really interesting question, which is why is the fortunes test affected just a tiny bit less than the other one - TBH I haven't bothered doing a detailed statistical analysis across many runs or anything like that, but it seems that the effect is consistent and reproducible (that is, in the h2o case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants