Additional test types #133

Open
bhauer opened this Issue Apr 11, 2013 · 50 comments

Projects

None yet
@bhauer
Contributor
bhauer commented Apr 11, 2013 edited

We plan to add new test types over time. The following is a summary of tests we have presently and those we plan to specify and implement in the future.

  1. Present: JSON Serialization, in which a trivial newly-instantiated object is serialized to JSON.
  2. Present: Single database query, in which a single random row is fetched (via the framework's ORM) from a simple database table containing 10,000 rows and then serialized to JSON.
  3. Present: Multiple database queries, which is similar to the previous test but allowing the number of random rows to be specified as a URL parameter with the results rendered as a JSON list.
  4. Present: Server-side template and collections test. This involves retrieving a small number of rows from the database, sorting within the application code (not within the database), and rendering to HTML via server-side templates. No external assets will be referenced by the templates. This is detailed in issue #134.
  5. Present: Database update test. This is a variation of test 2. A single row will be fetched via the ORM, some trivial math will be applied to the random number field of the row, and then the object will be persisted using the ORM. This is intended to exercise the ORM's ability to persist rows, so the trivial math isn't applied directly to the row using SQL. This is now detailed in issue #263.
  6. Present: Small plaintext responses. This is detailed as in issue #290.
  7. Future: Caching test. Testing caching might begin with a variation of test 2 using the framework's caching capability, but we will also want to test caching results of more complex query operations. See #374. This is likely to be the next test type.
  8. Future: Server-side templates with assets. This will extend test 4 and add to-be-determined assets, at least composed of a style-sheet (CSS), but possibly also including JavaScript. Performance-wise, this likely won't differ much from test 4. However, it will be an opportunity for readers to dig into the code and observe the frameworks' variety of approaches for handling assets.
  9. Future: Compression tests. Add gzip or deflate compression to one or more test.
  10. Future: SSL tests. Add SSL to one or more tests.
  11. Future: WebSocket enabled tests. (High concurrency is desirable here.)
  12. Future: Tests that exercise requests made to external services and therefore must go idle until the external service provides a response. (High concurrency is desirable here.)
  13. Future: JSON responses with larger workloads (complex data structure serialization).
  14. Future: Transactional update test. See #326.
  15. Future: Large plaintext responses.
  16. Future: Complex routing map test. Require a given number of routes to be present to exercise the overhead of a larger routing map/table/tree.
  17. Future: Heavy model test, involving a larger number of entity objects and classes, as suggested by @methane in comments below.
  18. Future: CSRF protection and form processing test as suggested by @michaelhixson below. Note that @wg of Wrk fame has made a special version that selects from a list of requests and might allow us to run this test with Wrk.
  19. Future: Large static response test as suggested by @weltermann17 below.
  20. Future: Static file serving, to exercise the performance of the web-server. To be clear, this test would be expected to bypass the framework where applicable and be served directly by the web server or application server, whichever is available and best suited.
  21. Future: Penetration test(s). This would require additional client-side testing tools (beyond the load generator we use today), but would validate the security of the platform and framework combination.
  22. Future: TCP-heavy test. This test would mirror the Plaintext test but eliminate both pipelining and keep-alive—each request would need to be connected via a TCP socket and disconnected. It may be the case that such a test runs into networking-layer limits in the Linux TCP stack, so we'll need to be prepared to do some tuning there.

For the time being, we're still interested in relatively simple tests that exercise various components of the frameworks. But we're also interested in hearing your thoughts on more tests for the long term. If you have any ideas, please post them here.

@robertmeta

Any chance of getting more significant levels of concurrency being tested? At least 1,000+ concurrency, and ideally 10,000+. Concurrency seems to be exceptionally under-represented in these tests.

@bhauer
Contributor
bhauer commented Apr 12, 2013

Hi @robertmeta. Thanks for the input! Would you mind reviewing the thread on issue #49 regarding concurrency levels and perhaps add to that conversation? It's my opinion that higher concurrency levels beyond what we have provided here would be useful if we were ready to benchmark high-connection low-utilization Websockets. But presently we are testing high-traffic traditional HTTP where responding to requests as quickly as possible is the paramount objective.

As with anything though, I'm prepared to be proven wrong. :)

@bitemyapp
Contributor

I'm with bhauer, this isn't, "how many users can we serve per server on our chat service".

@drewcrawford

Some things I would like to see in the future:

  • msgpack tests. Msgpack is rapidly becoming an alternative to JSON particularly with non-browser clients.
  • multirow reads/writes, perhaps computing a mathematical function from a table or updating a hundred rows. Almost all of the requests I serve are multirow reads or writes, and frameworks have some per-row overhead usually
  • Making an "onward" request to another server. This tests the outbound HTTP stack.
  • A relationship (join) test, where you are using the ORM to relate two or more entities in a parent/child configuration. Frameworks take different approaches for eager vs lazy loading; the results may be interesting. Maybe construct a loop of entities and then check it for cycles.
@bhauer
Contributor
bhauer commented Apr 12, 2013

Hi @drewcrawford. Thanks for the ideas!

No rush, since this is just long-term planning, but I am curious about your second idea concerning multi-row reads and writes. In my head I conceive of that as executing a single UPDATE but I am probably misunderstanding you. Would you be able to draft up some quick pseudo-code to allow me to visualize what you mean?

Your third idea was echoed by another reader, so that's got "high demand" from my perspective. :)

A test of relationships is a great idea too.

@bhauer
Contributor
bhauer commented Apr 12, 2013

A commenter on HN named Terretta suggested the following. I'm just copying this here for easy future reference.

  1. Exercising a randomized mix of reading and writing. I think you already said you were planning a CRUD test. Consider a tunable ratio here, something like 10000 R to 100 U to 10 C to 1 D.
  2. Exercising synchronous web service (JSONP) calls in two modes: (a) to some web service that is consistently fast and low latency, say, the initial JSON example from this test suite running in servlet mode, and (b) to a web service written in the same framework as the one being tested, again using the initial JSON example. (The idea here is that many frameworks fall on their faces when confronted with latency. This is why synthetic tests are usually so poorly predictive of real world behavior -- people forget that latency causes backlogs and backlogs cause all parts of the stack to misbehave in interesting ways.)
  3. Test async ability if the framework has it, with a system call (sleep?) that takes a randomized 0 - 60 seconds to return. Would help understand when a framework is likely to blow up calling out to a credit card processor, doing server side image processing, etc.
  4. Exercising authentication (standardize on bcrypt, but only create passwords on 1 in 10K requests), authorization, and session state, if offered.
  5. Exercising any built-in support for caching, where 1 in rand(X) requests invalidates the DB query cache, 1 in rand(X) requests invalidates the WS call cache, 1 in rand(X) requests invalidates the long term async system call cache, and 1 in rand(Y) requests blows away the whole cache.
  6. For the enterprise legacy integrators, it would also be interesting to test XML as well (in particular, SOAP), anywhere we're testing JSON.
@bhauer
Contributor
bhauer commented Apr 12, 2013

Another HN commenter named kbenson suggested a goal of defining a simple blog-style application. This is ambitious but we've already passed the threshold at which we require community contributions in order to move forward (even adding one more simple test will require several pull requests from the community to see that test implemented in more than a small sampling of the frameworks).

With that in mind, I think it's a great item to have on the long-term plan. If we keep the requirements simple, it could be done.

@drewcrawford

Would you be able to draft up some quick pseudo-code to allow me to visualize what you mean?

a = 0
b = 1
for i from 1 to 100
    insert into table values(a+b)
    c = b
    b = a + b
    a = c

or

i = 0
not_quite_sum = 0
for row in table
    if i is even:
        not_quite_sum += row.field
    else:
        not_quite_sum -= row.field

The key insight being

  • there's a for loop
  • each pass of the for loop operates on one row
  • the overall operation is simple, but not so simple that it's natural to do in a SQL one-liner

The interesting thing about this test is that it does reads/writes in the same connection. Whereas in the single row access case the dominating factor might be setting up the connection or acquiring it from a shared pool, here the test is about how quick the ORM bindings are once they're in place and how fast you can move memory between the DB process and the application process.

@bhauer
Contributor
bhauer commented Apr 12, 2013

@drewcrawford, thanks! I understand what you have in mind now. For some reason, I read your original statement to imply something much more complicated.

But you simply mean for a test that operates over multiple rows in a single resultset, in the case of reading, with per-row functionality occurring within the application rather than within SQL functions. Your pseudo-code illustrates the idea well.

This was referenced Apr 21, 2013
@bhauer
Contributor
bhauer commented May 9, 2013

Note that requirements for each test type are now posted at the results web site: http://www.techempower.com/benchmarks/#section=code

@bhauer
Contributor
bhauer commented May 18, 2013

I just edited this issue to indicate the updates test is "present," and to add quick notes about the need to implement plaintext tests (both small and large payloads) and a larger work-load JSON test (something involving a complex and large data structure).

@michaelhixson
Contributor

A test that exercises form rendering, validation, and CSRF protection could be interesting. I'm pretty sure most of the full stack frameworks have utilities for those. Maybe the test would have three parts? One server-side implementation, but three sets of wrk parameters: (a) GET the form, (b) POST the form with errors, (c) POST the form successfully.

@bhauer
Contributor
bhauer commented Jul 26, 2013

I've added test type 16, a more complex routing table/map/tree based on Christopher Lord's comment on the Google Group:

https://groups.google.com/d/msg/framework-benchmarks/r0B3tPaCMPs/_PG1_p1McbwJ

@methane
Contributor
methane commented Jul 28, 2013

Heavy model test:

  • Make 10 ActiveRecord or RowGateway classes. Instantiate them from each table.
  • Make additional 100 classes with single methods. Instantiate and call them while request.

This test may reveal cost of class loading [1], method calling and GC.

[1] Some languages like php loads class for each request.

@weltermann17
Contributor

I think a pretty simple additional test would be serving static context of different sizes (100k, 1m, 10m, 100m?). Frameworks that perform similar to (maybe even better than) Apache httpd in this domain could make life a lot easier for full-blown web applications than those that serve small content extremely well but degrade significantly when content sizes get large. With our framework PLAIN, for instance, we generate dynamic content of 3D data (JT, 3DXML, CATIA) that quickly reaches sizes >100m. Streaming those to files and then serve them with an httpd or the like would be a big drawback in terms of performance and complexity.

@bhauer
Contributor
bhauer commented Aug 2, 2013

A note for any future SSL test: we should aim to ensure that the cipher being used ECDHE to ensure we're testing a proper production configuration with perfect forward secrecy. SSL tests will not be easy. SSL configuration on a single platform can be complicated; getting it right on several will take some effort.

@bhauer
Contributor
bhauer commented Aug 2, 2013

@methane I like your "heavy model" test. Do I have it summarized correctly below?

  • Create 10 ORM-wired entity classes. During the scope of each request fetch one row from each of the 10 tables (this isn't really a database-centric test, though, so perhaps we simplify things a bit and always fetch the same row?)
  • Create another 100 classes that are not wired to the ORM but have a method that must be called. Do we require an instance of each be instantiated during each request? What sort of operation would the method run? Something trivial, yes?
@bhauer
Contributor
bhauer commented Aug 2, 2013

@weltermann17 Could you give me a little more detail about what you have in mind? I'm worried that a test of large static asset delivery would be fairly uninteresting because we'll saturate our gigabit Ethernet connectivity between the servers almost immediately (even lower down the performance ranks than we presently do with our intentionally small-payload plaintext and JSON tests).

But you mention very large dynamic responses, which could be interesting; assuming those large responses need to be computed on the fly in some manner.

Maybe I'm misunderstanding something?

@methane
Contributor
methane commented Aug 2, 2013

@bhauer

(this isn't really a database-centric test, though, so perhaps we simplify things a bit and always fetch the same row?)

Yes. It may reveal the cost of Data Mapper. "Queries" test only use one table. "Complex Model" should use more tables and columns.

Do we require an instance of each be instantiated during each request?

Yes. Both of creating instance and calling method should be measured.

What sort of operation would the method run? Something trivial, yes?

It should not be optimized away by compiler.

Sample code:

@app.route('/complex-model')
def complex_model():
    entities = []
    entitles.append(Entity1.query.get(1))
    entitles.append(Entity2.query.get(1))
    # ...
    entitles.append(Entity10.query.get(1))

    msgs = []
    ModelClass1().method(msgs)
    ModelClass2().method(msgs)
    ModelClass3().method(msgs)
    # ...
    ModelClass10().method(msgs)
    return render_template('complex_model.tpl', entities=entities, messages=msgs)

class Entity1(Model):
    __table_name__ = 'entity1'
    id = Column(Integer, primary_key=True)
    col1 = Column(Integer)
    col2 = Column(Integer)
    col3 = Column(Integer)
    # ...
    col10 = Column(Integer)

# ... Entity10

class ModelClass1(object):
    def method(self, msgs):
        ModelClass1_1().method(msgs)
        ModelClass1_2().method(msgs)
        # ...
        ModelClass1_10().method(msgs)

# ... ModelClass10

class ModelClass1_1(object):
    def method(self, msgs):
        msgs.append("hello 1-1")

class ModelClass1_2(object):
    def method(self, msgs):
        msgs.append("hello 1-2")

# ... ModelClass10_10
@bhauer
Contributor
bhauer commented Aug 2, 2013

@methane Thanks! I like that. Implementations will involve quite a bit of copying and pasting, but that's easy enough.

I'll get this and the others mentioned in the comments added to the list above.

@weltermann17
Contributor

@bhauer
Thanks for your comment. I tested some frameworks locally serving a 50mb and a 250mb file. The frameworks perform quite differently: from 3.5 to 0.3 gigabyte/sec (on an i7/8core osx). A factor of 10. But you are absolutely right, with a 1 gb cable between client and server the throughput drops to 112 mb/sec for each of them. Still, the frameworks with a throughput of > 3 gb/s do a better job. If you want I can provide more details.
Creating huge dynamic responses in my opinion is very domain specific and should not be part of your test suite. Testing static content would eliminate the cost of creating it and concentrate on how well frameworks can provide it. This kind of test would show whether a framework utilizes the available network capacity and gains from scaling up or whether itself is the bottleneck in the end.

@bhauer
Contributor
bhauer commented Aug 3, 2013

@weltermann17 Thanks for the follow-up. I've revised the list above to say static instead of dynamic.

I am surprised to hear there much in the way of differentiation between the frameworks when transferring large static files on gigabit Ethernet. You say your tests saturated the network but the frameworks with a higher network-unlimited throughput do a better job. In what way is the better job measurable? Network-limited, aren't the rps numbers roughly equivalent?

@bhauer bhauer referenced this issue Sep 18, 2013
Closed

New test: JSF #307

@luan-cestari

Hi guys! =)

I was thinking in some web framework performance tests to be added. For example, display a form and submiting some data on it. We could make a workaround to create such test, where we would use tools like webdriver/selenium to browse throw the HTML and submit some information and we could see the latency of those actions and others metrics? I think people might like this new category for other frameworks like JSF. =]

Regards!

@hrj
hrj commented Oct 3, 2013
Test async ability if the framework has it, with a system call (sleep?) that takes a randomized 0 - 60 seconds to return. Would help understand when a framework is likely to blow up calling out to a credit card processor, doing server side image processing, etc.

I suggest instead of a random sleep interval, have a URL parameter that indicates how long to wait before return. This way, the randomisation can be controlled in the test harness. Also the test harness can check if the implementation is correct (response time should be greater than the requested wait time).

@bhauer
Contributor
bhauer commented Oct 3, 2013

@hrj Agreed. Where possible, I'd like to move control of that sort of variable to the test harness.

@bhauer
Contributor
bhauer commented Oct 9, 2013

Added tests 20 (static file serving) and 21 (penetration tests) based on feedback from Harshad RJ on the Google Group.

@kpacha
Contributor
kpacha commented Jan 22, 2014

Future: Penetration test(s). This would require additional client-side testing tools (beyond the load generator we use today), but would validate the security of the platform and framework combination.

Did you hear about Gauntlt? It's based on cucumber and the 'attacks' are just Gherkin feature descriptions (easy to read and improve).

@tgkprog
tgkprog commented Oct 17, 2014

Can you include a new test type for graph databases? Would be good with a fairly complex set of nodes and relationships. Test data can be got from an existing small pod/ social website (after scrubbing) so its random but still kind of real world.

@bitemyapp
Contributor

@tgkprog that's more about the databases than the languages accessing them. Lotta work too. How different could/would the numbers be from one language runtime to the next?

@hamiltont
Contributor

For what it's worth, I'd love to see a test that requires utilizing a ton of memory on the host system.

AFAIK, there is no test bottlenecking on available RAM or Disk IO. Every test we run will, assuming the framework utilizes the hardware we provide, hit peak performance due to running out of CPU or Network IO. I'd love to hear community thoughts on what value (if any) would be gotten from tests that stress these other hardware components - are there common real-world examples that map to stressing these hardware metrics? For example, I think @methane's suggestion of a large model might be a test that stresses available RAM at high numbers of requests

Future: Heavy model test, involving a larger number of entity objects and classes, as suggested by @methane in comments below.

@kbrock
Contributor
kbrock commented Nov 3, 2014

@hamiltont Not sure if it is related and/or desired, but for many platforms, memory intensive tests will potentially also stress the garbage collector.

Then the winners could be dependent on the tuning of the garbage collector. (Which requires a lot of skill for something like the jvm)
Would be nice to have a stance on how much you can do to tweak garbage collection.

@hamiltont
Contributor

@kbrock good point. If others are also interested in a test that stresses memory, I'd say the way to go is to ask everyone to share their thoughts on memory and/or GC optimizations that are possible (and any opinions they have on which ones should be dis/allowed). Before TFB could make a reasonable set of rules, we would need others to help us gather this knowledge.

My only substantial thought was that I would like to pass something like args.mem_in_bytes to the setup.py files - there are a few other issues that would benefit from frameworks intelligently launching with a fixed amount of memory. At the moment a number of frameworks try to allocate something like 2GB up front, which is kind of weird given that none of the tests we run use anything close to that - something like 500MB pre-allocated would already be plenty

@tgkprog
tgkprog commented Nov 9, 2014

Interested in comparing different data bases with each other. graph dbs,
sql, ...yes it would be a sub project and a lot of work.

On Fri, Oct 17, 2014 at 11:45 PM, Chris Allen notifications@github.com
wrote:

@tgkprog https://github.com/tgkprog that's more about the databases
than the languages accessing them. Lotta work too. How different
could/would the numbers be from one language runtime to the next?


Reply to this email directly or view it on GitHub
#133 (comment)
.

Regards
Tushar Kapila

@kbrock
Contributor
kbrock commented Nov 10, 2014

@hamiltont Java is a pig. The VM (for jruby) is so much bigger than cruby.
But jruby is so mature and tunable. And if you know what you are doing, you can probably get a much more performant and memory efficient implementation than cruby.

Memory usage is very necessary to understand the equation.

This also has the side effect of showing off how to properly tune the various environments to properly handle garbage collection. Or at least the levers that exist.

@kbrock
Contributor
kbrock commented Nov 10, 2014

Ruby servers tend to have metrics collection (E.g. newrelic) to monitor slow processes.
I'm not sure if other frameworks do much with monitoring performance, page counts, or other metrics so that developers can tune the slow queries, find missing index, and stuff like that.

Would it make sense to move some of the metrics collection into the app? Both to show how it is done, and also reflect how you write code in the real world?

Not suggesting that these numbers are the ones used by the tests. Also curious if other frameworks tend to implement this or if it is handled / reported mostly by apache logs or external services.

@hamiltont
Contributor

@kbrock Few questions:

Java is a pig. The VM (for jruby) is so much bigger than cruby

I have some ability to test these (and similar) statements. Using my docker branch I can limit the real memory and swap a framework can use. Are there 2-3 frameworks you can point me too where you would expect to see vastly different performance for something like 128 vs 2048 vs 16384MB of real RAM? Every framework I have tested different RAM levels on has either a) launched and performed as normal b) refused to launch. I've not found a framework that has different behavior based on total RAM allowed, although I strongly suspect that's because most tests barely touch memory limits like 128MB - going even crazier (8MB,16MB,32MB) I can detect some minor differences in latency values, not in throughput values. (Seriously, openresty pulls over 150k/sec on my test boxes with 16MB RAM. I had to double-check my setup to be sure I wasn't messing up somewhere)

Would it make sense to move some of the metrics collection into the app?

I think that makes a lot of sense, for the reasons you listed, and to let framework teams optimize for the test machine based on feedback from the preview round. My main thought is that it should definitely be opt-in, with some new flag like --store-optimization-metrics passed to run-tests (and then down to the different framework's setup.py files via the args parameter), as presumably collecting this data would 1) cause the results folder for a full round to gain a few extra GB and 2) hurt performance in most frameworks

@kbrock
Contributor
kbrock commented Nov 15, 2014

@hamiltont Are there thoughts on changing the configuration scripts for these to move over to docker?

If it is changing (and getting easier), it would be nice to know before trying to implement a few new benchmarks

/cc @bhauer

@hamiltont
Contributor

@kbrock Keeping in mind that I can't speak for TechEmpower, I seriously doubt there will be any movement towards docker (or similar container-based mechanisms) without some substantial data proving that the overhead is both negligible and consistent across the ~120 frameworks. There's a lot of chatter about containerization because 1) @msmith-techempower's been having some tricky issues putting together R10 preview and containers/namespaces were one proposed solution 2) I'm passionate about docker and have a branch that successfully combines docker with TFB. My docker+TFB work is not destined for merge into master though, it's just for my own experimentation.

TLDR - nope, docker's not coming to TFB anytime soon (potentially never)

@msmith-techempower
Contributor

@kbrock I would like to avoid relying on docker, but I am okay with there being another option for testing using docker (similar to how we have a vagrant development setup).

Basically, we want to keep the tests (cliche incoming) as close to the metal as possible. As @hamiltont mentioned, I am having a really hard time getting round 10 rolling because of all sorts of issues, and the approach I have opted to take keeps things that way.

What I would like to know is what docker would bring to the project that is worth while. It sounds (at least naively to me) that it would bring a simple way to spin up an environment suitable for running tests directly, but we have been improving quite a bit on that front and I think that that is just a minor convenience. What am I missing?

@kbrock
Contributor
kbrock commented Nov 19, 2014

@msmith-techempower Since I know docker is often touted as a solution to standardizing installations, I wanted to make sure I didn't dive into here just to find out that the installation process has changed.

Not pushing docker at all.
Thanks.

@msmith-techempower
Contributor

Again, I'm fine with docker as an option for unifying our, admittedly difficult, installation and setup (we already have a vagrant script that does this... though, I'm going to break it with the next merge).

The only thing I require is that we can install/setup/run the suite without the overhead of a virtualized or contained environment (read: bare metal).

@edsiper
Contributor
edsiper commented Apr 22, 2015

request: when performing plaintext test, do it in two modes: with pipelined requests and without pipelined requests. This will help to measure the internal architecture when switching between requests and processing outgoing data.

@circlespainter
Contributor

I'm with @bhauer and @hrj on the parameterized "sleep" test: it could be an easy way to test the specific strenghts of async frameworks as well as Quasar / Pulsar / Comsat ones using fibers (see #1719 and #1720), which offer greater benefits when there are more outstanding requests than OS threads a box can handle.

@Drawaes
Drawaes commented Oct 26, 2016

On SSL/TLS considering their importance and prevalence these days, has a test or test pack been devised yet for this?

@Drawaes
Drawaes commented Nov 14, 2016

Is anyone looking at SSL/TLS I am happy to help as much as I can, but I think with most companies really starting to push this, it is a very important test for now and more so into the future.

@bhauer
Contributor
bhauer commented Nov 28, 2016

I have added a "TCP heavy" test type number 22 in the list above. This is based on feedback from Daniel Nicoletti here: https://groups.google.com/d/msg/framework-benchmarks/2LRga8pkm6E/YOqNI58lAgAJ

@weltermann17
Contributor
@cjnething
Contributor

Hi @weltermann17 it looks like Scala (plain) was removed in this PR because it wasn't working and no one could find documentation or evidence of maintenance. Pinging @knewmanTE for further help/next steps.

@knewmanTE
Contributor

@weltermann17 in an effort to clean up the suite, we removed Scala/plain because it was failing our suite's tests and appeared to be unmaintained. I apologize if this wasn't the case! If you are able to get the Scala/plain tests working again, please open up a pull request to get it back in! You can find the code for our last known implementation here, and you can check it out with that commit (beb9b3e). Though, to get it merged back into the main code base, you should be testing the framework on a branch that also contains the latest changes from master.

@weltermann17
Contributor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment