Skip to content

Conversation

burner
Copy link
Member

@burner burner commented Feb 16, 2015

Basically, Haskell Quickcheck with in build benchmarking for performance regression testing.

IMO phobos is missing something like Quickfix (randomized test data generation). Additionally, it would be nice to use the tests as a way to measure the performance of the functions and the code generation over time.

The benchmark results are written to a file as csv. The program benchmarkplotter.d generates .dat and .gp files for all uniquely named benchmarks. gnuplot then generates a plot of the benchmark values over time. This could than be displayed on dlang.org to show our progress and be transparent about our performance.

make BUILD=benchmark enables all benchmarks. For more info see example in std/experimental/randomized_unittest_benchmark.d

alt text

/// The following examples show an overview of the given functionalities.
unittest
{
    void theFunctionToTest(int a, float b, string c)
    {
        // super expensive operation
        auto rslt = (a + b) * c.length;

        /* Pass the result to doNotOptimizeAway so the compiler
        can not remove the expensive operation, and thereby falsify the
        benchmark.
        */
        doNotOptimizeAway(rslt);

        debug
        {
            /* As the paramters to the function assume random values,
            $(D benchmark) allows to quickly test function with various input
            values. As the verification of computed value or state will at to
            the runtime of the function to benchmark, it makes sense to only
            execute these verifications in debug mode.
            */
            assert(c.length ? true : true);
        }
    }

    /* $(D benchmark) will run the function $(D theFunctionToTest) as often as
    possible in 1 second. The function will be called with randomly selected
    values for its parameters.
    */
    benchmark!theFunctionToTest();
}

dub: https://github.com/burner/std.benchmark

@burner burner changed the title Parameter unittest and benchmark Parameterized unittests and benchmarks Feb 16, 2015
@nordlow
Copy link
Contributor

nordlow commented Feb 16, 2015

I've thought about implementing this. Good thing you've started it. Things I've been dreaming about:

  • Benchmarker keeps track of total time spent and stops iteration (progression of test data sizes) after a caller-specified total clock time. To be exact this will require prediction of execution time for next data iteration (see next).
  • Guess time-complexity by fitting samples to a list of models, say O(n), O(n^^2), ..., O(n*log(n)). This can be used in previous point to predict execution time given a specific data size.
  • Multi-Dimensional Size Iterations: Multi-dimensional data structures (containers, ranges) in, for instance, linear algebra packages would need to have some way to tell the benchmarker about their dimensionality and how to instantiate them given a sample of the these size dimensions. It would be really cool to have a benchmarker that random samples this space in a clever way and visualizes the result in some clever (spatial) way. I believe that would require extending the existing hierarchy of Ranges in Phobos to Multi-Dimensional variants. That might become complicated, though. Nevertheless https://github.com/kyllingstad/scid/tree/master/source/scid might gives some inspirations. Do you see any applications of your own in this regard?

@burner
Copy link
Member Author

burner commented Feb 16, 2015

  • I'm not sure about the max run time stop part (but I accept PRs :-) )
  • The time complexity stuff would be nice if we had a way to feed it back in into the source. (aka auto doc)
  • what do you mean by multi-dim size iterators?
  • I'm planning to have another unittest target running all the benchmark unitttest and have the result be convert into a gnuplot graph that we can show on the webpage.
  • next I'm will create a unicode string generator

@nordlow
Copy link
Contributor

nordlow commented Feb 16, 2015

I updated my previous comment. I hope that makes my idea more clear.

@burner burner force-pushed the parameter_unittest_and_benchmark branch from 7be5538 to 6176b16 Compare March 16, 2015 10:51
@burner burner force-pushed the parameter_unittest_and_benchmark branch from b73d2b2 to 4b92fd1 Compare March 23, 2015 18:47
@burner burner force-pushed the parameter_unittest_and_benchmark branch from 216d56d to 44c435f Compare March 31, 2015 23:22
@burner
Copy link
Member Author

burner commented Apr 8, 2015

anyone?

@burner burner force-pushed the parameter_unittest_and_benchmark branch from 44c435f to 391a3d1 Compare April 9, 2015 15:01
@MartinNowak
Copy link
Member

I have a very complete library for this kind of random testing, build after QuickCheck.
http://code.dlang.org/packages/qcheck
Andrei recently asked me to move the arbitrary part of it to std.random or so. Now you're aging something similar and duplicate existing work std.benchmark.
While I think this is a good addition to phobos we should probably go the long route, integrate existing stuff with your code and phobos.
I could really take some help on getting arbitrary Phobos ready.

@burner
Copy link
Member Author

burner commented Apr 10, 2015

IMO the long road is wrong. qcheck, as far as I can see it, was not build for benchmarking. and std.benchmark was not build for testing. Both were never meant to work together, let alone in a consistent way. Therefore, I believe combining both will not fly and be a total mess. Making the combination work, would properly yield this PR. Just because these two projects exists does not mean they have to be used.

On the other hand PR shows a clear vision. It presents a clear track how to integrate all the way from phobos to marketing D and phobos on dlang.org. It works seamlessly with TypeTuple foreach loops. The Implementation and usage is trivial and was build to fit together. Extending the random value generation to new types is easy and flexible. benchmarkplotter.d is a nice showcase for D and especially phobos, IMO. It is properly trivial to integrate the benchmarking and record-keeping in the autotester (the BUILD target already exists). I can go on, how this PR will make all the bikeshedding about benchmarking superfluous and how it puts pressure one PRs to not worsen performance which leads to got marketing and so forth.

But most importantly, this PR is ready now. Not in a week or a year. Now!

p.s. Sorry for the rant, I'm sometimes just really annoyed by the scared, half-cocked decision making that seams to be governing D development. Instead of bold, risk taking, vision driven decision making.

p.p.s. Even if this toolchain is a bust there is no risk. It is in internal and will not be exposed to the public, apart from the gnunplot graphs.

template RefType(T)
{
alias RefType = ref T;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a public symbol without documentation is a no-go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved that. as ref if strange anywhere

@MartinNowak
Copy link
Member

qcheck, as far as I can see it, was not build for benchmarking. and std.benchmark was not build for testing

That's a pretty made up argument, benchmark was made for benchmarking and qcheck was made for generating random values. Now you want to benchmark with random value, well...
Anyhow, I was just trying to take the opportunity to push an import phobos addition (std.benchmark), and also wanted to take up the existing idea to move arbitrary to phobos.

@MartinNowak
Copy link
Member

I think you also mean QuickCheck right?

{
immutable theChar = cast(dchar)charsToSearch[toSearchFor];
auto idx = toSearchIn.indexOf(theChar);
ben.stop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's that ben.stop() doing here? Who is reactivating the stopwatch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generator is reactivation the stopwatch

@MartinNowak
Copy link
Member

That's a lot of ugly boilerplate code for a single benchmark.

    auto ben = Benchmark(format("string.lastIndexOf(%s,%s)", 
        S.stringof, R.stringof));

    auto generator = RndValueGen!(
        GenUnicodeString!(S, 10, 500), 
        GenUnicodeString!(R, 1, 10))
            (rnd, ben, rounds);

    foreach(S toSearchIn, R toSearchFor; generator)
    {
        auto idx = toSearchIn.lastIndexOf(toSearchFor);
        ben.stop();

        if (idx != -1) {
            assert(equal(
                toSearchIn[idx .. idx + to!S(toSearchFor).length],
                toSearchFor));
        }
    }

I'd suggest to redesign this around functions.

void testee(S, R)(S toSearchIn, R toSearchFor)
{
    auto idx = toSearchIn.lastIndexOf(toSearchFor);
    if (idx != -1) {
        assert(equal(
            toSearchIn[idx .. idx + to!S(toSearchFor).length],
            toSearchFor));
    }
}

runBench!(testee, GenUnicodeString!(S, 10, 500), GenUnicodeString!(R, 1, 10))();

You could even infer the arguments from the function parameters (runBench!(testee!(S, R))) and go with a generic config for things like low and high, though runBench should still take generators.
https://github.com/MartinNowak/qcheck/blob/master/src/qcheck/config.d

@MartinNowak
Copy link
Member

If you're benchmarking functions like indexOf I'd be worried about the overhead of clock_gettime called by StopWatch.stop/start().

the $(D theFunctionToTest). The member function $(D stop) of
$(D Benchmark) stops the stopwatch. The stopwatch is automatically
resumed when the loop is continued. */
ben.stop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're benchmarking, the that's a release build with assertions disabled.

@MartinNowak
Copy link
Member

At least you need to do something to handle noisy benchmark results, the simplest effective approach is to take the best out of 10 runs or so.
http://forum.dlang.org/post/jlsi14$1pi5$1@digitalmars.com
Please have a look at Andrei's benchmark module. It contains some very good concepts https://github.com/andralex/phobos/blob/benchmark/std/benchmark.d.

@burner
Copy link
Member Author

burner commented Apr 11, 2015

I'd be worried about the overhead of clock_gettime

I do worry about this. I'm have not fund a good solution yet. maybe generate data for multiple runs and than just serve it.

the forum link didn't lead me anywhere

@burner
Copy link
Member Author

burner commented Apr 11, 2015

runBench!(testee, GenUnicodeString!(S, 10, 500), GenUnicodeString!(R, 1, 10))();

I'm not to sure if that will reduce the code size in the long run. All that and testee needs to be include into a unittest and it is missing the name of the benchmark etc and you can't stop the stopwatch for the assertion.

I could collect some info with ctfe, but I rather like it obviouse. I will think about this some more.

@burner
Copy link
Member Author

burner commented Apr 11, 2015

thank you for reviewing

@MartinNowak
Copy link
Member

you can't stop the stopwatch for the assertion.

No need to do that, assertion are disabled for release benchmarks anyhow.
That also solves part of the clock_gettime cost problem, if generating test data is cheap enough (or is pregenerated).

@MartinNowak
Copy link
Member

missing the name of the benchmark

Just take the name of the function then.

@@ -0,0 +1,29 @@
#include <stdio.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way we can do this in D?
Currently there is a huge movement to get rid of etc/c ...

@wilzbach
Copy link
Contributor

@wilzbach awesome. I will rebase to make it merge again

I am happy to push people!
Feel free to make a newsgroup announcement - I will collect the feedback then ;-)
Looking at the last announcement there are probably a few things to bear in mind

  • title should have [phobos-experimental] tag
  • short & concise description
  • people want to preview an addition with dub

@burner burner force-pushed the parameter_unittest_and_benchmark branch from 2792fa9 to cb91641 Compare June 18, 2016 15:00
@burner
Copy link
Member Author

burner commented Jun 18, 2016

@wilzbach still feel like being review manager?

@burner burner force-pushed the parameter_unittest_and_benchmark branch from cb91641 to 7ea67d8 Compare June 18, 2016 15:08
@wilzbach
Copy link
Contributor

@wilzbach still feel like being review manager?

Sure - as said I am more than happy to ensure that feedback will be collected ;-)
Should I make the NG announcement or do you want to describe your project yourself?

@burner
Copy link
Member Author

burner commented Jun 18, 2016

If you could please start the NG thread. That would be awesome.

@wilzbach
Copy link
Contributor

If you could please start the NG thread. That would be awesome.

I hope this announcement describes your PR correctly.

@JackStouffer
Copy link
Contributor

@nomad-software
Copy link
Contributor

nomad-software commented Jun 19, 2016

IMHO the namespace needs a little work. std.experimental.randomized_unittest_benchmark is not very scalable and ruins the namespace for future work. I would suggest std.experimental.unittest.benchmark.randomised or something near. Then other work could be added around those specific modules.

@JackStouffer
Copy link
Contributor

You should remove all global selective imports. While DMD now warns about them, LDC users who import the library manually will still experience 314 with no warning.

else static if (is(T : GenASCIIString!(S), S...))
alias ParameterToGen = T;
else
static assert(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need an error message here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix that

@burner burner force-pushed the parameter_unittest_and_benchmark branch from 7ea67d8 to b937c27 Compare June 30, 2016 14:28
@andralex
Copy link
Member

Plopping this here to make sure: http://forum.dlang.org/post/npk8rb$13pk$1@digitalmars.com

@burner
Copy link
Member Author

burner commented Aug 24, 2016

@andralex IMO the current approach is superior to your suggests.

  1. You can access the surrounding non global scope to access additional no random data.
  2. This approach also allows me create a patch makes it possible to seamlessly pass additional data as function arguments.
  3. My approach does not require a different test runner. Your approach requires to test all functions in all modules if they are benchmarks. Not hard but unnecessarily complex, especially when we have the unit test runner.
  4. Aggregate.median already exists. The additional tool I created even allows you to track the performance over time (see Figure at the top).
  5. If you can have functions you can have template functions. That allows you to use all their magic. Sure you can do that with your approach as well, but that would require to create another function that calls the benchmark function. And I would argue using unittests as that function is more consistent.
  6. The binding between parameter value range and parameter is clearer in my approach. (Is Aggregate.median the value of float b?)
  7. IMO your example is overusing UDAs. Sure UDAs are awesome, but that does not mean we have to invent yet another DSL.

@andralex
Copy link
Member

@andralex IMO the current approach is superior to your suggests.

A productive thing to do would be to look at other benchmark frameworks and see how yours stacks against them. But at the end I agree it may end up as you and I having different design sensibilities.

  1. You can access the surrounding non global scope to access additional no random data.

What I wrote was but a sketch, shouldn't be taken as a complete design. Clearly more features may be added (such as fixtures, before/after code and state, etc).

  1. This approach also allows me create a patch makes it possible to seamlessly pass additional data as function arguments.

The same can be trivially achieved with the benchmark you just fix Param!3(3.14). What am I missing?

  1. My approach does not require a different test runner. Your approach requires to test all functions in all modules if they are benchmarks. Not hard but unnecessarily complex, especially when we have the unit test runner.

I agree that there's more work involved with introspecting the attributes. But that work will be reused across all benchmarks, making it easy to write many individual benchmarks. It's scalability and nicely applied.

  1. Aggregate.median already exists. The additional tool I created even allows you to track the performance over time (see Figure at the top).

Sure, but that's not a competitive trait; I assume such a feature would be present in any framework along with a few others (min, max, average, p90 come to mind).

  1. If you can have functions you can have template functions. That allows you to use all their magic. Sure you can do that with your approach as well, but that would require to create another function that calls the benchmark function. And I would argue using unittests as that function is more consistent.

I'm not sure how you mean that. Surely attributes can be applied to template functions?

  1. The binding between parameter value range and parameter is clearer in my approach. (Is Aggregate.median the value of float b?)

It is clearer because the function is a benchmark, not a function being benchmarked.

  1. IMO your example is overusing UDAs. Sure UDAs are awesome, but that does not mean we have to invent yet another DSL.

Now this is just a fallacy: https://en.wikipedia.org/wiki/Begging_the_question

@andralex
Copy link
Member

andralex commented Aug 24, 2016

The main thing here is that an attribute-based framework benchmarks existing functions that are not written necessarily to implement benchmark, whereas this PR defines a framework in which the user must define benchmarks. It follows that an attribute-based framework needs less code from the user.

  • This PR: "You write the benchmarks and I call them"
  • Attribute-based: "You write your usual application code and mention what values it should be benchmarked with, and I'll take care of the rest".

There is no contest here.

@burner
Copy link
Member Author

burner commented Aug 24, 2016

The main thing here is that an attribute-based framework benchmarks existing functions that are not
written necessarily to implement benchmark, whereas this PR defines a framework in which the user
must define benchmarks. It follows that an attribute-based framework needs less code from the user.

I think I do not understand what you mean. I mean

int someFunctionYouWantToBenchmark(int a, int b);

unittest {
    void fun(int a, int b) {
        int r = someFunctionYouWantToBenchmark(a, b);
       ....
    }

    benchmark!fun();

    // update
    benchmark!someFunctionYouWantToBenchmark();
    // also works ;)
}

At look at the list at stackoverflow and more closely at those that looked good. This PR has all there features (some are implicitly given by D itself), additionally I couldn't find any library that allows for user defined types. This PR allows that.

Param!3(3.14) What am I missing?

You have two places to keep in sync, which is worse than one place. Being DRY and all. And 3 is not really clear IMO ;-)

I agree that there's more work involved with introspecting the attributes.

That is not really a problem, if it were no library ever would be created. IMO UDA are harder to use in the long run (I have no data, just my opinion on UDAs from what I have seen and used in D libraries.)

It is clearer because the function is a benchmark, not a function being benchmarked.

I do not understand what you mean.

There is no contest here.

Well sort of, not between us, but between our ideas. Because, I hope you would agree that providing both approaches is not a good idea, that means we either have to choose one, or invent something better.

forgot a file

some more

made it compile again

some more

rebase

something makes some trouble

translate also works

some tutorial

makefile update for windows

started to create the program that will create the graphs

the .dat gets generated

some plotting

no more release and RefType docu

some more

another update

updates

just some tickering

rebase

RefType broken

more tinkering

more tinkering

works again

some work on the plotter

some plotter work

candlesticks

another update

whitespace fix

whitespace part 2

move and donotoptimizeaway

a restart

made it compile again

win attempt

andrei round 1

whitespace

win32 try

updates

whitespace

another makefile try

some update

small fix

more fixing

another build bug

a litte hack

dmc do not optimize away

fix

started moving string benchmark to seperate file

* to remove cyclic includes

some reworking

makefile fix

whitespace

makefile fix

veelo corrections

veelo strikes again

improved donotoptimize away

another getpid test

sort of builds again

some formatting nitpicks

more nipicks

whitespace

whitespace

redoing do not optimize away

doNotOptimizeAway

whitespace

format
@burner burner force-pushed the parameter_unittest_and_benchmark branch from b937c27 to 8841a53 Compare August 24, 2016 18:11
@dlang-bot
Copy link
Contributor

@burner, thanks for your PR! By analyzing the annotation information on this pull request, we identified @andralex, @JesseKPhillips and @9rnsr to be potential reviewers. @andralex: The PR was automatically assigned to you, please reassign it if you were identified mistakenly.

(The DLang Bot is under development. If you experience any issues, please open an issue at its repo.)

@codecov-io
Copy link

codecov-io commented Aug 24, 2016

Codecov Report

Merging #2995 into master will increase coverage by 0.02%.
The diff coverage is 97.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2995      +/-   ##
==========================================
+ Coverage   88.78%   88.81%   +0.02%     
==========================================
  Files         121      122       +1     
  Lines       74161    74375     +214     
==========================================
+ Hits        65847    66057     +210     
- Misses       8314     8318       +4
Impacted Files Coverage Δ
std/string.d 99.71% <ø> (ø) ⬆️
std/experimental/randomized_unittest_benchmark.d 97.66% <97.66%> (ø)
std/concurrency.d 83.42% <0%> (+0.17%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update beaba6e...8841a53. Read the comment docs.

@burner
Copy link
Member Author

burner commented May 17, 2017

superseeded by https://github.com/burner/benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.