Benchmarks: Dsl.scala vs Monix vs Cats Effect vs Scalaz Concurrent vs Scala Async vs Scala Continuation

杨博 (Yang Bo) edited this page Jul 20, 2018 · 26 revisions

We created some benchmarks to evaluate the computational performance of code generated by our compiler plug-in, especially, we are interesting how our !-notation and other direct style DSL affect the performance in an effect system that support both asynchronous and synchronous effects.

In spite of keywords of adapters to monads or other effect systems (see domain.cats and domain.scalaz), the preferred effect system for Dsl.scala is Task, the type alias of vanilla continuation-passing style function, defined as:

type Continuation[Domain, Value] = (Value => Domain) => Domain
type !![Domain, Value] = Continuation[Domain, Value]

type TaskDomain = TailRec[Unit] !! Throwable

// Use this `TaskDomain` as the answer type of `Task`,
// for the ability of tail call optimization and exception handling
type Task[Value] = TaskDomain !! Value

Our benchmarks measured the performance of Task in Dsl.scala, along with other combination of effect system with direct style DSL, listed in the following table:

The combination of effect system and direct style DSL being benchmarked
Effect System direct style DSL
Dsl.scala Task, an alias of vanilla continuation-passing style functions !-notation provided by Dsl.scala
Scala Future Scala Async
Scala Continuation library Scala Continuation compiler plug-in
Monix tasks for comprehension
Cats effects for comprehension
Scalaz Concurrent for comprehension

The performance of recursive tasks in each effect systems

The purpose of the first benchmark is to determine the performance of recursive functions in various effect system, especially when a direct style DSL is used.

The performance baseline

In order to measure the performance impact due to direct style DSLs, we have to measure the performance baseline of different effect systems at first. We created some benchmarks for the most efficient implementation of a sum function in each effect system. These benchmarks perform the following computation:

  • Creating a List[X[Int]] of 1000 tasks, where X is the data type of task in the effect system.
  • Performing recursive right-associated "binds" on each element to add the Int to an accumulator, and finally produce a X[Int] as a task of the sum result.
  • Running the task and blocking awaiting the result.

Note that the tasks in the list is executed in the current thread or in a thread pool. We keep each task returning a simple pure value, because we want to measure the overhead of effect systems, not the task itself.

The “bind” operation means the primitive operation of each effect system. For Monix tasks, Cats effects, Scalaz Concurrent and Scala Continuations library, the “bind” operation is flatMap; for Dsl.scala, the “bind” operation is cpsApply, which may or may not be equivalent to flatMap according to the type of the current domain.

We use the !-notation to perform the cpsApply in Dsl.scala. The !-notation results the exact same Java bytecode to manually passing a callback function to cpsApply:

// The most efficient implementation of sum in Dsl.scala
def loop(tasks: List[Task[Int]], accumulator: Int = 0)(callback: Int => TaskDomain): TaskDomain = {
  tasks match {
    case head :: tail =>
      // The same performance as: `head.cpsApply(i => loop(tail, i + accumulator)(callback))`
      loop(tail, !head + accumulator)(callback)
    case Nil =>
      callback(accumulator)
  }
}

However, direct style DSLs for other effect systems are not used in favor of raw flatMap calls, in case of decay of the performance. The following code block shows the benchmark code for Scala Futures. The code for all the other effect systems are similar to it.

// The most efficient implementation of sum in Scala Futures
def loop(tasks: List[Future[Int]], accumulator: Int = 0): Future[Int] = {
  tasks match {
    case head :: tail =>
      head.flatMap { i =>
        loop(tail, i + accumulator)
      }
    case Nil =>
      Future.successful(accumulator)
  }
}

The benchmark result is shown below (larger score is better):

executedIn size Score, ops/s
RawSum.cats thread-pool 1000 799.072 ±3.094
RawSum.cats current-thread 1000 26932.907 ±845.715
RawSum.dsl thread-pool 1000 729.947 ±4.359
RawSum.dsl current-thread 1000 31161.171 ±589.935
RawSum.future thread-pool 1000 575.403 ±3.567
RawSum.future current-thread 1000 876.377 ±8.525
RawSum.monix thread-pool 1000 743.340 ±11.314
RawSum.monix current-thread 1000 55421.452 ±251.530
RawSum.scalaContinuation thread-pool 1000 808.671 ±3.917
RawSum.scalaContinuation current-thread 1000 17391.684 ±385.138
RawSum.scalaz thread-pool 1000 722.743 ±11.234
RawSum.scalaz current-thread 1000 15895.606 ±235.992

The benchmark result of sum for performance baseline

The Task alias of continuation-passing style function used with Dsl.scala is quite fast. Dsl.scala, Monix and Cats Effects score on top 3 positions for either tasks running in the current thread or in a thread pool.

The performance impact of direct style DSLs

In this section, we will present the performance impact when different syntax notations are introduced. For vanilla CPS functions, we added one more !-notation to avoid manually passing the callback in the previous benchmark. For other effect systems, we refactored the previous sum benchmarks to use Scala Async, Scala Continuation’s @cps annotations, and for comprehension, respectively:

// Left associated sum in Dsl.scala
def loop(tasks: List[Task[Int]]): Task[Int] = _ {
  tasks match {
    case head :: tail =>
      !head + !loop(tail)
    case Nil =>
      0
  }
}

// Right associated sum in Dsl.scala
def loop(tasks: List[Task[Int]], accumulator: Int = 0): Task[Int] = _ {
  tasks match {
    case head :: tail =>
      !loop(tail, !head + accumulator)
    case Nil =>
      accumulator
  }
}

// Left associated sum in Scala Future
def loop(tasks: List[Future[Int]]): Future[Int] = async {
  tasks match {
    case head :: tail =>
      await(head) + await(loop(tail))
    case Nil =>
      0
  }
}

// Right associated sum in Scala Future
def loop(tasks: List[Future[Int]], accumulator: Int = 0): Future[Int] = async {
  tasks match {
    case head :: tail =>
      await(loop(tail, await(head) + accumulator))
    case Nil =>
      accumulator
  }
}

// Left associated sum in Scala Continuation
def loop(tasks: List[() => Int @suspendable]): Int @suspendable = {
  tasks match {
    case head :: tail =>
      head() + loop(tail)
    case Nil =>
      0
  }
}

// Right associated sum in Scala Continuation
def loop(tasks: List[() => Int @suspendable], accumulator: Int = 0): Int @suspendable = {
  tasks match {
    case head :: tail =>
      loop(tail, head() + accumulator)
    case Nil =>
      accumulator
  }
}

// Left associated sum in Scalaz Concurrent
def loop(tasks: List[Task[Int]]): Task[Int] = {
  tasks match {
    case head :: tail =>
      for {
        i <- head
        accumulator <- loop(tail)
      } yield i + accumulator
    case Nil =>
      Task(0)
  }
}

// Right associated sum in Scalaz Concurrent
def loop(tasks: List[Task[Int]], accumulator: Int = 0): Task[Int] = {
  tasks match {
    case head :: tail =>
      for {
        i <- head
        r <- loop(tail, i + accumulator)
      } yield r
    case Nil =>
      Task.now(accumulator)
  }
}

Note that reduced sum can be implemented in either left-associated recursion or right-associated recursion. The above code contains benchmark for both cases. The benchmark result is shown below:

executedIn size Score, ops/s
LeftAssociatedSum.cats thread-pool 1000 707.940 ±10.497
LeftAssociatedSum.cats current-thread 1000 16165.442 ±298.072
LeftAssociatedSum.dsl thread-pool 1000 729.122 ±7.492
LeftAssociatedSum.dsl current-thread 1000 19856.493 ±386.225
LeftAssociatedSum.future thread-pool 1000 339.415 ±1.486
LeftAssociatedSum.future current-thread 1000 410.785 ±1.535
LeftAssociatedSum.monix thread-pool 1000 742.836 ±9.904
LeftAssociatedSum.monix current-thread 1000 19976.847 ±84.222
LeftAssociatedSum.scalaContinuation thread-pool 1000 657.721 ±9.453
LeftAssociatedSum.scalaContinuation current-thread 1000 15103.883 ±255.780
LeftAssociatedSum.scalaz thread-pool 1000 670.725 ±8.957
LeftAssociatedSum.scalaz current-thread 1000 5113.980 ±110.272

The benchmark result of left-associated sum in direct style DSLs

executedIn size Score, ops/s
RightAssociatedSum.cats thread-pool 1000 708.441 ±9.201
RightAssociatedSum.cats current-thread 1000 15971.331 ±315.063
RightAssociatedSum.dsl thread-pool 1000 758.152 ±4.600
RightAssociatedSum.dsl current-thread 1000 22393.280 ±677.752
RightAssociatedSum.future thread-pool 1000 338.471 ±2.188
RightAssociatedSum.future current-thread 1000 405.866 ±2.843
RightAssociatedSum.monix thread-pool 1000 736.533 ±10.856
RightAssociatedSum.monix current-thread 1000 21687.351 ±107.249
RightAssociatedSum.scalaContinuation thread-pool 1000 654.749 ±7.983
RightAssociatedSum.scalaContinuation current-thread 1000 12080.619 ±274.878
RightAssociatedSum.scalaz thread-pool 1000 676.180 ±7.705
RightAssociatedSum.scalaz current-thread 1000 7911.779 ±79.296

The benchmark result of right-associated sum in direct style DSLs

The result demonstrates that the !-notation provided by Dsl.scala is faster than all other direct style DSLs in the right-associated sum benchmark. The Dsl.scala version sum consumes a constant number of memory during the loop, because we implemented a tail-call detection in our CPS-transform compiler plug-in, and the Dsl interpreter for Task use a trampoline technique . On the other hand, the benchmark result of Monix Tasks, Cats Effects and Scalaz Concurrent posed a significant performance decay, because they costs O(n) memory due to the map call generated by for comprehension, although those effect systems also built in trampolines. In general, the performance of recursive monadic binds in a for comprehension is always underoptimized due to the inefficient map.

The performance of collection manipulation in effect systems

The previous sum benchmarks measured the performance of manually written loops, but usually we may want to use higher-ordered functions to manipulate collections. We want to know how those higher-ordered functions can be expressed in direct style DSLs, and how would the performance be affected by direct style DSLs.

In this section, we will present the benchmark result for computing the Cartesian product of lists.

The performance baseline

As we did in sum benchmarks, we created some benchmarks to maximize the performance for Cartesian product. Our benchmarks create the Cartesian product from traverseM for Scala Future, Cats Effect, Scalaz Concurrent and Monix Tasks. The following code block shows the benchmark code for Scala Future.

// Cartesian product in Scala Future, separated to two functions

import scala.concurrent.Future
import scalaz.std.list._
import scalaz.std.scalaFuture._
import scalaz.syntax.all._

def cellTask(taskX: Future[Int], taskY: Future[Int]): Future[List[Int]] = async {
  List(await(taskX), await(taskY))
}

def listTask(rows: List[Future[Int]], columns: List[Future[Int]]): Future[List[Int]] = {
  rows.traverseM { taskX =>
    columns.traverseM { taskY =>
      cellTask(taskX, taskY)
    }
  }
}

Scala Async or for comprehension is used in element-wise task cellTask, but the collection manipulation listTask is kept as manually written higher order function calls, because neither Scala Async nor for comprehension supports traverseM.

The benchmark for Dsl.scala is entirely written in !-notation:

// Cartesian product in Dsl.scala, separated to two functions

def cellTask(taskX: Task[Int], taskY: Task[Int]): Task[List[Int]] = _ {
  List(!taskX, !taskY)
}

def listTask(rows: List[Task[Int]], columns: List[Task[Int]]): Task[List[Int]] = {
  cellTask(!Each(rows), !Each(columns))
}

The Each keyword is available here because it is adaptive. Each keyword can be used in not only List[_] domain, but also (_ !! Coll[_]) domain as long as Coll is a Scala collection type that supports CanBuildFrom type class.

We didn’t benchmark Scala Continuation here because all higher ordered functions for List do not work with Scala Continuation.

The benchmark result is shown below:

executedIn size Score, ops/s
RawCartesianProduct.cats thread-pool 50 136.415 ±1.939
RawCartesianProduct.cats current-thread 50 1346.874 ±7.475
RawCartesianProduct.dsl thread-pool 50 140.098 ±2.062
RawCartesianProduct.dsl current-thread 50 1580.876 ±27.513
RawCartesianProduct.future thread-pool 50 100.340 ±1.894
RawCartesianProduct.future current-thread 50 93.678 ±1.829
RawCartesianProduct.monix thread-pool 50 142.071 ±1.299
RawCartesianProduct.monix current-thread 50 1750.869 ±18.365
RawCartesianProduct.scalaz thread-pool 50 78.588 ±0.623
RawCartesianProduct.scalaz current-thread 50 357.357 ±2.102

The benchmark result of Cartesian product for performance baseline

Monix tasks, Cats Effects and vanilla CPS functions created from Dsl.scala are still the top 3 scored effect systems.

The performance of collection manipulation in direct style DSLs

We then refactored the benchmarks to direct style DSLs. The following code blocks are DSLs for Scala Future, written in ListT monad transformer provided by Scalaz. The benchmarks for Monix tasks, Scalaz Concurrent are also rewritten in the similar style.

import _root_.scalaz.syntax.all._
import _root_.scalaz.ListT
import _root_.scalaz.std.scalaFuture._

// Cartesian product for based on `ListT` transformer, in one function
def listTask(rows: List[Future[Int]], columns: List[Future[Int]]): Future[List[Int]] = {
  for {
    taskX <- ListT(Future.successful(rows))
    taskY <- ListT(Future.successful(columns))
    x <- taskX.liftM[ListT]
    y <- taskY.liftM[ListT]
    r <- ListT(Future.successful(List(x, y)))
  } yield r
}.run

With the help of ListT monad transformer, we are able to merge cellTask and listTask into one function in a direct style for comprehension, avoiding any manual written callback functions.

We also merged cellTask and listTask in the Dsl.scala version of benchmark:

// Cartesian product in Dsl.scala, in one function
def listTask: Task[List[Int]] = reset {
  List(!(!Each(inputDslTasks)), !(!Each(inputDslTasks)))
}

This time, Cats Effects are not benchmarked due to lack of ListT in Cats. The benchmark result are shown in Table.

executedIn size Score, ops/s
CartesianProduct.dsl thread-pool 50 283.450 ±3.042
CartesianProduct.dsl current-thread 50 1884.514 ±47.792
CartesianProduct.future thread-pool 50 91.233 ±1.333
CartesianProduct.future current-thread 50 150.234 ±20.396
CartesianProduct.monix thread-pool 50 28.597 ±0.265
CartesianProduct.monix current-thread 50 120.068 ±17.676
CartesianProduct.scalaz thread-pool 50 31.110 ±0.662
CartesianProduct.scalaz current-thread 50 87.404 ±1.734

The benchmark result of Cartesian product in direct style DSLs

Despite the trivial manual lift calls in for comprehension, the monad transformer approach causes terrible computational performance in comparison to manually called traverseM. In contrast, the performance of Dsl.scala even got improved when cellTask is inlined into listTask.

Conclusion

Task in Dsl.scala is a simple vanilla continuation-passing style function, whose performance is comparable to Monix and Cats Effect.

When using a direct style DSL, our !-bang notation is the fastest implementation among for-comprehension, Scala Async, and Scala Continuation. Especially, when performing a complex task to manipulate collections, our !-notation can be 12.5 times faster than for comprehension when running in current thread, and more than 3.1 times faster than for comprehension when running in a thread pool, despite conciser syntax that we provided.

In brief, Dsl.scala provides a concise, efficient and extensible approach to perform different types of effects at once, as a replacement of monad transformers, Eff monad and AutoLifts.

(The source code of these benchmarks can be found at https://github.com/ThoughtWorksInc/Dsl.scala/blob/8e7ab03/benchmarks/src/main/scala/com/thoughtworks/dsl/benchmarks.scala)


Update: Effect system benchmarks on GraalVM

Recently, I reran those benchmarks on GraalVM. And this is the result (larger score is better):

Benchmark                               (executedIn)  (size)   Mode  Cnt       Score      Error  Units
CartesianProduct.dsl                     thread-pool      50  thrpt   25     381.073 ?    2.799  ops/s
CartesianProduct.dsl                  current-thread      50  thrpt   25    3804.440 ?   85.645  ops/s
CartesianProduct.future                  thread-pool      50  thrpt   25     429.282 ?    3.807  ops/s
CartesianProduct.future               current-thread      50  thrpt   25     485.513 ?   10.014  ops/s
CartesianProduct.monix                   thread-pool      50  thrpt   25      88.741 ?    2.505  ops/s
CartesianProduct.monix                current-thread      50  thrpt   25     485.425 ?   10.327  ops/s
CartesianProduct.scalaz                  thread-pool      50  thrpt   25      82.574 ?    2.669  ops/s
CartesianProduct.scalaz               current-thread      50  thrpt   25     405.065 ?   14.329  ops/s
LeftAssociatedSum.cats                   thread-pool    1000  thrpt   25    1202.644 ?   23.662  ops/s
LeftAssociatedSum.cats                current-thread    1000  thrpt   25   27250.554 ?  948.168  ops/s
LeftAssociatedSum.dsl                    thread-pool    1000  thrpt   25    1547.312 ?   15.828  ops/s
LeftAssociatedSum.dsl                 current-thread    1000  thrpt   25   42525.837 ? 1739.524  ops/s
LeftAssociatedSum.future                 thread-pool    1000  thrpt   25     642.385 ?   11.138  ops/s
LeftAssociatedSum.future              current-thread    1000  thrpt   25     766.983 ?   11.002  ops/s
LeftAssociatedSum.monix                  thread-pool    1000  thrpt   25     883.240 ?   15.565  ops/s
LeftAssociatedSum.monix               current-thread    1000  thrpt   25   21703.792 ?  451.721  ops/s
LeftAssociatedSum.scalaContinuation      thread-pool    1000  thrpt   25    1338.400 ?    7.799  ops/s
LeftAssociatedSum.scalaContinuation   current-thread    1000  thrpt   25   53880.878 ? 1116.539  ops/s
LeftAssociatedSum.scalaz                 thread-pool    1000  thrpt   25     847.057 ?   10.291  ops/s
LeftAssociatedSum.scalaz              current-thread    1000  thrpt   25    6222.248 ?  625.228  ops/s
RawCartesianProduct.cats                 thread-pool      50  thrpt   25     210.906 ?    2.668  ops/s
RawCartesianProduct.cats              current-thread      50  thrpt   25    2326.546 ?   32.516  ops/s
RawCartesianProduct.dsl                  thread-pool      50  thrpt   25     249.796 ?    6.692  ops/s
RawCartesianProduct.dsl               current-thread      50  thrpt   25    4505.705 ?   71.748  ops/s
RawCartesianProduct.future               thread-pool      50  thrpt   25      73.815 ?    2.765  ops/s
RawCartesianProduct.future            current-thread      50  thrpt   25      84.541 ?    2.742  ops/s
RawCartesianProduct.monix                thread-pool      50  thrpt   25     159.687 ?    3.617  ops/s
RawCartesianProduct.monix             current-thread      50  thrpt   25    2834.124 ?   64.134  ops/s
RawCartesianProduct.scalaz               thread-pool      50  thrpt   25     111.042 ?    3.088  ops/s
RawCartesianProduct.scalaz            current-thread      50  thrpt   25     686.909 ?   18.852  ops/s
RawSum.cats                              thread-pool    1000  thrpt   25    1425.520 ?   24.974  ops/s
RawSum.cats                           current-thread    1000  thrpt   25   62939.245 ?  896.369  ops/s
RawSum.dsl                               thread-pool    1000  thrpt   25    1728.368 ?   15.419  ops/s
RawSum.dsl                            current-thread    1000  thrpt   25   90190.665 ? 1347.352  ops/s
RawSum.future                            thread-pool    1000  thrpt   25    1105.385 ?   10.313  ops/s
RawSum.future                         current-thread    1000  thrpt   25    1665.301 ?   20.220  ops/s
RawSum.monix                             thread-pool    1000  thrpt   25    1054.701 ?   12.500  ops/s
RawSum.monix                          current-thread    1000  thrpt   25   64029.761 ? 1010.878  ops/s
RawSum.scalaContinuation                 thread-pool    1000  thrpt   25    1603.568 ?   25.196  ops/s
RawSum.scalaContinuation              current-thread    1000  thrpt   25  118392.309 ? 2401.650  ops/s
RawSum.scalaz                            thread-pool    1000  thrpt   25    1162.216 ?   24.588  ops/s
RawSum.scalaz                         current-thread    1000  thrpt   25   36355.420 ?  967.373  ops/s
RightAssociatedSum.cats                  thread-pool    1000  thrpt   25    1222.784 ?    7.624  ops/s
RightAssociatedSum.cats               current-thread    1000  thrpt   25   28149.466 ?  421.281  ops/s
RightAssociatedSum.dsl                   thread-pool    1000  thrpt   25    1581.496 ?   10.579  ops/s
RightAssociatedSum.dsl                current-thread    1000  thrpt   25   52651.555 ? 2381.584  ops/s
RightAssociatedSum.future                thread-pool    1000  thrpt   25     641.212 ?    4.255  ops/s
RightAssociatedSum.future             current-thread    1000  thrpt   25     786.968 ?   18.144  ops/s
RightAssociatedSum.monix                 thread-pool    1000  thrpt   25     915.910 ?   13.983  ops/s
RightAssociatedSum.monix              current-thread    1000  thrpt   25   25382.029 ?  455.675  ops/s
RightAssociatedSum.scalaContinuation     thread-pool    1000  thrpt   25    1317.374 ?   10.588  ops/s
RightAssociatedSum.scalaContinuation  current-thread    1000  thrpt   25   53417.944 ?  592.979  ops/s
RightAssociatedSum.scalaz                thread-pool    1000  thrpt   25     871.240 ?   10.063  ops/s
RightAssociatedSum.scalaz             current-thread    1000  thrpt   25   11928.049 ?  281.798  ops/s

All benchmarks got improved on GraalVM in comparison to OpenJDK. However, for those case class based effect systems, including Monix, Cats Effects and Scalaz Concurrent, the performance improvement is relatively small. While, for those CPS function based effect systems, including Scala Continuations and Dsl.scala, the performance improvement is huge.

The difference between Scala Continuations and Dsl.scala is tail call optimization. Dsl.scala internally uses the trampoline technology on !-notation; Scala Continuations performs stack-unsafe calls.

Those case classes in Monix/Cats Effects/Scalaz can improve performance for OpenJDK by avoiding megamorphic calls. However, this benchmark suggests that any case classes other than Async are unnecessary in the future, since the pattern matching on case classes actually prevents GraalVM from further optimization.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.