Skip to content

HeinrichApfelmus/optimize-monad-trans

Repository files navigation

This is a toy problem to investigate how the Glasgow Haskell Compiler can optimize a monad transformer stack.

Discussion on reddit.

Overview

Consider the generic problem of traversing some data structure using a monad. Think Data.Traversable. The traversal is usually coded as a recursive loop, like this

go :: Structure -> Monad A 
go x = ... go x' ...

The Monad is usually a function type, for instance a state monad

type Monad a = S -> (a, S)

The go function looks like a recursive definition of a closure / function, but that's not what we want. Instead, what we want is that the state monad is inlined into the definition of go, so that it becomes a tight inner loop. In other words, we are not interested in the result Monad A as a first class value, we just want to use the monad to organize our traversal.

Example & Analysis

The module OptimizeMonadTrans implements a traversal

traverseTree :: Monad m => (a -> m [a]) -> a -> m ()

which is then applied to a tree

evaluate :: Tree -> Eval [Tree]

The monad Eval is a simple ReaderT monad transformer on top of IO. The reader monad should be the simplest case to optimize for the compiler. The goal is to have ghc -O inline the monad to the point that the traversal becomes a tight inner loop.

Different implementations for the Eval monad can yield different results:

  • The module Eval1 implementes a two-level ReaderT transformer. The resulting core OptimizeMonadTrans.core1.hs is not good, it allocates closures. It looks roughly like this:

      traverseExample :: [Tree] -> Eval1.Value -> ReaderT Eval1.Value IO ()
      traverseExample = \tree valueOuter ->
          case tree of _ {
              []     -> return';
              (x:xs) -> 
                  let m_Xim :: ReaderT Eval1.Value IO [Tree]
                      m_Xim = case x of _ {
                              OptimizeMonadTrans.Branch action children -> ... 
                          }
                  in
                      \(valueInner :: Eval1.value)
                       (s :: GHC.Prim.State# GHC.Prim.RealWorld) ->
                          case m_Xim valueInner s of _
                              { ... traverseExample ... };
          }
    
      
      return' :: Eval1.Value -> GHC.Prim.State# GHC.Prim.RealWorld
              -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
      return' = \_ s -> (# s, () #)
    

    The program performs a case analysis on the list of trees after binding the outer reader value to valueOuter, but then it allocates an action m_Xim for the inner reader and builds a closure that essentially just calls m_Xim.

  • The module Eval2 uses a single level ReaderT. The resulting core OptimizeMonadTrans.core2.hs is very good, the traversal is inlined into a tight inner loop. It looks roughly like this:

      traverseExample :: [Tree] -> (Eval2.Value,Eval2.Value)
          -> GHC.Prim.State# GHC.Prim.RealWorld
          -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
      traverseExample = \tree value s ->
          case tree of _ {
            []     -> (# s, () #);
            (x:xs) -> case x of _ {
                OptimizeMonadTrans.Branch action_aAv children_aAw -> ... ;
                ...
            }
    

    After taking both the reader value and GHC's internal state token for the IO monad as arguments, the program continues a case analysis on the list of trees. No other closures are build and no monadic action is shared with a let binding.

The first variant Eval1 is more modular, as it uses a stack of monad transformers, while the second variant Eval2 yields much faster code. Is it possible to add INLINE or other compiler annotations to have the first variant behave like the second?

Unfortunately, the answer is "no". The variant Eval3 reveals why. Defining a reader monad with two arguments,

newtype Eval a = E { run :: Value -> Value -> IO a }

the stack of monad transformers from variant Eval1 is equivalent to the monad bind

m >>= k  = E $ \x -> let b = run m x in \y -> b y >>= \a -> run (k a) x y

while the single argument from variant Eval2 is equivalent to

m >>= k  = E $ \x y -> run m x y >>= \a -> run (k a) x y

The difference is slight but important: in the first variant, the computation run m x i shared over several values for y, while in the second variant, the computation run m x is recomputed for every invocation with a value y. Since this computation could have been very expensive, rightfully GHC refuses to inline the first variant.

Note that the same reasoning could be applied to the state token for the IO monad. Why doesn't GHC build closures for that? The answer is that the compiler treats the State# type constructor as a special case and doesn't refrain from recomputing IO actions. This is GHC's infamous state hack.

Building

A Makefile helps with compiling. The commands

$ make core1
$ make core2
$ make core3
$ make core3s

will compile the main module using the different variants and output the corresponding GHC core to the results folder.

There are two more exotic variants as well. Building with

$ make coreImplicit

will use GHCs ImplicitParameters extension to implement the reader monad transformer, while

$ make coreRefl

uses type class reflection to implement the monad transformer, both in the hope of circumventing the issue with sharing by moving the reader arguments into the type system. However, as the core demonstrates, neither implementation succeeds.

Resources

GHC documentation

  • How to read GHC Core. Reading core may take a little time to get used to, but it's actually less painful than it looks at first glance. The main clutter is actually type signatures, which can be ignored for the most part.
  • INLINE and INLINEABLE pragmas. Note that INLINE inlines the original right-hand side, not an optimized one.
  • SPECIALIZE pragma. Specializes polymorphic types. This doesn't seem to be a problem in our case, though.
  • RULES pragma. Rewrite rules, the big gun. Not sure whether they apply in this situation, though, because everything should be solveable by inlining. However, the CONLIKE pragma seems interesting for reducing let statements, though it only applies to rules.

General information on improving running times of Haskell programs

Releases

No releases published

Packages

No packages published