Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce allocations for event dispatch #243

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft

Conversation

ocharles
Copy link
Collaborator

This commit continues my work on low-level optimizations for event dispatch. My plan of attack is to compile all modules with -ddump-stg, and then methodically work from the outside in, finding all unnecessary let bindings (let bindings I know that we will always need to force) and trying to eliminate them. The commits here show the reduction in heap allocations for cabal run benchmarks -- -p Boring --stdev Infinity +RTS -s.

This is still WIP, but I thought I'd throw this up for now so others can see what I'm doing!

{-----------------------------------------------------------------------------
Type and class instances
------------------------------------------------------------------------------}
newtype ReaderWriterIOT r w m a = ReaderWriterIOT { run :: r -> IORef w -> m a }
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually just incurs more boxing than using CPS'ed WriterT, because whenever we call run we need to box up an IORef

@@ -107,14 +108,14 @@ insertNodes :: RWS.Tuple BuildR (EvalPW, BuildW) Lazy.Vault -> [SomeNode] -> Que
insertNodes (RWS.Tuple (time,_) _ _) = go
where
go :: [SomeNode] -> Queue SomeNode -> IO (Queue SomeNode)
go [] q = return q
go (node@(P p):xs) q = do
go [] !q = return q
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big reduction in allocations, perhaps the biggest in this branch. I think we could replace insertNodes with foldl' or foldr and this might be a bit more obvious.

@@ -39,7 +39,7 @@ newInput = mdo
}
-- Also add the alwaysP pulse to the inputs.
let run :: a -> Step
run a = step ([P pulse, P always], Lazy.insert key (Just a) Lazy.empty)
run a n = step ([P pulse, P always], Lazy.insert key (Just a) Lazy.empty) n
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This eta expansion allows run to be a two-parameter closure, rather than two nested closures.

Comment on lines +139 to +147
-- Recursively execute the buildLater calls.
unfold :: BuildR -> BuildW -> BuildIO a -> IO (a, BuildW)
unfold !i w m = do
(a, BuildW (w1, w2, w3, later)) <- RW.runReaderWriterIOT m i
let !w' = w <> BuildW (w1,w2,w3,mempty)
w'' <- case later of
Just m -> snd <$> unfold i w' m
Nothing -> return w'
return (a,w'')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily need to make unfold a top-level binding, but the important thing is that it doesn't mention any variables bound by runBuildIO. This allows us to inline a bit of runBuildIO, and to float the recursion loop out to the top-level. Otherwise, each runBuildIO allocates a new recursion closure

@@ -23,7 +23,7 @@ import Reactive.Banana.Prim.Util
-- | A 'Network' represents the state of a pulse/latch network,
data Network = Network
{ nTime :: !Time -- Current time.
, nOutputs :: !(OrderedBag Output) -- Remember outputs to prevent garbage collection.
, nOutputs :: {-# unpack #-} !(OrderedBag Output) -- Remember outputs to prevent garbage collection.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoids actually allocating an OrderedBag when creating an updated Network after stepping - instead we can just store the underlying pointers that we have access to from some worker/wrapper transformations.

HeinrichApfelmus added a commit that referenced this pull request Jan 2, 2023
…c-evaluation

Perform evaluation steps `Network` using `GraphGC`

### Overview

This pull request completely changes the way that dependencies between `Pulse` are tracked and used in order to perform an `Evaluation.step` for a `Network`.

We use the machinery provided by `GraphGC` to

* track dependencies between `Pulse`, using `insertEdge` and `clearPredecessors`.
* traverse the `Pulse` in dependency order with early exit, using `walkSuccessors_`.

### Comments

* This should fix many remaining issues with garbage collection for `Pulse`, specifically #261 
* I think that in order to fix *all* remaining issues for `Pulse`, we may have to look at garbage collection and `Vault`.
* This pull request doesn't do anything for `Latch`.

Still, 🥳!

### Obsoletes

* #182
* sadly, #243
@mitchellwrosen
Copy link
Collaborator

No pressure of course but it'd be great to get these memory optimizations in somehow :)

As mentioned in #268 this PR was made stale. Maybe smaller chunks would be appropriate? Or, well -- I doubt it -- probably just bad timing. Big rewrites of internals don't happen often.

@mitchellwrosen
Copy link
Collaborator

Huh, if the benchmarks haven't changed, things seem to have gotten substantially worse. I'm seeing 11,163,946,400 total bytes allocated by cabal run reactive-banana:benchmark -- -p Boring --stdev Infinity +RTS -s, compared to where Ollie started at 4,941,300,968. Is that right?

@mitchellwrosen
Copy link
Collaborator

I ran a quick eventlog2html and I see that the majority of allocations are attributed to the STACK (thread's stack), [], and ARR_WORDS (ByteArray# / MutableByteArray#)

Screen Shot 2023-02-24 at 11 04 42 AM

@HeinrichApfelmus
Copy link
Owner

HeinrichApfelmus commented Feb 25, 2023

I'm seeing 11,163,946,400 total bytes allocated by

Fixing the space leaks was worth it, I would say. 😅

Also note that this is "The total number of bytes allocated by the program over the whole run." Which is not quite the same as "The maximum space actually used by your program is the 'bytes maximum residency' figure." The total number of bytes is highly correlated to CPU time, not so much to memory use.

@mitchellwrosen , if you want to play around with such lower-level optimizations, switching Data.HashMap to Data.IntMap or something like that may be worth investigating. I'd like to keep Graph polymorphic in v, but replacing the Hashable v with

class IsInteger v where
    -- | `toInteger x = toInteger y` implies `x = y`. 
    toInteger :: v  Integer

or something like that would be fine with me.

Hm, or maybe

class HasUnique v where
    toUnique :: v  Unique

@mitchellwrosen
Copy link
Collaborator

@mitchellwrosen , if you want to play around with such lower-level optimizations, switching Data.HashMap to Data.IntMap or something like that may be worth investigating. I'd like to keep Graph polymorphic in v, but replacing the Hashable v with

I think to get down to IntMap performance we'd need to just have a monotonically-increasing Int supply of unique numbers, not Integer, which wouldn't really work on 32 bit systems. On 64 bit systems I think treating Int as effectively infinitely-sized makes good sense.

@dpwiz
Copy link

dpwiz commented Nov 23, 2023

Can we put this under CPP or a flag and use IntMap on x64?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants