Internal Cache blows up with many constrain's #460

doyougnu · 2019-02-28T21:27:19Z

Hey Levent,

I have a computation that exists entirely in the Query monad. I recently did some profiling and it looks like the internal Cache is blowing up in size leading to about 95% GC use. So I'm looking for help in understanding why this behavior is occurring and how to avoid it.

The basic computation is a fold over a custom abstract syntax tree where I accumulate on SBools. I've been able to reproduce this with the following tiny program that folds a list of SBools in the query monad:

import qualified Data.SBV                as S
import qualified Data.SBV.Control        as SC
import qualified Data.SBV.Internals      as SI

-- | generate an infinite list of unique strings and take n of them dropping the
-- empty string
stringList :: Int -> [String]
stringList n = tail . take (n+1) $ concatMap (flip replicateM "abc") [0..]


-- | the test runner, takes a computation that runs in the query monad and an
-- int that dictates size of the list of sbools
test :: ([S.SBool] -> SC.Query (Map String Bool)) -> Int -> IO (Map String Bool)
test f n = S.runSMT $
           do
             prop' <- S.sBools $! stringList n
             SC.query $ f prop'


-- | I fold over the string of SBools here constraining at each accumulation,
-- this seems to blow up the internal cache severely leading to about 95% GC
bad :: [S.SBool] -> SC.Query (Map String Bool)
bad prop' = do b <- foldM (helper) S.true prop'
               S.constrain b
               fmap (fmap SI.cwToBool) $ S.getModelDictionary <$> SC.getSMTResult
  -- | combine the current sbool with the accumulated sbool, constrain the
  -- two and then return the accumulated result
  where helper x acc = do let b = x S.&&& acc
                          S.constrain b
                          return b


-- | identical to the bad version but I do not constrain for each accumulation
good :: [S.SBool] -> SC.Query (Map String Bool)
good prop' = do b <- foldM (helper) S.true prop'
                S.constrain b
                fmap (fmap SI.cwToBool) $ S.getModelDictionary <$> SC.getSMTResult
  -- | this helper is equivalent to just foldr' (S.&&&)
  where helper x acc = do let b = x S.&&& acc
                          return b

main = do
  putStrLn "Running Good:\n"
  goodRes <- test good 1000

  putStrLn "Running Bad:\n"
  badRes <- test bad  1000

  -- just ensuring evaluation
  print (size goodRes)
  print (size badRes)

I just commented out the lines for each test and ran with stack bench --profile cache-test --benchmark-arguments='+RTS -hc -s -RTS'; I get the following results:

for good:

Running Good:

     278,823,496 bytes allocated in the heap
1000
      15,291,392 bytes copied during GC
Benchmark auto: FINISH
       2,165,352 bytes maximum residency (7 sample(s))
         110,480 bytes maximum slop
              10 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       165 colls,   165 par    0.064s   0.017s     0.0001s    0.0010s
  Gen  1         7 colls,     6 par    0.051s   0.014s     0.0021s    0.0042s

  Parallel GC work balance: 30.47% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.004s  (  0.003s elapsed)
  MUT     time    0.332s  (  0.316s elapsed)
  GC      time    0.102s  (  0.027s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.013s  (  0.004s elapsed)
  EXIT    time    0.001s  (  0.000s elapsed)
  Total   time    0.452s  (  0.350s elapsed)

  Alloc rate    840,844,689 bytes per MUT second

  Productivity  73.5% of total user, 90.3% of total elapsed

gc_alloc_block_sync: 3526
whitehole_spin: 0
gen[0].sync: 1
gen[1].sync: 35
Completed 2 action(s).

with heap profile.

and for bad:

Running Bad:

1000
   2,131,924,376 bytes allocated in the heap
   2,973,614,912 bytes copied during GC
     239,523,080 bytes maximum residency (27 sample(s))
         920,312 bytes maximum slop
             472 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1497 colls,  1497 par   48.666s  12.479s     0.0083s    0.0225s
  Gen  1        27 colls,    26 par   12.980s   3.371s     0.1248s    0.3105s

  Parallel GC work balance: 82.55% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.003s  (  0.003s elapsed)
  MUT     time    2.904s  (  2.630s elapsed)
  GC      time   53.022s  ( 13.638s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    8.624s  (  2.212s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time   64.553s  ( 18.483s elapsed)

  Alloc rate    734,259,598 bytes per MUT second

  Productivity   4.5% of total user, 14.2% of total elapsed

gc_alloc_block_sync: 498638
whitehole_spin: 0
gen[0].sync: 2
gen[1].sync: 689287
Benchmark auto: FINISH
Completed 2 action(s).

with heap profile

Notice the discrepancy in the GC time and calculated Productivity and the difference in the y-axis of the heap profiles. So I assume it is the case that (constrain $ a and b and c) /= ((constrain (a and b) >> return (a and b) >>= constrain . (and c)). Any suggestions for whats happening here? I've already reduced the amount of constrains in my computation to a maximum, but I'm dealing with about 16000 variables which easily blows up the cache.

Let me know if I can further assist in any way and thanks for the help.

The text was updated successfully, but these errors were encountered:

LeventErkok · 2019-02-28T21:31:26Z

Thanks for the detailed write-up! I've never looked into performance characteristics, and there might very well be a memory leak in there. Let me take some time to investigate.

LeventErkok · 2019-03-01T08:29:38Z

Is there a reason why you can't collect all the "must-be-constraint" SBools in a Haskell data-structure, collect them all, and then assert their conjunction in the end? I wonder if that would address the issue.

Additionally, I can also imagine adding a variant of constrain that took a whole bunch of variables and instead of anding them, which will cost you the construction of the tree of binary ands, assert all of them in one go. That should also give some savings. But I'd first explore the first idea above to see if that'd handle the issue.

What do you think?

doyougnu · 2019-03-01T17:55:35Z

In the test example all the sBools are and'd but in my application that is not strictly true. If it were then I think having a separate data structure that stores, and then conjoins the SBools would work. Essentially I have a AST like:

data Lang a
   = Lit Bool
   | Ref a
   | Neg (Lang a)
   | Or (Lang a)
   | And (Lang a)
   | Foo !String (Lang a) (Lang a)

The semantics of which are identical to normal propositional logic except for the Foo constructor, which describes an interaction with the assertion stack of the underlying solver. For example if I have something like this (pretty printing for readability):

myFormula :: Lang String
myFormula = a /\ (~b => (a \/ c)) /\ (Foo a e) \/ ((a => b) \/ (c => a))

Then i do a left fold (or prefix traversal) of the AST. i end up with accumulatedSBool /\ (Foo a e) \/ (a => b \/ c => a). When I see a Foo I push onto the assertion stack, and combine the accumulatedSBool with Foos left child and recur to get a model representing the left child out, I then pop and repeat for the right side. For example with myFormula the recursion into the left side would become accumulatedSBool /\ a \/ ((a => b) \/ (c => a)) which would then be converted into a single SBool via recursion, at the end of the recursive call it would be constrain'd, used to get a model, then popped to reset the assertion stack back before the evaluation of the left side of Foo. So in this way a Foo describes a point where I manipulate the assertion stack directly to solve the two related formulas that are represented by Foos children. This is essentially why I need to keep a handle on the accumulated SBool, i.e., so that I can continue the recursion for a child of Foo for any binary logical connective such as implication, and/or etc..

This leads to having 2^|Foo| constraints because I only constrain when I evaluate on a Foo, other than that I just accumulate the SBools. I was under the impression that constraining an SBool adds the assertions represented by the SBool onto the assertion stack. If that is the case then I am already accumulating as much as possible and constraining as minimally as possible. Furthermore, if that is true then I would want to constrain as much as possible to keep the assertions on the assertion stack small and fine grained so that when I got to a Foo constructor I could reuse the state of the assertion stack as much as possible.

Hopefully that makes sense. I can mock up a more worked example if necessary.

LeventErkok · 2019-03-01T18:23:12Z

Yes; something that I can run an experiment with would be awesome!

Do you know if this is also an issue when you don't use query features at all? Looks like your use case does call for queries, but I'm curious if the culprit is in the query-layer or if this sort of bad behavior can be triggered from regular Symbolic code.

doyougnu · 2019-03-01T19:28:46Z

Interesting I observe the same behavior with only Symbolic code:

-- | the test runner, this time f returns a predicate and we have no query
-- | operations whatsoever
testS :: ([S.SBool] -> S.Predicate) -> Int -> IO S.SatResult
testS f n = S.sat $ (S.sBools $! stringList n) >>= f


-- | I fold over the string of SBools here constraining at each accumulation,
-- this seems to blow up the internal cache severely leading to about 95% GC
-- time
badS :: [S.SBool] -> S.Predicate
badS prop' = do b <- foldM (helper) S.true prop'
                S.constrain b
                return b
  -- | combine the current sbool with the accumulated sbool, constrain the
  -- two and then return the accumulated result
  where helper x acc = do let b = x S.&&& acc
                          S.constrain b
                          return b

-- | identical to the bad version but I do not constrain for each accumulation
goodS :: [S.SBool] -> S.Predicate
goodS prop' = do b <- foldM (helper) S.true prop'
                 S.constrain b
                 return b
  -- | this helper is equivalent to just foldr' (S.&&&)
  where helper x acc = do let b = x S.&&& acc
                          return b

main = do
  -- putStrLn "Running Good:\n"
  -- goodRes <- testS goodS 1000

  putStrLn "Running Bad:\n"
  badRes <- testS badS 1000

  -- just ensuring evaluation
  -- print (length $ show goodRes)
  print (length $ show badRes)

Good test:

Running Good:

15473
     280,648,024 bytes allocated in the heap
      15,749,208 bytes copied during GC
       2,972,112 bytes maximum residency (6 sample(s))
         173,616 bytes maximum slop
              11 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       167 colls,   167 par    0.088s   0.025s     0.0002s    0.0027s
  Gen  1         6 colls,     5 par    0.083s   0.027s     0.0046s    0.0096s

  Parallel GC work balance: 18.75% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.005s  (  0.009s elapsed)
  MUT     time    0.396s  (  0.427s elapsed)
  GC      time    0.151s  (  0.046s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    0.020s  (  0.006s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time    0.573s  (  0.489s elapsed)

  Alloc rate    708,712,104 bytes per MUT second

  Productivity  69.3% of total user, 87.5% of total elapsed

gc_alloc_block_sync: 2217
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 28
Benchmark auto: FINISH
Completed 2 action(s).

with heap profile:

Bad Test:

Running Bad:

24019
   2,151,236,224 bytes allocated in the heap
   3,536,806,008 bytes copied during GC
     245,950,352 bytes maximum residency (28 sample(s))
       1,026,160 bytes maximum slop
             488 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1890 colls,  1890 par   61.774s  17.736s     0.0094s    0.0435s
  Gen  1        28 colls,    27 par   15.184s   4.362s     0.1558s    0.3335s

  Parallel GC work balance: 77.98% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.004s  (  0.004s elapsed)
  MUT     time    3.664s  (  2.996s elapsed)
  GC      time   67.154s  ( 19.401s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    9.804s  (  2.698s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time   80.626s  ( 25.099s elapsed)

  Alloc rate    587,177,866 bytes per MUT second

  Productivity   4.5% of total user, 11.9% of total elapsed

gc_alloc_block_sync: 1812203
whitehole_spin: 0
gen[0].sync: 13
gen[1].sync: 863036
Benchmark auto: FINISH
Completed 2 action(s).

with heap:

LeventErkok · 2019-03-03T10:08:31Z

It's actually good to know that 'Symbolic' has the same issue.

Currently we store constraints in an IORef that looks like this:

 rConstraints :: IORef [(Bool, [(String, String)], SV)]

That's one list that you'll keep on overwriting all the time. We really need to replace that with a much better data-structure. It may not help solve the issue you're facing, but replacing it with something like a sequence would definitely help.

I'll see if I can code this up sometime tomorrow; shouldn't be that hard. Hopefully it'll fix the issue, but if not, it's an improvement that won't be a wasted effort. I should have something you can try out sometime tomorrow.

Hopefully addresses #460

LeventErkok · 2019-03-03T19:04:06Z

@doyougnu

I just pushed in some changes to improve how constraints are stored internally. I'm not sure if it'll fix the issue, but hopefully, it won't make it worse! Can you build from master and give it a shot? I'm curious if it'll make any difference. We can then investigate further.

doyougnu · 2019-03-03T19:57:04Z

@doyougnu

I just pushed in some changes to improve how constraints are stored internally. I'm not sure if it'll fix the issue, but hopefully, it won't make it worse! Can you build from master and give it a shot? I'm curious if it'll make any difference. We can then investigate further.

Will do!

LeventErkok · 2019-03-03T20:44:30Z

Thanks! I very much doubt this'll solve the problem; as the issue is really with constructing those constraints in the first place; which isn't addressed by this change.

But performance is hard to predict, maybe we'll be pleasantly surprised!

doyougnu · 2019-03-03T21:46:23Z

Unfortunately your suspicions were correct. This had no impact on the issue whatsoever. Note I upgraded to 8.0.5 from 7.13 to build from master:

-- | the test runner, this time f returns a predicate and we have no query
-- | operations whatsoever
testS :: ([S.SBool] -> S.Predicate) -> Int -> IO S.SatResult
testS f n = S.sat $ (S.sBools $! stringList n) >>= f


-- | I fold over the string of SBools here constraining at each accumulation,
-- this seems to blow up the internal cache severely leading to about 95% GC
-- time
badS :: [S.SBool] -> S.Predicate
badS prop' = do b <- foldM (helper) S.sTrue prop'
                S.constrain b
                return b
  -- | combine the current sbool with the accumulated sbool, constrain the
  -- two and then return the accumulated result
  where helper x acc = do let b = x S..&& acc
                          S.constrain b
                          return b

main = do
  putStrLn "Running Bad:\n"
  badRes <- testS badS 1000

  -- just ensuring evaluation
  writeFile "badres" (show badRes)

and running:

Running Bad:

   2,139,086,472 bytes allocated in the heap
   3,984,221,968 bytes copied during GC
     244,111,848 bytes maximum residency (30 sample(s))
         969,408 bytes maximum slop
             484 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1871 colls,  1871 par   66.052s  19.048s     0.0102s    0.0496s
  Gen  1        30 colls,    29 par   18.475s   5.342s     0.1781s    0.3477s

  Parallel GC work balance: 82.21% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.003s  (  0.002s elapsed)
  MUT     time    3.986s  (  3.311s elapsed)
  GC      time   72.316s  ( 20.981s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time   12.210s  (  3.409s elapsed)
  EXIT    time    0.001s  (  0.000s elapsed)
  Total   time   88.516s  ( 27.704s elapsed)

  Alloc rate    536,708,535 bytes per MUT second

  Productivity   4.5% of total user, 12.0% of total elapsed

gc_alloc_block_sync: 1478284
whitehole_spin: 0
gen[0].sync: 13
gen[1].sync: 1315257
Benchmark auto: FINISH

Here are some extra heap profiles:
By Type:

By Data Constructor:

Cost Center but with longer description:

What exactly goes on under the hood when constrain is called? I may have some time to hack on this tonight. This doesn't seem to me to be a space leak but rather a combinatorial explosion of storing combinations of the sBool accumulator. I'm unsure if this is really a problem or bug with sbv as I seem to be using it outside of its design space.

LeventErkok · 2019-03-03T22:11:34Z

In the end, it may not be something we can fix easily; but I'd like to understand it better. Right now I don't see any smoking guns.

Would it be possible for you to create a github repo and scripts etc? I'd like to clone it so I can run the profiler myself to experiment. I'm hoping that's not a lot of work for you!

doyougnu · 2019-03-03T22:52:14Z

Sure I can set this up tonight. You prefer cabal over stack? As for scripts I'll make a few to automatically build and execute with the profiling I've shown here. Is that sufficient?

LeventErkok · 2019-03-03T23:26:32Z

stack is just fine; cabal is OK too; whichever you already have. a "doit" script that spits out these graphs would be fantastic!

LeventErkok · 2019-03-04T05:23:40Z

Hi! I got some local stuff setup so I can get the profiling data myself; no set-up necessary. I'll dig into this and hopefully at least understand where the memory is going. Thanks!

doyougnu · 2019-03-04T05:31:52Z

Repo is up here, let me know if you have any issues or questions and thanks for the help.

LeventErkok · 2019-03-04T05:50:36Z

Thanks! These'll definitely come in handy for future performance studies as well.

LeventErkok · 2019-03-04T16:57:04Z

@doyougnu

I spent some time looking at this; unfortunately the constraint call induces a lot of extra work that is all but wasted in this particular case as it ends up doing nothing. (It checks if the constraint is true, reduces all those booleans, checks if it was already in the cache etc.) With many such constraints it adds up! I'm a bit surprised that it's this bad though, and I want to explore a bit more to see if we can make it a little better at least.

LeventErkok · 2019-03-05T08:41:28Z

@doyougnu

I've managed to confuse myself.. I decided to simplify the program a bit, and got this:

import Control.Monad
import Data.SBV

bad = sat $ do bs <- mapM (const free_) [1..1000]
               foldM go sTrue bs
  where go x acc = let s = x .&& acc
                   in do constrain s
                         return s

good = sat $ do bs <- mapM (const free_) [1..1000]
                foldM go sTrue bs
  where go x acc = return $ x .&& acc

main = do print =<< bad
          print =<< good

So far as I can tell, this is essentially the same as your program. But when I compile and run this, both bad and good run quickly. Can you spot what the difference is between this and your original?

As a matter of fact, I also compiled it, and got the same result. Something really fishy is going on here!

doyougnu · 2019-03-05T20:02:39Z

@LeventErkok

I was unable to reproduce this. I've setup a benchmark poorly named newBad that just runs the code you posted above. Could you please pull from the test repo and run the newBadByCostCenter.sh script? Here is my output:

scripts git:(master) ✗ ./newBadByCostCenter.sh
sbv460-0.1.0.0: unregistering (local file changes: bench/LeventTest/Main.hs package.yaml)
sbv460-0.1.0.0: build (lib + bench)
Preprocessing benchmark 'newBad' for sbv460-0.1.0.0..
Building benchmark 'newBad' for sbv460-0.1.0.0..
[3 of 3] Compiling Main             ( bench/LeventTest/Main.hs, .stack-work/dist/x86_64-linux-nix/Cabal-2.4.0.1/build/newBad/newBad-tmp/Main.p_o )
Linking .stack-work/dist/x86_64-linux-nix/Cabal-2.4.0.1/build/newBad/newBad ...
Preprocessing library for sbv460-0.1.0.0..
Building library for sbv460-0.1.0.0..
sbv460-0.1.0.0: copy/register
Installing library in /home/doyougnu/Programming/sbv460/.stack-work/install/x86_64-linux-nix/lts-13.10/8.6.3/lib/x86_64-linux-ghc-8.6.3/sbv460-0.1.0.0-3vKHhoTMprbLVCKH2e5y8G
Installing executable sbv460-profile in /home/doyougnu/Programming/sbv460/.stack-work/install/x86_64-linux-nix/lts-13.10/8.6.3/bin
Registering library for sbv460-0.1.0.0..
sbv460-0.1.0.0: benchmarks
Running 1 benchmarks...
Benchmark newBad: RUNNING...
22019
22019
   2,385,670,880 bytes allocated in the heap
   3,626,356,240 bytes copied during GC
     244,700,152 bytes maximum residency (31 sample(s))
         949,256 bytes maximum slop
             233 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      2231 colls,     0 par   16.019s  15.826s     0.0071s    0.0137s
  Gen  1        31 colls,     0 par    3.579s   3.550s     0.1145s    0.2612s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.938s  (  1.004s elapsed)
  GC      time   18.098s  ( 17.884s elapsed)
  RP      time    0.000s  (  0.000s elapsed)
  PROF    time    1.500s  (  1.493s elapsed)
  EXIT    time    1.500s  (  1.494s elapsed)
  Total   time   22.038s  ( 21.875s elapsed)

  Alloc rate    2,542,367,658 bytes per MUT second

  Productivity   4.3% of total user, 4.6% of total elapsed

Benchmark newBad: FINISH
Completed 2 action(s).
running hp2ps to generate plot...
all done, open plot with your favorite image program!

The only alteration I made to the program you posted above was printing the string length of the SatResultss. This makes it easier to see the GC statistics printed to stdout; however I do observe the same behavior printing the SatResult directly. I'm unsure if it'll matter but here are the versioning for my environment:

➜  scripts git:(master) z3 --version
Z3 version 4.7.1 - 64 bit
➜  scripts git:(master) stack --version
Version 1.7.1 x86_64

The test repo should pull down lts-13.10 from stackage which is ghc 8.6.3

LeventErkok · 2019-03-06T00:59:31Z

This is so bizarre. I did the following in a brand new directory; no stack, no nothing:

$ uname -a
Darwin istanbul 17.7.0 Darwin Kernel Version 17.7.0: Thu Dec 20 21:47:19 PST 2018; root:xnu-4570.71.22~1/RELEASE_X86_64 x86_64
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.6.3
$ ls -l
total 8
-rw-r--r--  1 LeventErkok  staff  744 Mar  5 16:57 Main.hs
$ cat Main.hs
import Control.Monad
import Data.SBV

-- program taken from
-- https://github.com/LeventErkok/sbv/issues/460#issuecomment-469590832

bad = sat $ do bs <- mapM (const free_) [1..1000]
               foldM go sTrue bs
  where go x acc = let s = x .&& acc
                   in do constrain s
                         return s

good = sat $ do bs <- mapM (const free_) [1..1000]
                foldM go sTrue bs
  where go x acc = return $ x .&& acc

-- my only modification is printing length of the result instead of the result
-- itself, this makes it easier to read teh GC output generated by passing "-s"
-- to the benchmark args. See scripts for examples
main = do print =<< length . show <$> bad
          print =<< length . show <$> good
$ ghc -o main Main.hs
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking main ...
$ time ./main
22019
22019
./main  0.42s user 0.13s system 109% cpu 0.505 total

The main runs so quickly and prints those numbers in succession with no wait.

Are you saying it doesn't do that for you, if you repeat this in a brand new directory?

LeventErkok · 2019-03-06T01:08:31Z

The above was on a Mac. I also just tried the exact same on a linux box, running GHC 8.0.1 (yeah, very old!), and got the exact same speedy output.

LeventErkok · 2019-03-06T01:14:56Z

NB. Please run this without compiling for profiling or any RTS options; just pure GHC compilation with no fancy flags. I'm now concerned that the profiling annotations that GHC inserts are causing blow-up, though I don't understand why that should be the case. Would be good to have you confirm or refute that independently!

doyougnu · 2019-03-06T04:31:18Z

Extremely strange! I was able to reproduce your findings:

➜  sbvTest uname -a
Linux 7thChamber 4.14.102 #1-NixOS SMP Wed Feb 20 09:20:56 UTC 2019 x86_64 GNU/Linux

➜  sbvTest ls -l
total 4
-rw-r--r-- 1 doyougnu nogroup 744 Mar  5 20:21 Main.hs

➜  sbvTest cat Main.hs
import Control.Monad
import Data.SBV

-- program taken from
-- https://github.com/LeventErkok/sbv/issues/460#issuecomment-469590832

bad = sat $ do bs <- mapM (const free_) [1..1000]
               foldM go sTrue bs
  where go x acc = let s = x .&& acc
                   in do constrain s
                         return s

good = sat $ do bs <- mapM (const free_) [1..1000]
                foldM go sTrue bs
  where go x acc = return $ x .&& acc

-- my only modification is printing length of the result instead of the result
-- itself, this makes it easier to read teh GC output generated by passing "-s"
-- to the benchmark args. See scripts for examples
main = do print =<< length . show <$> bad
          print =<< length . show <$> good

➜  sbvTest ghc -o main Main.hs
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking main ...
➜  sbvTest time ./main
22019
22019
./main  0.38s user 0.07s system 109% cpu 0.412 total

LeventErkok · 2019-03-06T04:57:30Z

Great, glad you can replicate!

So, I have to admit I'm not entirely sure what this means:

stack bench --profile cache-test --benchmark-arguments='+RTS -hc -s -RTS';

Does it mean run it like a million times or something? There's a minor difference obviously, and maybe that's getting amplified? (Although I don't see how that command can manage to run it a million times, but I've never been a big user of stack, so I don't know the details.)

I think there's still a problem in there somewhere; but without understanding the real root cause, it'll be tough to fix!

LeventErkok · 2019-03-06T05:05:11Z

From the user manual here: https://docs.haskellstack.org/en/stable/GUIDE/#components-test-and-bench

It seems it merely builds it in this particular case; and just runs it once. So, this suggests the bad behavior is triggered when SBV is profiled? That doesn't make much sense to me.

You obviously didn't run into this problem while benchmarking something else, I presume? Does the bad behavior exhibit itself in your larger context with no profile arguments? I'm really curious now!

doyougnu · 2019-03-06T05:16:40Z

Confirm with the test repo. If you set profiling to false in stack.yaml I do not observe the GC issue:

➜  sbv460 git:(master) ✗ cat stack.yaml
# This file was automatically generated by 'stack init'
#
# Some commonly used options have been documented as comments in this file.
# For advanced use and comprehensive documentation of the format, please see:
# https://docs.haskellstack.org/en/stable/yaml_configuration/

# Resolver to choose a 'specific' stackage snapshot or a compiler version.
# A snapshot resolver dictates the compiler version and the set of packages
# to be used for project dependencies. For example:
#
# resolver: lts-3.5
# resolver: nightly-2015-09-21
# resolver: ghc-7.10.2
# resolver: ghcjs-0.1.0_ghc-7.10.2
#
# The location of a snapshot can be provided as a file or url. Stack assumes
# a snapshot provided as a file might change, whereas a url resource does not.
#
# resolver: ./custom-snapshot.yaml
# resolver: https://example.com/snapshots/2018-01-01.yaml
resolver: lts-13.10

# User packages to be built.
# Various formats can be used as shown in the example below.
#
# packages:
# - some-directory
# - https://example.com/foo/bar/baz-0.0.2.tar.gz
# - location:
#    git: https://github.com/commercialhaskell/stack.git
#    commit: e7b331f14bcffb8367cd58fbfc8b40ec7642100a
# - location: https://github.com/commercialhaskell/stack/commit/e7b331f14bcffb8367cd58fbfc8b40ec7642100a
#  subdirs:
#  - auto-update
#  - wai
packages:
- .
- lib/sbv

# Dependency packages to be pulled from upstream that are not in the resolver
# using the same syntax as the packages field.
# (e.g., acme-missiles-0.3)
# extra-deps: []

# Override default flag values for local packages and extra-deps
# flags: {}

# Extra package databases containing global packages
# extra-package-dbs: []

# Control whether we use the GHC we find on the path
# system-ghc: true
#
# Require a specific version of stack, using version ranges
# require-stack-version: -any # Default
# require-stack-version: ">=1.7"
#
# Override the architecture used by stack, especially useful on Windows
# arch: i386
# arch: x86_64
#
# Extra directories used by stack for building
# extra-include-dirs: [/path/to/dir]
extra-lib-dirs:
  - lib/

build:
  library-profiling: false
  executable-profiling: false

#
# Allow a newer minor version of GHC than the snapshot specifies
# compiler-check: newer-minor

## uncomment the following lines to tell stack you are in a nix environment
nix:
  enable: true
  pure: true
  packages: [ z3, pkgconfig, zlib ]

➜  sbv460 git:(master) ✗ ./scripts/newBadByCostCenter.sh
sbv460-0.1.0.0: benchmarks
Running 1 benchmarks...
Benchmark newBad: RUNNING...
22019
22019
     213,414,184 bytes allocated in the heap
      21,221,192 bytes copied during GC
       2,643,032 bytes maximum residency (11 sample(s))
          76,712 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       197 colls,     0 par    0.020s   0.022s     0.0001s    0.0036s
  Gen  1        11 colls,     0 par    0.026s   0.034s     0.0031s    0.0123s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0(0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.003s elapsed)
  MUT     time    0.377s  (  0.514s elapsed)
  GC      time    0.046s  (  0.056s elapsed)
  EXIT    time    0.001s  (  0.007s elapsed)
  Total   time    0.425s  (  0.581s elapsed)

  Alloc rate    566,114,845 bytes per MUT second

  Productivity  88.7% of total user, 88.5% of total elapsed

Benchmark newBad: FINISH
running hp2ps to generate plot...
hp2ps: cannot open bad.hp
all done, open plot with your favorite image program!

LeventErkok · 2019-03-06T05:21:19Z

Great sleuthing! Thanks for the analysis.

I'm at a loss, unfortunately. I'm not sure why profiling is causing issues here. I'm open to ideas on how to further debug.

doyougnu · 2019-03-06T05:27:51Z

Regarding stack bench --profile cache-test --benchmark-arguments='+RTS -hc -s -RTS'. stack bench is just a synonym for stack build, for this case stack bench --profile cache-test tells stack to build the cache-test target, listed in the benchmarks array in the package.yaml file. For example, the package.yaml for the test repo looks like:

➜  sbv460 git:(master) ✗ cat package.yaml
name:                sbv460
version:             0.1.0.0
github:              "githubuser/sbv460"
license:             BSD3
author:              "Author name here"
maintainer:          "example@example.com"
copyright:           "2019 Author name here"

extra-source-files:
- README.md
- ChangeLog.md

# Metadata used when publishing your package
# synopsis:            Short description of your package
# category:            Web

# To avoid duplicated efforts in documentation and dealing with the
# complications of embedding Haddock markup inside cabal files, it is
# common to point users to the README.md file.
description:         Please see the README on GitHub at <https://github.com/githubuser/sbv460#readme>

dependencies:
- base >= 4.7 && < 5
- sbv >= 8.0.5


library:
  source-dirs: src

executables:
  sbv460-profile:
    main:                Main.hs
    source-dirs:         app
    ghc-options:
    - -threaded
    - -rtsopts
    - -with-rtsopts=-N
    dependencies:
    - sbv460

benchmarks:
  # benchmark that tests the simplified program levent posted
  # https://github.com/LeventErkok/sbv/issues/460#issuecomment-469590832
  newBad:
    main: Main.hs
    source-dirs:
      - bench/LeventTest
      - src
    ghc-options:
      - -O2
      - -threaded
      - -rtsopts
      - -with-rtsopts=-s
  bad:
    main: Main.hs
    source-dirs:
      - bench/Bad
      - src
    ghc-options:
      - -O2
      - -threaded
      - -rtsopts
      - -with-rtsopts=-s

  good:
    main: Main.hs
    source-dirs:
      - bench/Good
      - src
    ghc-options:
      - -O2
      - -threaded
      - -rtsopts
      - -with-rtsopts=-s

tests:
  sbv460-test:
    main:                Spec.hs
    source-dirs:         test
    ghc-options:
    - -threaded
    - -rtsopts
    - -with-rtsopts=-N
    dependencies:
    - sbv460

notice the benchmarks section has three sub-sections: bad newBad and good this means that I can choose to benchmark either of these via stack bench sbv460:good or in general stack <command> <executable-name:target>, so in the prior example sbv460 is the executable name, good the target name and bench the command, but this also would work with exec if we had more than one executable. Besides that the binary is compiled with each ghc-options flag. in the case of bad that is -threaded, -rtsopts, -with-rtsopts=-N. Because bench is a build synonym, that build and executes the compiled binary, when we bench we pass the profiling options as command line options to the binary. So in summary stack bench --profile cache-test --benchmark-arguments='+RTS -hc -s -RTS' means build the cache-test executable with the ghc-options specified in package.yaml and then pass the options +RTA -hc -s -RTS to the binary when you execute it. Then -hc just means profile by cost center, and -s just collects memory and garbage collector performance statistics which are output to stdout.

LeventErkok · 2019-03-06T05:31:57Z

Thanks for the explanation; good to know.

So, do we concur that profiling SBV causes the issue here, or do you suspect something else might be going on?

doyougnu · 2019-03-06T05:33:10Z

I've gotta say this is also above my pay grade, but I wonder if enabling profiling like this forces values in the cache to not be garbage collected until the computation ends? I agree that profiling triggers this behavior in SBV, as I think we've both demonstrated the behavior thoroughly.

LeventErkok · 2019-03-06T05:35:50Z

Does that explain why "good" behaves ok? If profiling hangs on to state, then I'd expect both to exhibit the same behavior. I don't understand why having a call to constrain makes any difference.

If we can replicate this on a toy program (i.e., without SBV), I think it would be well worth asking the GHC folks; or at least on stack-overflow or something. I'm afraid with SBV in the loop, however, it'll be hard for anyone else to dig into it. Do you think you might be able to replicate with something simpler? (I know this is a lot to ask!)

LeventErkok · 2019-03-06T05:37:20Z

Also: You clearly must've hit some other issue before you decided to profile. We still need to address the original issue I suppose. What caused you to go down the profile path?

doyougnu · 2019-03-06T05:40:34Z

I can try to replicate without sbv I think the hypothesis would be that a foldM with some sort of effect interacts poorly with the profiling and the GC. From that the candidates would simply be IO, ST and State...i think. I went down the profile path because I wanted to verify a lack of memory leaks and I hadn't done any strictness analysis on a application whose major contribution is performance. I also noticed with criterion benchmarks that my initial tests ran extremely slow so I had suspected a big leak somewhere.

LeventErkok · 2019-03-06T05:45:16Z

Great; please do keep me in the loop; if SBV is contributing to this in any way I'd like to fix it. (Pull requests most welcome!)

There'll be a release of SBV sometime this weekend. Should I hold off before you get some further insight? Or should we go with the release? (The next-next release, I'm afraid, won't be for quite a while after this one.)

doyougnu · 2019-03-06T05:50:38Z

Nah go ahead with the release. I think we have good evidence this is isolated to just profiling and when profiling is turned off SBV is performant. I'll pull the new release, verify the bug, and experiment this weekend. Thanks for the quick responses and excellent work Levent!

LeventErkok · 2019-03-06T06:01:09Z

Sounds good. I'm going to close this ticket then; but please re-open it if you spot anything funky going on in SBV proper. Even if it only shows up during profiling, I'd like to at least understand it better. Much appreciated!

LeventErkok added a commit that referenced this issue Mar 3, 2019

Use a better representation of constraints

692094a

Hopefully addresses #460

LeventErkok added the Performance label Mar 4, 2019

LeventErkok closed this as completed Mar 6, 2019

doyougnu mentioned this issue Sep 1, 2019

SBV General Performance #484

Closed

doyougnu mentioned this issue Dec 27, 2020

Cache map #571

Merged

Internal Cache blows up with many constrain's #460

Internal Cache blows up with many constrain's #460

Comments

doyougnu commented Feb 28, 2019

LeventErkok commented Feb 28, 2019

LeventErkok commented Mar 1, 2019

doyougnu commented Mar 1, 2019 • edited

LeventErkok commented Mar 1, 2019

doyougnu commented Mar 1, 2019 • edited

LeventErkok commented Mar 3, 2019

LeventErkok commented Mar 3, 2019

doyougnu commented Mar 3, 2019 • edited

LeventErkok commented Mar 3, 2019

doyougnu commented Mar 3, 2019

LeventErkok commented Mar 3, 2019

doyougnu commented Mar 3, 2019

LeventErkok commented Mar 3, 2019

LeventErkok commented Mar 4, 2019

doyougnu commented Mar 4, 2019

LeventErkok commented Mar 4, 2019

LeventErkok commented Mar 4, 2019

LeventErkok commented Mar 5, 2019

doyougnu commented Mar 5, 2019

LeventErkok commented Mar 6, 2019 • edited

LeventErkok commented Mar 6, 2019

LeventErkok commented Mar 6, 2019

doyougnu commented Mar 6, 2019

LeventErkok commented Mar 6, 2019 • edited

LeventErkok commented Mar 6, 2019

doyougnu commented Mar 6, 2019

LeventErkok commented Mar 6, 2019

doyougnu commented Mar 6, 2019 • edited

LeventErkok commented Mar 6, 2019

doyougnu commented Mar 6, 2019

LeventErkok commented Mar 6, 2019

LeventErkok commented Mar 6, 2019

doyougnu commented Mar 6, 2019

LeventErkok commented Mar 6, 2019

doyougnu commented Mar 6, 2019

LeventErkok commented Mar 6, 2019 • edited

doyougnu commented Mar 1, 2019 •

edited

doyougnu commented Mar 1, 2019 •

edited

doyougnu commented Mar 3, 2019 •

edited

LeventErkok commented Mar 6, 2019 •

edited

LeventErkok commented Mar 6, 2019 •

edited

doyougnu commented Mar 6, 2019 •

edited

LeventErkok commented Mar 6, 2019 •

edited