Encoding lists without space leaks #6

Closed
Gabriel439 opened this Issue Aug 20, 2012 · 6 comments

Projects

None yet

2 participants

@Gabriel439

cereal triggers a space leak when encoding lists. The source of the problem is encodeListOf, which computes the length of the list, which brings the entire list into memory.

encodeListOf :: (a -> Builder) -> [a] -> Builder
encodeListOf f = -- allow inlining with just a single argument
    \xs ->  execPut (putWord64be (fromIntegral $ length xs)) `mappend`
            foldMap f xs

The way I get around this in my own code is a chunked list representation. The following code writes to disk directly just because I'm not as familiar with cereal's internals, but I'm reasonably sure you could translate it into a pure equivalent:

{-# LANGUAGE ScopedTypeVariables #-}

import Control.Monad
import qualified Data.ByteString.Char8 as B
import Data.List.Split
import Data.Serialize
import Foreign.Safe
import System.IO

list = [1..10000000] :: [Int]

main = do
--  B.writeFile "test.dat" $ encode list
    withFile "test.dat" WriteMode $ \h -> encodeList 100 encodeS list h
--  xs <- withFile "test.dat" ReadMode  $ \h -> decodeList decodeS h
--  print $ length (xs :: [Int])

encodeS :: forall a . (Storable a) => a -> Handle -> IO ()
encodeS x hdl = with x $ \ptr -> hPutBuf hdl ptr (sizeOf (undefined :: a))

decodeS :: forall a . (Storable a) => Handle -> IO a
decodeS hdl = alloca $ \ptr -> do
    hGetBuf hdl ptr (sizeOf (undefined :: a))
    peek ptr

markLast    []  = (Just          0, []):[]
markLast (l:[]) = (Just $ length l,  l):[]
markLast (l:ls) = (Nothing        ,  l):markLast ls

encodeList :: Int -> (a -> Handle -> IO ()) -> [a] -> Handle -> IO ()
encodeList n e as hdl = do
    encodeS n hdl
    forM_ (markLast $ chunk n as) $ \(m, as') -> do
        case m of
            Nothing  -> encodeS False hdl
            Just len -> do
                encodeS True hdl
                encodeS len hdl
        forM_ as' $ \a -> e a hdl

decodeList :: (Handle -> IO a) -> Handle -> IO [a]
decodeList d hdl = do
    n <- decodeS hdl
    let loop = do
            last <- decodeS hdl
            if last
            then do
                len <- decodeS hdl
                replicateM len (d hdl)
            else do
                as <- replicateM n (d hdl)
                fmap (as ++) loop
    loop

The above implementation runs in constant space (and much faster because of direct IO, but I'll create a separate ticket for that).

I wanted to ask if you could add something like the above chunked implementation that runs in constant space in Data.Serialize.Get and Data.Serialize.Put (i.e. perhaps "getChunkedListOf" or "putChunkedListOf").

Also, optionally, you could presumably pick a fast default value of n (I found roughly 100 worked best on my machine) and replace the default instance for encoding lists with that one, although if you choose not to (i.e. for binary compatibility with previous cereal versions), that would be just fine with me and I would be happy to just use the named versions instead.

@elliottt
Galois, Inc. member

I like the idea of doing the chunked representation. I don't really see any problem with exposing both a chunked version of putListOf/getListOf, in addition to one that picks a sane default value. If you're interested in writing the patch, I'd be happy to take the contribution :)

@Gabriel439

I'd be happy to write up the patch. Just give me a while because I need to familiarize myself with the cereal internals some more first.

@elliottt
Galois, Inc. member

You should be able to write it as something that makes iterated calls to putListOf/getListOf as a primitive, looping over its input and splitting at the chunk boundary using splitAt.

putChunkedListOf :: Int -> Putter a -> Putter [a]
putChunkedListOf clen f = loop
  where
  loop xs = do
    let (as,bs) = splitAt clen xs
    putListOf f as
    unless (null bs) (loop bs)
@Gabriel439
@elliottt elliottt closed this Dec 10, 2012
@elliottt
Galois, Inc. member

(I'm closing this as it seems like you'll submit a pull request when the feature is ready :)

@Gabriel439

Yeah, I will submit a pull request when it is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment