Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Encoding lists without space leaks #6

Closed
Gabriel439 opened this Issue August 20, 2012 · 6 comments

2 participants

Gabriel Gonzalez Trevor Elliott
Gabriel Gonzalez

cereal triggers a space leak when encoding lists. The source of the problem is encodeListOf, which computes the length of the list, which brings the entire list into memory.

encodeListOf :: (a -> Builder) -> [a] -> Builder
encodeListOf f = -- allow inlining with just a single argument
    \xs ->  execPut (putWord64be (fromIntegral $ length xs)) `mappend`
            foldMap f xs

The way I get around this in my own code is a chunked list representation. The following code writes to disk directly just because I'm not as familiar with cereal's internals, but I'm reasonably sure you could translate it into a pure equivalent:

{-# LANGUAGE ScopedTypeVariables #-}

import Control.Monad
import qualified Data.ByteString.Char8 as B
import Data.List.Split
import Data.Serialize
import Foreign.Safe
import System.IO

list = [1..10000000] :: [Int]

main = do
--  B.writeFile "test.dat" $ encode list
    withFile "test.dat" WriteMode $ \h -> encodeList 100 encodeS list h
--  xs <- withFile "test.dat" ReadMode  $ \h -> decodeList decodeS h
--  print $ length (xs :: [Int])

encodeS :: forall a . (Storable a) => a -> Handle -> IO ()
encodeS x hdl = with x $ \ptr -> hPutBuf hdl ptr (sizeOf (undefined :: a))

decodeS :: forall a . (Storable a) => Handle -> IO a
decodeS hdl = alloca $ \ptr -> do
    hGetBuf hdl ptr (sizeOf (undefined :: a))
    peek ptr

markLast    []  = (Just          0, []):[]
markLast (l:[]) = (Just $ length l,  l):[]
markLast (l:ls) = (Nothing        ,  l):markLast ls

encodeList :: Int -> (a -> Handle -> IO ()) -> [a] -> Handle -> IO ()
encodeList n e as hdl = do
    encodeS n hdl
    forM_ (markLast $ chunk n as) $ \(m, as') -> do
        case m of
            Nothing  -> encodeS False hdl
            Just len -> do
                encodeS True hdl
                encodeS len hdl
        forM_ as' $ \a -> e a hdl

decodeList :: (Handle -> IO a) -> Handle -> IO [a]
decodeList d hdl = do
    n <- decodeS hdl
    let loop = do
            last <- decodeS hdl
            if last
            then do
                len <- decodeS hdl
                replicateM len (d hdl)
            else do
                as <- replicateM n (d hdl)
                fmap (as ++) loop
    loop

The above implementation runs in constant space (and much faster because of direct IO, but I'll create a separate ticket for that).

I wanted to ask if you could add something like the above chunked implementation that runs in constant space in Data.Serialize.Get and Data.Serialize.Put (i.e. perhaps "getChunkedListOf" or "putChunkedListOf").

Also, optionally, you could presumably pick a fast default value of n (I found roughly 100 worked best on my machine) and replace the default instance for encoding lists with that one, although if you choose not to (i.e. for binary compatibility with previous cereal versions), that would be just fine with me and I would be happy to just use the named versions instead.

Trevor Elliott
Collaborator

I like the idea of doing the chunked representation. I don't really see any problem with exposing both a chunked version of putListOf/getListOf, in addition to one that picks a sane default value. If you're interested in writing the patch, I'd be happy to take the contribution :)

Gabriel Gonzalez

I'd be happy to write up the patch. Just give me a while because I need to familiarize myself with the cereal internals some more first.

Trevor Elliott
Collaborator

You should be able to write it as something that makes iterated calls to putListOf/getListOf as a primitive, looping over its input and splitting at the chunk boundary using splitAt.

putChunkedListOf :: Int -> Putter a -> Putter [a]
putChunkedListOf clen f = loop
  where
  loop xs = do
    let (as,bs) = splitAt clen xs
    putListOf f as
    unless (null bs) (loop bs)
Gabriel Gonzalez
Trevor Elliott elliottt closed this December 10, 2012
Trevor Elliott
Collaborator

(I'm closing this as it seems like you'll submit a pull request when the feature is ready :)

Gabriel Gonzalez

Yeah, I will submit a pull request when it is ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.