Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using result of :readByteArray on large file is slow #1428

Open
qsctr opened this issue Sep 6, 2022 · 1 comment
Open

Using result of :readByteArray on large file is slow #1428

qsctr opened this issue Sep 6, 2022 · 1 comment
Labels
performance General performance related issues.

Comments

@qsctr
Copy link
Contributor

qsctr commented Sep 6, 2022

In #346 the implementation of :readByteArray was changed to turn the input file into a huge number and bind it to split applied to the huge number, to reduce memory usage compared to building up a huge sequence literal.

readFileCmd :: FilePath -> REPL ()
readFileCmd fp = do
bytes <- replReadFile fp (\err -> rPutStrLn (show err) >> return Nothing)
case bytes of
Nothing -> return ()
Just bs ->
do pm <- getPrimMap
let val = byteStringToInteger bs
let len = BS.length bs
let split = T.ePrim pm (M.prelPrim "split")
let number = T.ePrim pm (M.prelPrim "number")
let f = T.EProofApp (foldl T.ETApp split [T.tNum len, T.tNum (8::Integer), T.tBit])
let t = T.tWord (T.tNum (toInteger len * 8))
let x = T.EProofApp (T.ETApp (T.ETApp number (T.tNum val)) t)
let expr = T.EApp f x
void $ bindItVariable (E.TVSeq (toInteger len) (E.TVSeq 8 E.TVBit)) expr

However, using the resulting sequence turns out to be very slow when the file is big, because the result of split is not a whole new sequence in memory but rather a kind of lazy sequence that goes back to the original sequence and looks up the appropriate value whenever you index into it. And since the original sequence in this case is a huge bitvector represented as an Integer, the lookup of a byte within it results in doing a shift as pointed out by @yav, which is linear in the size of the integer.

splitV :: Backend sym =>
sym ->
Nat' ->
Integer ->
TValue ->
SEval sym (GenValue sym) ->
SEval sym (GenValue sym)
splitV sym parts each a val =
case (parts, each) of
(Nat p, e) | isTBit a -> do
val' <- sDelay sym (fromWordVal "splitV" <$> val)
return $ VSeq p $ indexSeqMap $ \i ->
VWord e <$> (extractWordVal sym e ((p-i-1)*e) =<< val')

Therefore processing the whole result of :readByteArray takes quadratic time. Note that fully forcing the result of split does not help, because the result of the lookup is not cached, so it will be slow every time you use the sequence.

It seems to me that the best way of solving this would be to just construct the sequence value directly in memory in a form where lookup is constant time and add a way to bind it to values instead of just expressions. Alternatively we could optimize the lookup of a byte within the huge bitvector to be constant time, perhaps by using a series of testBit calls as suggested by @RyanGlScott. This could improve performance more generally any time we are looking up a small number of bits in a very big bitvector, although I'm not sure how often that occurs outside of :readByteArray.

@qsctr qsctr added the performance General performance related issues. label Sep 6, 2022
@RyanGlScott
Copy link
Contributor

The hgmp library provides Haskell bindings to some low-level Integer manipulation primitives from gmp, which may offer an alternative way to search for a particular byte in an Integer. In particular, mpz_import and/or mpz_export might be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance General performance related issues.
Projects
None yet
Development

No branches or pull requests

2 participants