Using result of `:readByteArray` on large file is slow #1428

qsctr · 2022-09-06T07:39:13Z

In #346 the implementation of :readByteArray was changed to turn the input file into a huge number and bind it to split applied to the huge number, to reduce memory usage compared to building up a huge sequence literal.

cryptol/src/Cryptol/REPL/Command.hs

Lines 1046 to 1061 in 39dea16

    
           readFileCmd :: FilePath -> REPL () 
        
           readFileCmd fp = do 
        
             bytes <- replReadFile fp (\err -> rPutStrLn (show err) >> return Nothing) 
        
             case bytes of 
        
               Nothing -> return () 
        
               Just bs -> 
        
                 do pm <- getPrimMap 
        
                    let val = byteStringToInteger bs 
        
                    let len = BS.length bs 
        
                    let split = T.ePrim pm (M.prelPrim "split") 
        
                    let number = T.ePrim pm (M.prelPrim "number") 
        
                    let f = T.EProofApp (foldl T.ETApp split [T.tNum len, T.tNum (8::Integer), T.tBit]) 
        
                    let t = T.tWord (T.tNum (toInteger len * 8)) 
        
                    let x = T.EProofApp (T.ETApp (T.ETApp number (T.tNum val)) t) 
        
                    let expr = T.EApp f x 
        
                    void $ bindItVariable (E.TVSeq (toInteger len) (E.TVSeq 8 E.TVBit)) expr

However, using the resulting sequence turns out to be very slow when the file is big, because the result of split is not a whole new sequence in memory but rather a kind of lazy sequence that goes back to the original sequence and looks up the appropriate value whenever you index into it. And since the original sequence in this case is a huge bitvector represented as an Integer, the lookup of a byte within it results in doing a shift as pointed out by @yav, which is linear in the size of the integer.

cryptol/src/Cryptol/Eval/Generic.hs

Lines 1072 to 1084 in 39dea16

    
           splitV :: Backend sym => 
        
             sym -> 
        
             Nat' -> 
        
             Integer -> 
        
             TValue -> 
        
             SEval sym (GenValue sym) -> 
        
             SEval sym (GenValue sym) 
        
           splitV sym parts each a val = 
        
               case (parts, each) of 
        
                  (Nat p, e) | isTBit a -> do 
        
                     val' <- sDelay sym (fromWordVal "splitV" <$> val) 
        
                     return $ VSeq p $ indexSeqMap $ \i -> 
        
                       VWord e <$> (extractWordVal sym e ((p-i-1)*e) =<< val')

Therefore processing the whole result of :readByteArray takes quadratic time. Note that fully forcing the result of split does not help, because the result of the lookup is not cached, so it will be slow every time you use the sequence.

It seems to me that the best way of solving this would be to just construct the sequence value directly in memory in a form where lookup is constant time and add a way to bind it to values instead of just expressions. Alternatively we could optimize the lookup of a byte within the huge bitvector to be constant time, perhaps by using a series of testBit calls as suggested by @RyanGlScott. This could improve performance more generally any time we are looking up a small number of bits in a very big bitvector, although I'm not sure how often that occurs outside of :readByteArray.

The text was updated successfully, but these errors were encountered:

RyanGlScott · 2022-09-19T18:49:04Z

The hgmp library provides Haskell bindings to some low-level Integer manipulation primitives from gmp, which may offer an alternative way to search for a particular byte in an Integer. In particular, mpz_import and/or mpz_export might be useful.

qsctr added the performance General performance related issues. label Sep 6, 2022

qsctr mentioned this issue Sep 6, 2022

FFI version of SuiteB AES functions #1398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using result of `:readByteArray` on large file is slow #1428

Using result of `:readByteArray` on large file is slow #1428

qsctr commented Sep 6, 2022

RyanGlScott commented Sep 19, 2022

Using result of :readByteArray on large file is slow #1428

Using result of :readByteArray on large file is slow #1428

Comments

qsctr commented Sep 6, 2022

RyanGlScott commented Sep 19, 2022

Using result of `:readByteArray` on large file is slow #1428

Using result of `:readByteArray` on large file is slow #1428