Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ByteString implement Iterable #294

Open
mgroth0 opened this issue Apr 11, 2024 · 8 comments
Open

Make ByteString implement Iterable #294

mgroth0 opened this issue Apr 11, 2024 · 8 comments

Comments

@mgroth0
Copy link

mgroth0 commented Apr 11, 2024

Why does ByteString not implement Iterable? Was it a design decision, or was there just never a reason to?

It seems to me like it would be incredibly useful and 100% safe.

@JakeWharton
Copy link
Contributor

On the JVM this would incur boxing and a ton of virtual dispatching making the performance pretty awful. What are you trying to do?

@mgroth0
Copy link
Author

mgroth0 commented Apr 13, 2024

Thanks for explaining. If I understand correctly, implementing Iterable would only incur boxing when the iterator method is called.

I don't have one specific use case, but it seems all around practical because we immediately can use entire suites of extensions methods on Iterable such as Kotlin Collections from the standard library. Also, I like to make classes that take Iterable constructor parameters. Then I build a list or set to get a safe immutable copy inside the class.

I think the particular thing I noticed was that there is no existing forEach extension for ByteString. I could quickly write it myself, but it just gave me this idea because implementing Iterable would make it work out of the box in convenient ways.

I also feel convinced because since ByteString is immutable, it is particularly more safe to iterate than bytearray (even concurrently).

On the topic of performance, I try to first write code that is safe, easy to read, and easy to maintain and try not to let performance preocupy me too much since in many real life scenarios these performance differences will be negligable. Also what's the point of computer hardware getting better if we don't let it shave off a few lines of code for us?

If the consensus to implementing Iterable is not to, I would consider an alternative solution maybe to be extension methods to match what we have in the standard library for ByteArray or String.

@JakeWharton
Copy link
Contributor

If I understand correctly, implementing Iterable would only incur boxing when the iterator method is called.

Every call to Iterator#next() would box, and then immediately unbox because it has to travel through the interface.

Kotlin does feature iterable specializations that avoid the boxing (ByteIterable), but I don't recall the for/in syntax using them. And certainly none of the transforming operators use them as you mention below.

I don't have one specific use case, but it seems all around practical because we immediately can use entire suites of extensions methods on Iterable such as Kotlin Collections from the standard library. Also, I like to make classes that take Iterable constructor parameters. Then I build a list or set to get a safe immutable copy inside the class.

For actual collections of objects this makes sense. Unfortunately for bytes you are much better off treating them as large units. ByteString is already immutable and represents a chunk which as a whole semantically represents something. Additionally, bytes are typically not processed using the operators one would normally use on other iterable things. Sometimes you do, sure, but it's very rare, and in those cases you need to operate directly on the bytes rather than through a (comparatively) heavy abstraction like Iterable.

In the 10 years of Okio we have not needed to iterate a ByteString, at least not in a way that couldn't be done by shallow-copying its backing segments to a Buffer and operating on that directly or its backing ByteArrays via UnsafeCursor.

I think the particular thing I noticed was that there is no existing forEach extension for ByteString. I could quickly write it myself, but it just gave me this idea because implementing Iterable would make it work out of the box in convenient ways.

The need to iterate bytes like this is not entirely uncommon, but if you're operating at the level of bytes you generally want to ingest with as little overhead as possible because you're interpreting those bytes as gzip data, png data, mp3 data, or something else. You aren't looking at each byte one-by-one and doing something with it individually.

On the topic of performance, I try to first write code that is safe, easy to read, and easy to maintain and try not to let performance preocupy me too much since in many real life scenarios these performance differences will be negligible.

This is a good strategy for high-level code which deals with objects and things. Correctness first, performance second.

I'm not sure it makes sense at this layer of abstraction, however. Concrete use-cases of what you are trying to accomplish would go a long way in helping us understand if this is the right strategy or if there's a better one. Speculatively adding the feature, though, is unlikely to yield great results.

Also what's the point of computer hardware getting better if we don't let it shave off a few lines of code for us?

I'm not sure when you started using computers, but computers today are perceptively slower than older ones! Using programs on my 333Mhz PC twenty-five years ago was a much more responsive and snappy experience than my M3 Mac.

The pace at which we've introduced indirection and abstraction at every layer has outpaced Moore's law (which itself has mostly died). The laws of thermodynamics and the speed of light remain fixed, and we're smack up against them now.

The ability to process things efficiently, especially those represented in low-level constructs like a series of bytes, has actually become a competitive advantage for apps.

@qwwdfsad
Copy link
Collaborator

qwwdfsad commented Apr 13, 2024

Nice discussion!

but I don't recall the for/in syntax using them.

They should, but for now it is a hidden knowledge for no reason. Filed https://youtrack.jetbrains.com/issue/KT-67379/

Example
class ByteString : Iterable<Byte> {  
  
    class BIter(private val arr: ByteArray) : ByteIterator() {  
        private var idx = -1  
        override fun nextByte(): Byte = arr[++idx]  
  
        override fun hasNext(): Boolean = idx < arr.size - 1  
    }  
  
    // Experiment: replace ByteIterator with Iterator<Byte> and see the bytecode difference  
    override fun iterator(): ByteIterator = BIter(byteArrayOf(1, 2, 3))  
}  
 
@Test
fun sample() {
    var sum: Int = 0
    for (b in ByteString()) {
        sum += b
    }
}

I'm not sure it makes sense at this layer of abstraction, however. Concrete use-cases of what you are trying to accomplish would go a long way in helping us understand if this is the right strategy or if there's a better one. Speculatively adding the feature, though, is unlikely to yield great results.

This is quite a subtle topic, especially if/when we are going to move ByteString into the standard library and have kotlin.String and kotlin.ByteString along each other, where it is reasonable to expect them to be more or less aligned.

On the one hand, for passer-by usages (and if we are successful, we expect plenty of them!), the impedance mismatch would be quite noticeable -- why for String a user can iterate over characters, while for ByteString they cannot iterate over their bytes/ascii characters? (if there will be such a need at all) It's hard to explain either -- boxing considerations is not a lingua we expect everyone to understand and reason with. Having non-included batteries in the standard library is always questionable.

On the other hand, at the very core, the API is quite specialized, and exposing inherently suboptimal operations is a questionable thing to do. Even if boxing is optimizeable via ByteIterator, the iterator allocation is not (and, more importantly, byte-by-byte processing becomes more straightforward than batch processing that we encourage). This is a perfect use case for https://youtrack.jetbrains.com/issue/KT-33851/iterator-less-for-statement-operator-convention though

@qwwdfsad
Copy link
Collaborator

The ability to process things efficiently, especially those represented in low-level constructs like a series of bytes, has actually become a competitive advantage for apps.

I share this sentiment: for apps, for tools, for IDEs, for everything, to be honest.

This is the topic that puzzles me regularly -- from one perspective, we can only provide quite tailored APIs that are hard to misuse ([lack of] byte iteration is a nice example!); it might help, but comes with its own price. From the other perspective, being efficient cannot really be an afterthought for an app, and no amount of well-thought API can help it if it's not a consideration in the first place. All the convenience extensions will be written anyway if they are not here.

The place on the spectrum from "efficient low-level API" and "convenient yet probably slower API" is really elusive, esp. for IO layer.

@mgroth0
Copy link
Author

mgroth0 commented Aug 26, 2024

Thank you both for the thoughtful answers! I have a lot to learn here.

Every call to Iterator#next() would box, and then immediately unbox because it has to travel through the interface.

Boxing is a topic I don't understand very well. Can someone point me to something I can read on this? Or is it something that can be explained here? I understand that the Iterator object will have to be instantiated, but I don't understand why the ByteString object will have to be repeatedly boxed and unboxed as opposed to boxed once and then have a reference of the box held by the iterator.

In the 10 years of Okio we have not needed to iterate a ByteString

I see my request might be going against an important principle. That principle being that binary objects should be handled with bulk operations.

I think I can respect this principle but still make a case for supporting iteration. One reason is debugging. If a bulk operation on a binary object has an issue, it might be useful to iterate one byte at a time and see what is going wrong, maybe with a breakpoint in there.

Another reason is being able to maintain more of a flow state in your head when writing an operation on bytes in which you can immidiately think of how to do it byte-by-byte, but figuring out the appropriate bulk operation might take some more thought.

I'm not sure when you started using computers, but computers today are perceptively slower than older ones!
The pace at which we've introduced indirection and abstraction at every layer has outpaced Moore's law

I started around 2005. I think this is a fascinating perspective but I personally don't see it. I mean, unless we are talking about gradle builds, that's about the only thing that is slow for me. But even there once the configuration cache was introduced it got super fast (at least when the configuration cache works correctly!). Then again, I take what you are saying as the truth given you have more experience.

On the one hand, for passer-by usages (and if we are successful, we expect plenty of them!), the impedance mismatch would be quite noticeable -- why for String a user can iterate over characters, while for ByteString they cannot iterate over their bytes/ascii characters?

For the record that's definitely what brought me here! It just stuck me as strange, even through I can see now there are valid reasons for the abscence of an iteration API on ByteString.

This is indeed a nice topic. Maybe documenting the reasoning behind not including forEach could be just as valuable as the community (if not more) than including it at all. Though I still think some of my reasons in this comment are worth considering. I'll definitely respect whatever decision you all make.

@fzhinkin
Copy link
Collaborator

Boxing is a topic I don't understand very well. Can someone point me to something I can read on this? Or is it something that can be explained here? I understand that the Iterator object will have to be instantiated, but I don't understand why the ByteString object will have to be repeatedly boxed and unboxed as opposed to boxed once and then have a reference of the box held by the iterator.

Iterator over ByteString would implement Iterator<Byte>.
When targeting JVM, Kotlin's Iterator interface is mapped to Java java.util.Iterator (javadoc).
Java prohibits generic type instantiation using primitive types.
As a result, iterator's fun next(): Byte would need to return Byte-object, not a primitive type. But values stored inside ByteString's backing array are primitive bytes, not objects. So every byte needs to boxed into java.lang.Byte before returning from next and then, most likely, it needs to be unboxed at a use-site.

Here's an example of an iterator over ByteArray: https://godbolt.org/z/WE8nonbeK
As you can see, a value retrieved from the array is wrapper into j.l.Byte via java/lang/Byte.valueOf call.

@JakeWharton
Copy link
Contributor

If you return a ByteIterator it doesn't box! https://kotlin.godbolt.org/z/P5rqb8vx7

I forgot Kotlin has these iterator specializations. We of course never had access to them in Okio historially since it spent most of its life written in Java.

That said, as soon as someone calls an extension on it like map the game is over and we're back to boxing.

On Java all big-B Byte instances are cached as long as you go through Byte.valueOf. I have no idea what happens on other Kotlin targets though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants