-
Notifications
You must be signed in to change notification settings - Fork 50
Proposed PooledDataArray changes #50
Comments
What this proposal doesn't mention, because I want to debate it in a separate issue, is the idea that we should define |
Sounds like a plan, though here are a few points I'd like to discuss:
Sounds like too restrictive to me. When you have large data consisting of strings, storing it as a vector of strings takes much more memory than pooling them, so it should be possible to load it as a Anyway, I don't understand why you say pooled data should be stable. What's the technical constrain which imposes that?
I'm not sure this adds much value. The one-way frequency table does not help you in any way to make a multidimensional table. And it imposes a non-negligible cost everytime you take a subset of the vector (and this would be worse with Being able to automatically drop empty levels is an important feature to me, but it could be done only when taking a subset, and as an option. |
You're right: the framing of "don't mutate PDA's" was too strong. What I want is to get everyone to agree that PDA's offer a performance trade-off similar to that offered by indexed columns in an RDBMS. You can do lots of cool things faster when you have an index for a database, but you pay a price in the speed of insertions and deletions. PDA's would offer such a trade-off: they perform insertions and deletions a little slower, but give you lots of useful cached information in exchange. I'm not sure if the subsetting concern is going to be a problem for long, since the long-term plan has always been to make subsetting operations on DataArray's and DataFrame's return SubDataArray and SubDataFrame structures once Base switches to producing array views by default. Since the Sub* structure is just a mapping from indices in the subset to indices in the original, I don't see any need to copy anything about the counts. The costly copy only happens when you want an independent copy of the subset, which is something you generally don't need. |
Yeah, but if people mostly work with |
You mean when they call |
Sorry, I don't get what you mean. ;-) |
Wow, I like the picture 👍 |
Now I'm totally lost! |
What specifically are you lost about? Because I'm lost about what you're lost about. |
Just joking, but I simply did not understand your question in #50 (comment). |
I just wanted to ask if there were cases in which we would use the proposed |
I see. Well, precisely, I don't think it could be used in any other case -- that's why I think it's not worth keeping around. |
But it is worth keeping around: it lets you maintain referential integrity. |
Well, once you've removed empty levels, you no longer need it, do you? |
Yes, but this is the proposed mechanism by which you know when a level is empty. This is basically the reference counting approach used in many garbage collectors. |
There's a little uncertainty about how Will people still use If so, it would be nice if neither use required remembering to convert (/ functions taking extra agruments to know) each time the array is used. As nominal and ordinal pooled data are living under the same type, perhaps adding If not, maybe treat them as categorical at least until more compact storage comes back after core functionality and benchmarking are in place. (Fits with my preference for the default -- that using the same clean syntax to pool a string and to pool an integer works, and that it's seems more useful to be explicit when pooling for performance.) |
I've just had an epiphany about PDA's. And now I'm resolutely opposed to their existence and think we need something totally different from our current implementation. My proposed changes are also not going to save us from the problem I've discovered. The problem is that PDA's, as we've thought about them up till now, are subtly corrupted by our familiarity with R's type system. But R's type system is fundamentally different from Julia's type system, because R has no scalar types at all. And the implications of that statement are actually a disaster waiting to happen for us if we ever implement ordered categorical data. The problem is that we can never hope to define an ordering on PDA's that works in a reliable way in Julia. The reason for that is that R's factors are always vectors and, importantly, a single element of R factor is itself a length-1 R factor vector. As such, simple things like As such, every single time you select a scalar element from a PDA, you lose access to all information about the ordering defined by the PDA. This will be a nightmare for us at some point. To see why, imagine that you have a For example, if you defined the ordering over the two levels, So what should we do? I now think that we should abandon PDA's completely. The way to provide both minimum memory usage, representation of categorical data with unobserved levels and custom orderings of scalars is to use vectors whose elements are instances of a new kind of Enum type. The Enum type could have all the properties we'd like to define and would not be subtly broken the way ordered PDA's would be. That said, I think we could still make PDA's useful. But I think they'd be most useful if we used them as a tool for providing O(1) calculations of But, if we ever want to get ordered factors right, we need to start baking the levels of ordered categorical data into the Julia type system, which means using something like Enum's and not using something like PDA's. Just like I've come to realize that emulation of delayed evaluation is hard to get right in Julia, I think working with ordered categorical data in Julia requires new abstractions that are further removed from R's precedent than our current implementation. |
I did not actively contribute to the DataArray/DataFrames package, but I often keep an eye on the discussions here. John, I agree with your point that an efficient & versatile Enum type is the correct way forward. |
Thanks, Dahua! |
Very interesting considerations. ;-) You're right that the concept of an ordered/unordered set of levels is an enum in the language, so it makes sense to consider it as such. Using strings to represent these values is a habit we took from R (but also from most stats languages I know of). So how would this work? I guess an enum type would be created on the fly?. And then how could you access the individual values of enum, e.g. how would you do |
I'm not sure what the status of Conditional on how the Base implementation works out, we would have to make new Re |
OK. So the only controversial point is actually whether the term |
I don't have strong feelings about the best name. But |
Yeah, let's find a better name. I don't know where the S authors found this term... Going back to enums: do you still want to support storing integers in PDAs? Julia enums may not support integer labels. This may also be a problem for any type which is not a string, e.g. |
I'm pretty sure enums are going to let you use any type, but let me know if that's not right. |
Well, apparently the idea with enums is to associate symbols to integers. Then it's up to you to make anything with them. While you can easily convert symbols to strings, you cannot automatically convert them to any type: you'd need to store the objects in a pool and retrieve them using their integer index. |
That's a bummer. We could always say that you can only use categorical data with string labels, then do a string -> symbol conversion in the background. Arguably there's no real need to use integers as categorical variables since the categoricalness means that arithmetic on the categories doesn't make any sense. So we can just replace integers with the symbols that represent them. |
My thinking on PooledDataArray's is slowly changing, especially with regard to the absence of anything like an official
factor
type like. Here are some proposals for how we might use them going forward.compact
. Instead, we should revert our earlier stance and ensure, on every operation, that PDA's are represented in maximally compact form.ordered::Bool
andorder::Vector{Uint64}
, which define an ordering over the pool by mapping each item of the pool to an integer that matches the rank of the associated level in the ordering over categories.reorder
so that we only have asetorder!
function, which takes in a new ordering of the pool as a vector. This would let you take something likepda = @pdata(["a", "b", "c"])
and reorder it assetorder!(pda, ["b", "c", "a"])
.get_indices
and similar functions should not exist.The text was updated successfully, but these errors were encountered: