Proposed PooledDataArray changes #50

johnmyleswhite · 2014-01-05T21:35:42Z

My thinking on PooledDataArray's is slowly changing, especially with regard to the absence of anything like an official factor type like. Here are some proposals for how we might use them going forward.

PDA's shouldn't be used as a workhorse data structure for data that is still likely to be mutated. One should, under most circumstances, only construct a PDA when you know that the data is effectively static.
The value of PDA's is that they cache information about categorical data. For now, PDA's cache only information about the unique values found in the full array. But I'd like to propose adding more information.
In addition to caching the unique values, I would argue for maintaing a count of the number of times that each value occurs. This would make it easier to repeatedly use PDA's in cross tabulations. In addition, we can use a number of counts of each unique value to ensure that the PDA's pool never contains entries that are no longer present in the PDA's refs.
Right now the ordering of levels in the pool is something one has to guess at. Although we thankfully construct PDA's such that the pool is ordered, we don't maintain this invariant when values are inserted or deleted from the PDA. I would argue that we should guarantee that all operations on PDA's leave the pool ordered.
We shouldn't require that users ever call compact. Instead, we should revert our earlier stance and ensure, on every operation, that PDA's are represented in maximally compact form.
We should implement an ordering option for PDA's by adding two new fields to the data structure: ordered::Bool and order::Vector{Uint64}, which define an ordering over the pool by mapping each item of the pool to an integer that matches the rank of the associated level in the ordering over categories.
We should maybe remove/rename reorder so that we only have a setorder! function, which takes in a new ordering of the pool as a vector. This would let you take something like pda = @pdata(["a", "b", "c"]) and reorder it as setorder!(pda, ["b", "c", "a"]).
We should never leak the numeric representation of the refs to the outside world. This means that get_indices and similar functions should not exist.

The text was updated successfully, but these errors were encountered:

johnmyleswhite · 2014-01-05T21:38:32Z

What this proposal doesn't mention, because I want to debate it in a separate issue, is the idea that we should define factor(x) in a model formula in terms of the output of unique(x). For PDA's, this is cached. For DA's, this is calculated at model matrix construction time. For streaming data, the unique values must be specified upfront.

nalimilan · 2014-01-06T09:28:21Z

Sounds like a plan, though here are a few points I'd like to discuss:

PDA's shouldn't be used as a workhorse data structure for data that is still likely to be mutated. One should, under most circumstances, only construct a PDA when you know that the data is effectively static.

Sounds like too restrictive to me. When you have large data consisting of strings, storing it as a vector of strings takes much more memory than pooling them, so it should be possible to load it as a PooledDataVector even if you still need to recode things. Moreover, I don't think the distinction between "in flux" and "stable" is very practical: this would mean that once your data is pooled, you have to reload it in another form to make a few changes when you detect problems/find new ideas?

Anyway, I don't understand why you say pooled data should be stable. What's the technical constrain which imposes that?

The value of PDA's is that they cache information about categorical data. For now, PDA's cache only information about the unique values found in the full array. But I'd like to propose adding more information.
In addition to caching the unique values, I would argue for maintaing a count of the number of times that each value occurs. This would make it easier to repeatedly use PDA's in cross tabulations. In addition, we can use a number of counts of each unique value to ensure that the PDA's pool never contains entries that are no longer present in the PDA's refs.

I'm not sure this adds much value. The one-way frequency table does not help you in any way to make a multidimensional table. And it imposes a non-negligible cost everytime you take a subset of the vector (and this would be worse with DataFrames with many variables), which may make Julia feel slow. This behavior would also be a problem if DataFrames were backed by mmapped files or by a database: subsetting would impose reloading all of the data into memory.

Being able to automatically drop empty levels is an important feature to me, but it could be done only when taking a subset, and as an option.

johnmyleswhite · 2014-01-06T19:51:40Z

You're right: the framing of "don't mutate PDA's" was too strong. What I want is to get everyone to agree that PDA's offer a performance trade-off similar to that offered by indexed columns in an RDBMS. You can do lots of cool things faster when you have an index for a database, but you pay a price in the speed of insertions and deletions. PDA's would offer such a trade-off: they perform insertions and deletions a little slower, but give you lots of useful cached information in exchange.

I'm not sure if the subsetting concern is going to be a problem for long, since the long-term plan has always been to make subsetting operations on DataArray's and DataFrame's return SubDataArray and SubDataFrame structures once Base switches to producing array views by default. Since the Sub* structure is just a mapping from indices in the subset to indices in the original, I don't see any need to copy anything about the counts. The costly copy only happens when you want an independent copy of the subset, which is something you generally don't need.

nalimilan · 2014-01-06T20:12:45Z

Yeah, but if people mostly work with SubDataArrays, and counts are not updated in that case, they are of a very limited use. ;-)

johnmyleswhite · 2014-01-06T20:16:00Z

You mean when they call countmap? That's the only case I'm seeing right now in which we'd use the counts to do anything other than prune the pool. Seems like it would be no slower than using countmap on DataArray.

nalimilan · 2014-01-06T21:01:59Z

Sorry, I don't get what you mean. ;-)

johnmyleswhite · 2014-01-06T21:04:23Z

lindahua · 2014-01-06T21:12:35Z

Wow, I like the picture 👍

nalimilan · 2014-01-06T21:26:52Z

Now I'm totally lost!

johnmyleswhite · 2014-01-07T04:43:30Z

What specifically are you lost about? Because I'm lost about what you're lost about.

nalimilan · 2014-01-07T08:43:40Z

Just joking, but I simply did not understand your question in #50 (comment).

johnmyleswhite · 2014-01-07T16:06:47Z

I just wanted to ask if there were cases in which we would use the proposed counts field other than maintaining referential integrity of the pool field -- except for the case of someone asking exactly for the counts in a tabulation.

nalimilan · 2014-01-07T16:46:50Z

I see. Well, precisely, I don't think it could be used in any other case -- that's why I think it's not worth keeping around.

johnmyleswhite · 2014-01-07T16:47:34Z

But it is worth keeping around: it lets you maintain referential integrity.

nalimilan · 2014-01-07T16:53:46Z

Well, once you've removed empty levels, you no longer need it, do you?

johnmyleswhite · 2014-01-07T16:54:42Z

Yes, but this is the proposed mechanism by which you know when a level is empty. This is basically the reference counting approach used in many garbage collectors.

garborg · 2014-01-25T19:36:04Z

There's a little uncertainty about how PooledDataArrays with numeric eltypes should be interpreted, here and #46 -- should they be treated like a DataArray or like categorical data?

Will people still use PooledDataArrays to store numbers more compactly if they're Uint32 under the hood (#57)?

If so, it would be nice if neither use required remembering to convert (/ functions taking extra agruments to know) each time the array is used. As nominal and ordinal pooled data are living under the same type, perhaps adding categorical::Bool or replacing ordered with scale.

If not, maybe treat them as categorical at least until more compact storage comes back after core functionality and benchmarking are in place. (Fits with my preference for the default -- that using the same clean syntax to pool a string and to pool an integer works, and that it's seems more useful to be explicit when pooling for performance.)

johnmyleswhite · 2014-01-26T02:13:04Z

I've just had an epiphany about PDA's. And now I'm resolutely opposed to their existence and think we need something totally different from our current implementation. My proposed changes are also not going to save us from the problem I've discovered.

The problem is that PDA's, as we've thought about them up till now, are subtly corrupted by our familiarity with R's type system. But R's type system is fundamentally different from Julia's type system, because R has no scalar types at all.

And the implications of that statement are actually a disaster waiting to happen for us if we ever implement ordered categorical data.

The problem is that we can never hope to define an ordering on PDA's that works in a reliable way in Julia.

The reason for that is that R's factors are always vectors and, importantly, a single element of R factor is itself a length-1 R factor vector.

As such, simple things like factor[index1] > factor[index2] always have access to all of the properties of the factor in R. In contrast, single-element indexing in Julia will always produce a single scalar element that loses all access to the vector from which it was extracted.

As such, every single time you select a scalar element from a PDA, you lose access to all information about the ordering defined by the PDA. This will be a nightmare for us at some point.

To see why, imagine that you have a PooledDataArray{UTF8String} and that you then define an ordering over its levels. As one would hope, the ordering we define can affect computations on the vector as a whole, but will never apply to the single elements of the PDA when they're considered in isolation.

For example, if you defined the ordering over the two levels, ["A", "B"] so that B < A held in the vector, you would get totally broken behavior for comparison operations every time you pulled elements out of this PDA, because "A" < "B" is a property of strings that we can't just override at will.

So what should we do? I now think that we should abandon PDA's completely. The way to provide both minimum memory usage, representation of categorical data with unobserved levels and custom orderings of scalars is to use vectors whose elements are instances of a new kind of Enum type. The Enum type could have all the properties we'd like to define and would not be subtly broken the way ordered PDA's would be.

That said, I think we could still make PDA's useful. But I think they'd be most useful if we used them as a tool for providing O(1) calculations of unique and levels. That might make me them of such rare and obscure use that we shouldn't spend our time maintaining them, but it's at least one possible use case for that kind of data structure.

But, if we ever want to get ordered factors right, we need to start baking the levels of ordered categorical data into the Julia type system, which means using something like Enum's and not using something like PDA's.

Just like I've come to realize that emulation of delayed evaluation is hard to get right in Julia, I think working with ordered categorical data in Julia requires new abstractions that are further removed from R's precedent than our current implementation.

lindahua · 2014-01-26T03:38:32Z

I did not actively contribute to the DataArray/DataFrames package, but I often keep an eye on the discussions here.

John, I agree with your point that an efficient & versatile Enum type is the correct way forward.

johnmyleswhite · 2014-01-26T03:39:34Z

Thanks, Dahua!

nalimilan · 2014-01-26T16:28:34Z

Very interesting considerations. ;-) You're right that the concept of an ordered/unordered set of levels is an enum in the language, so it makes sense to consider it as such. Using strings to represent these values is a habit we took from R (but also from most stats languages I know of).

So how would this work? I guess an enum type would be created on the fly?. And then how could you access the individual values of enum, e.g. how would you do pda .== "A"? with pda .== :A? or pda .== MyEnum.A? I reckon the enum shouldn't be named nor accessible independently of the PDA.

johnmyleswhite · 2014-01-26T16:37:23Z

I'm not sure what the status of Enum types is in Base, but there's been some recent discussion about making them first-class citizens. I think we need to wait to see how they get handled there before we could make a clear plan for moving forward.

Conditional on how the Base implementation works out, we would have to make new Enum types on the fly to represent categorical data. We'd need to pick an interface, but I think something like makefactor(DataArray, levels = [...], order = [...]) would let us convert a DataArray{T} into a DataArray{Enum{T}} pretty easily. That function would then have to do a whole bunch of stuff: define the Enum type (and, arguably, do so in the proper scope), define an ordering on the Enum or define the absence of an ordering, define conversions between types, etc.

Re pda .== "A", I think this shouldn't be too hard if the Enum is derived from UTF8String. You'd just provide a mechanism for going back and forth between UTF8String and the Enum internal encoding, which is going to be something like Uint8.

nalimilan · 2014-01-26T16:44:40Z

OK. So the only controversial point is actually whether the term factor should be retained or not. ;-)

johnmyleswhite · 2014-01-26T16:46:57Z

I don't have strong feelings about the best name. But factor already exists in Base Julia and we shouldn't pun against it, so I'd say that's a no-go.

nalimilan · 2014-01-26T18:21:41Z

Yeah, let's find a better name. I don't know where the S authors found this term...

Going back to enums: do you still want to support storing integers in PDAs? Julia enums may not support integer labels. This may also be a problem for any type which is not a string, e.g. Datetime objects. We would need either very flexible enums, or a way to map enum values to a pool of levels.

johnmyleswhite · 2014-01-26T18:26:48Z

I'm pretty sure enums are going to let you use any type, but let me know if that's not right.

nalimilan · 2014-01-26T20:26:31Z

Well, apparently the idea with enums is to associate symbols to integers. Then it's up to you to make anything with them. While you can easily convert symbols to strings, you cannot automatically convert them to any type: you'd need to store the objects in a pool and retrieve them using their integer index.

johnmyleswhite · 2014-01-26T20:31:03Z

That's a bummer. We could always say that you can only use categorical data with string labels, then do a string -> symbol conversion in the background. Arguably there's no real need to use integers as categorical variables since the categoricalness means that arithmetic on the categories doesn't make any sense. So we can just replace integers with the symbols that represent them.

nalimilan mentioned this issue Jan 26, 2014

enum support in base JuliaLang/julia#3080

Closed

johnmyleswhite mentioned this issue Jan 29, 2014

Metadata for columns and/or DataFrames JuliaData/DataFrames.jl#35

Closed

nalimilan mentioned this issue Feb 8, 2014

Replace PooledDataArray with NominalVariable/CategoricalVariable #73

Closed

nalimilan mentioned this issue May 14, 2017

Move CategoricalValue and CategoricalPool into separate package JuliaData/CategoricalArrays.jl#64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed PooledDataArray changes #50

Proposed PooledDataArray changes #50

johnmyleswhite commented Jan 5, 2014

johnmyleswhite commented Jan 5, 2014

nalimilan commented Jan 6, 2014

johnmyleswhite commented Jan 6, 2014

nalimilan commented Jan 6, 2014

johnmyleswhite commented Jan 6, 2014

nalimilan commented Jan 6, 2014

johnmyleswhite commented Jan 6, 2014

lindahua commented Jan 6, 2014

nalimilan commented Jan 6, 2014

johnmyleswhite commented Jan 7, 2014

nalimilan commented Jan 7, 2014

johnmyleswhite commented Jan 7, 2014

nalimilan commented Jan 7, 2014

johnmyleswhite commented Jan 7, 2014

nalimilan commented Jan 7, 2014

johnmyleswhite commented Jan 7, 2014

garborg commented Jan 25, 2014

johnmyleswhite commented Jan 26, 2014

lindahua commented Jan 26, 2014

johnmyleswhite commented Jan 26, 2014

nalimilan commented Jan 26, 2014

johnmyleswhite commented Jan 26, 2014

nalimilan commented Jan 26, 2014

johnmyleswhite commented Jan 26, 2014

nalimilan commented Jan 26, 2014

johnmyleswhite commented Jan 26, 2014

nalimilan commented Jan 26, 2014

johnmyleswhite commented Jan 26, 2014

Proposed PooledDataArray changes #50

Proposed PooledDataArray changes #50

Comments

johnmyleswhite commented Jan 5, 2014

johnmyleswhite commented Jan 5, 2014

nalimilan commented Jan 6, 2014

johnmyleswhite commented Jan 6, 2014

nalimilan commented Jan 6, 2014

johnmyleswhite commented Jan 6, 2014

nalimilan commented Jan 6, 2014

johnmyleswhite commented Jan 6, 2014

lindahua commented Jan 6, 2014

nalimilan commented Jan 6, 2014

johnmyleswhite commented Jan 7, 2014

nalimilan commented Jan 7, 2014

johnmyleswhite commented Jan 7, 2014

nalimilan commented Jan 7, 2014

johnmyleswhite commented Jan 7, 2014

nalimilan commented Jan 7, 2014

johnmyleswhite commented Jan 7, 2014

garborg commented Jan 25, 2014

johnmyleswhite commented Jan 26, 2014

lindahua commented Jan 26, 2014

johnmyleswhite commented Jan 26, 2014

nalimilan commented Jan 26, 2014

johnmyleswhite commented Jan 26, 2014

nalimilan commented Jan 26, 2014

johnmyleswhite commented Jan 26, 2014

nalimilan commented Jan 26, 2014

johnmyleswhite commented Jan 26, 2014

nalimilan commented Jan 26, 2014

johnmyleswhite commented Jan 26, 2014