Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Operators for PooledDataArray? #67

Closed
simonster opened this issue Jan 29, 2014 · 15 comments
Closed

Operators for PooledDataArray? #67

simonster opened this issue Jan 29, 2014 · 15 comments
Labels

Comments

@simonster
Copy link
Member

For unary operators, binary operators with scalar aguments, and some others (e.g. transpose, which I'm working on now), we could make specialized versions that operate substantially faster on PooledDataArray than the current implementations for AbstractDataArray. My questions are:

  1. Is it worth doing this? I suspect it is, since it's mostly just a change to a couple macros and such operators could be useful in practice.
  2. If we have implementations for both DataArray and PooledDataArray, should we remove the implementation for AbstractDataArray? AFAICT we don't provide any other subtypes of AbstractDataArray in this package, and if someone wants to implement their own AbstractDataArray, it is highly likely that there would be a more efficient way of performing these operations than the generic implementation.
@johnmyleswhite
Copy link
Member

I'm on board with using custom implementations of every function for both DA's and PDA's since I don't see any new AbstractDataArray types coming up soon.

@simonster
Copy link
Member Author

Should e.g. +(PooledDataArray, AbstractArray) and +(PooledDataArray, DataArray) return PooledDataArrays or DataArrays?

@johnmyleswhite
Copy link
Member

That's a really good question. I'd say those things should raise errors: once you assert your data is categorical, we turn arithmetic off.

@simonster
Copy link
Member Author

I can see a legitimate use case for +(PooledDataArray, PooledDataArray) if you have multiple categorical variables but you want to do something with their sum. Not sure about mixed operations like those above, although I suppose they could arise if you have data in different forms.

@johnmyleswhite
Copy link
Member

What's the legitimate use case for + on categorical variables? I'm pretty uncomfortable with the idea. I'd really like our approach to be consistent with the classical theory of measurement: http://en.wikipedia.org/wiki/Level_of_measurement#Nominal_scale

@nalimilan
Copy link
Member

Well, for ordinal data in some cases this can make sense (think of a Likert scale). But I think it would be safer to require people to convert to integers explicitly. If you start implementing +, you'll have to support all operations that apply to reals.

@simonster
Copy link
Member Author

Yes, I think this boils down to a question of what PooledDataArrays represent. Here are four possible answers:

  • PooledDataArrays are behaviorally identical to DataArrays, but with different storage characteristics. The subtype of AbstractDataArray has no relationship to the interpretation of the data it contains. In this case they should implement the same operators. (Currently they do, although the existing implementations are slow for PooledDataArrays.)
  • PooledDataArrays are behaviorally identical to DataArrays, but not intended for numeric values since those can be more efficiently in DataArrays. (Generally true, but there are cases where PooledDataArrays could be more efficient, if you can use Uint8s in your reference array but your values span a larger range, or if you plan to apply scalar operators to your data that only need to operate on the pool in the PooledDataArray case.)
  • PooledDataArrays represent variables on a nominal scale. In this case, no arithmetic or comparison operators should be defined.
  • PooledDataArrays represent variables on nominal or ordinal scales. In this case, comparison operators should be defined. The Wikipedia article @johnmyleswhite links to above suggests that no arithmetic operators should be defined. I have only superficial knowledge of the issues involved in classical theory of measurement, but an apparent counterpoint is Spearman's rho, which involves mathematical operations on ranks.

We should not worry about the cost of supporting operators for PooledDataArrays, since we have to support all operators for DataArrays anyway, and metaprogramming makes supporting all operators for PooledDataArrays almost as easy as supporting one.

@nalimilan
Copy link
Member

Yeah, we really need to decide what are PDAs.

Regarding ordinal scales, computing Spearman's rank correlation coefficient does not imply you are able to attribute a precise numeric value to a level: just that you know their order. So + does not apply in that case.

Anyway, I was not suggesting the problem was the cost of implementing operators, rather that you need to draw all the logical implications of adding +.

@johnmyleswhite
Copy link
Member

As I said in another issue, I now quite strongly think that we should use an Enum-like type for processing categorical and ordinal data, then store those values in DataArray's. PDA's are an interesting idea, but not very valuable in the long run because of the existence of true scalars in Julia. The analogies with R that inspired PDA's were helpful, but inexact and shouldn't be part of our long-term strategy.

@nalimilan
Copy link
Member

So that means PDAs would become an enum for ordinal/nominal variables, and so + wouldn't make sense? And then maybe another type will have to be added one day if it's deemed to useful to store numeric values from a small set?

@simonster
Copy link
Member Author

Would this mean that what are now PooledDataArrays could become DataArrays of Enums, and we could kill AbstractDataArrays entirely? I'd be pretty happy with that, and I think it could simplify many things. Another question this discussion brings up is whether we want a separate wrapper for ordinal types.

My point about Spearman's rho was that regardless of whether the original data was ordinal or interval scaled, the ranks are ordinal, but we would need to be able to perform + and - on them to calculate rho.

@johnmyleswhite
Copy link
Member

PDA's will just go away in my ideal world. For any specific categorical variable, there would be a custom type, which could be stored in Array's or DataArray's.

I would oppose implementing + for these variables.

If you want to use integers, but only a small number of them, use Int8. No need for us to reinvent the wheel.

@simonster
Copy link
Member Author

Sounds good to me.

@johnmyleswhite
Copy link
Member

Yes, we'd only have DataArrays of Enums.

I think the Enum's we'd want would be either descendants of NominalVariable or OrdinalVariable.

For calculating Spearman's rho, you map the ordered elements to the integers and then do arithmetic on the integers. So we don't need to implement anything more than the map from elements to integers.

@simonster
Copy link
Member Author

If we're getting rid of PooledDataArrays, we obviously don't need operators for them, so I'm going to close this and open a new issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants