-
Notifications
You must be signed in to change notification settings - Fork 50
isequal(@data([2, NA]), @pdata([2, NA])) ? #46
Comments
Hmm. This is a tough one. |
I guess I see two issues:
Maybe we should follow the example of |
I think either The only time |
(Sorry, iPad fail.) The proper behavior for isequal is kind of tricky. I do feel pretty strongly that it shouldn't return NA. The docs say that isequal roughly means that the objects print the same and that is true even with NAs, and I think if it returned NA it would break DataArrays with NAs as Dict keys and probably other things. That DataArrays and PooledDataArrays yield different models is a good point: even though they print the same, they are functionally different. I guess we should make PooledDataArrays not equal (in either the isequal or == sense) anything that isn't also a PooledDataArray? Since DataArrays without NA should be functionally identically to other AbstractArrays, we can leave the other definitions as is. |
That sounds good to me, but I agree that it's tricky. |
Mixing the issues of
|
That seems in line with Simon's outline, and makes sense to me. |
I agree with @nalimilan in thinking that the definitions should be equivalent to the output of something like the pseudocode |
That's correct; |
@JeffBezanson Could you add this decription to |
Is there agreement now that @nalimilan seemed to be leaning the other way in a comment on a DataFrames issue, but I feel pretty strongly that |
Why should it return |
|
If we stipulate that the only difference between DataArray and PooledDataArray is the way values are stored, e.g. If |
That's a good point, @simonster. I would propose that automatic treatment as a factor in a model formula should depend only the type of the contents of a container, not on the type of the container. So Personally, I don't want people to treat |
That seems reasonable to me, and in that case I think the current behavior of |
What do you think @garborg? This is a big design decision, so let's make sure we're all comfortable with it. |
I'm strongly in favor of Taking the definition of @JeffBezanson above, a typical piece of code does not behave exactly when passed a DA and a PDA; the results are often the same, but not always (again, because the ordering of the levels), and the implementations are often different (see frequency tables). |
I think we all agree about I am having trouble wrapping my head around what function PDAs with non-string content will serve if PDAs aren't factors, especially because adding PDAs with ordered levels is on the roadmap. It seems that PDAs are meant as the type to represent ordinal data, and as an alternative to strings for representing categorical data, so I'm just missing in what sense having packages assume they are cardinal if not strings is safer/convenient/etc. EDIT: I've seen PDAs with non-string content put to good use in the DataFrames |
I'm not sure the reference to The ordering of levels is an important distinction. That seems sufficient to me to justify making them not equal. |
I hadn't thought about order, but I agree that's a legitimate reason for |
I'm a little queasy about the idea that the type of the levels of a PooledDataArray determine whether it is treated as a factor in linear predictor expressions. I do spend a lot of time trying to convince people coming from an SPSS/SAS background that it makes more sense to code a variable like gender with levels "F" and "M" rather than 0/1 so in that sense I encourage the use of strings as levels in a categorical variable. However, it is also common to have numbers, such as subject identifiers, as levels of a categorical variable and there are good reasons (privacy) for using a neutral representation. It is, of course, possible to use the string representations of the subject identifiers as the levels of the categorical variable but that level of subtlety will not make sense to many users. As indicated in the discussion, one can always use |
I agree that considering PDAs as categorical data in models makes things simpler. What's the point of storing integers or floats in a PDA, if you don't consider them as categorical? Anyway they'll have to be converted back to integer or floats when performing any analysis. OTC, we need a simple way of treating integer codes as categorical (think about mixed models with individual random or fixed effects), without calling But we're starting a different debate, probably better to move this elsewhere. |
I think John mentioned the possibility of I'm okay with a system in which it is clear, say from the equivalent of R's |
I was expecting it to be
false
(same for@data(1:3) == @pdata(1:3)
), but I guess either way, you lose the ability to evaluate something easily.The text was updated successfully, but these errors were encountered: