Stop treating DataFrames like matrices #484

johnmyleswhite · 2014-01-20T23:02:51Z

A few weeks ago I had a very long conversation with Hadley Wickham where we debated the merits of allowing arithmetic operations on DataFrames and came to agree that it's not helpful to offer functions that work on some DataFrames (if they're all numeric), but fail on others.

This pull request attempts to push that logic to its final conclusion: DataFrames aren't matrices. They shouldn't be the objects of numeric computation and we shouldn't define operators on DataFrames that only really work for DataMatrix{T <: Number}.

Removing this functionality goes a long way towards simplifying our codebase and removing quirky features that we probably never use and would never encourage.

NB: This depends upon my previous pull request.

johnmyleswhite · 2014-01-21T01:14:43Z

Thoughts on this, @kmsquire, @simonster, @garborg?

kmsquire · 2014-01-21T06:14:48Z

So, I applaud the effort, and I think the implications aren't exactly clear to me yet.

On one hand, operations on individual columns are not affected, since columns are just DataArrays, and the relevant operations are still defined on DataArrays.

On the other hand, while removing quirky, unused features is good, I'm wondering if any nice, convenient features are being removed by this PR?

Anyway, since applying some operations to whole DataFrames (or even subsets of columns) simultaneously can still be useful, I think that we should provide map and map! for DataFrames.

johnmyleswhite · 2014-01-21T06:22:01Z

Well, one argument that nothing worth keeping changed is that not a single line of any test file had to be changed except the test file filled with made up calls to these functions. It's a little hard to imagine anyone was writing *df and excepting it to be treated as the identity operator. Even the more plausible cases like df + df are arguably bad habits we can nip in the bud now. Here's how R handles that case, which I think isn't ideal, although it's certainly reasonable:

> df <- data.frame(A = c(1, 2))
> df + df
  A
1 2
2 4
> df <- data.frame(A = c("a", "b"))
> df + df
   A
1 NA
2 NA
Warning message:
In Ops.factor(left, right) : + not meaningful for factors
> ?data.frame
> df <- data.frame(A = c("a", "b"), stringsAsFactors = FALSE)
> df + df
Error in FUN(left, right) : non-numeric argument to binary operator

Implementing map and map! seems like a better approach. How would map differ from colwise? If not at all, maybe we should only offer map and map!? And what does map do if the function being passed in can't be successfully applied to one of the columns?

kmsquire · 2014-01-21T15:18:16Z

I don't think map and map! would differ in behavior from colwise.

Thinking further, one operation I use frequently in pandas is df.apply(fn, axis=1), which applies a function to each column (for axis=1) or row (for axis=0). These currently seem to correspond to colwise and rowwise in DataFrames.jl, and would seem to correspond to the map idiom.

So perhaps colwise and rowwise could become map with a key work argument, or (less Julian) mapcols and maprows, with mutating ! variants in either case. Thoughts?

kmsquire · 2014-01-21T15:22:00Z

And what does map do if the function being passed in can't be successfully applied to one of the columns?

Well, an error, of course! :-)

kmsquire · 2014-01-21T15:24:47Z

Actually, for map!, that is a little more challenging.

I guess I would punt on map! for now.

johnmyleswhite · 2014-01-21T15:56:30Z

Thinking about this more, I think map operating column-wise and row-wise may be difficult to get right. I may be missing a strategy for working around it, but I think row-wise map is permanently type-unstable, whereas column-wise map is always type-stable. Maybe that won't matter since the important work will happen elsewhere, but I'm learning to be cautious about this stuff as @simonster has found so many function with type-instability in the past.

As you note, map! seems to be likely to raise a lot of errors. It would be easier to get right in a future world in which Julia had annotated return types.

kmsquire · 2014-01-21T17:44:18Z

One thought I had was to parametrize DataFrames by column type, so that when the column types are all the same, you have type stability. This makes it harder to add non-type conformant columns to a typed DataFrame though.

johnmyleswhite · 2014-01-21T17:46:42Z

That's very similar to attempts by @simonster and @tshort to make DataFrames have a tuple-like type. It seems to be a hard problem to get right from my reading of Simon's efforts along this line.

kmsquire · 2014-01-21T17:53:57Z

I agree that the generalization to tuple types is hard. I was proposing handling the easier case of parametrizing by one type which is the supertype of all column types. Generally, this would be Any, but could be easily specialized when working with a subset of columns.

johnmyleswhite · 2014-01-21T17:56:08Z

Ah. You're proposing that row operations generate results using the typejoin of the column types. Seems like a reasonable start for sure, although I bet some people will be worried that the resulting DataFrame offers worse performance than the input DataFrame.

johnmyleswhite · 2014-01-22T02:56:26Z

I'd like to put the colwise / map discussion in a new issue.

Setting aside how we handle that issue, I still really want to make this change. @simonster, are you cool with that? I really don't think we need to support taking the square root of a DataFrame.

garborg · 2014-01-22T03:31:37Z

This functionality seems convenient even on a heterogenous DataFrame (e.g.

#to_round = ....
rounded = [string(s, "Rdd") for s in to_round]
d[rounded] = round(d[to_round])

#or
d[to_round] = round(d[to_round], 3)

).

But regardless of how mapping will be handled, I'm for this pull request -- I'd rather loose the functionality than handle these functions differently than custom / other functions.

johnmyleswhite · 2014-01-22T05:38:35Z

I'm a little confused about your example, @garborg. Isn't the code you've written just an instance of standard column indexing rules?

simonster · 2014-01-22T06:12:08Z

src/operators.jl

@@ -211,67 +78,9 @@ function isequal(df1::AbstractDataFrame, df2::AbstractDataFrame)
    return true
 end

-for sf in scalar_comparison_operators
+for sf in [:(==), :(!=), :(>), :(>=), :(<), :(<=)]
    vf = symbol(".$sf")


This might be clearer as:

for vf in [:(.==), :(.!=), :(.>), :(.>=), :(.<), :(.<=)]

simonster · 2014-01-22T06:28:18Z

I would be happy to see these go. Revisiting this code makes me realize that our binary operators aren't actually checking that column names were identical. I'll open a new issue for this.

johnmyleswhite · 2014-01-22T06:29:22Z

Ok. I'll make the changes you suggest and then merge this.

Stop treating DataFrames like matrices

garborg · 2014-01-22T13:06:52Z

@johnmyleswhite The example that confused you -- it was a response to the comment that we don't need to support taking the square root of a DataFrame. The right hand sides are rounding an entire DataFrame even if the purpose of the full line of code is to update a subset of columns -- the code worked before this PR and doesn't now.

I was just acknowledging that it enabled a clean, concise syntax (I like clean and concise) -- just not the right one.

johnmyleswhite · 2014-01-22T16:52:04Z

Ah. Sorry for not understanding your example.

Is there something you'd us to offer to make up for the change?

For me the trouble with the operators we've now removed is that they're conceptually murky: what really is the square root of a DataFrame? And, assuming you buy the old definition, why would sqrt work, but not a new user-created function defined on DataArrays?

garborg · 2014-01-22T18:05:29Z

No worries. And I agree with both of the troubles you raise, which is why I supported the PR.

map / colwise / rowwise may have to wear many hats, and I'm just hoping using them to assign to DataFrames ends up expressive.

johnmyleswhite · 2014-01-22T18:07:17Z

I think we can make map style functions work well. Let's move discussion to #485.

nalimilan · 2019-06-21T19:03:47Z

For reference, thanks to broadcasting, we can now support element-wise operations and interaction with matrix-like objects without having to reimplement an arbitrary subset of functions: #1840.

simonster reviewed Jan 22, 2014
View reviewed changes

Stop treating DataFrames like matrices

e46286a

johnmyleswhite added a commit that referenced this pull request Jan 22, 2014

Merge pull request #484 from JuliaStats/jmw/operators

64b18c0

Stop treating DataFrames like matrices

johnmyleswhite merged commit 64b18c0 into master Jan 22, 2014

johnmyleswhite deleted the jmw/operators branch January 22, 2014 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop treating DataFrames like matrices #484

Stop treating DataFrames like matrices #484

johnmyleswhite commented Jan 20, 2014

johnmyleswhite commented Jan 21, 2014

kmsquire commented Jan 21, 2014

johnmyleswhite commented Jan 21, 2014

kmsquire commented Jan 21, 2014

kmsquire commented Jan 21, 2014

kmsquire commented Jan 21, 2014

johnmyleswhite commented Jan 21, 2014

kmsquire commented Jan 21, 2014

johnmyleswhite commented Jan 21, 2014

kmsquire commented Jan 21, 2014

johnmyleswhite commented Jan 21, 2014

johnmyleswhite commented Jan 22, 2014

garborg commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

simonster Jan 22, 2014

johnmyleswhite Jan 22, 2014

simonster commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

garborg commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

garborg commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

nalimilan commented Jun 21, 2019

Stop treating DataFrames like matrices #484

Stop treating DataFrames like matrices #484

Conversation

johnmyleswhite commented Jan 20, 2014

johnmyleswhite commented Jan 21, 2014

kmsquire commented Jan 21, 2014

johnmyleswhite commented Jan 21, 2014

kmsquire commented Jan 21, 2014

kmsquire commented Jan 21, 2014

kmsquire commented Jan 21, 2014

johnmyleswhite commented Jan 21, 2014

kmsquire commented Jan 21, 2014

johnmyleswhite commented Jan 21, 2014

kmsquire commented Jan 21, 2014

johnmyleswhite commented Jan 21, 2014

johnmyleswhite commented Jan 22, 2014

garborg commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

simonster Jan 22, 2014

Choose a reason for hiding this comment

johnmyleswhite Jan 22, 2014

Choose a reason for hiding this comment

simonster commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

garborg commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

garborg commented Jan 22, 2014

johnmyleswhite commented Jan 22, 2014

nalimilan commented Jun 21, 2019