Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop treating DataFrames like matrices #484

Merged
merged 1 commit into from
Jan 22, 2014
Merged

Conversation

johnmyleswhite
Copy link
Contributor

A few weeks ago I had a very long conversation with Hadley Wickham where we debated the merits of allowing arithmetic operations on DataFrames and came to agree that it's not helpful to offer functions that work on some DataFrames (if they're all numeric), but fail on others.

This pull request attempts to push that logic to its final conclusion: DataFrames aren't matrices. They shouldn't be the objects of numeric computation and we shouldn't define operators on DataFrames that only really work for DataMatrix{T <: Number}.

Removing this functionality goes a long way towards simplifying our codebase and removing quirky features that we probably never use and would never encourage.

NB: This depends upon my previous pull request.

@johnmyleswhite
Copy link
Contributor Author

Thoughts on this, @kmsquire, @simonster, @garborg?

@kmsquire
Copy link
Contributor

So, I applaud the effort, and I think the implications aren't exactly clear to me yet.

On one hand, operations on individual columns are not affected, since columns are just DataArrays, and the relevant operations are still defined on DataArrays.

On the other hand, while removing quirky, unused features is good, I'm wondering if any nice, convenient features are being removed by this PR?

Anyway, since applying some operations to whole DataFrames (or even subsets of columns) simultaneously can still be useful, I think that we should provide map and map! for DataFrames.

@johnmyleswhite
Copy link
Contributor Author

Well, one argument that nothing worth keeping changed is that not a single line of any test file had to be changed except the test file filled with made up calls to these functions. It's a little hard to imagine anyone was writing *df and excepting it to be treated as the identity operator. Even the more plausible cases like df + df are arguably bad habits we can nip in the bud now. Here's how R handles that case, which I think isn't ideal, although it's certainly reasonable:

> df <- data.frame(A = c(1, 2))
> df + df
  A
1 2
2 4
> df <- data.frame(A = c("a", "b"))
> df + df
   A
1 NA
2 NA
Warning message:
In Ops.factor(left, right) : + not meaningful for factors
> ?data.frame
> df <- data.frame(A = c("a", "b"), stringsAsFactors = FALSE)
> df + df
Error in FUN(left, right) : non-numeric argument to binary operator

Implementing map and map! seems like a better approach. How would map differ from colwise? If not at all, maybe we should only offer map and map!? And what does map do if the function being passed in can't be successfully applied to one of the columns?

@kmsquire
Copy link
Contributor

I don't think map and map! would differ in behavior from colwise.

Thinking further, one operation I use frequently in pandas is df.apply(fn, axis=1), which applies a function to each column (for axis=1) or row (for axis=0). These currently seem to correspond to colwise and rowwise in DataFrames.jl, and would seem to correspond to the map idiom.

So perhaps colwise and rowwise could become map with a key work argument, or (less Julian) mapcols and maprows, with mutating ! variants in either case. Thoughts?

@kmsquire
Copy link
Contributor

And what does map do if the function being passed in can't be successfully applied to one of the columns?

Well, an error, of course! :-)

@kmsquire
Copy link
Contributor

Actually, for map!, that is a little more challenging.

I guess I would punt on map! for now.

@johnmyleswhite
Copy link
Contributor Author

Thinking about this more, I think map operating column-wise and row-wise may be difficult to get right. I may be missing a strategy for working around it, but I think row-wise map is permanently type-unstable, whereas column-wise map is always type-stable. Maybe that won't matter since the important work will happen elsewhere, but I'm learning to be cautious about this stuff as @simonster has found so many function with type-instability in the past.

As you note, map! seems to be likely to raise a lot of errors. It would be easier to get right in a future world in which Julia had annotated return types.

@kmsquire
Copy link
Contributor

One thought I had was to parametrize DataFrames by column type, so that when the column types are all the same, you have type stability. This makes it harder to add non-type conformant columns to a typed DataFrame though.

@johnmyleswhite
Copy link
Contributor Author

That's very similar to attempts by @simonster and @tshort to make DataFrames have a tuple-like type. It seems to be a hard problem to get right from my reading of Simon's efforts along this line.

@kmsquire
Copy link
Contributor

I agree that the generalization to tuple types is hard. I was proposing handling the easier case of parametrizing by one type which is the supertype of all column types. Generally, this would be Any, but could be easily specialized when working with a subset of columns.

@johnmyleswhite
Copy link
Contributor Author

Ah. You're proposing that row operations generate results using the typejoin of the column types. Seems like a reasonable start for sure, although I bet some people will be worried that the resulting DataFrame offers worse performance than the input DataFrame.

@johnmyleswhite
Copy link
Contributor Author

I'd like to put the colwise / map discussion in a new issue.

Setting aside how we handle that issue, I still really want to make this change. @simonster, are you cool with that? I really don't think we need to support taking the square root of a DataFrame.

@garborg
Copy link
Contributor

garborg commented Jan 22, 2014

This functionality seems convenient even on a heterogenous DataFrame (e.g.

#to_round = ....
rounded = [string(s, "Rdd") for s in to_round]
d[rounded] = round(d[to_round])

#or
d[to_round] = round(d[to_round], 3)

).

But regardless of how mapping will be handled, I'm for this pull request -- I'd rather loose the functionality than handle these functions differently than custom / other functions.

@johnmyleswhite
Copy link
Contributor Author

I'm a little confused about your example, @garborg. Isn't the code you've written just an instance of standard column indexing rules?

@@ -211,67 +78,9 @@ function isequal(df1::AbstractDataFrame, df2::AbstractDataFrame)
return true
end

for sf in scalar_comparison_operators
for sf in [:(==), :(!=), :(>), :(>=), :(<), :(<=)]
vf = symbol(".$sf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be clearer as:

for vf in [:(.==), :(.!=), :(.>), :(.>=), :(.<), :(.<=)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@simonster
Copy link
Contributor

I would be happy to see these go. Revisiting this code makes me realize that our binary operators aren't actually checking that column names were identical. I'll open a new issue for this.

@johnmyleswhite
Copy link
Contributor Author

Ok. I'll make the changes you suggest and then merge this.

johnmyleswhite added a commit that referenced this pull request Jan 22, 2014
Stop treating DataFrames like matrices
@johnmyleswhite johnmyleswhite merged commit 64b18c0 into master Jan 22, 2014
@johnmyleswhite johnmyleswhite deleted the jmw/operators branch January 22, 2014 06:41
@garborg
Copy link
Contributor

garborg commented Jan 22, 2014

@johnmyleswhite The example that confused you -- it was a response to the comment that we don't need to support taking the square root of a DataFrame. The right hand sides are rounding an entire DataFrame even if the purpose of the full line of code is to update a subset of columns -- the code worked before this PR and doesn't now.

I was just acknowledging that it enabled a clean, concise syntax (I like clean and concise) -- just not the right one.

@johnmyleswhite
Copy link
Contributor Author

Ah. Sorry for not understanding your example.

Is there something you'd us to offer to make up for the change?

For me the trouble with the operators we've now removed is that they're conceptually murky: what really is the square root of a DataFrame? And, assuming you buy the old definition, why would sqrt work, but not a new user-created function defined on DataArrays?

@garborg
Copy link
Contributor

garborg commented Jan 22, 2014

No worries. And I agree with both of the troubles you raise, which is why I supported the PR.

map / colwise / rowwise may have to wear many hats, and I'm just hoping using them to assign to DataFrames ends up expressive.

@johnmyleswhite
Copy link
Contributor Author

I think we can make map style functions work well. Let's move discussion to #485.

@nalimilan
Copy link
Member

For reference, thanks to broadcasting, we can now support element-wise operations and interaction with matrix-like objects without having to reimplement an arbitrary subset of functions: #1840.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants