Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate length, nrow, and ncol on DataFrames in favor of size. Fixe… #1224

Closed
wants to merge 3 commits into from

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Sep 8, 2017

…s #1200

@nalimilan
Copy link
Member

+1 for deprecating nrow and ncol, but I'm not so sure about length. As noted in #1200, it's consistent with df[i], so as long as we support doing that I'd rather keep length.

@garborg
Copy link
Contributor

garborg commented Sep 8, 2017

I'd have thought deprecating size would have been more likely at this point.

I thought the move to treat DataFrames as matrix-like had fizzled out. length and ncols, for example, would be more of a fit for DataFrames that are conceptually a collection of namedtuples.

@quinnj
Copy link
Member Author

quinnj commented Sep 8, 2017

I think there's been more consensus recently to go with size as a more "Julian" idiom rather than as a reason to treat DataFrames as matrix-like, though there's also JuliaLang/julia#22665, where it was debated to change size to shape. We'd probably want to switch to shape if that ends up happening in Base (though it's unclear if anything will get changed there).

I don't see what the big fuss is about deprecating length(df); at this point, it's just the more confusing thing between length(df) and df[1], mainly, IMO, because df[:col] has been such a staple in DataFrames for a long time. It can also be the "step 1" in transitioning DataFrame to be more of a "bag of namedtuples" representation rather than the bag-of-columns type of definition it's always had.

@nalimilan
Copy link
Member

I don't see what the big fuss is about deprecating length(df); at this point, it's just the more confusing thing between length(df) and df[1], mainly, IMO, because df[:col] has been such a staple in DataFrames for a long time. It can also be the "step 1" in transitioning DataFrame to be more of a "bag of namedtuples" representation rather than the bag-of-columns type of definition it's always had.

I'm not sure what you mean here. Can you elaborate?

ncol(df::DataFrame) = length(index(df))
Base.size(df::DataFrame) = ((length(df.columns) > 0 ? length(df.columns[1]) : 0),
length(index(df)))
Base.size(df::DataFrame, i::Int) = i == 1 ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cleaner with an if... else block (with a long-form function definition), the indentation a bit weird here. BTW, could print "$i" in the error message, that's always useful to understand what went wrong.

@garborg
Copy link
Contributor

garborg commented Sep 8, 2017

@quinnj I agree with the issues you brought up about length being problematic as a replacement for nrows. Take back my suggestion of using ncols and length, and replace it with using ncols and nrows (or whatever else).

size would be more julian if it wasn't a pun, but I don't see size(::Collection{NamedTuple}, 2) returning the number of fields the elements have for other collections of records.

Leaning solely on a couple (maybe gross) non-base names (like ncols), would make the next step easier if it turns out that this change is is not the end state.

Six in one hand, half dozen in the other -- just wanted to point out that size seems really punny to me, as I was MIA during #1200 and missed pointing out then that the punniness is why #406 never happened in the couple years after the issue was opened.

@rofinn
Copy link
Member

rofinn commented Sep 8, 2017

FWIW, I like just having size cause it's simple and intuitive. I've found nrow and eachrow hard to remember in the julia ecosystem (e.g., length(rows(df)) and length(columns(df)) always seemed more intuitive to me). As was discussed in #1200, it is unclear what length should do.

  • Columns: This is consistent with df[i], but I'd argue that we should drop df[i] since it isn't obvious whether that returns a column or row.
  • Rows: This is consistent if you view as dataframe as just a collection of rows (or named tuples), but that seems better handled by length(rows(df)).
  • Elements (e.g., length(df) = *(size(df)...)): Consistent with the array interface in base, but might not make sense for the table concept.

As a result, I'm pretty okay with not supporting any of the options for now.

@garborg
Copy link
Contributor

garborg commented Sep 8, 2017

size is consistent with how indexing into a cell with row & col happens, too, to argue against my previous comments.

@Nectarineimp
Copy link
Contributor

I haven't done much but lurk and submit one PR, but on this issue I will say this - nrow/ncol are nice, but separate operations. A data frame is not a matrix or an array. It has area, so to speak. size() gets you everything you need in an easy to parse result. From my experience, simpler and complete usually wins in design. Compatibility with other packages and aspects of Julia are, of course, major considerations. I am not sure there is enough concern here about compatibility to disqualify size, no?

@quinnj quinnj closed this Sep 11, 2017
@nalimilan
Copy link
Member

@quinnj How about merging the deprecation of nrow and ncol now? Then we can discuss whether to keep length or not.

@nalimilan nalimilan reopened this Sep 11, 2017
@quinnj
Copy link
Member Author

quinnj commented Sep 11, 2017

Meh, if JuliaLang/julia#22665 settles on size vs. shape then I think it's worth deprecating, but until that's settled, I'm not sure it's worth another potential deprecation later.

@nalimilan
Copy link
Member

Doesn't seem size is going to be deprecated. The only option that appears to be seriously considered is to replace indices with shape.

@coveralls
Copy link

coveralls commented Sep 11, 2017

Coverage Status

Changes Unknown when pulling ba396e6 on jq/len into ** on master**.

@randyzwitch
Copy link
Contributor

randyzwitch commented Sep 21, 2017

Might be late to the party on this, but I'm a fan of nrow, one because R does it, but more conceptually, almost every day I sample a DataFrame (or limit in select query), so I quite frequently think of "n(umber of) rows". Whereas I've never cared the number of columns, because they are named. Which is to say, I don't sample across columns or features (unless random forest is doing it for me), but limiting the number of rows is an obvious use case.

size to me feels much more of memory usage than the (# rows, #cols) tuple.

@bkamins
Copy link
Member

bkamins commented Feb 6, 2019

Now we have length deprecated.

So the decision is do we:

  1. retain nrow and ncol
  2. deprecate them in favor of size?

In both cases this PR should probably be closed, but if we want to deprecate them I would open a new PR only with this deprecation.

@nalimilan
Copy link
Member

I'd rather keep them. Since we've decided data frames are collections of rows, their dimensions are not really symmetric like matrices, so it would be weird to have to use size.

@bkamins
Copy link
Member

bkamins commented Feb 6, 2019

Fine with me.

@bkamins
Copy link
Member

bkamins commented Feb 6, 2019

So close this?

@nalimilan nalimilan closed this Feb 6, 2019
@nalimilan nalimilan deleted the jq/len branch February 6, 2019 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants