Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove nrow/ncol #406

Closed
johnmyleswhite opened this issue Nov 11, 2013 · 29 comments
Closed

Remove nrow/ncol #406

johnmyleswhite opened this issue Nov 11, 2013 · 29 comments
Labels
Milestone

Comments

@johnmyleswhite
Copy link
Contributor

We should remove nrow and ncol. Everything should be done with size(df).

@milktrader
Copy link

So how will loop code look like?

for i in 1:nrow(df) => for i in 1:size(df)[1] ?

@milktrader
Copy link

btw, the vi code to replace nrow occurrences would be:

:%s/nrow(\(\w\+\))/size(\1)[1]

So I suppose I'll start migrating over sooner rather than later.

@johnmyleswhite
Copy link
Contributor Author

You should use for i in 1:size(df, 1).

@milktrader
Copy link

Okay, that's better.

:%s/nrow(\(\w\+\))/size(\1, 1)

@ararslan
Copy link
Member

I'm surprised this hasn't been done in the past 3 years since this issue was brought up. I definitely agree that we should be using size everywhere rather than nrow/ncol.

@andreasnoack
Copy link
Member

andreasnoack commented Aug 17, 2016

I'm not sure about this. I don't think it is a good idea to make DataFrames too similar to arrays. It's better to think of them as a table and therefore we might not want to follow the AbstractArray interface.

I should explain what I mean by better here. If a DataFrame is too similar to an array then users might start to use in the same way and that would give a bad experience with slow execution and uninferred return values.

@ararslan
Copy link
Member

Hm, that's a very good point.

@nalimilan
Copy link
Member

OTOH, nrow and ncol clearly use the matrix vocabulary. I think we should either move to size, or use a database vocabulary like nobs and nvar.

@phaverty
Copy link

nrow and ncol will be very familiar to people coming from R, who are likely
to make up a large proportion of the DataFrames users. I would recommend
leaving nrow an ncol in.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty@gene.com

On Thu, Aug 18, 2016 at 8:11 AM, Milan Bouchet-Valat <
notifications@github.com> wrote:

OTOH, nrow and ncol clearly use the matrix vocabulary. I think we should
either move to size, or use a database vocabulary like nobs and nvar.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#406 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AH02KwsFj4d_ebDWSY0p-ArvnyJ2Pj3Dks5qhHYsgaJpZM4BMPWe
.

@nalimilan
Copy link
Member

@phaverty While we certainly take it into account, the fact that something exists in a language is never a sufficient argument by itself for including something in Julia. The most important question is whether it makes for a good and consistent design in Julia.

@ararslan
Copy link
Member

nobs/nvar seems like a good compromise to me between avoiding array-specific terminology and allowing clear, familiar names for accessing these dimensions.

@quinnj
Copy link
Member

quinnj commented Aug 18, 2016

I'm not a fan of nobs personally. I personally like size because of it's julian heritage at this point, I think there are enough differentiators from AbstractArray that DataFrames won't be confusing. I like that size also encourages getting the # of rows/columns in a single call as well. I vote we deprecate/remove nrow/ncol, and just keep size.

@johnmyleswhite
Copy link
Contributor Author

+1 for Jacob's proposal. Adding new names seems superfluous.

@ararslan
Copy link
Member

I'd be fine with that as well.

@milktrader
Copy link

Also not in favor af new names at this point. Just because ncol/nrow are used in R doesn't mean we have to be different. I prefer it slightly to size because it is more natural to a table vs array interface.

@andreasnoack
Copy link
Member

I think @nalimilan is right that nrow and ncol are as matrix like as size but I don't like the symmetry in size(DataFrame, 1) and size(DataFrame, 2) (not so speak about size(DataFrame, 7)==1?) . Pretending that the two dimensions are similar is not doing anybody a favor. If size is julian heritage so is mean(DataFrame, 2) which would be a really bad idea. My point is that we should try to communicate as much as possible, including the choice of function names, that DataFrames are not arrays. Finally, deprecating any or all of nrow, ncol and size shouldn't be disruptive. It would be a simple search/replace change and easy to add deprecation warnings for.

@johnmyleswhite
Copy link
Contributor Author

If this change is part of focusing on tables, should this really be part of the API at all? Computing nrow involves SELECT COUNT(1) FROM tbl, which isn't generally a near zero-cost operation.

@quinnj
Copy link
Member

quinnj commented Aug 18, 2016

^ that's a good point John. Currently in DataStreams for a Data.Schema, a Source or Sink is allowed to return -1 rows, which indicates an unknown number of rows (and Data.stream! methods need to be prepared to handle these cases).

@andreasnoack
Copy link
Member

which isn't generally a near zero-cost operation

Is that a requirement?

@johnmyleswhite
Copy link
Contributor Author

I don't know. It might be surprising to people if, say, the REPL blocked for a long time when you run nrow(tbl). If we end up supporting something like Hive, that operation might block for many minutes.

@andreasnoack
Copy link
Member

I'm also not sure. My gut feeling is that it's okay that the costs are higher for this operation but I think it's useful to include such large examples in the API discussion.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

We should still deprecate nrow, ncol in favor of size.

@nalimilan nalimilan added this to the 0.11 milestone Sep 7, 2017
@nalimilan
Copy link
Member

Adding to milestone before we forget since that's easy to do.

@nalimilan
Copy link
Member

I just bumped into a previous discussion from... 2013:
https://groups.google.com/d/msg/julia-stats/2EzVzAtrP9Y/NY6dkX4gbHUJ

@Nosferican
Copy link
Contributor

+1 in favor of size.

@HarlanH
Copy link
Contributor

HarlanH commented Nov 16, 2017

from that 2013 discussion, reminder for this issue that size(df, :nobs) or size(df, :nrow) were also suggested, either instead of or in addition to size(df, 1).

@nalimilan nalimilan removed this from the 0.11 milestone Nov 23, 2017
@nalimilan nalimilan added this to the 0.12 milestone Nov 23, 2017
@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@nalimilan nalimilan modified the milestones: 0.12, 1.0 Jan 25, 2019
@bkamins
Copy link
Member

bkamins commented Jul 25, 2019

I am closing this as my feeling is that after the discussions nrow and ncol will stay (at least for 1.0).
The major reasons are:

  • row and column are natural names (even Base now uses eachrow and eachcol as do we + we have DataFrameRow type),
  • it is easy to learn to use nrow and ncol (many people know it from R - although it is a minor reason, it is shorter to type, anyone, even beginner, immediately will know what it means)
  • and it really does not make much harm that we have it IMO (it will not confuse anyone I think nor conflict with anything; even if we removed it we would keep a long-term deprecation).

Feel free to reopen it if you disagree (I just want to clean up pending issues where I feel a conclusion was reached).

@bkamins bkamins closed this as completed Jul 25, 2019
@quinnj
Copy link
Member

quinnj commented Jul 25, 2019

I support @bkamins here.

@ararslan
Copy link
Member

#i'mwithbogamil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants