-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent column naming scheme #458
Comments
My answer would be "a Table with schema |
I would avoid 0 for the first. I would also want to be consistent so that id_count would be referenced by a similar mechanism as t.amount. Perhaps |
I'm not sure if I understand the syntax correctly.
="Count how many ids have amount greater than 0" In SQL that statement, would return a table with schema {id_count: int} and will just have one row. |
actually pandas uses some heuristics to name things
|
Example is more like the following SQL SELECT count(id) So there is a boolean column for |
@cpcloud Thanks for the clarification. If there is one thing I hate about Pandas its these heuristics. I've certainly heard this from more than one person how hard it is to know what exactly you are indexing into. While it is very well thought out and if you understand Pandas fully, you should be able to know, it's also nice to have a simple consistent heuristic that always works. |
@cpcloud thanks for the explanation. Has this worked out well for Pandas? Any desire to change things? What about for reductions like |
So, the output table is always going to have two rows True/False? How are we going to manage when you have missing data where you can't evaluate that statement? Would you just not count them and then have the output table's count of True + False not the equal to the original total count? Or would you add another row for "unknown"? So the total count matches the total count in the original table? The only general name that I can think for that column is |
Do we need to add unnamed fields in datashape to support the None cases? Allow things like |
@chdoig Here is Pandas behavior, not sure I agree with it In [10]: L = [(1, 'Alice', -100),
(2, 'Bob', 100),
(3, 'Charlie', None)]
In [11]: df = DataFrame(L, columns=['id', 'name', 'amount'])
In [12]: df.groupby(df.amount > 0)['name'].count()
Out[12]:
amount
False 2
True 1
Name: name, dtype: int64 |
@mwiebe we are essentially dealing with unnamed fields. I guess the current approach is to find some default naming scheme. Perhaps that's missing a different option though. |
I think supporting unnamed fields would be better than a default naming scheme like numpy does. You can get some annoying edge cases, in addition to losing the information that a field was unnamed:
|
definitely have heard things along this line too. in fact first learning pandas a while back indexing semantics were the biggest hurdle to get over before getting to be productive with pandas on a regular basis. add multiindex slicing (new in 0.14.0 i think) and you have a recipe for mind-bending parenthesis laden deliciousness. i like that particular feature, but it's definitely a power-user-don't-try-this-at-home kind of feature. that said not totally sure i understand how the operator @aterrel would love to get some clarification from you and others where this has been a pain point (an example may help), i think i'm just misunderstanding exactly what you mean
in general, yes. as a user i've never had any major issues with how things are named. however there are certain things that are minor annoyances, like this:
but we haven't given a lot of thought to considering a different way of doing this, probably because there's usually only one more method call to get to something more reasonable. i honestly don't have a general sense of exactly where naming heuristics happen. in this case it's df.rename({'level_0': 'awe_sum'}) finally, i do think there's room improvement for groupby ops, eg pandas-dev/pandas#7929, esp wrt to lambdas. i'm not a huge fan of the
since this is a single column, it makes sense to just return a scalar. this is different than sql in which everything is a table. the approach of "scalars are not tables" is a practical choice most likely for compatibility with other scientific libs (numpy, scipy, etc) where similar computations on 1d arrays yield scalars. in the case of blaze it could be nice to have a scalar-like object that's totally transparent to the user but may have additional properties that make it easier to work with in expressions (e.g., for optimizations) |
@mrocklin I don't agree with that behavior either... The only valuable output is for the For me a GROUP BY CONDITION, should be one of the two alternatives I mentioned:
In R, you get this output: df <- data.frame(y = c(TRUE, FALSE,NA ,TRUE, TRUE))
>summary(df)
y
Mode :logical
FALSE:1
TRUE :3
NA's :1 I think that if we are going for Blaze doing abstract computing that it's independent of the backends, we should aim for something general enough and not just a particular behavior of a backend that we don't feel comfortable with its output. I think we can get the two alternatives I presented from Pandas quite easily. |
It's a philosophical question: An open world assumption with Three-valued logic or a Closed world assumption & Binary logic?
|
Some information on 3VL in SQL:
|
I think we should move the 3VL discussion to a separate e-mail thread. I think that it is orthogonal to the column naming issue. |
I've merged the In [1]: from blaze import *
In [2]: t = TableSymbol('t', schema='{x: int, y: int}')
In [3]: t.x.name
Out[3]: 'x'
In [4]: (t.x + 1).name
Out[4]: 'x' |
This seems relatively stable for now. Closing. |
What should the following return?
So two questions:
t.amount > 0
t.id.count()
Currently
t.amount > 0
has no name, this is an issue. Pandas would name it as0
,1
, ....Currently
t.id.count()
has the name ofid_count
The text was updated successfully, but these errors were encountered: