Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent column naming scheme #458

Closed
mrocklin opened this issue Aug 12, 2014 · 20 comments
Closed

Consistent column naming scheme #458

mrocklin opened this issue Aug 12, 2014 · 20 comments
Labels

Comments

@mrocklin
Copy link
Member

What should the following return?

by(t, t.amount > 0, t.id.count()).columns

So two questions:

  1. What is the name of t.amount > 0
  2. What is the name of t.id.count()

Currently t.amount > 0 has no name, this is an issue. Pandas would name it as 0, 1, ....

Currently t.id.count() has the name of id_count

@talumbau
Copy link
Contributor

My answer would be "a Table with schema {0: bool, id_count: int}"

@aterrel
Copy link
Contributor

aterrel commented Aug 12, 2014

I would avoid 0 for the first. I would also want to be consistent so that id_count would be referenced by a similar mechanism as t.amount.

Perhaps "_0" could work?

@chdoig
Copy link
Member

chdoig commented Aug 12, 2014

I'm not sure if I understand the syntax correctly.
Is the example equivalent to the SQL statement:

SELECT COUNT(id) AS id_count FROM t
WHERE amount > 0;

="Count how many ids have amount greater than 0"

In SQL that statement, would return a table with schema {id_count: int} and will just have one row.
My question is, in which case you'll have more than a row? And how would you write that in Blaze?

@cpcloud
Copy link
Member

cpcloud commented Aug 12, 2014

Currently t.amount > 0 has no name, this is an issue. Pandas would name it as 0, 1, ....

actually pandas uses some heuristics to name things

  • (Series a, Scalar s) -> Series the name is a.name
  • (Series a, Series b) -> Series is None if a.name != b.name and otherwise a.name
  • (Series a).method() -> Scalar has no name
  • Constructing a Series with no explicit name will not number anything, it will be None
  • DataFrame construction without an explicit columns argument will give each Series a name based on np.arange(len(df.columns))

@mrocklin
Copy link
Member Author

Example is more like the following SQL

SELECT count(id)
FROM table
GROUP BY amount > 0

So there is a boolean column for amount > 0 and a integer column for count(id)

@aterrel
Copy link
Contributor

aterrel commented Aug 12, 2014

@cpcloud Thanks for the clarification.

If there is one thing I hate about Pandas its these heuristics. I've certainly heard this from more than one person how hard it is to know what exactly you are indexing into. While it is very well thought out and if you understand Pandas fully, you should be able to know, it's also nice to have a simple consistent heuristic that always works.

@mrocklin
Copy link
Member Author

@cpcloud thanks for the explanation. Has this worked out well for Pandas? Any desire to change things?

What about for reductions like id.count()?

@chdoig
Copy link
Member

chdoig commented Aug 12, 2014

So, the output table is always going to have two rows True/False?

How are we going to manage when you have missing data where you can't evaluate that statement? Would you just not count them and then have the output table's count of True + False not the equal to the original total count? Or would you add another row for "unknown"? So the total count matches the total count in the original table?

The only general name that I can think for that column is condition. e.g. {condition: bool, count: int}, but I'm not totally convinced... In simple cases, we could use a mnemonic equivalent like amount_gt_0, but the conditions can get complicated and long column names are not desirable.

@mwiebe
Copy link
Contributor

mwiebe commented Aug 12, 2014

Do we need to add unnamed fields in datashape to support the None cases? Allow things like {named0: int, real, named1: string}?

@mrocklin
Copy link
Member Author

@chdoig Here is Pandas behavior, not sure I agree with it

In [10]: L = [(1, 'Alice', -100),
     (2, 'Bob', 100),
     (3, 'Charlie', None)]

In [11]: df = DataFrame(L, columns=['id', 'name', 'amount'])

In [12]: df.groupby(df.amount > 0)['name'].count()
Out[12]: 
amount
False     2
True      1
Name: name, dtype: int64

@mrocklin
Copy link
Member Author

@mwiebe we are essentially dealing with unnamed fields. I guess the current approach is to find some default naming scheme. Perhaps that's missing a different option though.

@mrocklin
Copy link
Member Author

Also @mwiebe see #380

@mwiebe
Copy link
Contributor

mwiebe commented Aug 12, 2014

I think supporting unnamed fields would be better than a default naming scheme like numpy does. You can get some annoying edge cases, in addition to losing the information that a field was unnamed:

In [9]: np.dtype([('f2', int), ('', float)])
Out[9]: dtype([('f2', '<i4'), ('f1', '<f8')])

In [10]: np.dtype([('f1', int), ('', float)])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-e206f1861865> in <module>()
----> 1 np.dtype([('f1', int), ('', float)])

ValueError: two fields with the same name

@cpcloud
Copy link
Member

cpcloud commented Aug 12, 2014

@aterrel

I've certainly heard this from more than one person how hard it is to know what exactly you are indexing into.

definitely have heard things along this line too. in fact first learning pandas a while back indexing semantics were the biggest hurdle to get over before getting to be productive with pandas on a regular basis. add multiindex slicing (new in 0.14.0 i think) and you have a recipe for mind-bending parenthesis laden deliciousness. i like that particular feature, but it's definitely a power-user-don't-try-this-at-home kind of feature.

that said not totally sure i understand how the operator name heuristics affect indexing semantics at large (obvs generally "naming things" has a huge impact on indexing semantics). there are probably cases that aren't coming to my mind at the moment.

@aterrel would love to get some clarification from you and others where this has been a pain point (an example may help), i think i'm just misunderstanding exactly what you mean

@mrocklin

Has this worked out well for Pandas?

in general, yes. as a user i've never had any major issues with how things are named.

however there are certain things that are minor annoyances, like this:

In [27]: df = pd.DataFrame(randn(5, 2), index=pd.MultiIndex.from_arrays(([1,2,3,4,5], [1,1,2,2,
3])))

In [28]: df
Out[28]:
          0       1
1 1 -0.3739  1.5099
2 1  0.5093 -0.9382
3 2 -0.3830  1.2816
4 2  1.6075 -0.8923
5 3 -0.0417 -1.4439

In [29]: df.reset_index()
Out[29]:
   level_0  level_1       0       1
0        1        1 -0.3739  1.5099
1        2        1  0.5093 -0.9382
2        3        2 -0.3830  1.2816
3        4        2  1.6075 -0.8923
4        5        3 -0.0417 -1.4439

but we haven't given a lot of thought to considering a different way of doing this, probably because there's usually only one more method call to get to something more reasonable. i honestly don't have a general sense of exactly where naming heuristics happen. in this case it's

df.rename({'level_0': 'awe_sum'})

finally, i do think there's room improvement for groupby ops, eg pandas-dev/pandas#7929, esp wrt to lambdas. i'm not a huge fan of the <type>_<op> syntax, but i think eg code inspection via ast is probably a bit overkill.

What about for reductions like id.count()?

since this is a single column, it makes sense to just return a scalar. this is different than sql in which everything is a table. the approach of "scalars are not tables" is a practical choice most likely for compatibility with other scientific libs (numpy, scipy, etc) where similar computations on 1d arrays yield scalars.

in the case of blaze it could be nice to have a scalar-like object that's totally transparent to the user but may have additional properties that make it easier to work with in expressions (e.g., for optimizations)

@chdoig
Copy link
Member

chdoig commented Aug 13, 2014

@mrocklin I don't agree with that behavior either... The only valuable output is for the True count, which is then the same as doing a WHERE. I don't think that behavior is desirable for a GROUP BY. The False count is actually False + unknown, which can be confusing or misleading in any analysis.

For me a GROUP BY CONDITION, should be one of the two alternatives I mentioned:

  • Just True and False, without taking into account the unknowns.
  • True, False + NA

In R, you get this output:

    df <- data.frame(y = c(TRUE, FALSE,NA ,TRUE, TRUE))

    >summary(df)
    y          
    Mode :logical  
    FALSE:1        
    TRUE :3        
    NA's :1   

I think that if we are going for Blaze doing abstract computing that it's independent of the backends, we should aim for something general enough and not just a particular behavior of a backend that we don't feel comfortable with its output. I think we can get the two alternatives I presented from Pandas quite easily.

@chdoig
Copy link
Member

chdoig commented Aug 13, 2014

It's a philosophical question: An open world assumption with Three-valued logic or a Closed world assumption & Binary logic?

In formal logic, the open-world assumption is the assumption that the truth value of a statement is independent of whether or not it is known by any single observer or agent to be true. It is the opposite of the closed-world assumption, which holds that any statement that is not known to be true, is considered false.

In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false

@chdoig
Copy link
Member

chdoig commented Aug 13, 2014

Some information on 3VL in SQL:
http://en.wikipedia.org/wiki/SQL#Null_and_three-valued_logic_.283VL.29

SQL:1999 also introduced BOOLEAN type variables, which according to the standard can also hold Unknown values. In practice, a number of systems (e.g. PostgreSQL) implement the BOOLEAN Unknown as a BOOLEAN NULL.

@mrocklin
Copy link
Member Author

I think we should move the 3VL discussion to a separate e-mail thread. I think that it is orthogonal to the column naming issue.

@mrocklin
Copy link
Member Author

I've merged the (Series, Scalar) -> Series rule from Pandas. This cleans up a lot of the recent troubling cases.

In [1]: from blaze import *

In [2]: t = TableSymbol('t', schema='{x: int, y: int}')

In [3]: t.x.name
Out[3]: 'x'

In [4]: (t.x + 1).name
Out[4]: 'x'

@mrocklin
Copy link
Member Author

This seems relatively stable for now. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants