Consistent column naming scheme #458

mrocklin · 2014-08-12T15:18:24Z

What should the following return?

by(t, t.amount > 0, t.id.count()).columns

So two questions:

What is the name of t.amount > 0
What is the name of t.id.count()

Currently t.amount > 0 has no name, this is an issue. Pandas would name it as 0, 1, ....

Currently t.id.count() has the name of id_count

The text was updated successfully, but these errors were encountered:

talumbau · 2014-08-12T15:23:46Z

My answer would be "a Table with schema {0: bool, id_count: int}"

aterrel · 2014-08-12T15:59:48Z

I would avoid 0 for the first. I would also want to be consistent so that id_count would be referenced by a similar mechanism as t.amount.

Perhaps "_0" could work?

chdoig · 2014-08-12T16:02:07Z

I'm not sure if I understand the syntax correctly.
Is the example equivalent to the SQL statement:

SELECT COUNT(id) AS id_count FROM t
WHERE amount > 0;

="Count how many ids have amount greater than 0"

In SQL that statement, would return a table with schema {id_count: int} and will just have one row.
My question is, in which case you'll have more than a row? And how would you write that in Blaze?

cpcloud · 2014-08-12T16:06:42Z

Currently t.amount > 0 has no name, this is an issue. Pandas would name it as 0, 1, ....

actually pandas uses some heuristics to name things

(Series a, Scalar s) -> Series the name is a.name
(Series a, Series b) -> Series is None if a.name != b.name and otherwise a.name
(Series a).method() -> Scalar has no name
Constructing a Series with no explicit name will not number anything, it will be None
DataFrame construction without an explicit columns argument will give each Series a name based on np.arange(len(df.columns))

mrocklin · 2014-08-12T16:21:11Z

Example is more like the following SQL

SELECT count(id)
FROM table
GROUP BY amount > 0

So there is a boolean column for amount > 0 and a integer column for count(id)

aterrel · 2014-08-12T16:21:50Z

@cpcloud Thanks for the clarification.

If there is one thing I hate about Pandas its these heuristics. I've certainly heard this from more than one person how hard it is to know what exactly you are indexing into. While it is very well thought out and if you understand Pandas fully, you should be able to know, it's also nice to have a simple consistent heuristic that always works.

mrocklin · 2014-08-12T16:22:35Z

@cpcloud thanks for the explanation. Has this worked out well for Pandas? Any desire to change things?

What about for reductions like id.count()?

chdoig · 2014-08-12T17:36:07Z

So, the output table is always going to have two rows True/False?

How are we going to manage when you have missing data where you can't evaluate that statement? Would you just not count them and then have the output table's count of True + False not the equal to the original total count? Or would you add another row for "unknown"? So the total count matches the total count in the original table?

The only general name that I can think for that column is condition. e.g. {condition: bool, count: int}, but I'm not totally convinced... In simple cases, we could use a mnemonic equivalent like amount_gt_0, but the conditions can get complicated and long column names are not desirable.

mwiebe · 2014-08-12T18:40:52Z

Do we need to add unnamed fields in datashape to support the None cases? Allow things like {named0: int, real, named1: string}?

mrocklin · 2014-08-12T18:41:48Z

@chdoig Here is Pandas behavior, not sure I agree with it

In [10]: L = [(1, 'Alice', -100),
     (2, 'Bob', 100),
     (3, 'Charlie', None)]

In [11]: df = DataFrame(L, columns=['id', 'name', 'amount'])

In [12]: df.groupby(df.amount > 0)['name'].count()
Out[12]: 
amount
False     2
True      1
Name: name, dtype: int64

mrocklin · 2014-08-12T18:42:36Z

@mwiebe we are essentially dealing with unnamed fields. I guess the current approach is to find some default naming scheme. Perhaps that's missing a different option though.

mrocklin · 2014-08-12T18:46:17Z

Also @mwiebe see #380

mwiebe · 2014-08-12T18:46:36Z

I think supporting unnamed fields would be better than a default naming scheme like numpy does. You can get some annoying edge cases, in addition to losing the information that a field was unnamed:

In [9]: np.dtype([('f2', int), ('', float)])
Out[9]: dtype([('f2', '<i4'), ('f1', '<f8')])

In [10]: np.dtype([('f1', int), ('', float)])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-e206f1861865> in <module>()
----> 1 np.dtype([('f1', int), ('', float)])

ValueError: two fields with the same name

cpcloud · 2014-08-12T19:24:44Z

@aterrel

I've certainly heard this from more than one person how hard it is to know what exactly you are indexing into.

definitely have heard things along this line too. in fact first learning pandas a while back indexing semantics were the biggest hurdle to get over before getting to be productive with pandas on a regular basis. add multiindex slicing (new in 0.14.0 i think) and you have a recipe for mind-bending parenthesis laden deliciousness. i like that particular feature, but it's definitely a power-user-don't-try-this-at-home kind of feature.

that said not totally sure i understand how the operator name heuristics affect indexing semantics at large (obvs generally "naming things" has a huge impact on indexing semantics). there are probably cases that aren't coming to my mind at the moment.

@aterrel would love to get some clarification from you and others where this has been a pain point (an example may help), i think i'm just misunderstanding exactly what you mean

@mrocklin

Has this worked out well for Pandas?

in general, yes. as a user i've never had any major issues with how things are named.

however there are certain things that are minor annoyances, like this:

In [27]: df = pd.DataFrame(randn(5, 2), index=pd.MultiIndex.from_arrays(([1,2,3,4,5], [1,1,2,2,
3])))

In [28]: df
Out[28]:
          0       1
1 1 -0.3739  1.5099
2 1  0.5093 -0.9382
3 2 -0.3830  1.2816
4 2  1.6075 -0.8923
5 3 -0.0417 -1.4439

In [29]: df.reset_index()
Out[29]:
   level_0  level_1       0       1
0        1        1 -0.3739  1.5099
1        2        1  0.5093 -0.9382
2        3        2 -0.3830  1.2816
3        4        2  1.6075 -0.8923
4        5        3 -0.0417 -1.4439

but we haven't given a lot of thought to considering a different way of doing this, probably because there's usually only one more method call to get to something more reasonable. i honestly don't have a general sense of exactly where naming heuristics happen. in this case it's

df.rename({'level_0': 'awe_sum'})

finally, i do think there's room improvement for groupby ops, eg pandas-dev/pandas#7929, esp wrt to lambdas. i'm not a huge fan of the <type>_<op> syntax, but i think eg code inspection via ast is probably a bit overkill.

What about for reductions like id.count()?

since this is a single column, it makes sense to just return a scalar. this is different than sql in which everything is a table. the approach of "scalars are not tables" is a practical choice most likely for compatibility with other scientific libs (numpy, scipy, etc) where similar computations on 1d arrays yield scalars.

in the case of blaze it could be nice to have a scalar-like object that's totally transparent to the user but may have additional properties that make it easier to work with in expressions (e.g., for optimizations)

chdoig · 2014-08-13T16:38:38Z

@mrocklin I don't agree with that behavior either... The only valuable output is for the True count, which is then the same as doing a WHERE. I don't think that behavior is desirable for a GROUP BY. The False count is actually False + unknown, which can be confusing or misleading in any analysis.

For me a GROUP BY CONDITION, should be one of the two alternatives I mentioned:

Just True and False, without taking into account the unknowns.
True, False + NA

In R, you get this output:

    df <- data.frame(y = c(TRUE, FALSE,NA ,TRUE, TRUE))

    >summary(df)
    y          
    Mode :logical  
    FALSE:1        
    TRUE :3        
    NA's :1

I think that if we are going for Blaze doing abstract computing that it's independent of the backends, we should aim for something general enough and not just a particular behavior of a backend that we don't feel comfortable with its output. I think we can get the two alternatives I presented from Pandas quite easily.

chdoig · 2014-08-13T16:47:14Z

It's a philosophical question: An open world assumption with Three-valued logic or a Closed world assumption & Binary logic?

In formal logic, the open-world assumption is the assumption that the truth value of a statement is independent of whether or not it is known by any single observer or agent to be true. It is the opposite of the closed-world assumption, which holds that any statement that is not known to be true, is considered false.

In the OWA, statements about knowledge that are not included in or inferred from the knowledge explicitly recorded in the system may be considered unknown, rather than wrong or false

chdoig · 2014-08-13T19:31:46Z

Some information on 3VL in SQL:
http://en.wikipedia.org/wiki/SQL#Null_and_three-valued_logic_.283VL.29

SQL:1999 also introduced BOOLEAN type variables, which according to the standard can also hold Unknown values. In practice, a number of systems (e.g. PostgreSQL) implement the BOOLEAN Unknown as a BOOLEAN NULL.

mrocklin · 2014-08-15T15:39:30Z

I think we should move the 3VL discussion to a separate e-mail thread. I think that it is orthogonal to the column naming issue.

mrocklin · 2014-08-15T23:35:20Z

I've merged the (Series, Scalar) -> Series rule from Pandas. This cleans up a lot of the recent troubling cases.

In [1]: from blaze import *

In [2]: t = TableSymbol('t', schema='{x: int, y: int}')

In [3]: t.x.name
Out[3]: 'x'

In [4]: (t.x + 1).name
Out[4]: 'x'

mrocklin · 2014-09-24T01:19:12Z

This seems relatively stable for now. Closing.

mrocklin added the question label Aug 12, 2014

mrocklin mentioned this issue Aug 15, 2014

'by' of pandas DataFrame doesn't work as expected #468

Closed

chdoig mentioned this issue Aug 15, 2014

Open world assumption and 3VL in Blaze #472

Closed

mrocklin closed this as completed Sep 24, 2014

shoyer mentioned this issue Apr 19, 2015

BUG: index name lost with timedelta ops pandas-dev/pandas#9926

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent column naming scheme #458

Consistent column naming scheme #458

mrocklin commented Aug 12, 2014

talumbau commented Aug 12, 2014

aterrel commented Aug 12, 2014

chdoig commented Aug 12, 2014

cpcloud commented Aug 12, 2014

mrocklin commented Aug 12, 2014

aterrel commented Aug 12, 2014

mrocklin commented Aug 12, 2014

chdoig commented Aug 12, 2014

mwiebe commented Aug 12, 2014

mrocklin commented Aug 12, 2014

mrocklin commented Aug 12, 2014

mrocklin commented Aug 12, 2014

mwiebe commented Aug 12, 2014

cpcloud commented Aug 12, 2014

chdoig commented Aug 13, 2014

chdoig commented Aug 13, 2014

chdoig commented Aug 13, 2014

mrocklin commented Aug 15, 2014

mrocklin commented Aug 15, 2014

mrocklin commented Sep 24, 2014

Consistent column naming scheme #458

Consistent column naming scheme #458

Comments

mrocklin commented Aug 12, 2014

talumbau commented Aug 12, 2014

aterrel commented Aug 12, 2014

chdoig commented Aug 12, 2014

cpcloud commented Aug 12, 2014

mrocklin commented Aug 12, 2014

aterrel commented Aug 12, 2014

mrocklin commented Aug 12, 2014

chdoig commented Aug 12, 2014

mwiebe commented Aug 12, 2014

mrocklin commented Aug 12, 2014

mrocklin commented Aug 12, 2014

mrocklin commented Aug 12, 2014

mwiebe commented Aug 12, 2014

cpcloud commented Aug 12, 2014

chdoig commented Aug 13, 2014

chdoig commented Aug 13, 2014

chdoig commented Aug 13, 2014

mrocklin commented Aug 15, 2014

mrocklin commented Aug 15, 2014

mrocklin commented Sep 24, 2014