Make row lookup easier #3051

bkamins · 2022-05-16T08:12:02Z

This is a speculative idea. Maybe we could define GroupedDataFrame to be callable like this:

(gdf::GroupedDataFrame)(idxs...) = gdf[idxs]

In this way instead of writing:

gdf[("val",)]

users could write gdf("val").

@nalimilan, @pdeffebach - what do you think?

Ref: https://discourse.julialang.org/t/any-plan-for-functionality-like-pandas-loc/81134

The text was updated successfully, but these errors were encountered:

nalimilan · 2022-05-16T08:21:26Z

I'm not a fan of this. You only save 3 characters and the syntax is harder to understand.

It wouldn't really improve the workflow discussed in the Discourse thread, would it?

bkamins · 2022-05-16T08:29:51Z

It would not - that is why we originally have not implemented it. It is just non-conflicting and would make life of newcomers slightly easier. I wanted to discuss it to make sure we are OK with the current design.

Lincoln-Hannah · 2022-05-16T12:19:57Z

I was thinking of the case of a unique index.
I could convert the DataFrame to a dictionary of DataFrame rows:

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

d = Dict( df.name .=> eachrow(df) )

This then allows d["Sally"].age

But loses all the other DataFrame functionality.

There are so many times I want to do several lookups on different columns but using the same index column.

Example
I have a table of FX Forward rates I'm using to bootstrap discount factors. Columns are :Currency :Tenor :Rate.
The bootstrap is done separately per currency so first I'd like to @groupby :Currency.
Then @setindex :Tenor.
Possible Tenor values are: ON TN SPOT 1W 2W 1M 3M 6M 1Y 2Y
Using IndexValue:Column notation, the formulas are

:Fwd      = SPOT:Rate  +  :Rate
SPOT:Fwd   = :SPOT:Rate
TN:Fwd      = SPOT:Rate - TN:Rate
ON:Fwd     = TN:Rate - ON:Rate
TodayRate  = ON:Fwd
:Discount    =  :Fwd / TodayRate * USDiscount(:Tenor)

Currently I'm calling a function for each group. The function then converts each sub-DataFrame to named arrays.
This works fine. it would just be nice to do it all in a DataFrame.

bkamins · 2022-06-07T07:30:46Z

I am moving it to 1.5 release. Essentially the requirement is for a Dict-like object allowing an easy row lookup.
Essentially it should be a wrapper around GroupedDataFrame allowing easy indexing on group keys.

So this would be something like (in general probably this could be made more efficient, but this shows the idea - please look at the API not at the implementation):

struct RowLookup where T<:GroupedDataFrame
    gdf::T
end

function rowlookup(df::AbstractDataFrame, cols)
    gdf = groupby(df, cols)
    length(gdf) == nrow(df) || throw(ArgumentError("rows are not unique"))
    return RowLookup(gdf)
end

Base.getindex(rl::RowLookup, idxs...) = only(rl.gdf[idxs])
Base.parent(rl) = parent(rl.gdf)

This would allow for easy indexing like (referring to the example above):

d1 = rowlookup(df, :name)
d1["Sally"].age
d2 = rowlookup(df, :name, :age)
d2["Sally", 49].children

Do we think it is worth to add something like this?

Lincoln-Hannah · 2022-06-07T10:42:29Z

If this is for a unique index. Would it be better to return a DataFrameRow rather than a 1-row DataFrame.
So replace [1:1, :] with [1,:]

Then d1["Sally].age returns a number instead of a 1-element vector. I think generally the number would be preferred.

It would be nice to have the rowlookup functionality but keep the object as a DataFrame, so you can add columns and do other stuff to it. With this implementation, you can't for example do d1.ageBy2 = d1.age because d1 is a GroupedDataFrame. I guess this would make the implementation much more difficult as you couldn't use the pre-existing GroupedDataFrame functionality.

bkamins · 2022-06-07T10:56:11Z

So replace [1:1, :] with [1,:]

Yes - it was a typo - I meant [1, :]. Otherwise it would be needed. Actually it should be only. I will fix :).

It would be nice to have the rowlookup functionality but keep the object as a DataFrame

This cannot be done as indexing DataFrame has a different semantics.

What we could consider doing is to add syntax like:

df[:name => "Sally", :age => 49].children

with the requirement that the condition must identify a unique row.
The only drawback would be that it would be a O(n) operation, but maybe this is not the end of the world as it would be a convenience function?

@nalimilan - what do you think?

nalimilan · 2022-06-07T11:29:37Z

It seems problematic to assume that df[:name => "Sally", :age => 49].children should reference a single row, as nothing in the syntax indicates that. I would rather expect it to be equivalent to subset(df, :name => ==("Sally"), :age => ==(49)).children.

I feel like we're not targeting the real problem here. Maybe the problem is just that when using @rtransform, one needs to repeat the data frame name. Using @transform, this works and is not too ugly IMO:

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
@chain df begin
    @transform :Age_relative_to_Sally = :age .- only(:age[:name .== "Sally"])
end

Or even without only (broadcasting will fail if more than one row is selected):

@chain df begin
    @transform :Age_relative_to_Sally = :age .- :age[:name .== "Sally"]
end

Maybe what's missing is a way to do the same using @rtransform? For example, wrapping :age[:name .== "Sally"] within a special marker to indicate that it should be evaluated on whole vectors rather than row-by-row? It could also be useful for things like :x - mean(x).

bkamins · 2022-06-07T11:56:06Z

CC @pdeffebach

pdeffebach · 2022-06-07T14:27:33Z

I can't think of an easy way to mark just some columns as "whole column" references and others as not. The current implementation just takes the anonymous function created by the parsing and wraps it in ByRow.

We could have something that expands to :x - mean(df.x) but that results in type instabilities and would slow down performance.

Aren't there array types that allow for

@rtransform df :Age_relative_to_Sally = :age .- :age."Sally"

? That seems pretty clean to me.

Lincoln-Hannah · 2022-06-08T01:52:35Z

That's a really nice syntax if its possible.

With NamedArrays.jl you can do

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

@chain df begin

    @aside age = NamedArray( _.age, _.name )

    @rtransform :Age_relative_to_Sally = :age - age["Sally"]    

end

But of course changing the NamedArray won't change the column.

pdeffebach · 2022-06-08T11:18:52Z

@chain df begin

    @aside age = NamedArray( _.age, _.name )

    @rtransform :Age_relative_to_Sally = :age - age["Sally"]    

end

This still has some performance issues, since age["Sally"] is called every time. Better to use @transform instead of @rtransform and broadcast.

Lincoln-Hannah · 2022-06-09T05:48:12Z

Ok. @transform :Age_relative_to_Sallly = :age .- age["Sally"]

suppose there are a few columns you would like to attach the index to.

@aside  age     = NamedArray( _.age,    _.name )
@aside  height  = NamedArray( _.height, _.name )
@aside  weight  = NamedArray(_.weight, _.name )

could this be condensed to something like?

addindex( df, :name, to = [:age, :height, :weight] )

and could the NamedArray functionality be merged with the DataFrame column such that you can alter cells as well as referencing them. e.g.

@transform :age["Sally"] = 30

Lincoln-Hannah · 2022-06-15T06:04:36Z

Something I just realised. DataFrame columns can be defined as NamedArrays, and cell values referenced within a @transform statement. A cell value can't be altered within a @transform block but can altered as below.

df = DataFrame( T = ["1M","2M","3M","6M"],  x = [1,2,3,4],  y = [10,11,12,13] )

@chain df begin


    @DataFramesMeta.transform begin
         :x = NamedArray(:x,:T)
         :y = NamedArray(:y,:T)
    end

    @DataFramesMeta.transform :W = :x["1M"]


    (
        _.x["1M"] = 100;
        _
    )
    
end

fdekerme · 2023-08-25T09:15:58Z

https://discourse.julialang.org/t/why-is-it-so-complicated-to-access-a-row-in-a-dataframe/103162/10

I find this solution very elegant, although I agree that it is problematic in the case of multiple rows.

What we could consider doing is to add syntax like:
df[:name => "Sally", :age => 49].children
with the requirement that the condition must identify a unique row. The only drawback would be that it would be a O(n) operation, but maybe this is not the end of the world as it would be a convenience function?

As indicated in the post above, "The biggest issue is that the condition you might want to use could return exactly one row, or multiple rows (where 0 rows is a special case of multiple)".

Although I don't think this is really in line with the philosophy of DataFrame.jl or its internal implementation, one feature I really like in Pandas is MultiIndex (https://pandas.pydata.org/docs/user_guide/advanced.html). Even if they're not obvious to use, they make the DataFrame much more readable and could help solve this kind of problem.

xgdgsc · 2023-08-25T09:42:56Z

I had some bad experiences with pandas.Multiindex,. Would something as flexible as boost.MultiIndex be more desirable?

bkamins · 2023-08-25T15:05:07Z

Having thought about it:

If one wants to select multiple rows then GroupedDataFrames gives all the functionality one might want.
So essentially we are looking for a solution when user expects exactly one row. We just need to settle on the syntax.

The syntax (note a slight change, but this is a minor thing):

df[(:name => "Sally", :age => 49), :]

is tempting, but in the past with @nalimilan we wanted to avoid adding too much functionality to indexing of df. We would need to think about it. The second downside, as was commented is that the operation would be O(n).

The alternative would be to define something like:

l = lookup(df, :name, :age)

which would be a wrapper around df (like GroupedDataFrame), call it RowLookup tentatively that would assume a unique key situation.

Then one could write l["Sally", 49] to get the desired row. The additional benefit of this approach is that l would probably do some pre-processing, so that the lookup operation later would be O(1).

If a single lookup were required then one could of course write lookup(df, :name, :age)["Sally", 49] (but the whole design would be around ensuring fast repeated lookup).

alfaromartino · 2023-08-25T22:12:17Z

Is the problem here the syntax or the speed? If it's the former, I'm not sure if there's any benefit from notation like this

df[(:name => "Sally", :age => 49), :]
lookup(df, :name, :age)["Sally", 49]

relative to

df[df.name .== "Sally" .&& df.age .== 49,:]

We'd avoid writing df. repeatedly. But, apart from that, it'd be quite limited since these cases can only handle the case with ".&&". It'd also add the problem that new users could expect the functionality to behave as ".||".

I'm usually against adding new notation, especially if it creates additional ways of doing the same, while having approximately the same readibility/number of characters.

Maybe it's a problem more related to documentation? In indexing syntax, I noticed that the explanations for column selectors and row indexing are intertwined. Moreover, there's only one example with the broadcasting notation, and it's at the end of the section.

If the problem is speed, then I agree that having an optimized function like "lookup" could be beneficial.

bkamins · 2023-08-29T18:06:18Z

My proposal with lookup wrapper is related to speed and correctness (and indirectly a bit syntax). The point is that df[df.name .== "Sally" .&& df.age .== 49,:] is slow as was noted. Given this we have two calls for action:

Improve documentation (this is clearly doable, and I can add more examples)
Decide if we want the lookup variant for speed and correctness. Note the "correctness" aspect. The point is that lookup would make sure that the row is unique and error otherwise (so users would avoid a common error of having data with duplicates but thinking that they generate a unique row).

So the question is if we want the addition of lookup or not.

alfaromartino · 2023-08-29T19:10:22Z

If it's for speed and correctness, I like the idea. Maybe we could think more about the API? the API and name of lookup doesn't convince me.

I'm thinking of two possibilities. If it emulates filter (in the sense of a function returning a dataframe), something along the lines:

rowsubset(df, :id => 10, :name => "John")

the name makes it clear that it should return a row and that the API is similar to subset. Regardless of the internals, it makes it clear that it emulates the subset API.

If the intention is to be used for indexing, maybe the name row would suffice

df[row(:id => 10, :name => "John"), cols]

BTW, is the use of grouped dfs for selecting multiple rows documented?

nathanrboyer · 2023-08-30T16:23:01Z

Call the function indexby then (rather than lookup) and maintain similar API to groupby.
groupby will assume multiple rows per group. indexby will assume one row per index.

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
idf = indexby(df, :name)
sally_age = idf["Sally"].age

(I first considered just overloading groupby with groupby(df, :name; unique=true), but then I think the function name doesn't make much sense.)

nathanrboyer · 2023-08-30T19:31:37Z

In fact, I didn't know about this possibility or the fact this option is faster (although I rarely think about performance when I'm working with dataframes). And I've been using dataframes for a while. In hindsight, it can be obvious that groupby could be used for subsetting, but this is not immediate and easy to discasrd as it seems a more cumbersome way to subset rows.

I agree. I have avoided grouped data frames until today. After today, I still find them confusing. This is a lot of parenthesis/brackets to index two keys: (link)

iris_gdf = groupby(iris, :Species)
iris_gdf[[("Iris-virginica",), ("Iris-setosa",)]]

Why does this error?

julia> iris_gdf["Iris-virginica"]
ERROR: ArgumentError: invalid index: "Iris-virginica" of type String

It is the API proposed for lookup/indexby:

Then one could write l["Sally", 49] to get the desired row.

It should also work for GroupedDataFrame (just returning a different object), so the two functions operate consistently. Requiring Tuples or Dicts is pretty cumbersome; even Pairs don't work for indexing unless wrapped in Dict: iris_gdf[Dict(:Species => "Iris-virginica")].

alfaromartino · 2023-08-30T20:29:12Z

Why does this error?

See the first post of this discussion. That was the original proposal. The idea is that you need a tuple, such that
iris_gdf[("Iris-virginica",)]

I misinterpreted the whole discussion I think. Given the benchmarks I showed, I thought the discussion was about performing fast indexing, but it's more related to subsetting a dataframe assuming that you'll perform multiple operations on that.

Maybe there should be a clarification that if you want to index, the approach with groupby is not ideal.

I'm still unsure about the benefits of adding this functionality, even more now that it's not necessarily more efficient.
In any case, I think the name proposed by @nathanrboyer makes it clear its purpose and that the API follows groupby.

bkamins · 2023-08-30T21:15:11Z

Why does this error? julia> iris_gdf["Iris-virginica"]

This is an issue with legacy groupby API that was inherited from its initial design. The point is that gdf[1] picks the first group of gdf (not the group whose key value is 1). This is unfortunate as picking a group by its number is rarely useful. It would be much better if gdf did not allow indexing by a group number. Then we could allow what you ask for.

Because of this I proposed iris_gdf("Iris-virginica") syntax as an alternative that was not taken and could work exactly like you would want it to. However, this proposal did not get much support. But maybe we could reconsider it and then we would have:

gdf = groupby(df, :x)
idf = indexby(df, :x) # this will error if grouping by :x creates duplicates
gdf("v") # get a data frame that corresponds to group with key value "v"
idf("v") # get a unique row that corresponds to group with key value "v"

This would be consistent. The only drawback is that we use ( ) brackets and not [ ] brackets.

nathanrboyer · 2023-08-31T02:56:25Z

It would be much better if gdf did not allow indexing by a group number. Then we could allow what you ask for.

DataFrames 2.0?
Could you still iterate over a GroupedDataFrame if you disallow indexing into it by Integer?

This would be consistent. The only drawback is that we use ( ) brackets and not [ ] brackets.

Square brackets would be best, but I like gdf("v") better than gdf[("v",)].

bkamins · 2023-08-31T13:08:53Z

DataFrames 2.0?

Could be, but realistically DataFrames 2.0 is like Julia 2.0 - a very distant future.

Could you still iterate over a GroupedDataFrame if you disallow indexing into it by Integer?

Yes, iteration protocol is separate from indexing protocol. You would be able to index gdf by group number, but it would involve a custom wrapper something like gdf[Index(1)], which would make it clear that Index(1) is group number (while gdf[1] would do lookup by group value 1. See https://github.com/davidavdav/NamedArrays.jl#integer-indices for a similar (not identical) idea with a Name wrapper (which allows to distinguish index from value)

@nalimilan - having said that, maybe, even though it could create a minor confusion we could start allowing:

gdf["ABC"]

for a by-value lookup? The rule would be that everything that is not an Integer can be passed as is, and like in NamedArrays.jl we say:

if you write gdf[int] it is a group number;
if you have a single grouping variable that happens to be an integer, then you have to write gdf[(int,)] to get lookup by group value, but gdf[int] still does lookup by group number?

As I have said - this creates a minor confusion/api inconsistency, but maybe it is better than requiring users to always use the tuple wrapper?

If we went for this then we even could drop the idea of indexby and just recommend users the following syntax for single row lookup (again - this is not the efficient code, but the code that gives everything in one line):

only(groupby(df, :x)["ABC"])

(and the only addition serves as a verification if indeed we have a single row in lookup)

So, to put concrete proposals on a table I tend to like to add either:

only(gdf("ABC")) # the () indexing, which is unambiguous, but is less consistent with Base in terms of syntax

or

only(gdf["ABC"]) # use [] for indexing and clearly document the exception for Integer indexing, where tuple or similar has to be used

The point is that current syntax gdf[("ABC",)] is OK for me or e.g. for @nalimilan, but probably newcomers to DataFrames.jl find it confusing and hard to understand and read.

What do you think?

nathanrboyer · 2023-08-31T14:27:20Z

I prefer the rip-the-band-aid DataFrames 2.0 solution, but I understand.

maybe, even though it could create a minor confusion we could start allowing gdf["ABC"]
...
then you have to write gdf[(int,)] to get lookup by group value, but gdf[int] still does lookup by group number?

I am unfortunately against this since Integer cannot be included. It seems like an easy way for silent "correctness" errors to slip in. I guess misunderstanding the integer indexing is already possible, but allowing indexing to work one way for groupby(df, :name) and a different way for groupby(df, :children) seems bad.

julia> df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
3×3 DataFrame
 Row │ name    age      children 
     │ String  Float64  Int64
─────┼───────────────────────────
   1 │ John       23.0         3
   2 │ Sally      42.0         5
   3 │ Kirk       59.0         2

julia> gdf = groupby(df, :children)
GroupedDataFrame with 3 groups based on key: children
First Group (1 row): children = 2
 Row │ name    age      children 
     │ String  Float64  Int64
─────┼───────────────────────────
   1 │ Kirk       59.0         2
⋮
Last Group (1 row): children = 5
 Row │ name    age      children 
     │ String  Float64  Int64
─────┼───────────────────────────
   1 │ Sally      42.0         5

julia> gdf[2]
1×3 SubDataFrame
 Row │ name    age      children 
     │ String  Float64  Int64
─────┼───────────────────────────
   1 │ John       23.0         3

then we even could drop the idea of indexby ... only(gdf("ABC"))

Syntax is okay, but would this be performant? I imagine idf("ABC") would have already called only during the creation of idf and thus not need that check for every subsequent indexing into idf.

bkamins · 2023-08-31T14:31:01Z

I imagine idf("ABC") would have already called only during the creation of idf and thus not need that check for every subsequent indexing into idf.

Yes, using idf would be slightly faster, but not by a big deal (as groupby is already really fast). I am OK to add indexby.
Let us first decide if we like gdf("ABC") (the original idea I posted on top). Probably it is @nalimilan whom we need to convince. @nalimilan - could you comment what is your current thinking? Thank you!

nathanrboyer · 2023-08-31T17:52:35Z

I tried to create a summary of the current and proposed methods for looking up a value. Please add or correct as necessary.

using DataFrames, DataFramesMeta

# Single Category

## Definition
df = DataFrame(
    x = 1.0:8.0,
    id = 1001:1008,
)

## Group Method
gdf = groupby(df, :id)
xval = only(gdf[(1007,)]).x # Existing
xval = only(gdf(1007)).x    # Proposed
idf = indexby(df, :id)
xval = idf(1007).x          # Proposed

## Subset Method
xval = only(subset(df, :id => ByRow(==(1007)))).x # Existing
xval = only(@rsubset(df, :id == 1007)).x          # Meta

## Index Method
xval = only(df[df.id .== 1007, :x])       # Existing
xval = @with(df, df[:id .== 1007, ^(:x)]) # Meta
xval = only(df[:id => 1007, :x])          # Proposed

## Filter Method
xval = only(filter(:id => ==(1007), df)).x # Existing

# Multiple Categories

## Definition
df = DataFrame(
    x = 1.0:8.0,
    cat1 = [1,2,3,4,1,2,3,4],
    cat2 = ["A","A","A","A","B","B","B","B"],
)

## Group Method
gdf = groupby(df, [:cat1, :cat2])
xval = only(gdf[(3, "B")]).x      # Existing
xval = only(gdf(3, "B")).x        # Proposed
idf = indexby(df, [:cat1, :cat2])
xval = idf(3, "B").x              # Proposed

## Subset Method
xval = only(subset(df, :cat1 => ByRow(==(3)), :cat2 => ByRow(==("B")))).x # Existing
xval = only(@rsubset(df, :cat1 == 3, :cat2 == "B")).x                     # Meta

## Index Method
xval = only(df[df.cat1 .== 3 .&& df.cat2 .== "B", :x])    # Existing
xval = @with(df, df[:cat1 .== 3 .&& :cat2 .=="B", ^(:x)]) # Meta
xval = only(df[(:cat1 => 3, :cat2 => "B"), :x])           # Proposed

## Filter Method
xval = only(filter([:cat1, :cat2] => ((x, y) -> x .== 3 .&& y .== "B"), df)).x # Existing

The objective to me is to simplify the syntax for obtaining xval in both the single- and multi-category cases (performance goals are secondary). In both cases, my favorite syntax is the proposed index method followed by the @rsubset method.

gdf indexing makes sense to me now as is for the multiple categories case. It is just strange for the single category case. Maybe we can just throw a better error here to instruct users to input a Tuple (probably easier to implement for non-integers).

julia> gdf[1007]
ERROR: BoundsError: attempt to access 8-element Vector{Int64} at index [1007]

julia> gdf["v"]
ERROR: ArgumentError: invalid index: "v" of type String

I'm not sure now if callable gdf is enough of a simplification to be the solution. groupby in both forms is still pretty difficult to understand:

groupby(df, :id)[(1007,)].x |> only
groupby(df, :id)(1007).x |> only

For some reason, I find indexby to be fairly understandable with square brackets, so maybe we should make the index work on individual values (for simplicity) and tuples (for consistency with groupby) and call that consistent enough with groupby?

indexby(df, :id)[1007].x
indexby(df, :id)[(1007,)].x
indexby(df, [:cat1, :cat2])[3, "B"].x
indexby(df, [:cat1, :cat2])[(3, "B")].x

I think my favorite solution though is the Pair indexing that was dismissed for performance reasons.

bkamins · 2023-08-31T21:00:50Z

If I have not missed anything you have captured all correctly.

A small comment is that in general the @rsubset(df, :cat1 .== 3, :cat2 .== "B") style is in corner cases different from the groupby style. The reason is that groupby uses isequal and not ==, but this is minor.

nathanrboyer · 2024-02-12T20:52:53Z

I find myself wanting this again. Any chance of adding indexby and/or indexing with Pairs? @nalimilan
I prefer both indexby(df, :id)[1007].x and only(df[:id => 1007, :x]) over the originally proposed only(groupby(df, :id)(1007).x).

Lincoln-Hannah · 2024-02-14T02:21:21Z

For the case of a unique index column.
I define a Dictionary over the same dataset. df and di are linked to the same underlying data so changes flow through and new columns can be accessed via di

df  = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
di  = Dictionary( df.name, eachrow(df))

di["Sally"].age = 50
df[2,:age]

@rtransform! df   :age_relative_to_Sally = :age - di["Sally"].age

di["John"].age_relative_to_Sally
df[1,:age_relative_to_Sally]

Maybe you could you make it so that df[i] = di[i] for some specified index column. To give the above functionality without explicitly creating the Dictionary object.

bkamins · 2024-04-05T17:20:06Z

OK, let us try to finalize this.

There are two cases: a) single lookup, b) many lookups

Case a. single lookup

Here the current syntax is:

only(filter(:id => ==(1007), df))

or

only(df[df.id .== 1007, :])

The question is if we find it too cumbersome and if yes what is the proposed alternative.

Case b. multiple lookups

gdf = groubpy(df, :id)

and then efficiently

only(gdf((1007,)))

I think this pattern is OK as if someone wants efficiency this is relatively short (note that you need two separate statements to ensure efficiency). But again - please comment how you see this.

nathanrboyer · 2024-04-05T20:19:35Z

Keep in mind that row lookup is something a very early beginner will want to do: "How old is Sally?", but some of the concepts above are relatively advanced:

==(3) - create an anonymous boolean function the short way
df.id .== 1007 - broadcast a comparison operator across a column to create a BitVector to index with
gdf((1007,)) - construct a single-element tuple and pass that as a single argument

By comparison, I think these require less Julia knowledge to use:

only(df[:id => 1007, :]) - single lookup
idx = indexby(df, :id); idx[1007] - multiple lookups

bkamins · 2024-04-05T21:13:39Z

Keep in mind that row lookup is something a very early beginner will want to do

Yes - that is why I am keeping this discussion 😄. I want to find a solution for a beginner.

We could also define for single lookup:

lookup(df, :id => 1007)

and also a kwarg variant

lookup(df, id = 1007)

for multiple lookups the question is if it is needed to be added (i.e. the question is if a beginner would really need this fast variant?)

nathanrboyer · 2024-04-06T21:52:41Z

Would it be possible to index a grouped data frame with a Pair and avoid needing to wrap it in a collection: gdf[:id=>1007] |> only? That might be sufficiently simple for the multiple lookup case.

pdeffebach · 2024-04-07T15:43:46Z

@nathanrboyer Is this feature for indexing easier in general? Or is it for indexing where you know you are only going to return one row? Or maybe 0 rows.

If it's for indexing with one row, do you want this check to happen when a new object is created? Like indexby(df, :id), or when the lookup happens, like gdf[(id = 54)] |> only?

bkamins · 2024-04-07T18:55:32Z

I understand that it is indexing when exactly one row is returned and that @nathanrboyer looks for a simpler syntax than we have now.

Side note, you need (id = 54,) and not (id = 54), which I think is one of @nathanrboyer gripes.

nalimilan · 2024-04-16T13:18:06Z

I would find it weird that we would offer a convenience syntax to select a single row without an equivalent to select multiple rows. We require filter(:id => ==(1007), df) for the latter case, so why allow lookup(df, :id => 1007) and lookup(df, id = 1007) for the former? The :id => 1007 syntax would have that meaning in a single place in the API. For consistency we would also have to add filter(:id => 1007, df) and subset(df, :id => 1007). And unfortunately that would interfere with the select DSL, which is already quite complex.

So far I don't see a good solution that would be significantly more convenient than what we have now without increasing the complexity of the API with new ad-hoc concepts or syntaxes...

nathanrboyer · 2024-04-16T15:04:10Z

The filter method for single lookup is not too bad once you know the trick.

However, filter is generally discouraged in the documentation, so it is not easy to find. My code for row lookup based on reading the documentation used to be only(subset(df, :id => ByRow(x -> x == 1007)).x). Then I would sometimes write my own lookup when that got too annoying:

lookup(df, col_in, value, col_out) = only(subset(df, col_in => ByRow(x -> x == value))[:, col_out])

Maybe good enough documentation can close this knowledge gap without new syntax, and point people to filter.

On the other hand though, the filter method when multiple columns are required is probably the worst syntax of the bunch:

only(filter([:cat1, :cat2] => ((x, y) -> x .== 3 .&& y .== "B"), df)).x

In this case, I would probably point people to groupby for the best existing syntax, but again, the documentation only really refers to groupby in the Split-Apply-Combine context, not as a potential filtering method.

It is just hard to know what to reach for unless you are already familiar with everything DataFrames has to offer, and I know from experience that it is off-putting to need an hour of research and to go over your line character limit just to look up one value in your table. I think these types of operations are common enough to warrant a more simple and cohesive syntax, but documentation improvements could help too.

bkamins · 2024-04-16T18:44:29Z

@pdeffebach - would you accept adding to DataFramesMeta.jl something like:

@lookup(df, :x == 1, :y == 3)

that would return a single row or error if no row or multiple rows are found?

pdeffebach · 2024-04-16T21:02:42Z

Yes, I would!

bkamins · 2024-04-16T21:52:09Z

Great. @nathanrboyer - I will close this issue once the functionality lands in DataFramesMeta.jl.
I think that it is important to have this feature, but as @nalimilan says - the design is tricky, so it is safer to have it on DataFramesMeta.jl to start with.

nalimilan · 2024-04-17T14:41:05Z

Maybe it should be called @rlookup as it evaluates the predicate over each row, like @rsubset?

nathanrboyer · 2024-04-17T16:02:13Z

I hate to say this, but if the functionality is going to require DataFramesMeta.jl, then @rsubset is probably already good enough.

@lookup(df, :x == 1, :y == 3) == only(@rsubset(df, :x == 1, :y == 3))

(I edited my above post in which I previously thought .== was required. You can just use ==.)

If @lookup is going to be a distinct function, then I think it would nice if it would automatically extract a column value. I created an issue to discuss the design: JuliaData/DataFramesMeta.jl#394 (comment).

The issue then is convincing DataFrames.jl users to use DataFramesMeta.jl:

When I was new to DataFrames.jl, I avoided anything to do with metaprogramming as I found it scary and for power users.
Now that I have spent a good amount of time learning DataFrames.jl, I can usually figure out syntax I need, and I don't desire to spend more time learning yet another set of methods for data frames operations.

I've still yet to dive deeply into DataFramesMeta.jl thinking it is probably overkill for what I usually need to do. However, it seems to be more simple and helpful than I gave it credit for.

pdeffebach · 2024-04-17T20:03:32Z

I've still yet to dive deeply into DataFramesMeta.jl thinking it is probably overkill for what I usually need to do. However, it seems to be more simple and helpful than I gave it credit for.

I may be biased, but users should probably go to DataFramesMeta.jl first rather than after they learn DataFrames.jl src => fun => dest syntax. The latter is, in my view, a lot more complicated than DataFramesMeta.jl syntax. So I kind of disagree with your framing here. If users see @rlookup first and then never need to do anything with the base DataFrames.jl syntax, that's fine, and probably the goal.

nathanrboyer · 2024-04-17T20:14:58Z

I think you are right. I also think my misunderstanding, that DataFramesMeta.jl is an extension of DataFrames.jl for advanced power users, is a common one.

bkamins · 2024-04-18T09:24:32Z

that DataFramesMeta.jl is an extension of DataFrames.jl for advanced power users, is a common one.

I will add a blog post about it :), and also update the documentation of DataFrames.jl. @pdeffebach - probably one could also write something like this in DataFramesMeta.jl documentation to be more explicit (as per JuliaData/DataFramesMeta.jl#393).

I am closing this issue - if we add some.

bkamins added the decision label May 16, 2022

bkamins modified the milestones: 1.5, 1.4 May 16, 2022

bkamins changed the title ~~Make GroupedDataFrame callable~~ Make row lookup easier Jun 7, 2022

bkamins modified the milestones: 1.4, 1.5 Jun 7, 2022

bkamins modified the milestones: 1.5, 1.6 Feb 5, 2023

bkamins modified the milestones: 1.6, 1.7 Jul 10, 2023

nathanrboyer mentioned this issue Apr 17, 2024

Add convenience function to look up a single value in a DataFrame JuliaData/DataFramesMeta.jl#394

Open

bkamins closed this as completed Apr 18, 2024

Make row lookup easier #3051

Make row lookup easier #3051

Comments

bkamins commented May 16, 2022

nalimilan commented May 16, 2022

bkamins commented May 16, 2022

Lincoln-Hannah commented May 16, 2022 • edited

bkamins commented Jun 7, 2022 • edited

Lincoln-Hannah commented Jun 7, 2022

bkamins commented Jun 7, 2022

nalimilan commented Jun 7, 2022

bkamins commented Jun 7, 2022

pdeffebach commented Jun 7, 2022

Lincoln-Hannah commented Jun 8, 2022

pdeffebach commented Jun 8, 2022

Lincoln-Hannah commented Jun 9, 2022

Lincoln-Hannah commented Jun 15, 2022 • edited

fdekerme commented Aug 25, 2023

xgdgsc commented Aug 25, 2023

bkamins commented Aug 25, 2023

alfaromartino commented Aug 25, 2023

bkamins commented Aug 29, 2023

alfaromartino commented Aug 29, 2023 • edited

nathanrboyer commented Aug 30, 2023 • edited

nathanrboyer commented Aug 30, 2023 • edited

alfaromartino commented Aug 30, 2023

bkamins commented Aug 30, 2023

nathanrboyer commented Aug 31, 2023 • edited

bkamins commented Aug 31, 2023

nathanrboyer commented Aug 31, 2023 • edited

bkamins commented Aug 31, 2023

nathanrboyer commented Aug 31, 2023 • edited

bkamins commented Aug 31, 2023

nathanrboyer commented Feb 12, 2024

Lincoln-Hannah commented Feb 14, 2024 • edited

bkamins commented Apr 5, 2024

Case a. single lookup

Case b. multiple lookups

nathanrboyer commented Apr 5, 2024 • edited

bkamins commented Apr 5, 2024

nathanrboyer commented Apr 6, 2024

pdeffebach commented Apr 7, 2024

bkamins commented Apr 7, 2024

nalimilan commented Apr 16, 2024

nathanrboyer commented Apr 16, 2024

bkamins commented Apr 16, 2024

pdeffebach commented Apr 16, 2024

bkamins commented Apr 16, 2024

nalimilan commented Apr 17, 2024

nathanrboyer commented Apr 17, 2024 • edited

pdeffebach commented Apr 17, 2024

nathanrboyer commented Apr 17, 2024

bkamins commented Apr 18, 2024

Lincoln-Hannah commented May 16, 2022 •

edited

bkamins commented Jun 7, 2022 •

edited

Lincoln-Hannah commented Jun 15, 2022 •

edited

alfaromartino commented Aug 29, 2023 •

edited

nathanrboyer commented Aug 30, 2023 •

edited

nathanrboyer commented Aug 30, 2023 •

edited

nathanrboyer commented Aug 31, 2023 •

edited

nathanrboyer commented Aug 31, 2023 •

edited

nathanrboyer commented Aug 31, 2023 •

edited

Lincoln-Hannah commented Feb 14, 2024 •

edited

nathanrboyer commented Apr 5, 2024 •

edited

nathanrboyer commented Apr 17, 2024 •

edited