Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make row lookup easier #3051

Closed
bkamins opened this issue May 16, 2022 · 53 comments
Closed

Make row lookup easier #3051

bkamins opened this issue May 16, 2022 · 53 comments
Labels
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented May 16, 2022

This is a speculative idea. Maybe we could define GroupedDataFrame to be callable like this:

(gdf::GroupedDataFrame)(idxs...) = gdf[idxs]

In this way instead of writing:

gdf[("val",)]

users could write gdf("val").

@nalimilan, @pdeffebach - what do you think?

Ref: https://discourse.julialang.org/t/any-plan-for-functionality-like-pandas-loc/81134

@bkamins bkamins modified the milestones: 1.5, 1.4 May 16, 2022
@nalimilan
Copy link
Member

I'm not a fan of this. You only save 3 characters and the syntax is harder to understand.

It wouldn't really improve the workflow discussed in the Discourse thread, would it?

@bkamins
Copy link
Member Author

bkamins commented May 16, 2022

It would not - that is why we originally have not implemented it. It is just non-conflicting and would make life of newcomers slightly easier. I wanted to discuss it to make sure we are OK with the current design.

@Lincoln-Hannah
Copy link

Lincoln-Hannah commented May 16, 2022

I was thinking of the case of a unique index.
I could convert the DataFrame to a dictionary of DataFrame rows:

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

d = Dict( df.name .=> eachrow(df) )

This then allows d["Sally"].age

But loses all the other DataFrame functionality.

There are so many times I want to do several lookups on different columns but using the same index column.

Example
I have a table of FX Forward rates I'm using to bootstrap discount factors. Columns are :Currency :Tenor :Rate.
The bootstrap is done separately per currency so first I'd like to @groupby :Currency.
Then @setindex :Tenor.
Possible Tenor values are: ON TN SPOT 1W 2W 1M 3M 6M 1Y 2Y
Using IndexValue:Column notation, the formulas are

:Fwd      = SPOT:Rate  +  :Rate
SPOT:Fwd   = :SPOT:Rate
TN:Fwd      = SPOT:Rate - TN:Rate
ON:Fwd     = TN:Rate - ON:Rate
TodayRate  = ON:Fwd
:Discount    =  :Fwd / TodayRate * USDiscount(:Tenor)

Currently I'm calling a function for each group. The function then converts each sub-DataFrame to named arrays.
This works fine. it would just be nice to do it all in a DataFrame.

@bkamins bkamins changed the title Make GroupedDataFrame callable Make row lookup easier Jun 7, 2022
@bkamins bkamins modified the milestones: 1.4, 1.5 Jun 7, 2022
@bkamins
Copy link
Member Author

bkamins commented Jun 7, 2022

I am moving it to 1.5 release. Essentially the requirement is for a Dict-like object allowing an easy row lookup.
Essentially it should be a wrapper around GroupedDataFrame allowing easy indexing on group keys.

So this would be something like (in general probably this could be made more efficient, but this shows the idea - please look at the API not at the implementation):

struct RowLookup where T<:GroupedDataFrame
    gdf::T
end

function rowlookup(df::AbstractDataFrame, cols)
    gdf = groupby(df, cols)
    length(gdf) == nrow(df) || throw(ArgumentError("rows are not unique"))
    return RowLookup(gdf)
end

Base.getindex(rl::RowLookup, idxs...) = only(rl.gdf[idxs])
Base.parent(rl) = parent(rl.gdf)

This would allow for easy indexing like (referring to the example above):

d1 = rowlookup(df, :name)
d1["Sally"].age
d2 = rowlookup(df, :name, :age)
d2["Sally", 49].children

Do we think it is worth to add something like this?

@Lincoln-Hannah
Copy link

If this is for a unique index. Would it be better to return a DataFrameRow rather than a 1-row DataFrame.
So replace [1:1, :] with [1,:]

Then d1["Sally].age returns a number instead of a 1-element vector. I think generally the number would be preferred.

It would be nice to have the rowlookup functionality but keep the object as a DataFrame, so you can add columns and do other stuff to it. With this implementation, you can't for example do d1.ageBy2 = d1.age because d1 is a GroupedDataFrame. I guess this would make the implementation much more difficult as you couldn't use the pre-existing GroupedDataFrame functionality.

@bkamins
Copy link
Member Author

bkamins commented Jun 7, 2022

So replace [1:1, :] with [1,:]

Yes - it was a typo - I meant [1, :]. Otherwise it would be needed. Actually it should be only. I will fix :).

It would be nice to have the rowlookup functionality but keep the object as a DataFrame

This cannot be done as indexing DataFrame has a different semantics.


What we could consider doing is to add syntax like:

df[:name => "Sally", :age => 49].children

with the requirement that the condition must identify a unique row.
The only drawback would be that it would be a O(n) operation, but maybe this is not the end of the world as it would be a convenience function?

@nalimilan - what do you think?

@nalimilan
Copy link
Member

It seems problematic to assume that df[:name => "Sally", :age => 49].children should reference a single row, as nothing in the syntax indicates that. I would rather expect it to be equivalent to subset(df, :name => ==("Sally"), :age => ==(49)).children.

I feel like we're not targeting the real problem here. Maybe the problem is just that when using @rtransform, one needs to repeat the data frame name. Using @transform, this works and is not too ugly IMO:

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
@chain df begin
    @transform :Age_relative_to_Sally = :age .- only(:age[:name .== "Sally"])
end

Or even without only (broadcasting will fail if more than one row is selected):

@chain df begin
    @transform :Age_relative_to_Sally = :age .- :age[:name .== "Sally"]
end

Maybe what's missing is a way to do the same using @rtransform? For example, wrapping :age[:name .== "Sally"] within a special marker to indicate that it should be evaluated on whole vectors rather than row-by-row? It could also be useful for things like :x - mean(x).

@bkamins
Copy link
Member Author

bkamins commented Jun 7, 2022

CC @pdeffebach

@pdeffebach
Copy link
Contributor

I can't think of an easy way to mark just some columns as "whole column" references and others as not. The current implementation just takes the anonymous function created by the parsing and wraps it in ByRow.

We could have something that expands to :x - mean(df.x) but that results in type instabilities and would slow down performance.

Aren't there array types that allow for

@rtransform df :Age_relative_to_Sally = :age .- :age."Sally"

? That seems pretty clean to me.

@Lincoln-Hannah
Copy link

That's a really nice syntax if its possible.

With NamedArrays.jl you can do

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

@chain df begin

    @aside age = NamedArray( _.age, _.name )

    @rtransform :Age_relative_to_Sally = :age - age["Sally"]    

end

But of course changing the NamedArray won't change the column.

@pdeffebach
Copy link
Contributor

@chain df begin

    @aside age = NamedArray( _.age, _.name )

    @rtransform :Age_relative_to_Sally = :age - age["Sally"]    

end

This still has some performance issues, since age["Sally"] is called every time. Better to use @transform instead of @rtransform and broadcast.

@Lincoln-Hannah
Copy link

Ok. @transform :Age_relative_to_Sallly = :age .- age["Sally"]

suppose there are a few columns you would like to attach the index to.

@aside  age     = NamedArray( _.age,    _.name )
@aside  height  = NamedArray( _.height, _.name )
@aside  weight  = NamedArray(_.weight, _.name )

could this be condensed to something like?

addindex( df, :name, to = [:age, :height, :weight] )

and could the NamedArray functionality be merged with the DataFrame column such that you can alter cells as well as referencing them. e.g.

@transform :age["Sally"] = 30

@Lincoln-Hannah
Copy link

Lincoln-Hannah commented Jun 15, 2022

Something I just realised. DataFrame columns can be defined as NamedArrays, and cell values referenced within a @transform statement. A cell value can't be altered within a @transform block but can altered as below.

df = DataFrame( T = ["1M","2M","3M","6M"],  x = [1,2,3,4],  y = [10,11,12,13] )

@chain df begin


    @DataFramesMeta.transform begin
         :x = NamedArray(:x,:T)
         :y = NamedArray(:y,:T)
    end

    @DataFramesMeta.transform :W = :x["1M"]


    (
        _.x["1M"] = 100;
        _
    )
    
end

@bkamins bkamins modified the milestones: 1.5, 1.6 Feb 5, 2023
@bkamins bkamins modified the milestones: 1.6, 1.7 Jul 10, 2023
@fdekerme
Copy link

https://discourse.julialang.org/t/why-is-it-so-complicated-to-access-a-row-in-a-dataframe/103162/10

  1. I find this solution very elegant, although I agree that it is problematic in the case of multiple rows.

What we could consider doing is to add syntax like:

df[:name => "Sally", :age => 49].children

with the requirement that the condition must identify a unique row. The only drawback would be that it would be a O(n) operation, but maybe this is not the end of the world as it would be a convenience function?

As indicated in the post above, "The biggest issue is that the condition you might want to use could return exactly one row, or multiple rows (where 0 rows is a special case of multiple)".

Although I don't think this is really in line with the philosophy of DataFrame.jl or its internal implementation, one feature I really like in Pandas is MultiIndex (https://pandas.pydata.org/docs/user_guide/advanced.html). Even if they're not obvious to use, they make the DataFrame much more readable and could help solve this kind of problem.

@xgdgsc
Copy link
Contributor

xgdgsc commented Aug 25, 2023

I had some bad experiences with pandas.Multiindex,. Would something as flexible as boost.MultiIndex be more desirable?

@bkamins
Copy link
Member Author

bkamins commented Aug 25, 2023

Having thought about it:

  1. If one wants to select multiple rows then GroupedDataFrames gives all the functionality one might want.
  2. So essentially we are looking for a solution when user expects exactly one row. We just need to settle on the syntax.

The syntax (note a slight change, but this is a minor thing):

df[(:name => "Sally", :age => 49), :]

is tempting, but in the past with @nalimilan we wanted to avoid adding too much functionality to indexing of df. We would need to think about it. The second downside, as was commented is that the operation would be O(n).

The alternative would be to define something like:

l = lookup(df, :name, :age)

which would be a wrapper around df (like GroupedDataFrame), call it RowLookup tentatively that would assume a unique key situation.

Then one could write l["Sally", 49] to get the desired row. The additional benefit of this approach is that l would probably do some pre-processing, so that the lookup operation later would be O(1).

If a single lookup were required then one could of course write lookup(df, :name, :age)["Sally", 49] (but the whole design would be around ensuring fast repeated lookup).

@alfaromartino
Copy link
Contributor

Is the problem here the syntax or the speed? If it's the former, I'm not sure if there's any benefit from notation like this

df[(:name => "Sally", :age => 49), :]
lookup(df, :name, :age)["Sally", 49]

relative to

df[df.name .== "Sally" .&& df.age .== 49,:]

We'd avoid writing df. repeatedly. But, apart from that, it'd be quite limited since these cases can only handle the case with ".&&". It'd also add the problem that new users could expect the functionality to behave as ".||".

I'm usually against adding new notation, especially if it creates additional ways of doing the same, while having approximately the same readibility/number of characters.

Maybe it's a problem more related to documentation? In indexing syntax, I noticed that the explanations for column selectors and row indexing are intertwined. Moreover, there's only one example with the broadcasting notation, and it's at the end of the section.

If the problem is speed, then I agree that having an optimized function like "lookup" could be beneficial.

@bkamins
Copy link
Member Author

bkamins commented Aug 29, 2023

My proposal with lookup wrapper is related to speed and correctness (and indirectly a bit syntax). The point is that df[df.name .== "Sally" .&& df.age .== 49,:] is slow as was noted. Given this we have two calls for action:

  1. Improve documentation (this is clearly doable, and I can add more examples)
  2. Decide if we want the lookup variant for speed and correctness. Note the "correctness" aspect. The point is that lookup would make sure that the row is unique and error otherwise (so users would avoid a common error of having data with duplicates but thinking that they generate a unique row).

So the question is if we want the addition of lookup or not.

@alfaromartino
Copy link
Contributor

alfaromartino commented Aug 29, 2023

If it's for speed and correctness, I like the idea. Maybe we could think more about the API? the API and name of lookup doesn't convince me.

I'm thinking of two possibilities. If it emulates filter (in the sense of a function returning a dataframe), something along the lines:

rowsubset(df, :id => 10, :name => "John")

the name makes it clear that it should return a row and that the API is similar to subset. Regardless of the internals, it makes it clear that it emulates the subset API.

If the intention is to be used for indexing, maybe the name row would suffice

df[row(:id => 10, :name => "John"), cols]

BTW, is the use of grouped dfs for selecting multiple rows documented?

@nathanrboyer
Copy link
Contributor

nathanrboyer commented Aug 30, 2023

Call the function indexby then (rather than lookup) and maintain similar API to groupby.
groupby will assume multiple rows per group. indexby will assume one row per index.

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
idf = indexby(df, :name)
sally_age = idf["Sally"].age

(I first considered just overloading groupby with groupby(df, :name; unique=true), but then I think the function name doesn't make much sense.)

@nathanrboyer
Copy link
Contributor

nathanrboyer commented Aug 30, 2023

In fact, I didn't know about this possibility or the fact this option is faster (although I rarely think about performance when I'm working with dataframes). And I've been using dataframes for a while. In hindsight, it can be obvious that groupby could be used for subsetting, but this is not immediate and easy to discasrd as it seems a more cumbersome way to subset rows.

I agree. I have avoided grouped data frames until today. After today, I still find them confusing. This is a lot of parenthesis/brackets to index two keys: (link)

iris_gdf = groupby(iris, :Species)
iris_gdf[[("Iris-virginica",), ("Iris-setosa",)]]

Why does this error?

julia> iris_gdf["Iris-virginica"]
ERROR: ArgumentError: invalid index: "Iris-virginica" of type String

It is the API proposed for lookup/indexby:

Then one could write l["Sally", 49] to get the desired row.

It should also work for GroupedDataFrame (just returning a different object), so the two functions operate consistently. Requiring Tuples or Dicts is pretty cumbersome; even Pairs don't work for indexing unless wrapped in Dict: iris_gdf[Dict(:Species => "Iris-virginica")].

@alfaromartino
Copy link
Contributor

Why does this error?

See the first post of this discussion. That was the original proposal. The idea is that you need a tuple, such that
iris_gdf[("Iris-virginica",)]

I misinterpreted the whole discussion I think. Given the benchmarks I showed, I thought the discussion was about performing fast indexing, but it's more related to subsetting a dataframe assuming that you'll perform multiple operations on that.

Maybe there should be a clarification that if you want to index, the approach with groupby is not ideal.

I'm still unsure about the benefits of adding this functionality, even more now that it's not necessarily more efficient.
In any case, I think the name proposed by @nathanrboyer makes it clear its purpose and that the API follows groupby.

@bkamins
Copy link
Member Author

bkamins commented Aug 30, 2023

Why does this error? julia> iris_gdf["Iris-virginica"]

This is an issue with legacy groupby API that was inherited from its initial design. The point is that gdf[1] picks the first group of gdf (not the group whose key value is 1). This is unfortunate as picking a group by its number is rarely useful. It would be much better if gdf did not allow indexing by a group number. Then we could allow what you ask for.

Because of this I proposed iris_gdf("Iris-virginica") syntax as an alternative that was not taken and could work exactly like you would want it to. However, this proposal did not get much support. But maybe we could reconsider it and then we would have:

gdf = groupby(df, :x)
idf = indexby(df, :x) # this will error if grouping by :x creates duplicates
gdf("v") # get a data frame that corresponds to group with key value "v"
idf("v") # get a unique row that corresponds to group with key value "v"

This would be consistent. The only drawback is that we use ( ) brackets and not [ ] brackets.

@nathanrboyer
Copy link
Contributor

nathanrboyer commented Aug 31, 2023

It would be much better if gdf did not allow indexing by a group number. Then we could allow what you ask for.

DataFrames 2.0?
Could you still iterate over a GroupedDataFrame if you disallow indexing into it by Integer?

This would be consistent. The only drawback is that we use ( ) brackets and not [ ] brackets.

Square brackets would be best, but I like gdf("v") better than gdf[("v",)].

@bkamins
Copy link
Member Author

bkamins commented Aug 31, 2023

DataFrames 2.0?

Could be, but realistically DataFrames 2.0 is like Julia 2.0 - a very distant future.

Could you still iterate over a GroupedDataFrame if you disallow indexing into it by Integer?

Yes, iteration protocol is separate from indexing protocol. You would be able to index gdf by group number, but it would involve a custom wrapper something like gdf[Index(1)], which would make it clear that Index(1) is group number (while gdf[1] would do lookup by group value 1. See https://github.com/davidavdav/NamedArrays.jl#integer-indices for a similar (not identical) idea with a Name wrapper (which allows to distinguish index from value)


@nalimilan - having said that, maybe, even though it could create a minor confusion we could start allowing:

gdf["ABC"]

for a by-value lookup? The rule would be that everything that is not an Integer can be passed as is, and like in NamedArrays.jl we say:

  • if you write gdf[int] it is a group number;
  • if you have a single grouping variable that happens to be an integer, then you have to write gdf[(int,)] to get lookup by group value, but gdf[int] still does lookup by group number?

As I have said - this creates a minor confusion/api inconsistency, but maybe it is better than requiring users to always use the tuple wrapper?

If we went for this then we even could drop the idea of indexby and just recommend users the following syntax for single row lookup (again - this is not the efficient code, but the code that gives everything in one line):

only(groupby(df, :x)["ABC"])

(and the only addition serves as a verification if indeed we have a single row in lookup)


So, to put concrete proposals on a table I tend to like to add either:

only(gdf("ABC")) # the () indexing, which is unambiguous, but is less consistent with Base in terms of syntax

or

only(gdf["ABC"]) # use [] for indexing and clearly document the exception for Integer indexing, where tuple or similar has to be used

The point is that current syntax gdf[("ABC",)] is OK for me or e.g. for @nalimilan, but probably newcomers to DataFrames.jl find it confusing and hard to understand and read.

What do you think?

@nathanrboyer
Copy link
Contributor

nathanrboyer commented Aug 31, 2023

I prefer the rip-the-band-aid DataFrames 2.0 solution, but I understand.

maybe, even though it could create a minor confusion we could start allowing gdf["ABC"]
...
then you have to write gdf[(int,)] to get lookup by group value, but gdf[int] still does lookup by group number?

I am unfortunately against this since Integer cannot be included. It seems like an easy way for silent "correctness" errors to slip in. I guess misunderstanding the integer indexing is already possible, but allowing indexing to work one way for groupby(df, :name) and a different way for groupby(df, :children) seems bad.

julia> df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
3×3 DataFrame
 Row │ name    age      children 
     │ String  Float64  Int64
─────┼───────────────────────────
   1 │ John       23.0         3
   2 │ Sally      42.0         5
   3 │ Kirk       59.0         2

julia> gdf = groupby(df, :children)
GroupedDataFrame with 3 groups based on key: children
First Group (1 row): children = 2
 Row │ name    age      children 
     │ String  Float64  Int64
─────┼───────────────────────────
   1 │ Kirk       59.0         2

Last Group (1 row): children = 5
 Row │ name    age      children 
     │ String  Float64  Int64
─────┼───────────────────────────
   1 │ Sally      42.0         5

julia> gdf[2]
1×3 SubDataFrame
 Row │ name    age      children 
     │ String  Float64  Int64
─────┼───────────────────────────
   1 │ John       23.0         3

then we even could drop the idea of indexby ... only(gdf("ABC"))

Syntax is okay, but would this be performant? I imagine idf("ABC") would have already called only during the creation of idf and thus not need that check for every subsequent indexing into idf.

@bkamins
Copy link
Member Author

bkamins commented Aug 31, 2023

I imagine idf("ABC") would have already called only during the creation of idf and thus not need that check for every subsequent indexing into idf.

Yes, using idf would be slightly faster, but not by a big deal (as groupby is already really fast). I am OK to add indexby.
Let us first decide if we like gdf("ABC") (the original idea I posted on top). Probably it is @nalimilan whom we need to convince. @nalimilan - could you comment what is your current thinking? Thank you!

@nathanrboyer
Copy link
Contributor

nathanrboyer commented Aug 31, 2023

I tried to create a summary of the current and proposed methods for looking up a value. Please add or correct as necessary.

using DataFrames, DataFramesMeta

# Single Category

## Definition
df = DataFrame(
    x = 1.0:8.0,
    id = 1001:1008,
)

## Group Method
gdf = groupby(df, :id)
xval = only(gdf[(1007,)]).x # Existing
xval = only(gdf(1007)).x    # Proposed
idf = indexby(df, :id)
xval = idf(1007).x          # Proposed

## Subset Method
xval = only(subset(df, :id => ByRow(==(1007)))).x # Existing
xval = only(@rsubset(df, :id == 1007)).x          # Meta

## Index Method
xval = only(df[df.id .== 1007, :x])       # Existing
xval = @with(df, df[:id .== 1007, ^(:x)]) # Meta
xval = only(df[:id => 1007, :x])          # Proposed

## Filter Method
xval = only(filter(:id => ==(1007), df)).x # Existing

# Multiple Categories

## Definition
df = DataFrame(
    x = 1.0:8.0,
    cat1 = [1,2,3,4,1,2,3,4],
    cat2 = ["A","A","A","A","B","B","B","B"],
)

## Group Method
gdf = groupby(df, [:cat1, :cat2])
xval = only(gdf[(3, "B")]).x      # Existing
xval = only(gdf(3, "B")).x        # Proposed
idf = indexby(df, [:cat1, :cat2])
xval = idf(3, "B").x              # Proposed

## Subset Method
xval = only(subset(df, :cat1 => ByRow(==(3)), :cat2 => ByRow(==("B")))).x # Existing
xval = only(@rsubset(df, :cat1 == 3, :cat2 == "B")).x                     # Meta

## Index Method
xval = only(df[df.cat1 .== 3 .&& df.cat2 .== "B", :x])    # Existing
xval = @with(df, df[:cat1 .== 3 .&& :cat2 .=="B", ^(:x)]) # Meta
xval = only(df[(:cat1 => 3, :cat2 => "B"), :x])           # Proposed

## Filter Method
xval = only(filter([:cat1, :cat2] => ((x, y) -> x .== 3 .&& y .== "B"), df)).x # Existing

The objective to me is to simplify the syntax for obtaining xval in both the single- and multi-category cases (performance goals are secondary). In both cases, my favorite syntax is the proposed index method followed by the @rsubset method.

gdf indexing makes sense to me now as is for the multiple categories case. It is just strange for the single category case. Maybe we can just throw a better error here to instruct users to input a Tuple (probably easier to implement for non-integers).

julia> gdf[1007]
ERROR: BoundsError: attempt to access 8-element Vector{Int64} at index [1007]

julia> gdf["v"]
ERROR: ArgumentError: invalid index: "v" of type String

I'm not sure now if callable gdf is enough of a simplification to be the solution. groupby in both forms is still pretty difficult to understand:

  • groupby(df, :id)[(1007,)].x |> only
  • groupby(df, :id)(1007).x |> only

For some reason, I find indexby to be fairly understandable with square brackets, so maybe we should make the index work on individual values (for simplicity) and tuples (for consistency with groupby) and call that consistent enough with groupby?

  • indexby(df, :id)[1007].x
  • indexby(df, :id)[(1007,)].x
  • indexby(df, [:cat1, :cat2])[3, "B"].x
  • indexby(df, [:cat1, :cat2])[(3, "B")].x

I think my favorite solution though is the Pair indexing that was dismissed for performance reasons.

@bkamins
Copy link
Member Author

bkamins commented Aug 31, 2023

If I have not missed anything you have captured all correctly.

A small comment is that in general the @rsubset(df, :cat1 .== 3, :cat2 .== "B") style is in corner cases different from the groupby style. The reason is that groupby uses isequal and not ==, but this is minor.

@nathanrboyer
Copy link
Contributor

I find myself wanting this again. Any chance of adding indexby and/or indexing with Pairs? @nalimilan
I prefer both indexby(df, :id)[1007].x and only(df[:id => 1007, :x]) over the originally proposed only(groupby(df, :id)(1007).x).

@Lincoln-Hannah
Copy link

Lincoln-Hannah commented Feb 14, 2024

For the case of a unique index column.
I define a Dictionary over the same dataset. df and di are linked to the same underlying data so changes flow through and new columns can be accessed via di

df  = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])
di  = Dictionary( df.name, eachrow(df))

di["Sally"].age = 50
df[2,:age]

@rtransform! df   :age_relative_to_Sally = :age - di["Sally"].age

di["John"].age_relative_to_Sally
df[1,:age_relative_to_Sally]

Maybe you could you make it so that df[i] = di[i] for some specified index column. To give the above functionality without explicitly creating the Dictionary object.

@bkamins
Copy link
Member Author

bkamins commented Apr 5, 2024

OK, let us try to finalize this.

There are two cases: a) single lookup, b) many lookups

Case a. single lookup

Here the current syntax is:

only(filter(:id => ==(1007), df))

or

only(df[df.id .== 1007, :])

The question is if we find it too cumbersome and if yes what is the proposed alternative.

Case b. multiple lookups

gdf = groubpy(df, :id)

and then efficiently

only(gdf((1007,)))

I think this pattern is OK as if someone wants efficiency this is relatively short (note that you need two separate statements to ensure efficiency). But again - please comment how you see this.

@nathanrboyer
Copy link
Contributor

nathanrboyer commented Apr 5, 2024

Keep in mind that row lookup is something a very early beginner will want to do: "How old is Sally?", but some of the concepts above are relatively advanced:

  • ==(3) - create an anonymous boolean function the short way
  • df.id .== 1007 - broadcast a comparison operator across a column to create a BitVector to index with
  • gdf((1007,)) - construct a single-element tuple and pass that as a single argument

By comparison, I think these require less Julia knowledge to use:

  • only(df[:id => 1007, :]) - single lookup
  • idx = indexby(df, :id); idx[1007] - multiple lookups

@bkamins
Copy link
Member Author

bkamins commented Apr 5, 2024

Keep in mind that row lookup is something a very early beginner will want to do

Yes - that is why I am keeping this discussion 😄. I want to find a solution for a beginner.

We could also define for single lookup:

lookup(df, :id => 1007)

and also a kwarg variant

lookup(df, id = 1007)

for multiple lookups the question is if it is needed to be added (i.e. the question is if a beginner would really need this fast variant?)

@nathanrboyer
Copy link
Contributor

Would it be possible to index a grouped data frame with a Pair and avoid needing to wrap it in a collection: gdf[:id=>1007] |> only? That might be sufficiently simple for the multiple lookup case.

@pdeffebach
Copy link
Contributor

@nathanrboyer Is this feature for indexing easier in general? Or is it for indexing where you know you are only going to return one row? Or maybe 0 rows.

If it's for indexing with one row, do you want this check to happen when a new object is created? Like indexby(df, :id), or when the lookup happens, like gdf[(id = 54)] |> only?

@bkamins
Copy link
Member Author

bkamins commented Apr 7, 2024

I understand that it is indexing when exactly one row is returned and that @nathanrboyer looks for a simpler syntax than we have now.

Side note, you need (id = 54,) and not (id = 54), which I think is one of @nathanrboyer gripes.

@nalimilan
Copy link
Member

I would find it weird that we would offer a convenience syntax to select a single row without an equivalent to select multiple rows. We require filter(:id => ==(1007), df) for the latter case, so why allow lookup(df, :id => 1007) and lookup(df, id = 1007) for the former? The :id => 1007 syntax would have that meaning in a single place in the API. For consistency we would also have to add filter(:id => 1007, df) and subset(df, :id => 1007). And unfortunately that would interfere with the select DSL, which is already quite complex.

So far I don't see a good solution that would be significantly more convenient than what we have now without increasing the complexity of the API with new ad-hoc concepts or syntaxes...

@nathanrboyer
Copy link
Contributor

The filter method for single lookup is not too bad once you know the trick.

However, filter is generally discouraged in the documentation, so it is not easy to find. My code for row lookup based on reading the documentation used to be only(subset(df, :id => ByRow(x -> x == 1007)).x). Then I would sometimes write my own lookup when that got too annoying:

lookup(df, col_in, value, col_out) = only(subset(df, col_in => ByRow(x -> x == value))[:, col_out])

Maybe good enough documentation can close this knowledge gap without new syntax, and point people to filter.


On the other hand though, the filter method when multiple columns are required is probably the worst syntax of the bunch:

only(filter([:cat1, :cat2] => ((x, y) -> x .== 3 .&& y .== "B"), df)).x

In this case, I would probably point people to groupby for the best existing syntax, but again, the documentation only really refers to groupby in the Split-Apply-Combine context, not as a potential filtering method.


It is just hard to know what to reach for unless you are already familiar with everything DataFrames has to offer, and I know from experience that it is off-putting to need an hour of research and to go over your line character limit just to look up one value in your table. I think these types of operations are common enough to warrant a more simple and cohesive syntax, but documentation improvements could help too.

@bkamins
Copy link
Member Author

bkamins commented Apr 16, 2024

@pdeffebach - would you accept adding to DataFramesMeta.jl something like:

@lookup(df, :x == 1, :y == 3)

that would return a single row or error if no row or multiple rows are found?

@pdeffebach
Copy link
Contributor

Yes, I would!

@bkamins
Copy link
Member Author

bkamins commented Apr 16, 2024

Great. @nathanrboyer - I will close this issue once the functionality lands in DataFramesMeta.jl.
I think that it is important to have this feature, but as @nalimilan says - the design is tricky, so it is safer to have it on DataFramesMeta.jl to start with.

@nalimilan
Copy link
Member

Maybe it should be called @rlookup as it evaluates the predicate over each row, like @rsubset?

@nathanrboyer
Copy link
Contributor

nathanrboyer commented Apr 17, 2024

I hate to say this, but if the functionality is going to require DataFramesMeta.jl, then @rsubset is probably already good enough.

@lookup(df, :x == 1, :y == 3) == only(@rsubset(df, :x == 1, :y == 3))

(I edited my above post in which I previously thought .== was required. You can just use ==.)

If @lookup is going to be a distinct function, then I think it would nice if it would automatically extract a column value. I created an issue to discuss the design: JuliaData/DataFramesMeta.jl#394 (comment).

The issue then is convincing DataFrames.jl users to use DataFramesMeta.jl:

  • When I was new to DataFrames.jl, I avoided anything to do with metaprogramming as I found it scary and for power users.
  • Now that I have spent a good amount of time learning DataFrames.jl, I can usually figure out syntax I need, and I don't desire to spend more time learning yet another set of methods for data frames operations.

I've still yet to dive deeply into DataFramesMeta.jl thinking it is probably overkill for what I usually need to do. However, it seems to be more simple and helpful than I gave it credit for.

@pdeffebach
Copy link
Contributor

I've still yet to dive deeply into DataFramesMeta.jl thinking it is probably overkill for what I usually need to do. However, it seems to be more simple and helpful than I gave it credit for.

I may be biased, but users should probably go to DataFramesMeta.jl first rather than after they learn DataFrames.jl src => fun => dest syntax. The latter is, in my view, a lot more complicated than DataFramesMeta.jl syntax. So I kind of disagree with your framing here. If users see @rlookup first and then never need to do anything with the base DataFrames.jl syntax, that's fine, and probably the goal.

@nathanrboyer
Copy link
Contributor

I think you are right. I also think my misunderstanding, that DataFramesMeta.jl is an extension of DataFrames.jl for advanced power users, is a common one.

@bkamins
Copy link
Member Author

bkamins commented Apr 18, 2024

that DataFramesMeta.jl is an extension of DataFrames.jl for advanced power users, is a common one.

I will add a blog post about it :), and also update the documentation of DataFrames.jl. @pdeffebach - probably one could also write something like this in DataFramesMeta.jl documentation to be more explicit (as per JuliaData/DataFramesMeta.jl#393).

I am closing this issue - if we add some.

@bkamins bkamins closed this as completed Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants