Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Allow naming function in rename operation pairs. #3361

Closed
nathanrboyer opened this issue Jul 19, 2023 · 15 comments · Fixed by #3380
Closed

Feature Request: Allow naming function in rename operation pairs. #3361

nathanrboyer opened this issue Jul 19, 2023 · 15 comments · Fixed by #3380
Labels
Milestone

Comments

@nathanrboyer
Copy link
Contributor

The current options for renaming column(s) with a function are below, but these all require care in dealing with the other columns.

julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64
─────┼─────────────────────
   11      4      7
   22      5      8
   33      6      9

julia> transform(df, :b => identity => (s -> s * "_new"))
3×4 DataFrame
 Row │ a      b      c      b_new 
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   11      4      7      4
   22      5      8      5
   33      6      9      6

julia> select(df, :a, :b => identity => (s -> s * "_new"), :c)
3×3 DataFrame
 Row │ a      b_new  c     
     │ Int64  Int64  Int64
─────┼─────────────────────
   11      4      7
   22      5      8
   33      6      9

I would like the method below added to the rename and rename! functions, so columns can be renamed in-place with a function just like they can with explicit new name(s).

julia> rename(df, :b => (s -> s * "_new")) # currently errors
3×3 DataFrame
 Row │ a      b_new  c     
     │ Int64  Int64  Int64
─────┼─────────────────────
   11      4      7
   22      5      8
   33      6      9

julia> rename(df, :b => :b_new)
3×3 DataFrame
 Row │ a      b_new  c     
     │ Int64  Int64  Int64
─────┼─────────────────────
   11      4      7
   22      5      8
   33      6      9

It would also work for multiple columns.

julia> rename(df, 1:2 => (s -> s * "_new"))
3×3 DataFrame
 Row │ a_new  b_new  c     
     │ Int64  Int64  Int64
─────┼─────────────────────
   11      4      7
   22      5      8
   33      6      9

There is already a method for applying a function to the entire data frame rename((s -> s * "_new"), df), but I don't think there is currently a way to apply it to only some columns.

@bkamins
Copy link
Member

bkamins commented Sep 15, 2023

The question is if this functionality is needed. Currently instead of:

rename(df, 1:2 => (s -> s * "_new"))

you should write:

rename(df, [n => n * "_new" for n in ["a", "b"])

Do you think it is that much less convenient?

And for a common case of:

rename(df, :b => (s -> s * "_new"))

It is easier to just write:

rename(df, :b => "b_new")

@nalimilan - what do you think?

@nathanrboyer
Copy link
Contributor Author

nathanrboyer commented Sep 15, 2023

you should write:
rename(df, [n => n * "_new" for n in ["a", "b"]])
Do you think it is that much less convenient?

That syntax makes sense to me now that you've written it, but I probably wouldn't have come up with it on my own. I see now I could also write rename(df, [n => f(n) for n in ["a", "b"]]) for named functions, but I think the proposed method would be easier: rename(df, ["a", "b"] => f).

Also rename(df, [n => f(n) for n in Not("c")]) doesn't work, but rename(df, Not("c") => f) would.

The reason for this issue is that I assumed the below would work based on syntax similarity between rename and the manipulation functions, but it doesn't.

julia> df = DataFrame(a=1:3, b=4:6, c=7:9)
3×3 DataFrame
 Row │ a      b      c     
     │ Int64  Int64  Int64 
─────┼─────────────────────
   11      4      7
   22      5      8
   33      6      9

julia> function f(s::Union{String, Symbol})
           s = lpad(s, 3, '_')
           s = rpad(s, 5, '_')
           if contains(s, 'b')
               return uppercase(s)
           else
               return s
           end
       end
f (generic function with 2 methods)

julia> f(:a)
"__a__"

julia> f(:b)
"__B__"

julia> f(:c)
"__c__"

julia> select(df, :a => :d)
3×1 DataFrame
 Row │ d     
     │ Int64
─────┼───────
   11
   22
   33

julia> rename(df, :a => :d)
3×3 DataFrame
 Row │ d      b      c     
     │ Int64  Int64  Int64
─────┼─────────────────────
   11      4      7
   22      5      8
   33      6      9

julia> select(df, :a => identity => f)
3×1 DataFrame
 Row │ __a__ 
     │ Int64
─────┼───────
   11
   22
   33

julia> rename(df, :a => identity => f)
ERROR: MethodError: no method matching rename!(::DataFrame, ::Vector{Pair{Symbol, Pair{typeof(identity), typeof(f)}}})

julia> rename(df, :a => f)  # this method makes more sense than the previous one for `rename`
ERROR: MethodError: no method matching rename!(::DataFrame, ::Vector{Pair{Symbol, typeof(f)}})

This feature would make in-place column name changes with functions easier.

@nalimilan
Copy link
Member

rename(df, :cols => f) makes sense to me as a syntax. I'm not sure how useful it is but it sounds in line with the other syntaxes we already support.

@bkamins
Copy link
Member

bkamins commented Sep 17, 2023

I was thinking about it. What @nathanrboyer wants is rename(df, ["a", "b"] => f), so multiple columns would be allowed on LHS (and possibly, I assume other selectors). We currently do not allow it. The only syntax we support is :col1 => :col2 (i.e. single column to a single column).

Now, apart from a comprehension another way to express what @nathanrboyer wants is:

rename(n -> ifelse(n in ["a", "b"] ? f(n) : n), df)

or equivalently:

rename(df) do n
    ifelse(n in ["a", "b"] ? f(n) : n)
end

I know they are longer. But I am just asking myself if we add these examples to a manual, along with the comprehension rename(df, [n => n * "_new" for n in ["a", "b"]]) maybe that would be enough.

This boils down to a question if the pattern rename(df, ["a", "b"] => f) is needed frequently enough to warrant adding it. Because it is a rare use-case, then I would tend to improve the documentation and skip adding this option.

@nalimilan - indeed ["a", "b"] => f is supported in select etc., but it means something completely different, as it means:

  1. take columns :a and :b from df.
  2. pass them as positional arguments to f
  3. store the result in the :a_b_f column

@nalimilan
Copy link
Member

Right, ["a", "b"] .=> f would be more appropriate (if we add it at all).

@bkamins
Copy link
Member

bkamins commented Sep 17, 2023

["a", "b"] .=> f

Yes, but this is exactly why I am hesitating to add it. Since user (@nathanrboyer as an example) might have a different intuition, and ["a", "b"] .=> f is in my opinion by far not obvious (even if we added it).

Simply put => here and => in operation specification syntax have completely different meanings and I would not want to build a feeling for the users that they are the same.

@nathanrboyer
Copy link
Contributor Author

nathanrboyer commented Sep 18, 2023

Broadcasting when multiple columns are selected makes sense to me. My intuition is that the => in rename is the second one in select, so positional arguments to the operation_function would not be relevant for rename.

This is an excerpt of what I have written in the documentation PR:


Note that a renaming function will not work in the
source_column_selector => new_column_names operation form
because a function in the second element of the operation pair is assumed to take
the source_column_selector => operation_function operation form.
To work around this limitation, use the
source_column_selector => operation_function => new_column_names operation form
with identity as the operation_function.

julia> transform(df, :a => add_prefix)
ERROR: MethodError: no method matching *(::String, ::Vector{Int64})

julia> transform(df, :a => identity => add_prefix)
4×3 DataFrame
 Row │ a      b      new_a
     │ Int64  Int64  Int64
─────┼─────────────────────
   11      5      1
   22      6      2
   33      7      3
   44      8      4

In this case though,
it is probably again more useful to use the rename or rename! function
rather than one of the manipulation functions
in order to rename in-place and avoid the intermediate operation_function.

julia> rename(df, :a => add_prefix) # rename one column
4×2 DataFrame
Row │ new_a  b
   │ Int64  Int64
─────┼──────────────
   11      5
   22      6
   33      7
   44      8

julia> rename(add_prefix, df) # rename all columns
4×2 DataFrame
Row │ new_a  new_b
   │ Int64  Int64
─────┼──────────────
   11      5
   22      6
   33      7
   44      8

# Broadcasting syntax can be used to rename only some columns.
# See the Broadcasting Operation Pairs section below.

I do also have this example later in my documentation PR, but I don't think it needs to be supported by rename:

Renaming functions also work for multi-column transformations,
but they must operate on a vector of strings.

julia> df = DataFrame(data = [(1,2), (3,4)])
2×1 DataFrame
 Row │ data
     │ Tuple
─────┼────────
   1 │ (1, 2)
   2 │ (3, 4)

julia> new_names(v) = ["primary ", "secondary "] .* v
new_names (generic function with 1 method)

julia> transform(df, :data => identity => new_names)
2×3 DataFrame
 Row │ data    primary data  secondary data
     │ Tuple  Int64         Int64
─────┼──────────────────────────────────────
   1 │ (1, 2)             1               2
   2 │ (3, 4)             3               4

I do want rename(df, Not(:col) .=> f), rename(df, Cols(r"expression") .=> f), etc. to work. Quick idea of a use case:

df = DataFrame(Time = 0.0:0.1:0.3, Temp1 = rand(4), Temp2 = rand(4), Temp3 = rand(4))
function longname(x)
    x = string(x)
    n = last(x)
    if chop(x) == "Temp"
        return "Temperature " * n * " (°F)"
    else
        throw(ArgumentError("unsupported input string"))
    end
end
rename!(df, Not(:Time) .=> longname)

nathanrboyer added a commit to nathanrboyer/DataFrames.jl that referenced this issue Sep 18, 2023
@adienes
Copy link
Contributor

adienes commented Sep 18, 2023

the syntax does kinda make sense to me but I think it should definitely have to be broadcasted (and maybe set a custom error message for when people inevitably attempt to chain a second => in)

@jariji
Copy link
Contributor

jariji commented Sep 18, 2023

There is potential ambiguity between callable and AbstractString.

julia> struct StrFunc <: AbstractString
       s
       end

julia> ((;s)::StrFunc)(x) = string(s,x)

julia> StrFunc("hello ")("world")
"hello world"

julia> rename(df, :a => StrFunc("hello "))

@david-macmahon
Copy link

Not realizing that there already is a rename method that takes a function as its first argument, I suggested just that on Slack. It turns out that method, rename(f::Function, df::AbstractDataFrame), renames all columns but other methods following that same pattern could have an additional argument of a symbol to rename one column or a Vector of symbols to rename multiple columns:

rename(uppercase, df, :col)

rename(ab->uppercase.(ab), df, [:a, :b])

rename(df, [:a, :b]) do ab
    uppercase.(ab)
end

@bkamins
Copy link
Member

bkamins commented Sep 19, 2023

I like the proposal of @david-macmahon (if it is OK with the other discutants). Then the signature would be:

rename(f::Function, df::AbstractDataFrame; cols=All())

so the example call would be:

rename(uppercase, df, cols=Not(:col))

Note that I would make the last argument a keyword. There are two reasons for this:

  1. Then it is clearer that this is a non-standard syntax and what the parameter means (at least I feel like this).
  2. It is consistent with the describe method that follows this pattern e.g. you write describe(df, cols=[:a, :b]) to describe only columns :a and :b.

What do you think?

@david-macmahon
Copy link

I like the consistency with describe. (BTW, your example seems to be missing the cols= part)

@nathanrboyer
Copy link
Contributor Author

I like the keyword argument method; it is probably more functional (by enabling do) than mine. But I still like my method better for consistency. (We could add both methods.) I don't think there should be any difference between pairing a column to a new name or pairing a column to a function which generates a new name.

These should all work the same (except which columns are kept):

add_prefix(x) = "new_" * x
df = DataFrame(col = 1:3)

transform(df, :col => identity => :new_col)
transform(df, :col => identity => add_prefix)
rename(df, :col => :new_col)
rename(df, :col => add_prefix) # Error

That's my whole argument. If the top three methods work, then, to be consistent, the bottom method should work too. I'll leave it now to others to determine what is best to implement.

@bkamins
Copy link
Member

bkamins commented Sep 21, 2023

In #3380 I have added the rename(f::Function, df::AbstractDataFrame; cols=All()) variant.

I have thought a lot about the other proposal, and I do not think we should add it. I think it is better to keep rename API simple. Otherwise we would open a list of discussions if things like:

rename(df, Not(:a, :b) .=> uppercase, Cols(startswith("something")) .=> lowercase)

should work and how would they work (such things work in select et al., but they are very complex and most likely would be never used in rename). I think it is better to limit the API of rename. At the same time the rename(f::Function, df::AbstractDataFrame; cols=All()) variant gives a way to easily achieve what users would want in a typical case.

@nathanrboyer
Copy link
Contributor Author

Okay, thanks for considering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants