Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add replace!(::AbstractDataFrame, cols, ...) method #2257

Open
CameronBieganek opened this issue May 14, 2020 · 6 comments
Open

Add replace!(::AbstractDataFrame, cols, ...) method #2257

CameronBieganek opened this issue May 14, 2020 · 6 comments
Labels
decision non-breaking The proposed change is not breaking
Milestone

Comments

@CameronBieganek
Copy link

The manual shows how to replace values in multiple columns, e.g.

df2 = ifelse.(df .== 999, missing, df)

That's a neat trick, but it would be convenient if we had replace and replace! methods for data frames. Something like the following:

df2 = replace(df, :, 999 => missing)

df3 = replace(df, Between(:a, :c), 999 => missing)

Of course the eltype conversion behavior would mirror the behavior for Base.replace:

julia> y = [2, 5, 999, 7];

julia> replace(y, 999 => missing)
4-element Array{Union{Missing, Int64},1}:
 2
 5
  missing
 7

julia> replace!(y, 999 => missing)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Int64
@bkamins bkamins added decision non-breaking The proposed change is not breaking labels May 14, 2020
@bkamins bkamins added this to the 1.x milestone May 14, 2020
@bkamins
Copy link
Member

bkamins commented May 14, 2020

Why:

replace!.(eachcol(df[!, cols]), 999 => missing)

is not enough for you? (this would be in-place)

The issue with replace and replace! is that it would treat data frame as a matrix, and we tend to define functions that treat it as a collection of rows. It would not be end of the world, but still ...

So let us wait what others think.

@CameronBieganek
Copy link
Author

Hmm, well I think my suggested API for the replace data frame methods is intuitive at least. And maybe I haven't been following closely enough, but parts of the current API feel more column oriented to me. For example:

julia> df = DataFrame(a = 2:3);

julia> transform(df, :a => (x -> log.(x)) => :log_a)
2×2 DataFrame
│ Row │ a     │ log_a    │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 120.693147 │
│ 231.09861  │

julia> transform(df, :a => ByRow(log) => :log_a)
2×2 DataFrame
│ Row │ a     │ log_a    │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 120.693147 │
│ 231.09861

That feels to me like transform is treating :a as a column. You have to explicitly use ByRow() if you want to get by-row behavior.

But it could be my bias coming from R/dplyr where table manipulation is usually column based.

@CameronBieganek
Copy link
Author

CameronBieganek commented May 14, 2020

A nice side benefit would be that replace!(::DataFrame, ...) would probably return the modified data frame, rather than the modified columns. I currently have this in one of my functions:

function foo(df)
    replace!(df.x, 9 => missing)  # only returns the array :x
    df
end

which would reduce to this under the new syntax:

function foo(df)
    replace!(df, :x, 9 => missing)
end

@bkamins
Copy link
Member

bkamins commented May 14, 2020

This is true that select/transform/combine work differently, we could add replace to this group, I just noted about a general trend we want to follow (but still I agree that what is intuitive and useful should be taken into consideration). Let us see what other people think and then decide.

@anandijain
Copy link
Contributor

I think that replace!.(eachcol(df[!, cols]), nothing => missing) is "sufficient".

But I'm in favor of replace!(::AbstractDataFrame, cols, ...). It's a logical function call to those that don't necessarily know/want to know how a DataFrame is implemented.

The ifelse syntax is not all that memorable/intuitive, and replace! already exists.

It'd remove a small pain point for new users, I think

@bkamins
Copy link
Member

bkamins commented Oct 11, 2020

another option is just:

select!(df, cols .=> x -> replace!(x, nothing => missing), renamecols=false)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision non-breaking The proposed change is not breaking
Projects
None yet
Development

No branches or pull requests

3 participants