Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier syntax for mixed computations by row or by column #186

Closed
jkrumbiegel opened this issue Oct 7, 2020 · 8 comments
Closed

Easier syntax for mixed computations by row or by column #186

jkrumbiegel opened this issue Oct 7, 2020 · 8 comments
Milestone

Comments

@jkrumbiegel
Copy link
Contributor

jkrumbiegel commented Oct 7, 2020

I came up with a syntax that I like quite a lot in my own little macro package I was trying out, and I wanted to see if there is interest to add something like it here.

The problem

Some computations in @transform, @combine, etc. are easier to express when thinking about the inputs as whole column vectors. Others are easier to write in an elementwise fashion. Currently, computations are done on whole vectors. There is a @byrow macro but I think that iterates over all rows, making it slower than what I have in mind. In any case, it can not be mixed with the vector style.

The solution

We can pun on broadcasting assignment syntax to solve this issue. Here is an example:

df = DataFrame(val = rand(1:4, 100), tup = rand([(1, 2, 3), (4, 5, 6)], 100)

# vector syntax
@transform(df, z = :val .* getindex.(:tup, 2))

# proposed row wise syntax
@transform(df, z .= :val * :tup[2])

How it works

While the default syntax creates a this function:

(val, tup) -> val .* getindex.(tup, 2)

The rowwise version creates this:

(val, tup) -> map((v, t) -> val * tup[2], val, tup))

This should still be as fast as possible for a row wise computation, compared to iterating all rows unnecessarily.
Another benefit is that the two syntaxes can be easily mixed, depending on which way of thinking is more appropriate for the current computation.

@pdeffebach
Copy link
Collaborator

Thanks for this! I appreciate the proposal.

I think this idea conflicts with other appearances of .= sytax, though.

y .= f(x)

means that y is updated in-place by the values returned by f(x), i.e.

t = f(x)
for i in eachindex(y)
    y[i] = t[i]
end

This is not what's going on with the .= operator in your proposal. In particular, this operation is always going to allocate, and might not even exist to be assigned into.

Note that in DataFramesMeta, both the release branch and master, you can use @. and

@transform(df, y = @. :val * getindex(:tup, 2))

You are right that the :tup[2] doesn't work, though. iirc this is something that might be allowed in Julia base in the future.

@jkrumbiegel
Copy link
Contributor Author

I do know that the broadcasted assignment syntax is usually understood as mutation of an array, I just felt that for a macro which defines a nonstandard DSL for DataFrame manipulation, such a nonstandard functionality is ok.

I actually have the situation quite often that I have an expression that is not really suitable for broadcasting syntax. For example, keywords don't broadcast, so if you want to pass values of one column as keyword arguments to a function, that wouldn't work unless you manually created the map that I proposed here. Do you think that's too uncommon to make it easy with this dot assignment syntax?

@pdeffebach
Copy link
Collaborator

a macro which defines a nonstandard DSL for DataFrame manipulation, such a nonstandard functionality is ok

I think that this is a slightly different mental of DataFramesMeta than I am imagining people to have. I think I would like people to view DataFramesMeta as a way to construct an expression for inputting into DataFrames.transform. I don't think that it's very clear from y .= f(:x) that a user should read :x => ByRow(f) => :y. imo a simply flag @byrow y = f(:x) is more explicit (assuming we change the name of the existing @byrow macro).

@jkrumbiegel
Copy link
Contributor Author

jkrumbiegel commented Oct 8, 2020

a simply flag @byrow y = f(:x) is more explicit

Maybe you're right, I didn't even know about the ByRow(f) wrapper, because I guess I would not use it anyway in its standard form because things get very verbose. But in this macro package I think it could be good. Something like this?

@transform(df,
	@byrow y = f(:x),
	z = g.(:x))

# or

@transform(df,
	y = @byrow f(:x),
	z = g.(:x))

this is a slightly different mental

I understand where you're coming from. Personally, I value non-redundant syntax a lot, especially for things that I have to write over and over. So to me, a package like DataFramesMeta really makes DataFrames comfortable to use, because DataFrames' defaults rely on a lot of redundant typing. Which is fine, because it's understandable that the base package doesn't want to use macros. But I'd also say that this frees macro packages which try to implement the smoothest DSL workflow possible from staying too close to the original syntax.

In my mind it's not a problem to do a .= keyword syntax in a macro, people can easily learn what it means and move on. I don't know if you have used R's data.table before, which in my opinion is a good example of having a slightly steeper learning curve, but then offering a very concise syntax once you get the hang of it. It takes so much redundancy out of the split apply combine workflow. You could argue too that it doesn't do multidimensional indexing even though that's what it looks like for an outsider. They just determined that this is a really good way to express the typical transformations they need.

@matthieugomez
Copy link

matthieugomez commented Oct 14, 2020

Interesting. Another thing that does not work with your proposal is that it cannot be extended to expressions that do no create a new variable (which may happen in @where , @orderby, etc). But I like the idea: maybe a related one would be to use the syntax @transform., @select., etc.

@nalimilan
Copy link
Member

Using .= is appealing, but in Julia one needs to use dots elsewhere in the expression to enable broadcasting, e.g. y .= x .+ 1. So that would essentially be a syntax pun.

But I like the idea: maybe a related one would be to use the syntax @transform., @select., etc.

Unfortunately @select. isn't a valid identifier name so it can't work.

@matthieugomez
Copy link

matthieugomez commented Oct 15, 2020 via email

@pdeffebach pdeffebach added this to the 1.X milestone Mar 7, 2021
@pdeffebach
Copy link
Collaborator

Closed with the addition of @rtransform etc. in #267

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants