-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the DataFrame#transform method to be more elegant #6
Comments
Try a closure: def with_greeting(df):
return df.withColumn("greeting", lit("hi"))
def with_something(something):
def partial(df):
return df.withColumn("something", lit(something))
return partial
data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])
actual_df = (source_df
.transform(with_greeting) # no lambda required
.transform(with_something("crazy")) In JS this looks like: const myFunc = (first_set_of_args) => (second_set_of_args) => {
...function body
} |
from functools import partial
def with_greeting(df):
return df.withColumn("greeting", lit("hi"))
def with_something(something, df):
return df.withColumn("something", lit(something))
data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])
actual_df = (source_df
.transform(with_greeting)
.transform(partial(with_something, "crazy")) |
Thanks @pirate. I updated the test suite to demonstrate how I also updated the blog post to include a Thanks! |
AAHAHHAHAHAHAHAHAHAHAHAHAHAHAH 😂 it was a joke!!!
…On Wed, Nov 1, 2017 at 12:32 PM, Matthew Powers ***@***.***> wrote:
Thanks @pirate <https://github.com/pirate>.
I updated the test suite
<283629a>
to demonstrate how functools.partial can be used. I also changed the
string "luisa" to "liz" based on a code review from @lizparody
<https://github.com/lizparody> 😉
I also updated the blog post
***@***.***/chaining-custom-pyspark-transformations-4f38a8c7ae55>
to include a functools.partial example.
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMf4JP1hDc8_HUrcEuRF61QwKuIdi1bpks5syKs6gaJpZM4QNKmu>
.
|
Fyi @MrPowers, you don't need partial on the first transform func for the same reason that you don't need a lambda there:
actual_df = (source_df
.transform(with_greeting)
.transform(partial(with_jacket, "warm"))) I also recommend using the word "closure" or "higher order function" somewhere in your blog post, as those are the "standard" names instead of "nested function". Great blog post though, nice work! |
Thanks @pirate - I updated the code and blog post accordingly. Thanks for all the help here - I really appreciate the feedback. Feel free to rip up my code or blog posts anytime!!! |
@pirate - @capdevc showed me how to use cytoolz to run multiple custom DataFrame transformations with function composition. Take a look at this commit. Thanks @capdevc!!! |
@MrPowers I really like cytoolz and use it alot, but it's a pretty heavy dependency to pull in for just the curry decorator. Curry and compose are also available in toolz, which is the same as cytoolz minus the cython bits, which shouldn't matter in this application. You could also just add your own curry decorator since it's just a wrapper around |
Closing this now that DataFrame#transform has been included in PySpark. Really appreciate everyone's help. |
This library defines a
DataFrame.transform
method to chain DataFrame transformations as follows:The Spark Scala API has a built-in transform method that lets users chain DataFrame transformations more elegantly, as described in this blog post.
Here's an interface I'd prefer (this is what we do in Scala and I know this will need to be changed around for Python, but I'd like something like this):
Here is the code that needs to be changed.
If we can figure out a better interface, we should consider making a pull request to the Spark source code. I use the
transform
method every day when writing Spark/Scala code and think this is a major omission in the PySpark API.If my ideal interface isn't possible is there anything that's better?! I really don't like my current solution that requires
lambda
.@pirate - help!
The text was updated successfully, but these errors were encountered: