Refactor the DataFrame#transform method to be more elegant #6

MrPowers · 2017-10-31T17:53:25Z

This library defines a DataFrame.transform method to chain DataFrame transformations as follows:

from pyspark.sql.functions import lit

def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(df, something):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = source_df\
    .transform(lambda df: with_greeting(df))\
    .transform(lambda df: with_something(df, "crazy"))

The Spark Scala API has a built-in transform method that lets users chain DataFrame transformations more elegantly, as described in this blog post.

Here's an interface I'd prefer (this is what we do in Scala and I know this will need to be changed around for Python, but I'd like something like this):

def with_greeting()(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(something)(df):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = source_df\
    .transform(with_greeting())\ # the transform method magically knows that self should be passed into the second parameter list
    .transform(with_something("crazy"))

Here is the code that needs to be changed.

If we can figure out a better interface, we should consider making a pull request to the Spark source code. I use the transform method every day when writing Spark/Scala code and think this is a major omission in the PySpark API.

If my ideal interface isn't possible is there anything that's better?! I really don't like my current solution that requires lambda.

@pirate - help!

The text was updated successfully, but these errors were encountered:

pirate · 2017-10-31T18:39:14Z

Try a closure:

def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(something):
    def partial(df):
        return df.withColumn("something", lit(something))
    return partial

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = (source_df
    .transform(with_greeting)   # no lambda required
    .transform(with_something("crazy"))

In JS this looks like:

const myFunc = (first_set_of_args) => (second_set_of_args) => {
    ...function body
}

pirate · 2017-11-01T16:24:53Z

functools.partial actually works for this too, although I think the closure method is cleaner/easier to understand:

from functools import partial


def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(something, df):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = (source_df
    .transform(with_greeting)
    .transform(partial(with_something, "crazy"))

MrPowers · 2017-11-01T17:32:41Z

Thanks @pirate.

I updated the test suite to demonstrate how functools.partial can be used. I also changed the string "luisa" to "liz" based on a code review from @lizparody 😉

I also updated the blog post to include a functools.partial example.

Thanks!

LizzParody · 2017-11-01T17:34:02Z

AAHAHHAHAHAHAHAHAHAHAHAHAHAHAH 😂 it was a joke!!!

…

On Wed, Nov 1, 2017 at 12:32 PM, Matthew Powers ***@***.***> wrote: Thanks @pirate <https://github.com/pirate>. I updated the test suite <283629a> to demonstrate how functools.partial can be used. I also changed the string "luisa" to "liz" based on a code review from @lizparody <https://github.com/lizparody> 😉 I also updated the blog post ***@***.***/chaining-custom-pyspark-transformations-4f38a8c7ae55> to include a functools.partial example. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMf4JP1hDc8_HUrcEuRF61QwKuIdi1bpks5syKs6gaJpZM4QNKmu> .

pirate · 2017-11-01T17:36:10Z

Fyi @MrPowers, you don't need partial on the first transform func for the same reason that you don't need a lambda there:

lambda x: func(x) == partial(func) == func

actual_df = (source_df
    .transform(with_greeting)
    .transform(partial(with_jacket, "warm")))

I also recommend using the word "closure" or "higher order function" somewhere in your blog post, as those are the "standard" names instead of "nested function".

Great blog post though, nice work!

MrPowers · 2017-11-01T18:38:58Z

Thanks @pirate - I updated the code and blog post accordingly.

Thanks for all the help here - I really appreciate the feedback. Feel free to rip up my code or blog posts anytime!!!

MrPowers · 2017-11-06T04:09:46Z

@pirate - @capdevc showed me how to use cytoolz to run multiple custom DataFrame transformations with function composition. Take a look at this commit.

Thanks @capdevc!!!

capdevc · 2017-11-06T13:30:12Z

@MrPowers I really like cytoolz and use it alot, but it's a pretty heavy dependency to pull in for just the curry decorator. Curry and compose are also available in toolz, which is the same as cytoolz minus the cython bits, which shouldn't matter in this application. You could also just add your own curry decorator since it's just a wrapper around functools.partial.

MrPowers · 2023-09-27T00:28:38Z

Closing this now that DataFrame#transform has been included in PySpark. Really appreciate everyone's help.

MrPowers mentioned this issue Nov 1, 2017

Explore options for a different DataFrameTransformer interface hi-primus/optimus#152

Closed

MrPowers closed this as completed Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the DataFrame#transform method to be more elegant #6

Refactor the DataFrame#transform method to be more elegant #6

MrPowers commented Oct 31, 2017 •

edited

pirate commented Oct 31, 2017 •

edited

pirate commented Nov 1, 2017 •

edited

MrPowers commented Nov 1, 2017

LizzParody commented Nov 1, 2017 via email

pirate commented Nov 1, 2017 •

edited

MrPowers commented Nov 1, 2017

MrPowers commented Nov 6, 2017

capdevc commented Nov 6, 2017 •

edited

MrPowers commented Sep 27, 2023

Refactor the DataFrame#transform method to be more elegant #6

Refactor the DataFrame#transform method to be more elegant #6

Comments

MrPowers commented Oct 31, 2017 • edited

pirate commented Oct 31, 2017 • edited

pirate commented Nov 1, 2017 • edited

MrPowers commented Nov 1, 2017

LizzParody commented Nov 1, 2017 via email

pirate commented Nov 1, 2017 • edited

MrPowers commented Nov 1, 2017

MrPowers commented Nov 6, 2017

capdevc commented Nov 6, 2017 • edited

MrPowers commented Sep 27, 2023

MrPowers commented Oct 31, 2017 •

edited

pirate commented Oct 31, 2017 •

edited

pirate commented Nov 1, 2017 •

edited

pirate commented Nov 1, 2017 •

edited

capdevc commented Nov 6, 2017 •

edited