Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the DataFrame#transform method to be more elegant #6

Closed
MrPowers opened this issue Oct 31, 2017 · 9 comments
Closed

Refactor the DataFrame#transform method to be more elegant #6

MrPowers opened this issue Oct 31, 2017 · 9 comments

Comments

@MrPowers
Copy link
Owner

MrPowers commented Oct 31, 2017

This library defines a DataFrame.transform method to chain DataFrame transformations as follows:

from pyspark.sql.functions import lit

def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(df, something):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = source_df\
    .transform(lambda df: with_greeting(df))\
    .transform(lambda df: with_something(df, "crazy"))

The Spark Scala API has a built-in transform method that lets users chain DataFrame transformations more elegantly, as described in this blog post.

Here's an interface I'd prefer (this is what we do in Scala and I know this will need to be changed around for Python, but I'd like something like this):

def with_greeting()(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(something)(df):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = source_df\
    .transform(with_greeting())\ # the transform method magically knows that self should be passed into the second parameter list
    .transform(with_something("crazy"))

Here is the code that needs to be changed.

If we can figure out a better interface, we should consider making a pull request to the Spark source code. I use the transform method every day when writing Spark/Scala code and think this is a major omission in the PySpark API.

If my ideal interface isn't possible is there anything that's better?! I really don't like my current solution that requires lambda.

@pirate - help!

@pirate
Copy link

pirate commented Oct 31, 2017

Try a closure:

def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(something):
    def partial(df):
        return df.withColumn("something", lit(something))
    return partial

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = (source_df
    .transform(with_greeting)   # no lambda required
    .transform(with_something("crazy"))

In JS this looks like:

const myFunc = (first_set_of_args) => (second_set_of_args) => {
    ...function body
}

@pirate
Copy link

pirate commented Nov 1, 2017

functools.partial actually works for this too, although I think the closure method is cleaner/easier to understand:

from functools import partial


def with_greeting(df):
    return df.withColumn("greeting", lit("hi"))

def with_something(something, df):
    return df.withColumn("something", lit(something))

data = [("jose", 1), ("li", 2), ("luisa", 3)]
source_df = spark.createDataFrame(data, ["name", "age"])

actual_df = (source_df
    .transform(with_greeting)
    .transform(partial(with_something, "crazy"))

@MrPowers
Copy link
Owner Author

MrPowers commented Nov 1, 2017

Thanks @pirate.

I updated the test suite to demonstrate how functools.partial can be used. I also changed the string "luisa" to "liz" based on a code review from @lizparody 😉

I also updated the blog post to include a functools.partial example.

Thanks!

@LizzParody
Copy link

LizzParody commented Nov 1, 2017 via email

@pirate
Copy link

pirate commented Nov 1, 2017

Fyi @MrPowers, you don't need partial on the first transform func for the same reason that you don't need a lambda there:

lambda x: func(x) == partial(func) == func

actual_df = (source_df
    .transform(with_greeting)
    .transform(partial(with_jacket, "warm")))

I also recommend using the word "closure" or "higher order function" somewhere in your blog post, as those are the "standard" names instead of "nested function".

Great blog post though, nice work!

@MrPowers
Copy link
Owner Author

MrPowers commented Nov 1, 2017

Thanks @pirate - I updated the code and blog post accordingly.

Thanks for all the help here - I really appreciate the feedback. Feel free to rip up my code or blog posts anytime!!!

@MrPowers
Copy link
Owner Author

MrPowers commented Nov 6, 2017

@pirate - @capdevc showed me how to use cytoolz to run multiple custom DataFrame transformations with function composition. Take a look at this commit.

Thanks @capdevc!!!

@capdevc
Copy link

capdevc commented Nov 6, 2017

@MrPowers I really like cytoolz and use it alot, but it's a pretty heavy dependency to pull in for just the curry decorator. Curry and compose are also available in toolz, which is the same as cytoolz minus the cython bits, which shouldn't matter in this application. You could also just add your own curry decorator since it's just a wrapper around functools.partial.

@MrPowers
Copy link
Owner Author

Closing this now that DataFrame#transform has been included in PySpark. Really appreciate everyone's help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants