Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use numpy and math functions inside verbs? #80

Open
derekpowell opened this issue Jan 26, 2019 · 9 comments
Open

Use numpy and math functions inside verbs? #80

derekpowell opened this issue Jan 26, 2019 · 9 comments

Comments

@derekpowell
Copy link

derekpowell commented Jan 26, 2019

I'm running across errors when I try to use numpy or math functions (e.g., sqrt, log, etc) inside dfply verbs. Here's a minimal example:

import pandas as pd
from dfply import *
import numpy as np

df = pd.DataFrame({'x': np.linspace(1, 10, 500)})
df >> mutate(y = np.log(X.x))

This gives the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-f8d61ebf2e20> in <module>()
      3 df = pd.DataFrame({'x': np.linspace(1, 10, 500)})
      4 
----> 5 df >> mutate(y = np.log(X.x))

ValueError: invalid __array_struct__

Is this functionality not implemented? Maybe there's a workaround I'm not seeing?

(I'm on python 3.6.3)

@sharpe5
Copy link

sharpe5 commented Jan 27, 2019 via email

@derekpowell
Copy link
Author

Thanks for the quick response.

type(X.x) returns dfply.base.Intention
type(df.x) returns pandas.core.series.Series (as expected).

In the example I gave, df.assign(y = np.log(df.x)) works fine. So I'm pretty sure it's not a problem with the array in the dataframe.

@sharpe5
Copy link

sharpe5 commented Jan 27, 2019 via email

@derekpowell
Copy link
Author

I'm not an expert but I'm very confident it's not a problem with the original dataframe. Would be curious if the example I gave reproduces? Can also try an even simpler example:

df = pd.DataFrame({"x":[.1,.2,.3,.4,.5,1,2,3]})
df >> mutate(y = np.log(X.x))

That gives the same error for me. Hopefully @kieferk can solve

@jankatins
Copy link

The problem here is that python isn't R: R has delayed interpretation which means that the call to the log function is delayed until the function receives the dataframe as a context. Python doesn't have delayed interpretation so the interpretation order is doing the log transformation to the X.x object first and the passing the result to the mutate call. This X object usually simualtes delayed interpretation by kind of recording your intend (mutate(z=X.x*X.y): "multiply the x colum of the passed in dataframe with the y column"). The mutate gets this recording and executes it in the context of the real dataframe.

The problem is when a function doesn't know about it, as in this case the np.log function. It expects an array (which is why df.x works) and gets the "recorder" object.

What might work is a X.x.log().

@derekpowell
Copy link
Author

Aha, this is what I feared. That's unfortunate, definitely limits the utility of the mutate() functions in dfply.

I've also been playing with the plydata package which can handle these kinds of operations. In plydata, computations are passed as strings, e.g. mutate(y = "np.log(x)"). This isn't necessarily more elegant but seems it's allowed them to make these kinds of operations work properly. Unfortunately, it's currently a bit less complete wrt the verbs available in the tidyverse (e.g., currently missing gather() and spread()) that dfply has covered very well.

@germayneng
Copy link

i think a workaround could be:

df >> mutate(y_log = np.log(df['y']))

@omrihar
Copy link

omrihar commented Apr 23, 2020

I came across this issue because I was searching for the exact same problem. After reading the documentation I noticed this is actually addressed, and the correct way to solve this would be:

@make_symbolic
def log(series):
    return np.log(series)

df >> mutate(y_log = log(X.y))

I can verify this works without a problem!

@tonyduan
Copy link

I've run into this issue as well and it'd be awesome if we could add @omrihar's solution into the codebase!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants