Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More examples in docstrings #321

Open
mahiki opened this issue Jan 11, 2022 · 15 comments
Open

More examples in docstrings #321

mahiki opened this issue Jan 11, 2022 · 15 comments

Comments

@mahiki
Copy link
Contributor

mahiki commented Jan 11, 2022

It would be very helpful if each macro/command had an example in the docstring.

For example, I've been having a lot of trouble using @rtranform to make a new column based on conditional aspects of other columns at the row level. The doctring in the REPL is:

help?> @rtransform
  @rtransform(x, args...)

  Row-wise version of @transform, i.e. all operations use @byrow by default. See @transform for details.

I didn't find the help I needed here, a good example would keep me moving along with my work.

I'm not a computer scientist, and its time-intensive to unravel how these excellent tools are implemented. I've invested time to apply julia in my work vs. the python ecosystem because of its great qualities, however the most consistent hurdle for me is the lack of examples.

Perhaps the usage is obvious to package developers, but for more pedestrian types like me nothing could be more illuminating than a good example.

@bkamins
Copy link
Member

bkamins commented Jan 11, 2022

We should improve docstrings, but in the mean time maybe this https://bkamins.github.io/julialang/2021/11/19/dfm.html would help you?

@mahiki
Copy link
Contributor Author

mahiki commented Jan 11, 2022

Ah yes, these are good, thank you @bkamins. To be fair the referred @transform and @byrow are helpful, it just requires more synthesis at an inconvenient time.

I guess the docstring for @rtransform could be like the following, should I submit a PR?

"""
    @rtransform(x, args...)

Row-wise version of @transform, i.e. all operations use @byrow by default. See @transform for details.

### Example
```jldoctest
julia> df = DataFrame(x=1:5, y=11:15)
5×2 DataFrame
 Row │ x      y
     │ Int64  Int64
─────┼──────────────
   1 │     1     11
   2 │     2     12
   3 │     3     13
   4 │     4     14
   5 │     5     15

julia> @rtransform(df, :a = 2 * :x, :b = :x * :y ^ 2)
5×4 DataFrame
 Row │ x      y      a      b
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     11      2    121
   2 │     2     12      4    288
   3 │     3     13      6    507
   4 │     4     14      8    784
   5 │     5     15     10   1125

"""

@mahiki
Copy link
Contributor Author

mahiki commented Jan 11, 2022

I should definitely submit a PR. This is something I run into all the time, it would be fantastic to be in the habit of making contributions that people in the data engineering DE community would enjoy.

@bkamins
Copy link
Member

bkamins commented Jan 11, 2022

Looks good. Thank you!

@pdeffebach
Copy link
Collaborator

Yes, this would be appreciated. But to clarify the problem some more, did you do ? @transform to look at the docstring for @transform? After being pointed to do so by the docstring for @rtransform.

@bkamins
Copy link
Member

bkamins commented Jan 11, 2022

Even if @transform docstring is OK I think it is worth to improve @rtransform (and others)

@mahiki
Copy link
Contributor Author

mahiki commented Jan 12, 2022

@pdeffebach I did ? @rtransform, and then saw the reference. I followed with @byrow and realized I had a lot of study to do before using this operation. Unfortunately I was working and didn't have an extra 1/2 hour and managed to solve my problem inelegantly.

So, yes I couldn't be bothered but at least it was for work reasons.

A little more about my workflow: I come from a SQL and Scala Spark background in my work, and a couple years ago I decided to 1) expunge all usage of excel and 2) incorporate julia into my work. This is swimming against the tide in a big way, since colleagues and the industry are fairly well completely locked into the python ecosystem.

I've had success in 2021 incorporating julia at my job, developing workflows in production with containerized environments. It's really a pleasure, especially the DataFrames syntax but additionally the package management. It comes with a lot of up front cost, like this process of learning DataFramesMeta commands, totally worth it in my view.

It would have been so convenient to see an example right in the REPL docs, so I'll contribute by filling in those gaps where I hit them.

@pdeffebach
Copy link
Collaborator

Thanks for the background.

Yes, makes sense. No use making people play Zork for docs. Please submit a PR adding examples!

mahiki added a commit to mahiki/DataFramesMeta.jl that referenced this issue Jan 19, 2022
@mahiki
Copy link
Contributor Author

mahiki commented Jan 19, 2022

PR created!

@mahiki
Copy link
Contributor Author

mahiki commented Jan 19, 2022

I don't understand the cause of the doctest failure, I'll read up on this. Probably a bit of missing syntax?

mahiki added a commit to mahiki/DataFramesMeta.jl that referenced this issue Jan 19, 2022
@mahiki
Copy link
Contributor Author

mahiki commented Jan 20, 2022

I'm very happy to see the REPL examples in there, I think they are more effective than googling, reading doc pages, etc because of the diverted attention.

Now I've spent some time figuring some things out I made personal notes of the following equivalent dataframe tranformations. I needed column value assignments conditional on other rows. This shows how convenient and readable DataFramesMeta can be:

df = DataFrame(flag = [0, 1, 0, 1, 0, 1]
    , amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
    , qty = [1, 4, 1, 3, 21, 109]
    , item = ["B001", "B001", "B020", "B020", "BX00", "BX00"]
    , day = Date.(["2021-01-01", "2021-01-01", "2112-12-12", "2020-10-20", "2021-05-04", "1984-07-04"])
    )

6×5 DataFrame
 Row │ flag   amt      qty    item    day        
     │ Int64  Float64  Int64  String  Date       
─────┼───────────────────────────────────────────
   10    19.0       1  B001    2021-01-01
   21    11.0       4  B001    2021-01-01
   30    35.5       1  B020    2112-12-12
   41    32.5       3  B020    2020-10-20
   50     5.99     21  BX00    2021-05-04
   61     5.99    109  BX00    1984-07-04


@rtransform(df
    , :Tax = :flag * 0.11 * :amt
    , :Discount = :item == "B020" ? -0.25 * :amt : 0
    )
transform(df
    , [:flag, :amt] => ByRow((x,y) -> x * 0.11 * y) => :Tax
    , [:item, :amt] => ByRow((x,y) -> x == "B020" ? -0.25 * y :  0) => :Discount
    )
transform(df
    , [:flag, :amt] => ((x,y) -> x * 0.11 .* y) => :Tax
    , [:item, :amt] => ((x,y) -> (x .== "B020") * -0.25 .* y ) => :Discount
    )

6×7 DataFrame
 Row │ flag   amt      qty    item    day         Tax      Discount 
     │ Int64  Float64  Int64  String  Date        Float64  Float64  
─────┼──────────────────────────────────────────────────────────────
   10    19.0       1  B001    2021-01-01   0.0       -0.0
   21    11.0       4  B001    2021-01-01   1.21      -0.0
   30    35.5       1  B020    2112-12-12   0.0       -8.875
   41    32.5       3  B020    2020-10-20   3.575     -8.125
   50     5.99     21  BX00    2021-05-04   0.0       -0.0
   61     5.99    109  BX00    1984-07-04   0.6589    -0.0

# OK I haven't figured out the broadcast operation with ternary operator, however the dfs pass `==` test.

I wonder if this example of comparative constructions would be useful in the DataFramesMeta documentation page? I really struggled to figure this out, but it looks so obvious now.

@pdeffebach
Copy link
Collaborator

This is mentioned in certain places. Check out the first code block here.

A PR on this section would be welcomed. I don't want to make the translations too prominent at the beginning because I don't want new users to get too intimidated. My ideal user is probably a first year masters student in the social sciences who is programming for the first time. It would be great to work on a PR for this in detail, but with those constraints in mind.

Additionally, remember MacroTools.@macroexpand, which is super useful for understanding DataFramesMeta.jl, albeit only for advanced users.

julia> using DataFramesMeta, Dates;

julia> df = DataFrame(flag = [0, 1, 0, 1, 0, 1]
           , amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
           , qty = [1, 4, 1, 3, 21, 109]
           , item = ["B001", "B001", "B020", "B020", "BX00", "BX00"]
           , day = Date.(["2021-01-01", "2021-01-01", "2112-12-12", "2020-10-20", "2021-05-04", "1984-07-04"])
           );

julia> @rtransform(df
           , :Tax = :flag * 0.11 * :amt
           , :Discount = :item == "B020" ? -0.25 * :amt : 0
           );
julia> using MacroTools

julia> MacroTools.@macroexpand(@rtransform(df
           , :Tax = :flag * 0.11 * :amt
           , :Discount = :item == "B020" ? -0.25 * :amt : 0
           )) |> MacroTools.prettify
:((DataFrames).transform(df, DataFramesMeta.make_source_concrete([:flag, :amt]) => (ByRow(((waterbuffalo, gaur)->waterbuffalo * 0.11 * gaur)) => :Tax), DataFramesMeta.make_source_concrete([:item, :amt]) => (ByRow(((cod, fish)->if cod == "B020"
                          -0.25 * fish
                      else
                          0
                      end)) => :Discount)))

@mahiki
Copy link
Contributor Author

mahiki commented Jan 23, 2022

This is great. The more time I spend in Julia the better I like it.

I think what the DataFramesMeta docs pages are missing is a simple front page that shows clear examples of how easy the syntax is to formulate for common tasks. Also a clear message about the mission of the package.

The difficulty from the new user's perspective:

  • I am trying to write a series of transformations to shape some data.
  • I know about DataFrames.jl, its the obvious package to use.
  • I've never heard of DataFramesMeta, and if I have I don't know what it does. It sounds like 'next level' DataFrames, probably not something I'm ready for since I'm struggling with formulations in regular DataFrames.

Here's the first sentence of the Introduction on the repo REAME:

Metaprogramming tools for DataFrames.jl objects to provide more convenient syntax.

As a non-expert user, especially not knowing much about meta-programming, this already looks too advanced for me.

I recommend something more immediately obvious by saying something like:

Simplifies column and row transformations with natural syntax in column and row value assignments. For example, compare these two equivalent formulations:

df = DataFrame(x=1:5, y=11:15)

# DataFramesMeta syntax via assignment
@rtransform(df, :y = :x == 1 ? true : false)

# DataFrames typical pairs selector syntax with the ByRow() helper and anonymous function
transform(df, :x => ByRow(x -> x == 1 ? true : false) => :y)

This would make it easy to see what the purpose of this package is, I think.

@mahiki
Copy link
Contributor Author

mahiki commented Feb 15, 2022

@pdeffebach
Copy link
Collaborator

Thanks! Ill take a look later today.

pdeffebach pushed a commit that referenced this issue Mar 17, 2022
* add ready examples to docstring for rselect, rtransform (#321)

* jldoctest needs to load DataFramesMeta to run the tests (#321)

* jldoctest examples for @rorderby @rsubset

* minor df output alignment fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants