New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullableArrays + New API Release Roadmap #1092

Closed
johnmyleswhite opened this Issue Oct 4, 2016 · 38 comments

Comments

Projects
None yet
@johnmyleswhite
Contributor

johnmyleswhite commented Oct 4, 2016

I'd like to have a single issue that identifies every issue that's blocking a release of DataFrames that replaces DataArrays.

At a high-level, my understanding is the following:

  • Ensure that StructuredQueries.jl is in a place where we're ready to tell everyone to use it. (Alternatively, tell everyone to use Query.jl.)
  • Prepare people by publishing blog posts and plenty of documentation about how things will work. Make sure people see why we're deprecating the old API.
  • Tag and release the code merged in #1008.
  • Publish a release of DataFrames that uses deprecations or some other mechanism to push people towards our new API's.

@ararslan ararslan added this to the 0.9.0 milestone Oct 4, 2016

@davidagold

This comment has been minimized.

davidagold commented Oct 4, 2016

NullableArrays.jl could use some love before the new release -- either JuliaStats/NullableArrays.jl#148 or some subset of it.

EDIT: Though I suppose this isn't strictly blocking, it would be good to settle where lifted operators should live before making public any APIs that depend on them.

@amellnik

This comment has been minimized.

Contributor

amellnik commented Oct 5, 2016

Would this be a good place to link to issues with other packages that rely on DataFrames or should that be handled in a separate issue?

@amellnik

This comment has been minimized.

Contributor

amellnik commented Oct 5, 2016

@IainNZ -- I saw you wrote a tool to look at package dependencies. Would you mind running it on DataFrames and DataArrays so we can look for actively-maintained packages that might be affected? (If there's a way I should be able to do this let me know). Thanks -A

@ararslan

This comment has been minimized.

Member

ararslan commented Oct 5, 2016

Thanks for your help getting the word out on the changes, @amellnik!

If I'm not mistaken, Pkg.dependents("DataFrames") can be used to get a list of packages that currently rely on DataFrames. And I think Jacob Quinn had requested that Tony Kelman run the package evaluator on the ecosystem but with DataFrames on master rather than release to see what catches fire. That will be immensely helpful.

@amellnik

This comment has been minimized.

Contributor

amellnik commented Oct 5, 2016

Ah thanks Alex -- I didn't know about that Pkg command. I'll start looking through the list and trying to submit PRs for the low-hanging fruit now.

@pkofod

This comment has been minimized.

pkofod commented Oct 6, 2016

If I'm not mistaken, Pkg.dependents("DataFrames") can be used to get a list of packages that currently rely on DataFrames. And I think Jacob Quinn had requested that Tony Kelman run the package evaluator on the ecosystem but with DataFrames on master rather than release to see what catches fire. That will be immensely helpful.

Only for packages that you have installed though, right?

@tlnagy

This comment has been minimized.

tlnagy commented Oct 6, 2016

I think it scans your local copy of METADATA so it should find all dependents if METADATA is reasonably up to date.

@ararslan

This comment has been minimized.

Member

ararslan commented Oct 6, 2016

You can also use JuliaPkgDb and do select * from versions where dependency = 'DataFrames';

@tkelman

This comment has been minimized.

Contributor

tkelman commented Oct 12, 2016

It won't look at test dependencies though. Those are worth checking as well. I don't recall being asked explicitly to run pkgeval, I was asked how one would go about doing it which I answered.

@ararslan

This comment has been minimized.

Member

ararslan commented Oct 12, 2016

Hey @tkelman, would you be willing to explicitly run PkgEval? 🙂

@tkelman

This comment has been minimized.

Contributor

tkelman commented Oct 12, 2016

latest master here then? anyone have a particular metadata fork+branch with a provisional test tag? skip the git tag and pr part of it for now.

@ararslan

This comment has been minimized.

Member

ararslan commented Oct 12, 2016

latest master here then?

Yep.

anyone have a particular metadata fork+branch with a provisional test tag?

I don't but I can make one (much) later today if no one has beaten me to it.

@quinnj

This comment has been minimized.

Member

quinnj commented Oct 12, 2016

Does this look right, @tkelman? JuliaLang/METADATA.jl#6735

@tkelman

This comment has been minimized.

Contributor

tkelman commented Oct 14, 2016

May want to rebase that right before I fire off the run. Been sick the last few days, bug me again if I haven't run this by Monday.

@tkelman

This comment has been minimized.

Contributor

tkelman commented Oct 17, 2016

Looks like you had "enable edits" checked, so I did the rebase and fired off PkgEval. Hopefully it'll work, check https://github.com/JuliaCI/pkg.julialang.org/branches in 10-12ish hours to see if it creates a new "dataframes" branch with the results.

@tkelman

This comment has been minimized.

Contributor

tkelman commented Oct 18, 2016

I should have run that differently since this version doesn't support Julia 0.4 and there's a bunch of unrelated SSL certificate failures on 0.6-dev right now, but focus on the 0.5 pulse changes here: https://htmlpreview.github.io/?https://raw.githubusercontent.com/JuliaCI/pkg.julialang.org/dataframes/pulse.html

  1. BayesNates
  2. Bootstrap
  3. COMTRADE
  4. CrossfilterCharts
  5. DataFramesMeta
  6. EEG
  7. FreqTables
  8. GLM
  9. Gadfly
  10. Graft
  11. ImageCore unrelated
  12. JDBC
  13. KernelDensityEstimator
  14. KernelEstimator
  15. LifeTable
  16. Mamba
  17. Query
  18. RData
  19. RDatasets
  20. ROCAnalysis
  21. ScikitLearn
  22. Stan
  23. StatsBase
  24. ValueOrientedRiskManagementInsurance
  25. Vega
  26. VegaLite
  27. WorldBankData

Many of these, but not all, look like packages just making the now-wrong assumption that depending on DataFrames is enough to also ensure DataArrays will be installed. The easiest short term fix for those cases would be adding DataArrays to each of these packages' REQUIRE files (could even be done directly in METADATA without having to wait for a new tag), though there could easily still be remaining failures after doing that.

@davidanthoff

This comment has been minimized.

Contributor

davidanthoff commented Oct 18, 2016

Publish a release of DataFrames that uses deprecations or some other mechanism to push people towards our new API's

What exactly is the plan here? We are not going to say that some query framework is the only way to interact with a DataFrame, right? I had always assumed that Query.jl and StructuredQueries will add another API to interact with DataFrames, but not replace existing APIs. Is that not the plan?

@nalimilan

This comment has been minimized.

Contributor

nalimilan commented Oct 18, 2016

I don't think we're going to remove the current API (at least not immediately, and not all of it), but even with lifted operators it's significantly more cumbersome to use after the port to Nullable. So I'm not sure anybody will want to use it anymore. (Yes, evil plan... :-)

@davidanthoff

This comment has been minimized.

Contributor

davidanthoff commented Oct 19, 2016

Just as an aside and follow up to @ararslan's post on how you can query the package database with his JuliaPkgDb repo:

I took his code and turned it into a data source that can be queried with Query.jl. For now you need the code here, but I'll probably move that into Query.jl at some point. His example could then be written like this in pure julia terms, without savings things into a SQL database first:

using DataFrames, Query
include("packagedb.jl")

db = pkgdb()

@from i in pkgdb() begin
    @from j in i.versions
    @from k in j.requires
    @where k.name=="DataFrames"
    @select {i.name, j.version}
    @collect DataFrame
end

Should be fairly simple to also read information about test requirements and add them to the mix.

@ararslan Let me know if you have any thoughts about this code, but maybe here queryverse/Query.jl#51. I just posted this info in the current issue in case it might be helpful for going through packages, any follow up discussion about the code we should probably have somewhere else.

@ararslan

This comment has been minimized.

Member

ararslan commented Oct 19, 2016

@davidanthoff That's neat, thanks for sharing. 🙂

@amellnik

This comment has been minimized.

Contributor

amellnik commented Oct 20, 2016

I may have been premature in opening issues related to this change in other packages -- except in trivial cases like Mustache.jl, I don't anticipate any progress on those until we have a clear strategy for queries and lifting/conversion.

If we do go with either StructuredQueries or Queries, I think we also need to make sure there's still a way to do very simple operations inline.

@ExpandingMan

This comment has been minimized.

Contributor

ExpandingMan commented Oct 24, 2016

I'd just like to provide some feedback here in regard to the query methods.

I personally rather dislike Query.jl (no offense to the developers of course); though it's cleverly done, it involves a rather elaborate interface which gets too far away from the normal structure of the language. To me, the great thing about pandas is that it is NOT like SQL, but is instead normal python code, although it can certainly be simpler in a number of cases, as it naturally would be in Julia.

By contrast, I really quite like DataFramesMeta.jl. It's beautifully elegant and (aesthetically) efficient, and fits very comfortably and uniformly into the existing style and structure of the Julia language.

To me, using macros to splice SQL into Julia doesn't make sense. Julia is a wonderful, incredibly well-thought-out language (things that I would never say about SQL), any query API for dataframes should take advantage of this, and embrace all the things that make Julia so much better than SQL. Furthermore, a query package is much more suitable for general, custom use if it doesn't hide the underlying structure of the dataframes or the queries themselves.

This is just my opinion of course, hopefully a few others will agree with me. Of course I'm not trying to discourage work on Query.jl, those who prefer that style should go with it, and certainly there will be some cases where it is much easier to use that DataFramesMeta.jl, I'm just hoping it won't become a paradigm. It's likely that at some point I'll start working on DataFramesMeta.jl, or something like it.

@tshort

This comment has been minimized.

Contributor

tshort commented Oct 24, 2016

Both Query.jl and StructuredQueries.jl are new. As such, this is a good time to request features and describe use cases. For example, David has started coding some shortcuts in Query.jl that make it look more like DataFramesMeta and pandas. Feedback on that might be helpful. Unfortunately, I don't have a link handy where he showed an example.

@davidanthoff

This comment has been minimized.

Contributor

davidanthoff commented Oct 24, 2016

@ExpandingMan I totally agree that something like Query.jl shouldn't become the only way to interact with DataFrames. It is really powerful is you want to do something complicated, but for the small, simple manipulations is is overkill and too verbose. I've started to add another alternative user-facing API that is a little less heavy weight, and the underlying architecture is flexible enough to potentially allow additions of more alternative user facing APIs (e.g. something that looks more like dplyr could be added as another option). But that is all future talk.

The more I think about it, the more convinced I am that the current transition plan for DataFrames to Nullable is a major mistake. Giving completely up on trying to support things like df[:a] = log.(df[:b]) seems a major step back to me. That is actually a really good API for doing such a simple manipulation from the REPL. This is not a "I told you so" comment, I was actually strongly in favor of merging the NullableArrays PR into DataFrames. But I changed my mind lately, I think completely breaking the existing API is not a good strategy and we should have come up with a way to preserve the old API.

@davidanthoff

This comment has been minimized.

Contributor

davidanthoff commented Oct 24, 2016

@ExpandingMan and @tshort Here is an issue discussing this new API queryverse/Query.jl#72. Feedback welcome!

@ExpandingMan

This comment has been minimized.

Contributor

ExpandingMan commented Oct 24, 2016

Was there ever any real choice about transitioning to Nullable? Type stability is crucial for performance (and probably also stability) reasons, period. I don't see what the alternative would be. I don't know, I think with a good enough API most users will be happy with it, but it may take a lot of work to get there.

For example, df[:a] = log(df[:b]) should work out-of-the-box, because all basic math functions should be overloaded for NullableArray and eventually even Nullable (so that df[:a] = log.(df[:b]) would work). There should also be a macro like df[:a] = @lift userfunc.(df[:b]).

I think implementations of basic math functions for Nullable is one of the things that is missing right now that would make this process seem much less painful. If there were enough functionality already built into the language for Nullable this wouldn't seem like such a huge change.

I know it'll be a lot of work, and I haven't been involved in this, so I don't want to sound like a back-seat-driver, but it seems to me like you guys made the right decision.

@davidanthoff

This comment has been minimized.

Contributor

davidanthoff commented Oct 24, 2016

I think there is a zero chance that we will be allowed to add a lifted version of things like log to base. Adding it in a package is type piracy, and we are also not allowed to do that (rightfully so!). If I'm right, then there is no chance that log.(df[:b]) will ever work, short of some major language level lifting support (which I also don't see happening any time soon). See also my general skepticism re a general lifting strategy here queryverse/Query.jl#71.

If we don't want to give up on this kind of API, I think the only way to get there is to use a different type for missing values, see JuliaData/Roadmap.jl#3. I've actually migrated Query.jl to that kind of design, and everything works just fine: I get type stability, I can provide lifted versions of all the functions I like etc.

@ExpandingMan

This comment has been minimized.

Contributor

ExpandingMan commented Oct 24, 2016

I've now read many of these issues, and I am starting to see the point about why lifted methods should not be defined for Nullable.

First, let me just say that I still think type-stability is absolutely crucial. Even if performance were the only issue (and it isn't), there seems to be reason enough not to use something like NA. If you are going to write code specifically to handle NAs, you might as well write code specifically to handle Nullable.

Here's how I see it: there are two versions of null values which are currently in widespread use which should serve as pototypes: C++'s nullptr and IEEE floating point NaN. The default behavior for nullptr is not to propagate, but instead to throw errors. The default behavior of NaN is to propagate. For data purposes, we want a nullable type which behaves like NaN.

I think I agree with those who are saying that we need another type which resembles Nullable, but is specifically designed for cases which resemble NaN.

There's two other things I'd like to point out about lifting strategies:

  1. Even if the default behavior for f(df[:a]) is to return Nullable values, it doesn't necessarily follow that, for example Nullable(a) == Nullable(b) should return a Nullable{Bool} and not a Bool. In this case one doesn't necessarily expect a return value in the same space of a and b. It is a case in which it seems perfectly acceptable for it to return a completely different type. I don't really think that choosing a convention for this sort of thing is a huge obstacle for lifting schemes. Again, following the example of NaN seems like a reasonable choice. false == (NaN == NaN) has been true since its inception, and it works fine.
  2. Having @where(df, x .> a) behave somewhat differently than [x for x in df[:x] if x > a] doesn't seem like a big problem to me. The whole idea of having special query functions is for them to behave somewhat differently, and I think that it is entirely transparent to the user that this is the case. If it weren't, the user would probably have avoided using the query methods in the first place.
  3. This seems like an important use-case for metaprogramming. I don't think having to do df[:a] = @lift f.(df[:b]) is such a bad thing.

Also, let's try not to lose sight of something: at the end of the day, when what's being stored in a dataframe is ultimately fed into a machine learning method, integrator, whatever, the null values must be dealt with. I still haven't entirely given up on the idea of having a new type down-graph from AbstractDataFrame which has no null values, and uses ordinary Arrays for storage, because this will always be useful. The mathematical heavy-lifting, so to speak, is going to get done on good ol' floats, and I don't see that ever changing. I think that's reason for optimism.

@davidagold

This comment has been minimized.

davidagold commented Oct 24, 2016

I vote to reign this thread back to it's original purpose, which is to plan for the release of the new DataFrames.jl master. The foregoing comments suggest that we might want to offer at least a minimal native API that doesn't rely on a query framework. I think we can accommodate as much. Higher-order functions such as map, reduce and broadcast are easy enough to treat fairly generally, e.g.

julia> using NullableArrays

julia> immutable Lifted{F}
           f::F
       end

julia> lift(f) = Lifted(f)
lift (generic function with 1 method)

julia> @inline function (_f::Lifted{F}){F}(x)
           f = _f.f
           if Base.null_safe_op(f, typeof(x))
               return Nullable(f(x.value), !isnull(x))
           else
               U = Core.Inference.return_type(f, Tuple{eltype(typeof(x))})
               if isnull(x)
                   return Nullable{U}()
               else
                   return Nullable(f(unsafe_get(x)))
               end
           end
       end

julia> Base.broadcast(f::Function, X::NullableArray) = broadcast(Lifted(f), X)

julia> digamma.(NullableArray(rand(5), rand(Bool, 5)))
5-element Array{Nullable{Float64},1}:
 #NULL  
 #NULL  
 #NULL  
 -1.6254
 #NULL  

So perhaps we may as well offer those, even if I would encourage a query framework over their use in general. In any case, we should make very clear which APIs will work, which will not, and why we're moving away from the one's that we're breaking. I can see a compromise where we deprecate the old API by including a minimal set of lifted methods, to ease the transition.

@quinnj

This comment has been minimized.

Member

quinnj commented Oct 24, 2016

I think that's a great plan/idea.

It also occurred to me that, due to potential reach of such a change, we could provide a limited-time "upgrade message" that is printed in the DataFrames __init__() function when the package is first loaded. It could spell out that the package has gone through a big internals change and point out a few key APIs that may not work like before and offer alternatives. We could also point out that the user can do Pkg.checkout("DataFrames", "0.8.3") (or whatever the current version is) with a Pkg.pin("DataFrames") to ensure they stay on the legacy DataFrames code that will still work with any scripts/code they have written against. As long as we make sure existing DataFrame dependency packages also have proper upper bounds, this also ensures their package environment should be stable.

@davidanthoff

This comment has been minimized.

Contributor

davidanthoff commented Oct 25, 2016

Just to be clear, @davidagold's suggestion would have the following consequences:

isnull.(df[:a])

would not work in the way anyone would expect it to work. That case could probably be fixed by something more elaborate, but I believe nothing could be done about the following case, which would also not work as one would expect:

foo(x) = isnull(x) ? 0 : x
foo.(df[:a])

I seem to be in the minority on this, but I think both cases are really bad. The dot syntax is conceptually really simple, except that with this it all of a sudden it gets really complicated. I actually would much prefer if the difficulties with the DataFrame API didn't end up making the dot syntax more unpredictable...

@johnmyleswhite

This comment has been minimized.

Contributor

johnmyleswhite commented Oct 25, 2016

I'm pretty sure that David's solution allows you to provide a custom definition for how isnull works as an argument to higher-order functions.

@davidagold

This comment has been minimized.

davidagold commented Oct 25, 2016

Yes, isnull.(X) could be made to work easily. But David Anthoff is right that

f(x) = isnull(x) ? foo : bar
f.(X)

cannot. Even broadcast(x -> isnull(x) ? foo : bar, X) cannot, though @select isnull(a) ? foo : bar can.

I agree that such inconsistencies with the behavior for arrays is not ideal. Neither lifting approach is ideal. By any means. What would be ideal (from my limited perspective) would be to have two versions of Nullable{T} -- one that propagates (A) and one that doesn't (B) -- and then have any function f called on an argument signature xs that contains any A arguments automatically lowered to lift(f, xs...). I have no idea how difficult this would be to implement, but I also can't think of a problem that this doesn't solve. I'd like to hear @JeffBezanson 's take on whether or not this is possible/feasible. If it is, then we should view whatever lifting strategy we adopt presently as a stop-gap until somebody with the know-how has the time to work on it. Even if that takes a year, or two years, I think it would be worth it. In the meantime, things won't be perfect.

@quinnj

This comment has been minimized.

Member

quinnj commented Oct 28, 2016

In terms of timeline, I think we should be explicit about what we're thinking here and commit to something publicly. It gives the best chance for others to be aware and make changes and it also helps this package keep moving forward without being in this "limbo" state for too long (i.e. the difference between the latest release tag and master).

Of course, I'd say the bulk of what needs to happen before a new release depends on non-DataFrames things: NullableArarys, Nullable operations in general, StructuredQueries.jl/Query.jl, etc. I'd like to propose (and include in John's original post at the top), that we aim to have the bulk of the non-DataFrames package work done by the first of January. That gives us over two months to help update the ecosystem and iron out any remaining design decisions, as well as allowing StructuredQueries/Query to mature. We can think of the first of January as being a the moment we want to aim for a "release-candidate" of sorts, allowing a few weeks of final bugs to sort out before the final public release and announcement. We should obviously try to come up with some kind of announcement that can be released soonish letting everyone know about the the plan.

@ExpandingMan

This comment has been minimized.

Contributor

ExpandingMan commented Oct 31, 2016

Another point in favor of @davidagold 's comment above:

Metaprogramming tools are unbelievably complicated unless we have a null type that propagates. That seems to me like the only good solution to that particular problem. I've now spent some time looking at DataFramesMeta with the intention of making it work for NullableArrays. The task of making it work at the same level of generality with any lifting strategy is quite daunting, even if it would be very simple for a few specific use cases.

@davidagold

This comment has been minimized.

davidagold commented Oct 31, 2016

@ExpandingMan so you're aware, I've been working on a query representation framework that automatically transforms user-specified calls f(xs...) to lift(f, xs...), for generic null propagation, with lift defined here. It's still under heavy development. But maybe the general strategy will be useful for whatever it is you want to do.

EDIT: That's by way of saying, the metaprogramming involved in a generic, macro-based lifting approach is not intractable. That's good, because I highly doubt that the solution I proposed above will happen any time soon. (It certainly won't happen in the form I suggested above, in which the transformation is conditional upon the types of the arguments.)

@TheOnlySilverClaw

This comment has been minimized.

TheOnlySilverClaw commented Dec 22, 2016

If I may add something to the API discussion, as someone who wasn't involved in its development: I'm currently getting a DataFrame as a result of an SQL query. I would rather not be forced to use another Query API to get my values out of that DataFrame a second time. And I totally agree that imitating SQL in Julia would be a step backwards.

@ararslan

This comment has been minimized.

Member

ararslan commented Mar 14, 2017

The Nullable-based backend has been moved to the DataTables package. I'll close this issue here; further discussion should take place in an issue in DataTables. If you continue this particular discussion, please include a link to this issue for the sake of continuity.

Thanks, everyone!

@ararslan ararslan closed this Mar 14, 2017

@JuliaData JuliaData locked and limited conversation to collaborators Mar 14, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.