Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport DataTables using merge commit #1220

Merged
merged 97 commits into from
Sep 3, 2017
Merged

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Sep 2, 2017

This is equivalent to #1214 but using git merge rather than git rebase. The merge commit is quite small (it can be seen locally using git show from the head of the branch). DataTables history has been preserved, except for the DataFrames ->DataTables rename which has been erased using this script.

Gord Stephen and others added 30 commits September 14, 2016 10:13
Add compatibility with pre-contrasts ModelFrame constructor
* Only sort duplicated columns once

* Added comments

* moved check for identical arrays inside of for loop

* Don't mention PR under /src, only mention them under /test
Resolve "WARNING: [a] concatenation is deprecated; use collect(a) instead"
Use vcat() instead of collect() in colwise(), and identity() instead of abs(),
since the latter do not work with Nullable.
Also switch from mkdocs output to Documenter's HTML output.
make GroupApplied immutable by adding subframe type parameter
* avoid [:], use reshape()
* avoid unnecessary Symbol<->String conversions
* handle -1 and add tests

* replace `import Base.==` with `Base.:(==)`

* typo and error test
Also return a NullableCategoricalArray from sharepools() since
the code currently doesn't check that no null values are present.
anyway this function is internal and the change imposes no overhead.
* Better display of Nullables

* Don't write trailing space in Latex output

Also fix missing newline in show test
* limit attribute of IOContext is used for html generation

* fixup
I apparently missed these occurrences when removing these functions.
cjprybol and others added 9 commits July 8, 2017 19:04
The subdatatable/views code did not have a clear function heirarchy.
Sometimes view called subdatatable, sometimes subdatatable called back
up to view, and other times view would call view again before later
calling subdatatable. The code was also relying on a custom Index that
was used nowhere else, and seemed unneccessary. Fixes issue that users
could only specify single columns to subset on (rather than arrays of
columns), and adds tests for datatables and subdatatables to assert
view works as expected.
io.md does not exist since the readtable() has been removed. Pooling
should be called categorical, so rename the file and sections.
CategoricalValue entries should always be printed via showcompact() in order
to get a short representation. This uses ourshowcompact() to do that when
printing DataTables to REPL via show, as well as when printing them to HTML,
LaTeX and CSV via printtable().

Also fix a failure due to duplicated keyword arguments on Julia 0.7.
Fixes a failure on nightlies.
@nalimilan
Copy link
Member Author

Of course this cannot be merged via the GitHub interface since that's not fast-forward. It will have to be done manually.

@nalimilan
Copy link
Member Author

(Even if the colors don't indicate it, tests are green on Travis with Julia 0.6.)

@rofinn
Copy link
Member

rofinn commented Sep 2, 2017

Diffing rf/datatables and nl/datatables-backport wasn't very helpful for reviewing, but a few things I did notice were:

  1. A few remaining incorrect names (e.g., dt -> df, JuliaData -> JuliaStats).
  2. Documentation that needs to be removed or fixed (e.g., the formulas documentation should be removed)

I suppose those can probably be fixed after this is merged.

@nalimilan
Copy link
Member Author

Good catch! New version should fix these issues.

Copy link
Member

@ararslan ararslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks alright to me. A lot of the things I noted can be addressed by another FemtoCleaner run once this is merged.

```@docs
eltypes
head
completecases
completecases!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be removing completecases here in addition to completecases!?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm looks like they do different things; the former identifies complete cases as a boolean vector and the latter is now dropnull!.

iris = dataset("datasets", "iris")
using DataFrames
using CSV
iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"), DataFrame)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slashes in the path here kind of defeat the purpose of using joinpath 😉

colwise,
combine,
completecases,
completecases!,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment regarding completecases


##############################################################################
##
## Equality
##
##############################################################################

# Imported in DataFrames.jl for compatibility across Julia 0.4 and 0.5
Base.:(==)(df1::AbstractDataFrame, df2::AbstractDataFrame) = isequal(df1, df2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we change this recently in DataTables? Having == call isequal defeats the purpose of having them be separate functions. If this is only for 0.4 and 0.5 compatibility, we don't support those anymore, so this could be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought this looked fishy too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's also related to issues around how to define == for Nullable. We should fix this in later PRs.

@@ -405,14 +378,43 @@ function StatsBase.describe(io, df::AbstractDataFrame)
end
end

function StatsBase.describe{T}(io::IO, X::AbstractVector{Union{T, Null}})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax is deprecated, should use where

@@ -689,7 +663,7 @@ function deleterows!(df::DataFrame, ind::AbstractVector{Int})
idf = 1
iind = 1
ikeep = 1
keep = Vector{Int}(n - length(ind2))
keep = Array{Int}(n-length(ind2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was Array before the fork, and has been changed to Vector in DataFrames since then. I've changed the merge commit to use Vector since that's slightly better.

@@ -96,95 +95,54 @@ end
#'
#' DataFrames.gennames(10)
function gennames(n::Integer)
res = Vector{Symbol}(n)
res = Array{Symbol}(n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another seemingly unnecessary change from Vector to Array

#' DataFrames.countna(@data([1, 2, 3]))
countna(da::DataArray) = sum(da.na)
#' DataFrames.countnull([1, 2, 3])
function countnull(a::AbstractArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this should either be defined in Nulls or deprecated in favor of count(isnull, a)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's countnull to allow, e.g. DataArray/NullableArray/CategoricalArray to define optimized versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still can quite easily:

Base.count(::typeof(isnull), a::SweetArray) = # awesome implementation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's remove it in a subsequent PR.

@@ -5,6 +5,42 @@
##
##############################################################################

if VERSION >= v"0.6.0-dev.2643"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could simplify this to remove the conditional

@@ -42,18 +34,24 @@ module TestData
# lots more to do

#test_group("assign")
df6[3] = @data(["un", "deux", "troix", "quatre"])
df6[3] = ["un", "deux", "trois", "quatre"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was really a shame for all JuliaStats! ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C'est dommage!

rofinn and others added 3 commits September 3, 2017 15:54
- Changed julia requirement to 0.6 minimum
- Stopped testing on 0.5
- Removed Compat dependency
- Fixed a few 0.6 warnings.
This requires Nulls, as well as new versions of CategoricalArrays,
DataStreams and WeakRefStrings.
…fork)

Resolve all conflicts in favor of DataTables, except for two cases where
DataFrames contained improvements not in DataTables in areas touched since the fork:
- Add <thead> and <tbody> tags in HTML output.
- Use Vector instead of Array in one case where appropriate
(ncol(r1.df) == ncol(r2.df)) ||
throw(ArgumentError("Rows of the data tables that have different number of columns cannot be compared ($(ncol(df1)) and $(ncol(df2)))"))
@inbounds for i in 1:ncol(r1.df)
isless(r1.df[i][r1.row], r2.df[i][r2.row]) && return true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're doing a general review, we should have a look at the semantics of this function at some point. It was added in JuliaData/DataTables.jl#17 and I'm not sure what's its purpose. Its behavior looks quite counter-intuitive to me.

For now I've kept it, but simplified it a bit since null uses the right isless definition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1222. Actually, the way the code was written looked weird to me, but the behavior was OK.

@nalimilan nalimilan merged commit b293dd7 into master Sep 3, 2017
@nalimilan
Copy link
Member Author

I've fixed the issues which are related to the merge. Let's handle the others later (using Femtocleaner where possible). I've also noticed a few remaining uses of unsafe_get, I've removed them from @quinnj's commit to keep history clean.

I've just pushed it to master since I'm afraid of what GitHub is going to do in this complex situation. We should have a look at the failures on Julia 0.7, they don't seem too hard to fix.

@nalimilan nalimilan deleted the nl/datatables-backport branch September 3, 2017 14:49
@quinnj
Copy link
Member

quinnj commented Sep 3, 2017

Note that one of the 0.7 failures is a bug in current isbits Union array copying. A fix will be incoming to Base in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.