-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backport DataTables using merge commit #1220
Conversation
Add compatibility with pre-contrasts ModelFrame constructor
…ise for speed improvement (#1070)
* Only sort duplicated columns once * Added comments * moved check for identical arrays inside of for loop * Don't mention PR under /src, only mention them under /test
Resolve "WARNING: [a] concatenation is deprecated; use collect(a) instead"
Use vcat() instead of collect() in colwise(), and identity() instead of abs(), since the latter do not work with Nullable.
Also switch from mkdocs output to Documenter's HTML output.
make GroupApplied immutable by adding subframe type parameter
* avoid [:], use reshape() * avoid unnecessary Symbol<->String conversions
Misc minor enhancements
Also return a NullableCategoricalArray from sharepools() since the code currently doesn't check that no null values are present. anyway this function is internal and the change imposes no overhead.
* Better display of Nullables * Don't write trailing space in Latex output Also fix missing newline in show test
* limit attribute of IOContext is used for html generation * fixup
I apparently missed these occurrences when removing these functions.
The subdatatable/views code did not have a clear function heirarchy. Sometimes view called subdatatable, sometimes subdatatable called back up to view, and other times view would call view again before later calling subdatatable. The code was also relying on a custom Index that was used nowhere else, and seemed unneccessary. Fixes issue that users could only specify single columns to subset on (rather than arrays of columns), and adds tests for datatables and subdatatables to assert view works as expected.
io.md does not exist since the readtable() has been removed. Pooling should be called categorical, so rename the file and sections.
CategoricalValue entries should always be printed via showcompact() in order to get a short representation. This uses ourshowcompact() to do that when printing DataTables to REPL via show, as well as when printing them to HTML, LaTeX and CSV via printtable(). Also fix a failure due to duplicated keyword arguments on Julia 0.7.
Fixes a failure on nightlies.
Of course this cannot be merged via the GitHub interface since that's not fast-forward. It will have to be done manually. |
3da60ca
to
51c6f65
Compare
51c6f65
to
31005d1
Compare
(Even if the colors don't indicate it, tests are green on Travis with Julia 0.6.) |
Diffing
I suppose those can probably be fixed after this is merged. |
31005d1
to
d006adb
Compare
Good catch! New version should fix these issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks alright to me. A lot of the things I noted can be addressed by another FemtoCleaner run once this is merged.
```@docs | ||
eltypes | ||
head | ||
completecases | ||
completecases! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be removing completecases
here in addition to completecases!
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm looks like they do different things; the former identifies complete cases as a boolean vector and the latter is now dropnull!
.
iris = dataset("datasets", "iris") | ||
using DataFrames | ||
using CSV | ||
iris = CSV.read(joinpath(Pkg.dir("DataFrames"), "test/data/iris.csv"), DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slashes in the path here kind of defeat the purpose of using joinpath
😉
colwise, | ||
combine, | ||
completecases, | ||
completecases!, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment regarding completecases
|
||
############################################################################## | ||
## | ||
## Equality | ||
## | ||
############################################################################## | ||
|
||
# Imported in DataFrames.jl for compatibility across Julia 0.4 and 0.5 | ||
Base.:(==)(df1::AbstractDataFrame, df2::AbstractDataFrame) = isequal(df1, df2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we change this recently in DataTables? Having ==
call isequal
defeats the purpose of having them be separate functions. If this is only for 0.4 and 0.5 compatibility, we don't support those anymore, so this could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought this looked fishy too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's also related to issues around how to define ==
for Nullable
. We should fix this in later PRs.
@@ -405,14 +378,43 @@ function StatsBase.describe(io, df::AbstractDataFrame) | |||
end | |||
end | |||
|
|||
function StatsBase.describe{T}(io::IO, X::AbstractVector{Union{T, Null}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This syntax is deprecated, should use where
src/dataframe/dataframe.jl
Outdated
@@ -689,7 +663,7 @@ function deleterows!(df::DataFrame, ind::AbstractVector{Int}) | |||
idf = 1 | |||
iind = 1 | |||
ikeep = 1 | |||
keep = Vector{Int}(n - length(ind2)) | |||
keep = Array{Int}(n-length(ind2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was Array
before the fork, and has been changed to Vector
in DataFrames since then. I've changed the merge commit to use Vector
since that's slightly better.
@@ -96,95 +95,54 @@ end | |||
#' | |||
#' DataFrames.gennames(10) | |||
function gennames(n::Integer) | |||
res = Vector{Symbol}(n) | |||
res = Array{Symbol}(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another seemingly unnecessary change from Vector
to Array
#' DataFrames.countna(@data([1, 2, 3])) | ||
countna(da::DataArray) = sum(da.na) | ||
#' DataFrames.countnull([1, 2, 3]) | ||
function countnull(a::AbstractArray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should either be defined in Nulls or deprecated in favor of count(isnull, a)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's countnull
to allow, e.g. DataArray
/NullableArray
/CategoricalArray
to define optimized versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still can quite easily:
Base.count(::typeof(isnull), a::SweetArray) = # awesome implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's remove it in a subsequent PR.
@@ -5,6 +5,42 @@ | |||
## | |||
############################################################################## | |||
|
|||
if VERSION >= v"0.6.0-dev.2643" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could simplify this to remove the conditional
@@ -42,18 +34,24 @@ module TestData | |||
# lots more to do | |||
|
|||
#test_group("assign") | |||
df6[3] = @data(["un", "deux", "troix", "quatre"]) | |||
df6[3] = ["un", "deux", "trois", "quatre"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was really a shame for all JuliaStats! ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C'est dommage!
- Changed julia requirement to 0.6 minimum - Stopped testing on 0.5 - Removed Compat dependency - Fixed a few 0.6 warnings.
This requires Nulls, as well as new versions of CategoricalArrays, DataStreams and WeakRefStrings.
…fork) Resolve all conflicts in favor of DataTables, except for two cases where DataFrames contained improvements not in DataTables in areas touched since the fork: - Add <thead> and <tbody> tags in HTML output. - Use Vector instead of Array in one case where appropriate
d006adb
to
b293dd7
Compare
(ncol(r1.df) == ncol(r2.df)) || | ||
throw(ArgumentError("Rows of the data tables that have different number of columns cannot be compared ($(ncol(df1)) and $(ncol(df2)))")) | ||
@inbounds for i in 1:ncol(r1.df) | ||
isless(r1.df[i][r1.row], r2.df[i][r2.row]) && return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're doing a general review, we should have a look at the semantics of this function at some point. It was added in JuliaData/DataTables.jl#17 and I'm not sure what's its purpose. Its behavior looks quite counter-intuitive to me.
For now I've kept it, but simplified it a bit since null
uses the right isless
definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1222. Actually, the way the code was written looked weird to me, but the behavior was OK.
I've fixed the issues which are related to the merge. Let's handle the others later (using Femtocleaner where possible). I've also noticed a few remaining uses of I've just pushed it to master since I'm afraid of what GitHub is going to do in this complex situation. We should have a look at the failures on Julia 0.7, they don't seem too hard to fix. |
Note that one of the 0.7 failures is a bug in current isbits Union array copying. A fix will be incoming to Base in the next few days. |
This is equivalent to #1214 but using
git merge
rather thangit rebase
. The merge commit is quite small (it can be seen locally usinggit show
from the head of the branch). DataTables history has been preserved, except for the DataFrames ->DataTables rename which has been erased using this script.