Merge branch 'master' into nl/refgrouping

JuliaData · Oct 11, 2020 · f3ce3ed · f3ce3ed
2 parents 9d05965 + 4ec8009
commit f3ce3ed
Show file tree

Hide file tree

Showing 21 changed files with 960 additions and 270 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -16,26 +16,34 @@ Thanks for taking the plunge!
 
 ## Contributing
 
+* DataFrames.jl is a relatively complex package that also has many external dependencies.
+  Therefore if you would want to propose a new functionality (which is encouraged) it is
+  strongly recommended to open an issue first and reach a decision on the final design.
+  Then a pull request serves an implementation of the agreed way how things should work.
 * Feel free to open, or comment on, an issue and solicit feedback early on,
   especially if you're unsure about aligning with design goals and direction,
-  or if relevant historical comments are ambiguous
+  or if relevant historical comments are ambiguous.
 * Pair new functionality with tests, and bug fixes with tests that fail pre-fix.
-  Increasing test coverage as you go is always nice
+  Increasing test coverage as you go is always nice.
 * Aim for atomic commits, if possible, e.g. `change 'foo' behavior like so` &
   `'bar' handles such and such corner case`,
-  rather than `update 'foo' and 'bar'` & `fix typo` & `fix 'bar' better`
+  rather than `update 'foo' and 'bar'` & `fix typo` & `fix 'bar' better`.
 * Pull requests are tested against release and development branches of Julia,
-  so using `Pkg.test("DataFrames")` as you develop can be helpful
+  so using `Pkg.test("DataFrames")` as you develop can be helpful.
 * The style guidelines outlined below are not the personal style of most contributors,
-  but for consistency throughout the project, we've adopted them
-* It is recommended to disable GitHub Actions on your fork; check Settings > Actions
+  but for consistency throughout the project, we've adopted them.
+* It is recommended to disable GitHub Actions on your fork; check Settings > Actions.
 * If a PR adds a new exported name then make sure to add a docstring for it and
-  add a reference to it in the documentation
-* A PR with breaking changes should have `[BREAKING]` as a first part of its name
+  add a reference to it in the documentation.
+* A PR with breaking changes should have `[BREAKING]` as a first part of its name.
 * If a PR changes or adds functionality please update NEWS.md file accordingly as
   a part of the PR (along with the link to the PR); please do not add entries
   to NEWS.md for changes that are bug fixes or are not user visible, such as
-  adding tests, updating documentation or improving code layout
+  adding tests, updating documentation or improving code layout.
+* If you make a PR please try to avoid pushing many small commits to GitHub in
+  a sequence as each such commit triggers a separate CI job, which takes over
+  an hour. This has a consequence of making other PRs in packages from the JuliaData
+  ecosystem wait for such CI jobs to finish as hey share a common pool of CI resources.
 
 ## Style Guidelines
 

diff --git a/NEWS.md b/NEWS.md
@@ -2,6 +2,10 @@
 
 ## Breaking changes
 
+* the rules for transformations passed to `select`/`select!`, `transform`/`transform!`,
+  and `combine` have been made more flexible; in particular now it is allowed to
+  return multiple columns from a transformation function
+  [#2461](https://github.com/JuliaData/DataFrames.jl/pull/2461)
 * CategoricalArrays.jl is no longer reexported: call `using CategoricalArrays`
   to use it [#2404]((https://github.com/JuliaData/DataFrames.jl/pull/2404)).
   In the same vein, the `categorical` and `categorical!` functions
@@ -32,6 +36,8 @@
   choose the fast path only when it is safe; this resolves inconsistencies
   with what the same functions not using fast path produce
   ([#2357](https://github.com/JuliaData/DataFrames.jl/pull/2357))
+* `GroupKeys` now supports `in` for `GroupKey`, `Tuple`, `NamedTuple` and dictionaries
+  ([2392](https://github.com/JuliaData/DataFrames.jl/pull/2392))
 * in `describe` the specification of custom aggregation is now `function => name`;
   old `name => function` order is now deprecated
   ([#2401](https://github.com/JuliaData/DataFrames.jl/pull/2401))
@@ -67,6 +73,7 @@
 * `filter`, `sort`, `dropmissing`, and `unique` now support a `view` keyword argument
   which if set to `true` makes them retun a `SubDataFrame` view into the passed
   data frame.
+* add `only` method for `AbstractDataFrame` ([#2449](https://github.com/JuliaData/DataFrames.jl/pull/2449))
 
 ## Deprecated
 

diff --git a/Project.toml b/Project.toml
@@ -36,7 +36,7 @@ test = ["DataStructures", "DataValues", "Dates", "Logging", "Random", "Test"]
 [compat]
 julia = "1"
 CategoricalArrays = "0.8.3"
-Compat = "2.2, 3"
+Compat = "3.17"
 DataAPI = "1.2"
 InvertedIndices = "1"
 IteratorInterfaceExtensions = "0.1.1, 1"

diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@ DataFrames.jl
 =============
 
 [![Coverage Status](https://coveralls.io/repos/JuliaData/DataFrames.jl/badge.svg?branch=master&service=github)](https://coveralls.io/github/JuliaData/DataFrames.jl?branch=master)
-[![Travis Build Status](https://travis-ci.org/JuliaData/DataFrames.jl.svg?branch=master)](https://travis-ci.org/JuliaData/DataFrames.jl)
+[![Travis Build Status](https://travis-ci.com/JuliaData/DataFrames.jl.svg?branch=master)](https://travis-ci.com/JuliaData/DataFrames.jl)
 
 Tools for working with tabular data in Julia.
 

diff --git a/docs/src/lib/functions.md b/docs/src/lib/functions.md
@@ -99,6 +99,7 @@ filter
 filter!
 first
 last
+only
 nonunique
 unique
 unique!

diff --git a/docs/src/lib/types.md b/docs/src/lib/types.md
@@ -55,7 +55,8 @@ The `ByRow` type is a special type used for selection operations to signal that
 to each element (row) of the selection.
 
 The `AsTable` type is a special type used for selection operations to signal that the columns selected by a wrapped
-selector should be passed as a `NamedTuple` to the function.
+selector should be passed as a `NamedTuple` to the function or to signal that it is requested
+to expand the return value of a transformation into multiple columns.
 
 ## [The design of handling of columns of a `DataFrame`](@id man-columnhandling)
 

diff --git a/docs/src/man/comparisons.md b/docs/src/man/comparisons.md
@@ -69,13 +69,13 @@ rows having the index value of `'c'`.
 | Reduce multiple values   | `df['z'].mean(skipna = False)`                                 | `mean(df.z)`                                |
 |                          | `df['z'].mean()`                                               | `mean(skipmissing(df.z))`                   |
 |                          | `df[['z']].agg(['mean'])`                                      | `combine(df, :z => mean ∘ skipmissing)`     |
-| Add new columns          | `df.assign(z1 = df['z'] + 1)`                                  | `df.z1 = df.z .+ 1`                         |
-|                          |                                                                | `insertcols!(df, :z1 => df.z .+ 1)`         |
-|                          |                                                                | `transform(df, :z => (v -> v .+ 1) => :z1)` |
+| Add new columns          | `df.assign(z1 = df['z'] + 1)`                                  | `transform(df, :z => (v -> v .+ 1) => :z1)` |
 | Rename columns           | `df.rename(columns = {'x': 'x_new'})`                          | `rename(df, :x => :x_new)`                  |
 | Pick & transform columns | `df.assign(x_mean = df['x'].mean())[['x_mean', 'y']]`          | `select(df, :x => mean, :y)`                |
 | Sort rows                | `df.sort_values(by = 'x')`                                     | `sort(df, :x)`                              |
 |                          | `df.sort_values(by = ['grp', 'x'], ascending = [True, False])` | `sort(df, [:grp, order(:x, rev = true)])`   |
+| Drop missing rows        | `df.dropna()`                                                  | `dropmissing(df)`                           |
+| Select unique rows       | `df.drop_duplicates()`                                         | `unique(df)`                                |
 
 Note that pandas skips `NaN` values in its analytic functions by default. By contrast,
 Julia functions do not skip `NaN`'s. If necessary, you can filter out
@@ -93,6 +93,21 @@ examples above do not synchronize the column names between pandas and DataFrames
 (you can pass `renamecols=false` keyword argument to `select`, `transform` and
 `combine` functions to retain old column names).
 
+### Mutating operations
+
+| Operation          | pandas                                                | DataFrames.jl                                |
+| :----------------- | :---------------------------------------------------- | :------------------------------------------- |
+| Add new columns    | `df['z1'] = df['z'] + 1`                              | `df.z1 = df.z .+ 1`                          |
+|                    |                                                       | `transform!(df, :z => (x -> x .+ 1) => :z1)` |
+|                    | `df.insert(1, 'const', 10)`                           | `insertcols!(df, 2, :const => 10)`           |
+| Rename columns     | `df.rename(columns = {'x': 'x_new'}, inplace = True)` | `rename!(df, :x => :x_new)`                  |
+| Sort rows          | `df.sort_values(by = 'x', inplace = True)`            | `sort!(df, :x)`                              |
+| Drop missing rows  | `df.dropna(inplace = True)`                           | `dropmissing!(df)`                           |
+| Select unique rows | `df.drop_duplicates(inplace = True)`                  | `unique!(df)`                                |
+
+Generally speaking, DataFrames.jl follows the Julia convention of using `!` in the
+function name to indicate mutation behavior.
+
 ### Grouping data and aggregation
 
 DataFrames.jl provides a `groupby` function to apply operations
@@ -178,11 +193,8 @@ In DataFrames.jl, it just works normally with an array of join keys specified in
 The following table compares the main functions of DataFrames.jl with the R package dplyr (version 1):
 
 ```R
-df <- tibble(id = c('a','b','c','d','e','f'),
-             grp = c(1, 2, 1, 2, 1, 2),
-             x = c(6, 5, 4, 3, 2, 1),
-             y = c(4, 5, 6, 7, 8, 9),
-             z = c(3, 4, 5, 6, 7, 8))
+df <- tibble(grp = rep(1:2, 3), x = 6:1, y = 4:9,
+             z = c(3:7, NA), id = letters[1:6])
 ```
 
 | Operation                | dplyr                          | DataFrames.jl                          |

diff --git a/docs/src/man/getting_started.md b/docs/src/man/getting_started.md
@@ -355,7 +355,12 @@ we can observe that:
 
 #### Indexing syntax
 
-Specific subsets of a data frame can be extracted using the indexing syntax, similar to matrices. The colon `:` indicates that all items (rows or columns depending on its position) should be retained:
+Specific subsets of a data frame can be extracted using the indexing syntax,
+similar to matrices. In the [Indexing](@ref) section of the manual you can find
+all the details about the available options. Here we highlight the basic options.
+
+The colon `:` indicates that all items (rows or columns
+depending on its position) should be retained:
 
 ```jldoctest dataframe
 julia> df[1:3, :]
@@ -481,7 +486,7 @@ julia> df[!, Not(:x1)]
 │ 1   │ 2     │ 3     │
 ```
 
-Finally, you can use `Not` and `All` selectors in more complex column selection scenarios.
+Finally, you can use `Not`, `Between`, and `All` selectors in more complex column selection scenarios.
 The following examples move all columns whose names match `r"x"` regular expression respectively to the front and to the end of a data frame:
 ```
 julia> df = DataFrame(r=1, x1=2, x2=3, y=4)
@@ -571,7 +576,7 @@ a function object that tests whether each value belongs to the subset
     - when `view` or `@view` is used (e.g. `@view df[1:3, :A]`).
 
     More details on copies, views, and references can be found
-    [here.](https://juliadata.github.io/DataFrames.jl/stable/lib/indexing/#getindex-and-view-1)
+    in the [`getindex` and `view`](@ref) section.
 
 #### Column selection using `select` and `select!`, `transform` and `transform!`
 
@@ -627,6 +632,14 @@ julia> select(df, :x2, :x2 => ByRow(sqrt)) # transform columns by row
 ├─────┼───────┼─────────┤
 │ 1   │ 3     │ 1.73205 │
 │ 2   │ 4     │ 2.0     │
+
+julia> select(df, AsTable(:) => ByRow(extrema) => [:lo, :hi]) # return multiple columns
+2×2 DataFrame
+│ Row │ lo    │ hi    │
+│     │ Int64 │ Int64 │
+├─────┼───────┼───────┤
+│ 1   │ 1     │ 5     │
+│ 2   │ 2     │ 6     │
 ```
 
 It is important to note that `select` always returns a data frame,

diff --git a/src/DataFrames.jl b/src/DataFrames.jl
@@ -80,6 +80,13 @@ if VERSION < v"1.2"
     export hasproperty
 end
 
+if isdefined(Base, :only)  # Introduced in 1.4.0
+    import Base.only
+else
+    import Compat.only
+    export only
+end
+
 include("other/utils.jl")
 include("other/index.jl")
 

diff --git a/src/abstractdataframe/abstractdataframe.jl b/src/abstractdataframe/abstractdataframe.jl
@@ -434,6 +434,16 @@ end
 ##
 ##############################################################################
 
+"""
+    only(df::AbstractDataFrame)
+
+If `df` has a single row return it as a `DataFrameRow`; otherwise throw `ArgumentError`.
+"""
+function only(df::AbstractDataFrame)
+    nrow(df) != 1 && throw(ArgumentError("data frame must contain exactly 1 row"))
+    return df[1, :]
+end
+
 """
     first(df::AbstractDataFrame)
 

diff --git a/src/abstractdataframe/join.jl b/src/abstractdataframe/join.jl
@@ -812,9 +812,9 @@ function rightjoin(df1::AbstractDataFrame, df2::AbstractDataFrame;
 end
 
 """
-    outerjoin(df1, df2; on, kind = :inner, makeunique = false, indicator = nothing,
+    outerjoin(df1, df2; on, makeunique = false, indicator = nothing,
               validate = (false, false), renamecols = identity => identity)
-    outerjoin(df1, df2, dfs...; on, kind = :inner, makeunique = false,
+    outerjoin(df1, df2, dfs...; on, makeunique = false,
               validate = (false, false))
 
 Perform an outer join of two or more data frame objects and return a `DataFrame`