update the manual entry

JuliaData · Oct 15, 2020 · 360aee3 · 360aee3
1 parent 98bd976
commit 360aee3
Showing 1 changed file with 105 additions and 44 deletions.
diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
@@ -13,6 +13,15 @@ In order to perform operations by groups you first need to create a `GroupedData
 object from your data frame using the `groupby` function that takes two arguments:
 (1) a data frame to be grouped, and (2) a set of columns to group by.
 
+!!! note
+
+    All operations described for `GroupedDataFrame` in this section of the manual
+    are also supported for `AbstractDataFrame` in which case it is considered as
+    being grouped by no columns (typically meaning that it has one group except
+    when the data frame has zero rows in which case it is treated as having zero groups).
+    The only difference is that in this case the `keepkeys` and `ungroup` keyword
+    arguments are not supported and always a data frame is returned.
+
 Operations can then be applied on each group using one of the following functions:
 * `combine`: does not put restrictions on number of rows returned, the order of rows
   is specified by the order of groups in `GroupedDataFrame`; it is typically used
@@ -26,59 +35,103 @@ Operations can then be applied on each group using one of the following function
 
 All these functions take a specification of one or more functions to apply to
 each subset of the `DataFrame`. This specification can be of the following forms:
-1. standard column selectors (integers, symbols, vectors of integers, vectors of symbols,
+1. standard column selectors (integers, `Symbol`s, vectors of integers, vectors of symbols,
    `All`, `:`, `Between`, `Not` and regular expressions)
-2. a `cols => function` pair indicating that `function` should be called with
-   positional arguments holding columns `cols`, which can be a any valid column selector
-3. a `cols => function => target_col` form additionally
-   specifying the name of the target column (this assumes that `function` returns a single
-   value or a vector)
-4. a `col => target_col` pair, which renames the column `col` to `target_col`
-5. a `nrow` or `nrow => target_col` form which efficiently computes the number of rows
-   in a group (without `target_col` the new column is called `:nrow`)
-6. several arguments of the forms given above, or vectors thereof
-7. a function which will be called with a `SubDataFrame` corresponding to each group;
+2. a `cols => function => target_cols` form additionally specifying the target column or columns
+3. a `cols => function` pair indicating that `function` should be called with
+   positional arguments holding columns `cols`, which can be a any valid column selector;
+   in this case target column name is automatically generated and it is assumed that
+   `function` returns a single value or a vector; the generated name is created by concatenating
+   source column name and `function` name where possible (see examples below).
+4. a `col => target_cols` pair, which renames the column `col` to `target_cols`
+5. a `nrow` or `nrow => target_cols` form which efficiently computes the number of rows
+   in a group (without `target_cols` the new column is called `:nrow`)
+6. vectors or matrices transformations specified by `Pair` syntax described in points 2 to 5
+8. a function which will be called with a `SubDataFrame` corresponding to each group;
    this form should be avoided due to its poor performance unless a very large
    number of columns are processed (in which case `SubDataFrame` avoids excessive
    compilation)
 
-As a special rule that applies to `cols => function` syntax, if `cols` is wrapped
-in an `AsTable` object then a `NamedTuple` containing columns selected by `cols` is
-passed to `function`.
-
-In all of these cases, `function` can return either a single row or multiple rows.
-`function` can always generate a single column by returning a single value or a vector.
-Additionally, if `combine` is passed exactly one `function`, `cols => function`,
-or `cols => function => outcol` as a first argument
-and `target_col` is not specified,
-`function` can return multiple columns in the form of an `AbstractDataFrame`,
-`AbstractMatrix`, `NamedTuple` or `DataFrameRow`.
+All functions have two types of signatures. One of them takes a `GroupedDataFrame`
+as a first argument and an arbitrary number of transfomations described above
+as following arguments. The second type of signature is when `Function` or `Type`
+is passed as a first argument and `GroupedDataFrame` is a second argument (in a
+similar fashion like it is passed in e.g. `map` function).
+
+As a special rule that applies to `cols => function` and `cols => function =>
+target_cols` syntaxes is the following. If `cols` is wrapped in an `AsTable`
+object then a `NamedTuple` containing columns selected by `cols` is passed to
+`function`.
+
+What is allowed for `function` to return is determined by the `target_cols` value
+in the following way:
+1. If just a `function` is passed as an argument then returning a data frame,
+   a matrix, a `NamedTuple`, or a `DataFrameRow` will produce multiple columns in the
+   result. Returning any other value produces a single column.
+2. If `target_cols` is a `Symbol` or a string then the function is assumed to return
+   a single column. In this case returning a data frame, a matrix, a `NamedTuple`,
+   or a `DataFrameRow` raises an error.
+3. If `target_cols` is a vector of `Symbol`s or strings or `AsTable` it is assumed
+   that `function` returns multiple columns.
+   If `function` returns one of `AbstractDataFrame`, `NamedTuple`, `DataFrameRow`,
+   `AbstractMatrix` then rules described in point 1 above apply.
+   If `function` returns an `AbstractVector` then each element of this vector must
+   support the `keys` function, which must return a collection of `Symbol`s, strings
+   or integers; the return value of `keys` must be identical for all elements.
+   Then as many columns are created as there are elements in the return value
+   of the `keys` function. If `target_cols` is `AsTable` then their names
+   are set to be equal to the key names except if `keys` returns integers, in
+   which case they are prefixed by `x` (so the column names are e.g. `x1`,
+   `x2`, ...). If `target_cols` is a vector of `Symbol`s or strings then
+   column names produced using the rules above are ignored and replaced by
+   `target_cols` (the number of columns must be the same as the length of
+   `target_cols` in this case).
+   If `fun` returns a value of any other type then it is assumed that it is a
+   table conforming to the Tables.jl API and the `Tables.columntable` function
+   is called on it to get the resulting columns and their names. The names are
+   retained when `target_cols` is `AsTable` and are replaced if
+   `target_cols` is a vector of `Symbol`s or strings.
+
+In all of these cases, `function` can return either a single row or multiple
+rows. As a particular rule, values wrapped in a `Ref` or a `0`-dimensional
+`AbstractArray` are unwrapped and then treated as a single row.
 
 `select`/`select!` and `transform`/`transform!` always return a `DataFrame`
-with the same number of rows as the source.
-For `combine`, the shape of the resulting `DataFrame` is determined
-according to the following rules:
-- a single value produces a single row and column per group
-- a named tuple or `DataFrameRow` produces a single row and one column per field
-- a vector produces a single column with one row per entry
-- a named tuple of vectors produces one column per field with one row per entry in the vectors
-- a `DataFrame` or a matrix produces as many rows and columns as it contains;
-  note that this option should be avoided due to its poor performance when the number
-  of groups is large
+with the same number and order of rows as the source (even if `GroupedDataFrame`
+had its groups reordered).
 
-The kind of return value and the number and names of columns must be the same for all groups.
+For `combine` return value is ordered by the order of groups in `GroupedDataFrame`
+and for each group the functions can return an arbibrary number of rows (provided
+that these numbers are consistent).
 
 It is allowed to mix single values and vectors if multiple transformations
-are requested. In this case single value will be broadcasted to match the length
+are requested. In this case single value will be repeated to match the length
 of columns specified by returned vectors.
-As a particular rule, values wrapped in a `Ref` or a `0`-dimensional `AbstractArray`
-are unwrapped and then broadcasted.
 
-If a single value or a vector is returned by the `function` and `target_col` is not
-provided, it is generated automatically, by concatenating source column name and
-`function` name where possible (see examples below).
+To apply `function` to each row instead of whole columns, it can be wrapped in a
+`ByRow` struct. In this case if `cols` is a `Symbol`, a string, or an
+integer then `function` is applied to each element (row) of `cols` using
+broadcasting. Otherwise `cols` can be any column indexing syntax, in
+which case `function` will be passed one argument for each of the columns
+specified by `cols`. If `ByRow` is used it is allowed for
+`cols` to select an empty set of columns, in which case `function`
+ is called for each row without any arguments.
 
-We show several examples of the `by` function applied to the `iris` dataset below:
+The kind of return value and the number and names of columns must be the same for all groups.
+
+There the following keyword arguments are supported by the transformation functions
+(not all keyword arguments are supported in all cases; in general they are allowed
+in situations when they are meaningful, see the documentation of the specific functions
+for details):
+- `keepkeys` : if grouping columns should be kept in the returned data frame.
+- `ungroup` : if the retun value of the operation should be a data frame or a
+  `GroupedDataFrame`.
+- `copycols` : if columns of the source data frame should be copied if no transformation
+  is applied to them.
+- `renamecols` : if in `cols => funcion` form the automatically generated column name
+  should include the name of transformation function or not.
+
+We show several examples of these functions applied to the `iris` dataset below:
 
 ```jldoctest sac
 julia> using DataFrames, CSV, Statistics
@@ -176,8 +229,8 @@ julia> combine(gdf, nrow, :PetalLength => mean => :mean)
 │ 2   │ Iris-versicolor │ 50    │ 4.26    │
 │ 3   │ Iris-virginica  │ 50    │ 5.552   │
 
-julia> combine([:PetalLength, :SepalLength] => (p, s) -> (a=mean(p)/mean(s), b=sum(p)),
-               gdf) # multiple columns are passed as arguments
+julia> combine(gdf, [:PetalLength, :SepalLength] => ((p, s) -> (a=mean(p)/mean(s), b=sum(p))) =>
+       AsTable) # multiple columns are passed as arguments
 3×3 DataFrame
 │ Row │ Species         │ a        │ b       │
 │     │ String          │ Float64  │ Float64 │
@@ -215,6 +268,14 @@ julia> combine(gdf, 1:2 => cor, nrow)
 │ 2   │ Iris-versicolor │ 0.525911                   │ 50    │
 │ 3   │ Iris-virginica  │ 0.457228                   │ 50    │
 
+julia> combine(gdf, :PetalLength => (x -> [extrema(x)]) => [:min, :max])
+3×3 DataFrame
+│ Row │ Species         │ min     │ max     │
+│     │ String          │ Float64 │ Float64 │
+├─────┼─────────────────┼─────────┼─────────┤
+│ 1   │ Iris-setosa     │ 1.0     │ 1.9     │
+│ 2   │ Iris-versicolor │ 3.0     │ 5.1     │
+│ 3   │ Iris-virginica  │ 4.5     │ 6.9     │
 ```
 
 Contrary to `combine`, the `select` and `transform` functions always return
@@ -268,7 +329,7 @@ julia> transform(gdf, :Species => x -> chop.(x, head=5, tail=0))
 │ 150 │ Iris-virginica │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ virginica        │
 ```
 
-The `combine` function also supports the `do` block form. However, as noted above,
+All functions also support the `do` block form. However, as noted above,
 this form is slow and should therefore be avoided when performance matters.
 
 ```jldoctest sac
@@ -385,7 +446,7 @@ julia> combine(gd, valuecols(gd) .=> mean)
 │ 2   │ Iris-versicolor │ 5.936            │ 2.77            │ 4.26             │ 1.326           │
 │ 3   │ Iris-virginica  │ 6.588            │ 2.974           │ 5.552            │ 2.026           │
 
-julia> combine(gd, valuecols(gd) .=> (x -> (x .- mean(x)) ./ std(x)) .=> valuecols(gd))
+julia> combine(gd, valuecols(gd) .=> (x -> (x .- mean(x)) ./ std(x)), renamecols=false)
 150×5 DataFrame
 │ Row │ Species        │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │
 │     │ String         │ Float64     │ Float64    │ Float64     │ Float64    │