[BREAKING] Deprecate categorical and categorical! (#2394)

These methods are the only remaining traces of CategoricalArrays in the public API. Getting rid of them will allow dropping the dependency if we want. `transform` and `transform!` now provide quite reasonable replacements for some uses. What is lacking though is a way to select columns based on their types, which would be a useful addition in general. Also it could make sense to add a keyword argument to keep original names, which would be convenient to avoid repeating the selector.
JuliaData · Aug 31, 2020 · f10ee2e · f10ee2e
1 parent d156320
commit f10ee2e
Show file tree

Hide file tree

Showing 11 changed files with 285 additions and 419 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -22,6 +22,9 @@
   choose the fast path only when it is safe; this resolves inconsistencies
   with what the same functions not using fast path produce
   ([#2357](https://github.com/JuliaData/DataFrames.jl/pull/2357))
+* the `categorical` and `categorical!` functions have been deprecated in favor of
+  `transform(df, cols .=> categorical .=> cols)` and similar syntaxes
+  [#2394]((https://github.com/JuliaData/DataFrames.jl/pull/2394))
 
 ## New functionalities
 

diff --git a/docs/src/lib/functions.md b/docs/src/lib/functions.md
@@ -92,10 +92,7 @@ valuecols
 
 ## Filtering rows
 ```@docs
-completecases
 delete!
-dropmissing
-dropmissing!
 empty
 empty!
 filter
@@ -107,14 +104,15 @@ unique
 unique!
 ```
 
-## Changing column types
+## Working with missing values
 ```@docs
 allowmissing
 allowmissing!
-categorical
-categorical!
+completecases
 disallowmissing
 disallowmissing!
+dropmissing
+dropmissing!
 ```
 
 ## Iteration

diff --git a/docs/src/man/categorical.md b/docs/src/man/categorical.md
@@ -139,63 +139,6 @@ julia> cv1[1] < cv1[2]
 true
 ```
 
-Often, you will have factors encoded inside a `DataFrame` with `Vector` columns instead
-of `CategoricalVector` columns. You can convert one or more columns of the `DataFrame`
-using the `categorical!` function, which modifies the input `DataFrame` in-place.
-Compression can be applied by setting the `compress` keyword argument to `true`.
-
-```jldoctest categorical
-julia> using DataFrames
-
-julia> df = DataFrame(A = ["A", "B", "C", "D", "D", "A"],
-                      B = ["X", "X", "X", "Y", "Y", "Y"])
-6×2 DataFrame
-│ Row │ A      │ B      │
-│     │ String │ String │
-├─────┼────────┼────────┤
-│ 1   │ A      │ X      │
-│ 2   │ B      │ X      │
-│ 3   │ C      │ X      │
-│ 4   │ D      │ Y      │
-│ 5   │ D      │ Y      │
-│ 6   │ A      │ Y      │
-
-julia> categorical!(df, :A) # change the column `:A` to be categorical
-6×2 DataFrame
-│ Row │ A    │ B      │
-│     │ Cat… │ String │
-├─────┼──────┼────────┤
-│ 1   │ A    │ X      │
-│ 2   │ B    │ X      │
-│ 3   │ C    │ X      │
-│ 4   │ D    │ Y      │
-│ 5   │ D    │ Y      │
-│ 6   │ A    │ Y      │
-```
-
-If columns are not specified, all columns with an `AbstractString` element type
-are converted to be categorical. In the example below we also enable compression:
-
-```jldoctest categorical
-julia> categorical!(df, compress=true)
-6×2 DataFrame
-│ Row │ A    │ B    │
-│     │ Cat… │ Cat… │
-├─────┼──────┼──────┤
-│ 1   │ A    │ X    │
-│ 2   │ B    │ X    │
-│ 3   │ C    │ X    │
-│ 4   │ D    │ Y    │
-│ 5   │ D    │ Y    │
-│ 6   │ A    │ Y    │
-
-julia> eltype.(eachcol(df))
-2-element Array{DataType,1}:
- CategoricalValue{String,UInt8}
- CategoricalValue{String,UInt8}
-
-```
-
 Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl).
 When fitting regression models, `CategoricalVector` columns in the input are translated
 into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of

diff --git a/src/DataFrames.jl b/src/DataFrames.jl
@@ -28,7 +28,6 @@ export AbstractDataFrame,
        allowmissing!,
        antijoin,
        by,
-       categorical!,
        columnindex,
        combine,
        completecases,

diff --git a/src/abstractdataframe/abstractdataframe.jl b/src/abstractdataframe/abstractdataframe.jl
@@ -33,8 +33,6 @@ The following are normally implemented for AbstractDataFrames:
 * [`disallowmissing!`](@ref) : drop support for missing values in columns in-place
 * [`allowmissing`](@ref) : add support for missing values in columns
 * [`allowmissing!`](@ref) : add support for missing values in columns in-place
-* [`categorical`](@ref) : change column types to categorical
-* [`categorical!`](@ref) : change column types to categorical in-place
 * `similar` : a DataFrame with similar columns as `d`
 * `filter` : remove rows
 * `filter!` : remove rows in-place
@@ -1708,85 +1706,6 @@ function Missings.allowmissing(df::AbstractDataFrame,
     return DataFrame(newcols, _names(df), copycols=false)
 end
 
-"""
-    categorical(df::AbstractDataFrame, cols=Union{AbstractString, Missing};
-                compress::Bool=false)
-
-Return a copy of data frame `df` with columns `cols` converted to `CategoricalVector`.
-
-`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR)
-or a `Type`.
-
-If `categorical` is called with the `cols` argument being a `Type`, then
-all columns whose element type is a subtype of this type
-(by default `Union{AbstractString, Missing}`) will be converted to categorical.
-
-If the `compress` keyword argument is set to `true` then the created
-`CategoricalVector`s will be compressed.
-
-All created `CategoricalVector`s are unordered.
-
-**Examples**
-
-```jldoctest
-julia> df = DataFrame(a=[1,2], b=["a","b"])
-2×2 DataFrame
-│ Row │ a     │ b      │
-│     │ Int64 │ String │
-├─────┼───────┼────────┤
-│ 1   │ 1     │ a      │
-│ 2   │ 2     │ b      │
-
-julia> categorical(df)
-2×2 DataFrame
-│ Row │ a     │ b    │
-│     │ Int64 │ Cat… │
-├─────┼───────┼──────┤
-│ 1   │ 1     │ a    │
-│ 2   │ 2     │ b    │
-
-julia> categorical(df, :)
-2×2 DataFrame
-│ Row │ a    │ b    │
-│     │ Cat… │ Cat… │
-├─────┼──────┼──────┤
-│ 1   │ 1    │ a    │
-│ 2   │ 2    │ b    │
-```
-"""
-function CategoricalArrays.categorical(df::AbstractDataFrame,
-                                       cols::Union{ColumnIndex, MultiColumnIndex};
-                                       compress::Bool=false)
-    idxcols = Set(index(df)[cols])
-    newcols = AbstractVector[]
-    for i in axes(df, 2)
-        x = df[!, i]
-        if i in idxcols
-            # categorical always copies
-            push!(newcols, categorical(x, compress=compress))
-        else
-            push!(newcols, copy(x))
-        end
-    end
-    DataFrame(newcols, _names(df), copycols=false)
-end
-
-function CategoricalArrays.categorical(df::AbstractDataFrame,
-                                       cols::Type=Union{AbstractString, Missing};
-                                       compress::Bool=false)
-    newcols = AbstractVector[]
-    for i in axes(df, 2)
-        x = df[!, i]
-        if eltype(x) <: cols
-            # categorical always copies
-            push!(newcols, categorical(x, compress=compress))
-        else
-            push!(newcols, copy(x))
-        end
-    end
-    DataFrame(newcols, _names(df), copycols=false)
-end
-
 """
     flatten(df::AbstractDataFrame, cols)
 

diff --git a/src/dataframe/dataframe.jl b/src/dataframe/dataframe.jl
@@ -990,106 +990,6 @@ disallowmissing!(df::DataFrame, cols::MultiColumnIndex; error::Bool=true) =
 disallowmissing!(df::DataFrame, cols::Colon=:; error::Bool=true) =
     disallowmissing!(df, axes(df, 2), error=error)
 
-##############################################################################
-##
-## Pooling
-##
-##############################################################################
-
-"""
-    categorical!(df::DataFrame, cols=Union{AbstractString, Missing};
-                 compress::Bool=false)
-
-Change columns selected by `cols` in data frame `df` to `CategoricalVector`.
-
-`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR) or a `Type`.
-
-If `categorical!` is called with the `cols` argument being a `Type`, then
-all columns whose element type is a subtype of this type
-(by default `Union{AbstractString, Missing}`) will be converted to categorical.
-
-If the `compress` keyword argument is set to `true` then the created
-`CategoricalVector`s will be compressed.
-
-All created `CategoricalVector`s are unordered.
-
-# Examples
-```julia
-julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
-2×3 DataFrame
-│ Row │ X      │ Y     │ Z      │
-│     │ String │ Int64 │ String │
-├─────┼────────┼───────┼────────┤
-│ 1   │ a      │ 1     │ p      │
-│ 2   │ b      │ 2     │ q      │
-
-julia> categorical!(df)
-2×3 DataFrame
-│ Row │ X    │ Y     │ Z    │
-│     │ Cat… │ Int64 │ Cat… │
-├─────┼──────┼───────┼──────┤
-│ 1   │ a    │ 1     │ p    │
-│ 2   │ b    │ 2     │ q    │
-
-julia> eltype.(eachcol(df))
-3-element Array{DataType,1}:
- CategoricalValue{String,UInt32}
- Int64
- CategoricalValue{String,UInt32}
-
-julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
-2×3 DataFrame
-│ Row │ X      │ Y     │ Z      │
-│     │ String │ Int64 │ String │
-├─────┼────────┼───────┼────────┤
-│ 1   │ a      │ 1     │ p      │
-│ 2   │ b      │ 2     │ q      │
-
-julia> categorical!(df, :Y, compress=true)
-2×3 DataFrame
-│ Row │ X      │ Y    │ Z      │
-│     │ String │ Cat… │ String │
-├─────┼────────┼──────┼────────┤
-│ 1   │ a      │ 1    │ p      │
-│ 2   │ b      │ 2    │ q      │
-
-julia> eltype.(eachcol(df))
-3-element Array{DataType,1}:
- String
- CategoricalValue{Int64,UInt8}
- String
-```
-"""
-function categorical! end
-
-function categorical!(df::DataFrame, cols::ColumnIndex;
-                      compress::Bool=false)
-    df[!, cols] = categorical(df[!, cols], compress=compress)
-    return df
-end
-
-function categorical!(df::DataFrame, cols::AbstractVector{<:ColumnIndex};
-                      compress::Bool=false)
-    for cname in cols
-        df[!, cname] = categorical(df[!, cname], compress=compress)
-    end
-    return df
-end
-
-categorical!(df::DataFrame, cols::MultiColumnIndex;
-             compress::Bool=false) =
-    categorical!(df, index(df)[cols], compress=compress)
-
-function categorical!(df::DataFrame, cols::Type=Union{AbstractString, Missing};
-                      compress::Bool=false)
-    for i in 1:size(df, 2)
-        if eltype(df[!, i]) <: cols
-            df[!, i] = categorical(df[!, i], compress=compress)
-        end
-    end
-    return df
-end
-
 """
     append!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
             promote::Bool=(cols in [:union, :subset]))