Skip to content

Commit

Permalink
[BREAKING] Deprecate categorical and categorical! (#2394)
Browse files Browse the repository at this point in the history
These methods are the only remaining traces of CategoricalArrays in the public API.
Getting rid of them will allow dropping the dependency if we want.

`transform` and `transform!` now provide quite reasonable replacements for some uses.
What is lacking though is a way to select columns based on their types, which would
be a useful addition in general. Also it could make sense to add a keyword argument
to keep original names, which would be convenient to avoid repeating the selector.
  • Loading branch information
nalimilan committed Aug 31, 2020
1 parent d156320 commit f10ee2e
Show file tree
Hide file tree
Showing 11 changed files with 285 additions and 419 deletions.
3 changes: 3 additions & 0 deletions NEWS.md
Expand Up @@ -22,6 +22,9 @@
choose the fast path only when it is safe; this resolves inconsistencies
with what the same functions not using fast path produce
([#2357](https://github.com/JuliaData/DataFrames.jl/pull/2357))
* the `categorical` and `categorical!` functions have been deprecated in favor of
`transform(df, cols .=> categorical .=> cols)` and similar syntaxes
[#2394]((https://github.com/JuliaData/DataFrames.jl/pull/2394))

## New functionalities

Expand Down
10 changes: 4 additions & 6 deletions docs/src/lib/functions.md
Expand Up @@ -92,10 +92,7 @@ valuecols

## Filtering rows
```@docs
completecases
delete!
dropmissing
dropmissing!
empty
empty!
filter
Expand All @@ -107,14 +104,15 @@ unique
unique!
```

## Changing column types
## Working with missing values
```@docs
allowmissing
allowmissing!
categorical
categorical!
completecases
disallowmissing
disallowmissing!
dropmissing
dropmissing!
```

## Iteration
Expand Down
57 changes: 0 additions & 57 deletions docs/src/man/categorical.md
Expand Up @@ -139,63 +139,6 @@ julia> cv1[1] < cv1[2]
true
```

Often, you will have factors encoded inside a `DataFrame` with `Vector` columns instead
of `CategoricalVector` columns. You can convert one or more columns of the `DataFrame`
using the `categorical!` function, which modifies the input `DataFrame` in-place.
Compression can be applied by setting the `compress` keyword argument to `true`.

```jldoctest categorical
julia> using DataFrames
julia> df = DataFrame(A = ["A", "B", "C", "D", "D", "A"],
B = ["X", "X", "X", "Y", "Y", "Y"])
6×2 DataFrame
│ Row │ A │ B │
│ │ String │ String │
├─────┼────────┼────────┤
│ 1 │ A │ X │
│ 2 │ B │ X │
│ 3 │ C │ X │
│ 4 │ D │ Y │
│ 5 │ D │ Y │
│ 6 │ A │ Y │
julia> categorical!(df, :A) # change the column `:A` to be categorical
6×2 DataFrame
│ Row │ A │ B │
│ │ Cat… │ String │
├─────┼──────┼────────┤
│ 1 │ A │ X │
│ 2 │ B │ X │
│ 3 │ C │ X │
│ 4 │ D │ Y │
│ 5 │ D │ Y │
│ 6 │ A │ Y │
```

If columns are not specified, all columns with an `AbstractString` element type
are converted to be categorical. In the example below we also enable compression:

```jldoctest categorical
julia> categorical!(df, compress=true)
6×2 DataFrame
│ Row │ A │ B │
│ │ Cat… │ Cat… │
├─────┼──────┼──────┤
│ 1 │ A │ X │
│ 2 │ B │ X │
│ 3 │ C │ X │
│ 4 │ D │ Y │
│ 5 │ D │ Y │
│ 6 │ A │ Y │
julia> eltype.(eachcol(df))
2-element Array{DataType,1}:
CategoricalValue{String,UInt8}
CategoricalValue{String,UInt8}
```

Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl).
When fitting regression models, `CategoricalVector` columns in the input are translated
into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of
Expand Down
1 change: 0 additions & 1 deletion src/DataFrames.jl
Expand Up @@ -28,7 +28,6 @@ export AbstractDataFrame,
allowmissing!,
antijoin,
by,
categorical!,
columnindex,
combine,
completecases,
Expand Down
81 changes: 0 additions & 81 deletions src/abstractdataframe/abstractdataframe.jl
Expand Up @@ -33,8 +33,6 @@ The following are normally implemented for AbstractDataFrames:
* [`disallowmissing!`](@ref) : drop support for missing values in columns in-place
* [`allowmissing`](@ref) : add support for missing values in columns
* [`allowmissing!`](@ref) : add support for missing values in columns in-place
* [`categorical`](@ref) : change column types to categorical
* [`categorical!`](@ref) : change column types to categorical in-place
* `similar` : a DataFrame with similar columns as `d`
* `filter` : remove rows
* `filter!` : remove rows in-place
Expand Down Expand Up @@ -1708,85 +1706,6 @@ function Missings.allowmissing(df::AbstractDataFrame,
return DataFrame(newcols, _names(df), copycols=false)
end

"""
categorical(df::AbstractDataFrame, cols=Union{AbstractString, Missing};
compress::Bool=false)
Return a copy of data frame `df` with columns `cols` converted to `CategoricalVector`.
`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR)
or a `Type`.
If `categorical` is called with the `cols` argument being a `Type`, then
all columns whose element type is a subtype of this type
(by default `Union{AbstractString, Missing}`) will be converted to categorical.
If the `compress` keyword argument is set to `true` then the created
`CategoricalVector`s will be compressed.
All created `CategoricalVector`s are unordered.
**Examples**
```jldoctest
julia> df = DataFrame(a=[1,2], b=["a","b"])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
julia> categorical(df)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Cat… │
├─────┼───────┼──────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
julia> categorical(df, :)
2×2 DataFrame
│ Row │ a │ b │
│ │ Cat… │ Cat… │
├─────┼──────┼──────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
```
"""
function CategoricalArrays.categorical(df::AbstractDataFrame,
cols::Union{ColumnIndex, MultiColumnIndex};
compress::Bool=false)
idxcols = Set(index(df)[cols])
newcols = AbstractVector[]
for i in axes(df, 2)
x = df[!, i]
if i in idxcols
# categorical always copies
push!(newcols, categorical(x, compress=compress))
else
push!(newcols, copy(x))
end
end
DataFrame(newcols, _names(df), copycols=false)
end

function CategoricalArrays.categorical(df::AbstractDataFrame,
cols::Type=Union{AbstractString, Missing};
compress::Bool=false)
newcols = AbstractVector[]
for i in axes(df, 2)
x = df[!, i]
if eltype(x) <: cols
# categorical always copies
push!(newcols, categorical(x, compress=compress))
else
push!(newcols, copy(x))
end
end
DataFrame(newcols, _names(df), copycols=false)
end

"""
flatten(df::AbstractDataFrame, cols)
Expand Down
100 changes: 0 additions & 100 deletions src/dataframe/dataframe.jl
Expand Up @@ -990,106 +990,6 @@ disallowmissing!(df::DataFrame, cols::MultiColumnIndex; error::Bool=true) =
disallowmissing!(df::DataFrame, cols::Colon=:; error::Bool=true) =
disallowmissing!(df, axes(df, 2), error=error)

##############################################################################
##
## Pooling
##
##############################################################################

"""
categorical!(df::DataFrame, cols=Union{AbstractString, Missing};
compress::Bool=false)
Change columns selected by `cols` in data frame `df` to `CategoricalVector`.
`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR) or a `Type`.
If `categorical!` is called with the `cols` argument being a `Type`, then
all columns whose element type is a subtype of this type
(by default `Union{AbstractString, Missing}`) will be converted to categorical.
If the `compress` keyword argument is set to `true` then the created
`CategoricalVector`s will be compressed.
All created `CategoricalVector`s are unordered.
# Examples
```julia
julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> categorical!(df)
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ Cat… │ Int64 │ Cat… │
├─────┼──────┼───────┼──────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
CategoricalValue{String,UInt32}
Int64
CategoricalValue{String,UInt32}
julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> categorical!(df, :Y, compress=true)
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Cat… │ String │
├─────┼────────┼──────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
String
CategoricalValue{Int64,UInt8}
String
```
"""
function categorical! end

function categorical!(df::DataFrame, cols::ColumnIndex;
compress::Bool=false)
df[!, cols] = categorical(df[!, cols], compress=compress)
return df
end

function categorical!(df::DataFrame, cols::AbstractVector{<:ColumnIndex};
compress::Bool=false)
for cname in cols
df[!, cname] = categorical(df[!, cname], compress=compress)
end
return df
end

categorical!(df::DataFrame, cols::MultiColumnIndex;
compress::Bool=false) =
categorical!(df, index(df)[cols], compress=compress)

function categorical!(df::DataFrame, cols::Type=Union{AbstractString, Missing};
compress::Bool=false)
for i in 1:size(df, 2)
if eltype(df[!, i]) <: cols
df[!, i] = categorical(df[!, i], compress=compress)
end
end
return df
end

"""
append!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
Expand Down

0 comments on commit f10ee2e

Please sign in to comment.