Skip to content

Commit

Permalink
Merge 7921b45 into d156320
Browse files Browse the repository at this point in the history
  • Loading branch information
nalimilan committed Aug 30, 2020
2 parents d156320 + 7921b45 commit 1553c79
Show file tree
Hide file tree
Showing 11 changed files with 285 additions and 419 deletions.
3 changes: 3 additions & 0 deletions NEWS.md
Expand Up @@ -22,6 +22,9 @@
choose the fast path only when it is safe; this resolves inconsistencies
with what the same functions not using fast path produce
([#2357](https://github.com/JuliaData/DataFrames.jl/pull/2357))
* the `categorical` and `categorical!` functions have been deprecated in favor of
`transform(df, cols .=> categorical .=> cols)` and similar syntaxes
[#2394]((https://github.com/JuliaData/DataFrames.jl/pull/2394))

## New functionalities

Expand Down
10 changes: 4 additions & 6 deletions docs/src/lib/functions.md
Expand Up @@ -92,10 +92,7 @@ valuecols

## Filtering rows
```@docs
completecases
delete!
dropmissing
dropmissing!
empty
empty!
filter
Expand All @@ -107,14 +104,15 @@ unique
unique!
```

## Changing column types
## Working with missing values
```@docs
allowmissing
allowmissing!
categorical
categorical!
completecases
disallowmissing
disallowmissing!
dropmissing
dropmissing!
```

## Iteration
Expand Down
57 changes: 0 additions & 57 deletions docs/src/man/categorical.md
Expand Up @@ -139,63 +139,6 @@ julia> cv1[1] < cv1[2]
true
```

Often, you will have factors encoded inside a `DataFrame` with `Vector` columns instead
of `CategoricalVector` columns. You can convert one or more columns of the `DataFrame`
using the `categorical!` function, which modifies the input `DataFrame` in-place.
Compression can be applied by setting the `compress` keyword argument to `true`.

```jldoctest categorical
julia> using DataFrames
julia> df = DataFrame(A = ["A", "B", "C", "D", "D", "A"],
B = ["X", "X", "X", "Y", "Y", "Y"])
6×2 DataFrame
│ Row │ A │ B │
│ │ String │ String │
├─────┼────────┼────────┤
│ 1 │ A │ X │
│ 2 │ B │ X │
│ 3 │ C │ X │
│ 4 │ D │ Y │
│ 5 │ D │ Y │
│ 6 │ A │ Y │
julia> categorical!(df, :A) # change the column `:A` to be categorical
6×2 DataFrame
│ Row │ A │ B │
│ │ Cat… │ String │
├─────┼──────┼────────┤
│ 1 │ A │ X │
│ 2 │ B │ X │
│ 3 │ C │ X │
│ 4 │ D │ Y │
│ 5 │ D │ Y │
│ 6 │ A │ Y │
```

If columns are not specified, all columns with an `AbstractString` element type
are converted to be categorical. In the example below we also enable compression:

```jldoctest categorical
julia> categorical!(df, compress=true)
6×2 DataFrame
│ Row │ A │ B │
│ │ Cat… │ Cat… │
├─────┼──────┼──────┤
│ 1 │ A │ X │
│ 2 │ B │ X │
│ 3 │ C │ X │
│ 4 │ D │ Y │
│ 5 │ D │ Y │
│ 6 │ A │ Y │
julia> eltype.(eachcol(df))
2-element Array{DataType,1}:
CategoricalValue{String,UInt8}
CategoricalValue{String,UInt8}
```

Using categorical arrays is important for working with the [GLM package](https://github.com/JuliaStats/GLM.jl).
When fitting regression models, `CategoricalVector` columns in the input are translated
into 0/1 indicator columns in the `ModelMatrix` with one column for each of the levels of
Expand Down
1 change: 0 additions & 1 deletion src/DataFrames.jl
Expand Up @@ -28,7 +28,6 @@ export AbstractDataFrame,
allowmissing!,
antijoin,
by,
categorical!,
columnindex,
combine,
completecases,
Expand Down
81 changes: 0 additions & 81 deletions src/abstractdataframe/abstractdataframe.jl
Expand Up @@ -33,8 +33,6 @@ The following are normally implemented for AbstractDataFrames:
* [`disallowmissing!`](@ref) : drop support for missing values in columns in-place
* [`allowmissing`](@ref) : add support for missing values in columns
* [`allowmissing!`](@ref) : add support for missing values in columns in-place
* [`categorical`](@ref) : change column types to categorical
* [`categorical!`](@ref) : change column types to categorical in-place
* `similar` : a DataFrame with similar columns as `d`
* `filter` : remove rows
* `filter!` : remove rows in-place
Expand Down Expand Up @@ -1708,85 +1706,6 @@ function Missings.allowmissing(df::AbstractDataFrame,
return DataFrame(newcols, _names(df), copycols=false)
end

"""
categorical(df::AbstractDataFrame, cols=Union{AbstractString, Missing};
compress::Bool=false)
Return a copy of data frame `df` with columns `cols` converted to `CategoricalVector`.
`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR)
or a `Type`.
If `categorical` is called with the `cols` argument being a `Type`, then
all columns whose element type is a subtype of this type
(by default `Union{AbstractString, Missing}`) will be converted to categorical.
If the `compress` keyword argument is set to `true` then the created
`CategoricalVector`s will be compressed.
All created `CategoricalVector`s are unordered.
**Examples**
```jldoctest
julia> df = DataFrame(a=[1,2], b=["a","b"])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
julia> categorical(df)
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Cat… │
├─────┼───────┼──────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
julia> categorical(df, :)
2×2 DataFrame
│ Row │ a │ b │
│ │ Cat… │ Cat… │
├─────┼──────┼──────┤
│ 1 │ 1 │ a │
│ 2 │ 2 │ b │
```
"""
function CategoricalArrays.categorical(df::AbstractDataFrame,
cols::Union{ColumnIndex, MultiColumnIndex};
compress::Bool=false)
idxcols = Set(index(df)[cols])
newcols = AbstractVector[]
for i in axes(df, 2)
x = df[!, i]
if i in idxcols
# categorical always copies
push!(newcols, categorical(x, compress=compress))
else
push!(newcols, copy(x))
end
end
DataFrame(newcols, _names(df), copycols=false)
end

function CategoricalArrays.categorical(df::AbstractDataFrame,
cols::Type=Union{AbstractString, Missing};
compress::Bool=false)
newcols = AbstractVector[]
for i in axes(df, 2)
x = df[!, i]
if eltype(x) <: cols
# categorical always copies
push!(newcols, categorical(x, compress=compress))
else
push!(newcols, copy(x))
end
end
DataFrame(newcols, _names(df), copycols=false)
end

"""
flatten(df::AbstractDataFrame, cols)
Expand Down
100 changes: 0 additions & 100 deletions src/dataframe/dataframe.jl
Expand Up @@ -990,106 +990,6 @@ disallowmissing!(df::DataFrame, cols::MultiColumnIndex; error::Bool=true) =
disallowmissing!(df::DataFrame, cols::Colon=:; error::Bool=true) =
disallowmissing!(df, axes(df, 2), error=error)

##############################################################################
##
## Pooling
##
##############################################################################

"""
categorical!(df::DataFrame, cols=Union{AbstractString, Missing};
compress::Bool=false)
Change columns selected by `cols` in data frame `df` to `CategoricalVector`.
`cols` can be any column selector ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR) or a `Type`.
If `categorical!` is called with the `cols` argument being a `Type`, then
all columns whose element type is a subtype of this type
(by default `Union{AbstractString, Missing}`) will be converted to categorical.
If the `compress` keyword argument is set to `true` then the created
`CategoricalVector`s will be compressed.
All created `CategoricalVector`s are unordered.
# Examples
```julia
julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> categorical!(df)
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ Cat… │ Int64 │ Cat… │
├─────┼──────┼───────┼──────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
CategoricalValue{String,UInt32}
Int64
CategoricalValue{String,UInt32}
julia> df = DataFrame(X=["a", "b"], Y=[1, 2], Z=["p", "q"])
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Int64 │ String │
├─────┼────────┼───────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> categorical!(df, :Y, compress=true)
2×3 DataFrame
│ Row │ X │ Y │ Z │
│ │ String │ Cat… │ String │
├─────┼────────┼──────┼────────┤
│ 1 │ a │ 1 │ p │
│ 2 │ b │ 2 │ q │
julia> eltype.(eachcol(df))
3-element Array{DataType,1}:
String
CategoricalValue{Int64,UInt8}
String
```
"""
function categorical! end

function categorical!(df::DataFrame, cols::ColumnIndex;
compress::Bool=false)
df[!, cols] = categorical(df[!, cols], compress=compress)
return df
end

function categorical!(df::DataFrame, cols::AbstractVector{<:ColumnIndex};
compress::Bool=false)
for cname in cols
df[!, cname] = categorical(df[!, cname], compress=compress)
end
return df
end

categorical!(df::DataFrame, cols::MultiColumnIndex;
compress::Bool=false) =
categorical!(df, index(df)[cols], compress=compress)

function categorical!(df::DataFrame, cols::Type=Union{AbstractString, Missing};
compress::Bool=false)
for i in 1:size(df, 2)
if eltype(df[!, i]) <: cols
df[!, i] = categorical(df[!, i], compress=compress)
end
end
return df
end

"""
append!(df::DataFrame, df2::AbstractDataFrame; cols::Symbol=:setequal,
promote::Bool=(cols in [:union, :subset]))
Expand Down

0 comments on commit 1553c79

Please sign in to comment.