# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

In [1]:
using DataFrames # load package

## Working with CategoricalArrays

### Constructor

In [2]:
x = categorical(["A", "B", "B", "C"]) # unordered

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

In [3]:
y = categorical(["A", "B", "B", "C"], ordered=true) # ordered, by default order is sorting order

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

In [4]:
z = categorical(["A","B","B","C", missing]) # unordered with missings

5-element CategoricalArrays.CategoricalArray{Union{Missings.Missing, String},1,UInt32}:
 "A"    
 "B"    
 "B"    
 "C"    
 missing

In [5]:
c = cut(1:10, 5) # ordered, into equal counts, possible to rename labels and give custom breaks

10-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "[1.0, 2.8)" 
 "[1.0, 2.8)" 
 "[2.8, 4.6)" 
 "[2.8, 4.6)" 
 "[4.6, 6.4)" 
 "[4.6, 6.4)" 
 "[6.4, 8.2)" 
 "[6.4, 8.2)" 
 "[8.2, 10.0]"
 "[8.2, 10.0]"

In [6]:
by(DataFrame(x=cut(randn(100000), 10)), :x, d -> DataFrame(n=nrow(d)), sort=true) # just to make sure it works right

Unnamed: 0,x,n
1,"[-4.25932, -1.2841)",10000
2,"[-1.2841, -0.845596)",10000
3,"[-0.845596, -0.525619)",10000
4,"[-0.525619, -0.251247)",10000
5,"[-0.251247, 0.00327741)",10000
6,"[0.00327741, 0.256183)",10000
7,"[0.256183, 0.530553)",10000
8,"[0.530553, 0.848815)",10000
9,"[0.848815, 1.28808)",10000
10,"[1.28808, 4.27432]",10000


In [7]:
v = categorical([1,2,2,3,3]) # contains integers not strings

5-element CategoricalArrays.CategoricalArray{Int64,1,UInt32}:
 1
 2
 2
 3
 3

In [8]:
Vector{Union{String, Missing}}(z) # sometimes you need to convert back to a standard vector

5-element Array{Union{Missings.Missing, String},1}:
 "A"    
 "B"    
 "B"    
 "C"    
 missing

### Managing levels

In [9]:
arr = [x,y,z,c,v]

5-element Array{CategoricalArrays.CategoricalArray{T,1,UInt32,V,C,U} where U where C where V where T,1}:
 CategoricalArrays.CategoricalString{UInt32}["A", "B", "B", "C"]                                                                                                                          
 CategoricalArrays.CategoricalString{UInt32}["A", "B", "B", "C"]                                                                                                                          
 Union{CategoricalArrays.CategoricalString{UInt32}, Missings.Missing}["A", "B", "B", "C", missing]                                                                                        
 CategoricalArrays.CategoricalString{UInt32}["[1.0, 2.8)", "[1.0, 2.8)", "[2.8, 4.6)", "[2.8, 4.6)", "[4.6, 6.4)", "[4.6, 6.4)", "[6.4, 8.2)", "[6.4, 8.2)", "[8.2, 10.0]", "[8.2, 10.0]"]
 CategoricalArrays.CategoricalValue{Int64,UInt32}[1, 2, 2, 3, 3]                                                                                   

In [10]:
isordered.(arr) # chcek if categorical array is orderd

5-element BitArray{1}:
 false
  true
 false
  true
 false

In [11]:
ordered!(x, true), isordered(x) # make x ordered

(CategoricalArrays.CategoricalString{UInt32}["A", "B", "B", "C"], true)

In [12]:
ordered!(x, false), isordered(x) # and unordered again

(CategoricalArrays.CategoricalString{UInt32}["A", "B", "B", "C"], false)

In [13]:
levels.(arr) # list levels

5-element Array{Array{T,1} where T,1}:
 String["A", "B", "C"]                                                        
 String["A", "B", "C"]                                                        
 String["A", "B", "C"]                                                        
 String["[1.0, 2.8)", "[2.8, 4.6)", "[4.6, 6.4)", "[6.4, 8.2)", "[8.2, 10.0]"]
 [1, 2, 3]                                                                    

In [14]:
unique.(arr) # missing will be included

5-element Array{Array{T,1} where T,1}:
 String["A", "B", "C"]                                                        
 String["A", "B", "C"]                                                        
 Union{Missings.Missing, String}["A", "B", "C", missing]                      
 String["[1.0, 2.8)", "[2.8, 4.6)", "[4.6, 6.4)", "[6.4, 8.2)", "[8.2, 10.0]"]
 [1, 2, 3]                                                                    

In [15]:
y[1] < y[2] # can compare as y is ordered

true

In [16]:
v[1] < v[2] # not comparable, v is unordered although it contains integers

LoadError: [91mArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this[39m

In [17]:
levels!(y, ["C", "B", "A"]) # you can reorder levels, mostly useful for ordered CategoricalArrays

4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "A"
 "B"
 "B"
 "C"

In [18]:
y[1] < y[2] # observe that the order is changed

false

In [19]:
levels!(z, ["A", "B"]) # you have to specify all levels that are present

LoadError: [91mArgumentError: cannot remove level "C" as it is used at position 4 and allow_missing=false.[39m

In [20]:
levels!(z, ["A", "B"], allow_missing=true) # unless the underlying array allows for missings and force removal of levels

5-element CategoricalArrays.CategoricalArray{Union{Missings.Missing, String},1,UInt32}:
 "A"    
 "B"    
 "B"    
 missing
 missing

In [21]:
z[1] = "B"
z # now z has only "B" entries

5-element CategoricalArrays.CategoricalArray{Union{Missings.Missing, String},1,UInt32}:
 "B"    
 "B"    
 "B"    
 missing
 missing

In [22]:
levels(z) # but it remembers the levels it had (the reason is mostly performance)

2-element Array{String,1}:
 "A"
 "B"

In [23]:
droplevels!(z) # this way we can clean it up
levels(z)

1-element Array{String,1}:
 "B"

### Data manipulation

In [24]:
x, levels(x)

(CategoricalArrays.CategoricalString{UInt32}["A", "B", "B", "C"], String["A", "B", "C"])

In [25]:
x[2] = "0"
x, levels(x) # new level added at the end (works only for unordered)

(CategoricalArrays.CategoricalString{UInt32}["A", "0", "B", "C"], String["A", "B", "C", "0"])

In [26]:
v, levels(v)

(CategoricalArrays.CategoricalValue{Int64,UInt32}[1, 2, 2, 3, 3], [1, 2, 3])

In [27]:
v[1] + v[2] # even though underlying data is Int, we cannot operate on it

LoadError: [91mMethodError: no method matching +(::CategoricalArrays.CategoricalValue{Int64,UInt32}, ::CategoricalArrays.CategoricalValue{Int64,UInt32})[0m
Closest candidates are:
  +(::Any, ::Any, [91m::Any[39m, [91m::Any...[39m) at operators.jl:424[39m

In [28]:
Vector{Int}(v) # you have either to retrieve the data by conversion (may be expensive)

5-element Array{Int64,1}:
 1
 2
 2
 3
 3

In [29]:
get(v[1]) + get(v[2]) # or get a single value

3

In [30]:
get.(v) # this will work for arrays witout missings

5-element Array{Int64,1}:
 1
 2
 2
 3
 3

In [31]:
get.(z) # but will fail on missing values

LoadError: [91mMethodError: no method matching get(::Missings.Missing)[0m
Closest candidates are:
  get([91m::ObjectIdDict[39m, [91m::ANY[39m, [91m::ANY[39m) at associative.jl:434
  get([91m::Base.EnvHash[39m, [91m::AbstractString[39m, [91m::Any[39m) at env.jl:79
  get([91m::ZMQ.Context[39m, [91m::Integer[39m) at D:\Software\JULIA_PKG\v0.6\ZMQ\src\ZMQ.jl:136
  ...[39m

In [32]:
Vector{Union{String, Missing}}(z) # you have to do the conversion

5-element Array{Union{Missings.Missing, String},1}:
 "B"    
 "B"    
 "B"    
 missing
 missing

In [33]:
z[1]*z[2], z.^2 # the only exception are CategoricalArrays based on String - you can operate on them normally

("BB", Any["BB", "BB", "BB", missing, missing])

In [34]:
recode([1,2,3,4,5,missing], 1=>10) # recode some values in an array; has also in place recode! equivalent

6-element Array{Union{Int64, Missings.Missing},1}:
 10       
  2       
  3       
  4       
  5       
   missing

In [35]:
recode([1,2,3,4,5,missing], "a", 1=>10, 2=>20) # here we provided a default value for not mapped recodings

6-element Array{Any,1}:
 10       
 20       
   "a"    
   "a"    
   "a"    
   missing

In [36]:
recode([1,2,3,4,5,missing], 1=>10, missing=>"missing") # to recode Missing you have to do it explicitly

6-element Array{Any,1}:
 10         
  2         
  3         
  4         
  5         
   "missing"

In [37]:
t = categorical([1:5; missing])
t, levels(t)

(Union{CategoricalArrays.CategoricalValue{Int64,UInt32}, Missings.Missing}[1, 2, 3, 4, 5, missing], [1, 2, 3, 4, 5])

In [38]:
recode!(t, [1,3]=>2)
t, levels(t) # note that the levels are dropped after recode

(Union{CategoricalArrays.CategoricalValue{Int64,UInt32}, Missings.Missing}[2, 2, 2, 4, 5, missing], [2, 4, 5])

In [39]:
t = categorical([1,2,3], ordered=true)
levels(recode(t, 2=>0, 1=>-1)) # and if you introduce a new levels they are added at the end in the order of appearance

3-element Array{Int64,1}:
  3
  0
 -1

In [40]:
t = categorical([1,2,3,4,5], ordered=true) # when using default it becomes the last level
levels(recode(t, 300, [1,2]=>100, 3=>200))

3-element Array{Int64,1}:
 100
 200
 300

### Comparisons

In [41]:
x = categorical([1,2,3])
xs = [x, categorical(x), categorical(x, ordered=true), categorical(x, ordered=true)]
levels!(xs[2], [3,2,1])
levels!(xs[4], [2,3,1])
[a == b for a in xs, b in xs] # all are equal - comparison only by contents

4×4 Array{Bool,2}:
 true  true  true  true
 true  true  true  true
 true  true  true  true
 true  true  true  true

In [42]:
signature(x::CategoricalArray) = (x, levels(x), isordered(x)) # this is actually the full signature of CategoricalArray
# all are different, notice that x[1] and x[2] are unordered but have a different order of levels
[signature(a) == signature(b) for a in xs, b in xs]

4×4 Array{Bool,2}:
  true  false  false  false
 false   true  false  false
 false  false   true  false
 false  false  false   true

In [43]:
x[1] < x[2] # you cannot compare elements of unordered CategoricalArray

LoadError: [91mArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this[39m

In [44]:
t[1] < t[2] # but you can do it for an ordered one

true

In [45]:
isless(x[1], x[2]) # isless works within the same CategoricalArray even if it is not ordered

true

In [46]:
y = deepcopy(x) # but not across categorical arrays
isless(x[1], y[2])

LoadError: [91mArgumentError: CategoricalValue objects with different pools cannot be tested for order[39m

In [47]:
isless(get(x[1]), get(y[2])) # you can use get to make a comparison of the contents of CategoricalArray

true

In [48]:
x[1] == y[2] # equality tests works OK across CategoricalArrays

false

### Categorical columns in a DataFrame

In [49]:
df = DataFrame(x = 1:3, y = 'a':'c', z = ["a","b","c"])

Unnamed: 0,x,y,z
1,1,'a',a
2,2,'b',b
3,3,'c',c


In [50]:
categorical!(df) # converts all eltype(AbstractString) columns to categorical

Unnamed: 0,x,y,z
1,1,'a',a
2,2,'b',b
3,3,'c',c


In [51]:
showcols(df)

3×3 DataFrames.DataFrame
│ Col # │ Name │ Eltype                                      │ Missing │
├───────┼──────┼─────────────────────────────────────────────┼─────────┤
│ 1     │ x    │ Int64                                       │ 0       │
│ 2     │ y    │ Char                                        │ 0       │
│ 3     │ z    │ CategoricalArrays.CategoricalString{UInt32} │ 0       │

│ Col # │ Values      │
├───────┼─────────────┤
│ 1     │ 1  …  3     │
│ 2     │ 'a'  …  'c' │
│ 3     │ a  …  c     │

In [52]:
categorical!(df, :x) # manually convert to categorical column :x

Unnamed: 0,x,y,z
1,1,'a',a
2,2,'b',b
3,3,'c',c


In [53]:
showcols(df)

3×3 DataFrames.DataFrame
│ Col # │ Name │ Eltype                                           │ Missing │
├───────┼──────┼──────────────────────────────────────────────────┼─────────┤
│ 1     │ x    │ CategoricalArrays.CategoricalValue{Int64,UInt32} │ 0       │
│ 2     │ y    │ Char                                             │ 0       │
│ 3     │ z    │ CategoricalArrays.CategoricalString{UInt32}      │ 0       │

│ Col # │ Values      │
├───────┼─────────────┤
│ 1     │ 1  …  3     │
│ 2     │ 'a'  …  'c' │
│ 3     │ a  …  c     │