# Importing R data frames into Julia

Many [R](http://www.R-project.org) packages provide data used to illustrate and to test functions and methods in the package.  The [RCall package](https://github.com/JuliaStats/RCall.jl) for [Julia](http://julialang.org), in conjunction with the [DataFrames package](https://github.com/JuliaStats/DataFrames.jl) provides for importing an R `data.frame` object as a Julia `DataFrame`, preserving much of the metadata.  In particular, columns that are `factor` or `ordered` objects in R become `PooledDataArray` objects in Julia.

Because Julia has a richer type system than does R it is at times worthwhile modifying the types of the columns of the columns in Julia before saving the `DataFrame`.  Thus we explicitly import the [DataArrays package](https://github.org/JuliaStats/DataArrays.jl) in addition to [DataFrames](https://github.org/JuliaStats/DataArrays.jl).

The Julia `DataFrame`s are saved in the `JLD` format provided by the [HDF5 package](https://github.com/timholy/HDF5.jl). 

In [1]:
using DataArrays, DataFrames, HDF5, JLD, RCall

## Importing and examining an R data frame

Attaching the [RCall package](https://github.com/JuliaStats/RCall.jl) initializes an embedded R process and provides methods for the `DataFrame` generic with `ASCIIString` or `Symbol` arguments.

There are two approaches to accessing data from a particular package:
   - use the fully qualified name (as a string)
   - execute a call to the R function `library` and use the unqualified name.

In [2]:
inst = DataFrame("lme4::InstEval"); # fully qualified name
dump(inst)

DataFrame  73421 observations of 7 variables
  s: PooledDataArray{ASCIIString,Uint16,1}(73421) ASCIIString["1","1","1","1"]
  d: PooledDataArray{ASCIIString,Uint16,1}(73421) ASCIIString["1002","1050","1582","2050"]
  studage: PooledDataArray{ASCIIString,Uint8,1}(73421) ASCIIString["2","2","2","2"]
  lectage: PooledDataArray{ASCIIString,Uint8,1}(73421) ASCIIString["2","1","2","2"]
  service: PooledDataArray{ASCIIString,Uint8,1}(73421) ASCIIString["0","1","0","1"]
  dept: PooledDataArray{ASCIIString,Uint8,1}(73421) ASCIIString["2","6","2","3"]
  y: DataArray{Int32,1}(73421) Int32[5,2,5,3]


The `dump` generic is similar too (but less fully featured than) R's `str` function.  The data from the R `InstEval` object have been copied to Julia storage so there will not be a conflict between R's memory management and Julia's memory management.

The alternative approach is

In [3]:
reval("library(lme4)");
inst = DataFrame(:InstEval)

Loading required package: Matrix

Attaching package: ‘Matrix’

The following objects are masked from ‘package:base’:

    crossprod, tcrossprod

Loading required package: Rcpp


Unnamed: 0,s,d,studage,lectage,service,dept,y
1,1,1002,2,2,0,2,5
2,1,1050,2,1,1,6,2
3,1,1582,2,2,0,2,5
4,1,2050,2,2,1,3,3
5,2,115,2,1,0,5,2
6,2,756,2,1,0,5,4
7,3,7,2,1,1,11,4
8,3,13,2,1,0,10,5
9,3,36,2,1,0,10,5
10,3,140,2,1,0,10,4


## Convert numeric levels to numbers

The type of many of the columns of `inst` is, e.g. `PooledDataArray{ASCIIString,Uint16,1}`.  This is like a `factor` object in R in that the values being represented are from a finite collection, called the `levels` in R and the `pool` in Julia.  Here the  type of the `pool` is `ASCIIString`.  The integer indices into the pool are stored here as `Uint16` - an unsigned 16-bit integer.  These are called the `refs` in Julia because they are the references to objects in the pool.  In R the only integer type is a 32-bit signed integer.

The Julia `PooledDataArray` specifies the type of the pool, the type of the refs and the number of dimensions - 1 in this case.

In R the levels are almost always converted to strings.  In Julia there is no need to use strings when short integers or single characters can be used instead.  Strings are powerful but there is a lot going on behind the scenes to make the implementation effective.  All the indirection can be avoided if the pool is a simple "bitstype" vector.

First we convert the strings in the pool to integers then check the maximum to see what size of integer would be best to use.

In [4]:
dn = map(parseint, inst[:d].pool);
dump(dn)

Array(Int64,(1128,)) [1,6,7,8,12,13,14,15,17,18  …  2143,2145,2146,2147,2149,2152,2153,2156,2157,2160]


The value of `dn` is computed by mapping the `parseint` function to the pool of the `s` column of the `inst` data frame.  Note that individual columns in a `DataFrame` are indexed by name as a `Symbol` and that fields of a compound type like `PooledDataArray` are extracted with the `.` infix operator (similar to C, C++, ...).

Because the pool vector is not very long we could leave it as 64-bit integers but, for storage, it is better to use 32-bit or 16-bit integers in this case.

One way to create the 16-bit integers is

In [5]:
dump(convert(Vector{Int16}, dn))

Array(Int16,(1128,)) Int16[1,6,7,8,12,13,14,15,17,18  …  2143,2145,2146,2147,2149,2152,2153,2156,2157,2160]


An alternative is to create the 16-bit integers by parsing the strings as 16-bit integers when creating `dn`.  As this involves using a non-default integer type, calling `map` gets more complicated.  The alternative is to use a _comprehension_ with generates an array by applying a function form to elements of an _iterator_.  It looks like

In [6]:
dn = [parseint(Int16,x) for x in inst[:d].pool];
dump(dn)

Array(Int16,(1128,)) Int16[1,6,7,8,12,13,14,15,17,18  …  2143,2145,2146,2147,2149,2152,2153,2156,2157,2160]


We construct the new `PooledDataArray` with `setlevels`. It happens that the methods for `setlevels` don't allow changing the type of the pool, but we can easily construct such a method.

In [7]:
function DataArrays.setlevels(v::PooledDataArray,l::AbstractVector)
    length(l) == length(v.pool) || throw(DimensionMismatch(""))
    PooledDataArray(DataArrays.RefArray(v.refs),l)
end

setlevels (generic function with 3 methods)

In [8]:
inst[:d] = setlevels(inst[:d], dn);
inst[:s] = setlevels(inst[:s],
    [parseint(Int16,x) for x in inst[:s].pool])
inst[:studage] = setlevels(inst[:studage],
    [parseint(Int8,x) for x in inst[:studage].pool]);
inst[:lectage] = setlevels(inst[:lectage],
    [parseint(Int8,x) for x in inst[:lectage].pool]);
inst[:service] = setlevels(inst[:service], ['N','Y']);
inst[:dept] = setlevels(inst[:dept],
[parseint(Int8,x) for x in inst[:dept].pool])
inst[:y] = convert(Vector{Int8},inst[:y]);
dump(inst)

DataFrame  73421 observations of 7 variables
  s: PooledDataArray{Int16,Uint16,1}(73421) Int16[1,1,1,1]
  d: PooledDataArray{Int16,Uint16,1}(73421) Int16[1002,1050,1582,2050]
  studage: PooledDataArray{Int8,Uint8,1}(73421) Int8[2,2,2,2]
  lectage: PooledDataArray{Int8,Uint8,1}(73421) Int8[2,1,2,2]
  service: PooledDataArray{Char,Uint8,1}(73421) Char['N','Y','N','Y']
  dept: PooledDataArray{Int8,Uint8,1}(73421) Int8[2,6,2,3]
  y: DataArray{Int8,1}(73421) Int8[5,2,5,3]


In [9]:
head(inst,10)

Unnamed: 0,s,d,studage,lectage,service,dept,y
1,1,1002,2,2,N,2,5
2,1,1050,2,1,Y,6,2
3,1,1582,2,2,N,2,5
4,1,2050,2,2,Y,3,3
5,2,115,2,1,N,5,2
6,2,756,2,1,N,5,4
7,3,7,2,1,Y,11,4
8,3,13,2,1,N,10,5
9,3,36,2,1,N,10,5
10,3,140,2,1,N,10,4


## Saving a DataFrame in JLD format.

The JLD format is based on the [HDF5 format](http://www.hdfgroup.org/HDF5/), which is accessible from many languages and command line tools.  The implementation in Julia allows for compression of the data using [Blosc](http://www.blosc.org) but there is a trade-off between smaller data files using Blosc and the ability to _memory-map_ the data when reading it.

Saving the data without compression is straightforward.  There is a `@save` macro to do this.  There is also a `save` function.  The macro allows access to both the name and the value of the object to be saved with a single argument whereas both the name and a quoted string must be given as arguments in the call to the function.

In [10]:
@save "inst.jld" inst

We can run the shell programs `h5ls` and `h5stat` to examine the file just created.

In [11]:
; h5ls inst.jld

_refs                    Group
_require                 Dataset {NULL}
_types                   Group
inst                     Dataset {SCALAR}


In [12]:
; h5stat -f -d inst.jld

Filename: inst.jld
File information
	# of unique groups: 3
	# of unique datasets: 30
	# of unique named datatypes: 10
	# of unique links: 0
	# of unique other: 0
	Max. # of links to object: 4
	Max. # of objects in group: 28
Dataset dimension information:
	Max. rank of datasets: 1
	Dataset ranks:
		# of dataset with rank 0: 12
		# of dataset with rank 1: 18
1-D Dataset information:
	Max. dimension size of 1-D datasets: 73421
	Small 1-D datasets (with dimension sizes 0 to 9):
		# of datasets with dimension sizes 2: 1
		# of datasets with dimension sizes 4: 1
		# of datasets with dimension sizes 6: 1
		# of datasets with dimension sizes 7: 4
		Total # of small datasets: 7
	1-D Dataset dimension bins:
		# of datasets with dimension size 1 - 9: 7
		# of datasets with dimension size 10 - 99: 1
		# of datasets with dimension size 1000 - 9999: 3
		# of datasets with dimension size 10000 - 99999: 7
		Total # of datasets: 18
Dataset storage information:
	Total raw data size: 678725
	Total extern

The large number of HDF5 datasets being stored in the file reflects the fact that each column and its refs and pool members are considered separate datasets in the HDF5 terminology.

The alternative approach compressing the data requires using the function `save`.

In [13]:
save("inst1.jld", "inst", inst; compress=true)

In [14]:
; ls -l *.jld

-rw-rw-r-- 1 bates bates  24832 Apr 28 12:48 cake.jld
-rw-rw-r-- 1 bates bates 352807 Apr 28 13:17 inst1.jld
-rw-rw-r-- 1 bates bates 700469 Apr 28 13:17 inst.jld
-rw-rw-r-- 1 bates bates  19120 Apr 28 13:14 pastes.jld


In [15]:
; h5stat -d -f inst1.jld

Filename: inst1.jld
File information
	# of unique groups: 3
	# of unique datasets: 30
	# of unique named datatypes: 10
	# of unique links: 0
	# of unique other: 0
	Max. # of links to object: 4
	Max. # of objects in group: 28
Dataset dimension information:
	Max. rank of datasets: 1
	Dataset ranks:
		# of dataset with rank 0: 12
		# of dataset with rank 1: 18
1-D Dataset information:
	Max. dimension size of 1-D datasets: 73421
	Small 1-D datasets (with dimension sizes 0 to 9):
		# of datasets with dimension sizes 2: 1
		# of datasets with dimension sizes 4: 1
		# of datasets with dimension sizes 6: 1
		# of datasets with dimension sizes 7: 4
		Total # of small datasets: 7
	1-D Dataset dimension bins:
		# of datasets with dimension size 1 - 9: 7
		# of datasets with dimension size 10 - 99: 1
		# of datasets with dimension size 1000 - 9999: 3
		# of datasets with dimension size 10000 - 99999: 7
		Total # of datasets: 18
Dataset storage information:
	Total raw data size: 314059
	Total exter

In this case the file with compressed columns is about half the size of the file without compression.

## Single-character levels as characters

Often the levels of the factors in a dataset are single characters.  These can be converted to

In [16]:
cake = DataFrame("lme4::cake");
dump(cake)

DataFrame  270 observations of 5 variables
  replicate: PooledDataArray{ASCIIString,Uint8,1}(270) ASCIIString["1","1","1","1"]
  recipe: PooledDataArray{ASCIIString,Uint8,1}(270) ASCIIString["A","A","A","A"]
  temperature: PooledDataArray{ASCIIString,Uint8,1}(270) ASCIIString["175","185","195","205"]
  angle: DataArray{Int32,1}(270) Int32[42,46,47,39]
  temp: DataArray{Float64,1}(270) [175.0,185.0,195.0,205.0]


In [17]:
show(cake[:replicate].pool)

ASCIIString["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15"]

In [18]:
cake[:replicate] = setlevels(cake[:replicate],
    [parseint(Int8,x) for x in cake[:replicate].pool]);

show(cake[:recipe].pool)

ASCIIString["A","B","C"]

In [19]:
cake[:recipe] = 
setlevels(cake[:recipe],[x[1]::Char for x in cake[:recipe].pool]);
cake[:temperature] = setlevels(cake[:temperature],
    [parseint(Int16,x) for x in cake[:temperature].pool]);
dump(cake)

DataFrame  270 observations of 5 variables
  replicate: PooledDataArray{Int8,Uint8,1}(270) Int8[1,1,1,1]
  recipe: PooledDataArray{Char,Uint8,1}(270) Char['A','A','A','A']
  temperature: PooledDataArray{Int16,Uint8,1}(270) Int16[175,185,195,205]
  angle: DataArray{Int32,1}(270) Int32[42,46,47,39]
  temp: DataArray{Float64,1}(270) [175.0,185.0,195.0,205.0]


In [20]:
save("cake.jld", "cake", cake; compress=true)

In [21]:
; h5stat -d -f cake.jld

Filename: cake.jld
File information
	# of unique groups: 3
	# of unique datasets: 25
	# of unique named datatypes: 11
	# of unique links: 0
	# of unique other: 0
	Max. # of links to object: 3
	Max. # of objects in group: 23
Dataset dimension information:
	Max. rank of datasets: 1
	Dataset ranks:
		# of dataset with rank 0: 11
		# of dataset with rank 1: 14
1-D Dataset information:
	Max. dimension size of 1-D datasets: 270
	Small 1-D datasets (with dimension sizes 0 to 9):
		# of datasets with dimension sizes 3: 1
		# of datasets with dimension sizes 5: 6
		# of datasets with dimension sizes 6: 1
		Total # of small datasets: 8
	1-D Dataset dimension bins:
		# of datasets with dimension size 1 - 9: 8
		# of datasets with dimension size 10 - 99: 1
		# of datasets with dimension size 100 - 999: 5
		Total # of datasets: 14
Dataset storage information:
	Total raw data size: 4585
	Total external raw data size: 0
Dataset layout information:
	Dataset layout counts[COMPACT]: 25
	Dataset layout c

Notice that there has been no compression of any of the elements.  This is because the vector lengths must be above a threshold before it becomes worthwhile compressing the contents.

I just noticed that the same information is being stored in different forms in the `temperature` and `temp` columns.  That should be fixed but it is a minor fix.

## A more general method for setting new levels.

The `setlevels` method given above does save some typing but it is still rather tedious to use.  We create a method that takes the `DataFrame`, the column name and a function to apply to get the new levels.  In this case, the appropriate generic function name is `setlevels!` because the method will be _mutating_.  That is, it changes one or more of its arguments - here it will be the data frame.

It is a convention that the names of mutating functions end in `!`.  That character has no semantic meaning; it is just part of the name.  The convention is to alert the user that extra caution is needed when using such a function.

In [22]:
function DataArrays.setlevels!(fr::DataFrame, s::Symbol, func::Function)
    s ∈ names(fr) && isa(fr[s], PooledDataArray) || 
        error("The symbol s must name a PooledDataArray column of fr")
    fr[s] = setlevels(fr[s], map(func, fr[s].pool))
    fr
end

setlevels! (generic function with 4 methods)

In [23]:
pastes = DataFrame("lme4::Pastes");
dump(pastes)

DataFrame  60 observations of 4 variables
  strength: DataArray{Float64,1}(60) [62.8,62.6,60.1,62.3]
  batch: PooledDataArray{ASCIIString,Uint8,1}(60) ASCIIString["A","A","A","A"]
  cask: PooledDataArray{ASCIIString,Uint8,1}(60) ASCIIString["a","a","b","b"]
  sample: PooledDataArray{ASCIIString,Uint8,1}(60) ASCIIString["A:a","A:a","A:b","A:b"]


In [24]:
char1(s::ASCIIString) = s[1]   # extract first character of a string

char1 (generic function with 1 method)

In [25]:
setlevels!(pastes, :batch, char1);
setlevels!(pastes, :cask, char1);
dump(pastes)

DataFrame  60 observations of 4 variables
  strength: DataArray{Float64,1}(60) [62.8,62.6,60.1,62.3]
  batch: PooledDataArray{Char,Uint8,1}(60) Char['A','A','A','A']
  cask: PooledDataArray{Char,Uint8,1}(60) Char['a','a','b','b']
  sample: PooledDataArray{ASCIIString,Uint8,1}(60) ASCIIString["A:a","A:a","A:b","A:b"]


In [26]:
@save "pastes.jld" pastes

In [27]:
; h5stat -d -f pastes.jld

Filename: pastes.jld
File information
	# of unique groups: 3
	# of unique datasets: 21
	# of unique named datatypes: 9
	# of unique links: 0
	# of unique other: 0
	Max. # of links to object: 3
	Max. # of objects in group: 19
Dataset dimension information:
	Max. rank of datasets: 1
	Dataset ranks:
		# of dataset with rank 0: 9
		# of dataset with rank 1: 12
1-D Dataset information:
	Max. dimension size of 1-D datasets: 60
	Small 1-D datasets (with dimension sizes 0 to 9):
		# of datasets with dimension sizes 1: 1
		# of datasets with dimension sizes 3: 1
		# of datasets with dimension sizes 4: 4
		Total # of small datasets: 6
	1-D Dataset dimension bins:
		# of datasets with dimension size 1 - 9: 6
		# of datasets with dimension size 10 - 99: 6
		Total # of datasets: 12
Dataset storage information:
	Total raw data size: 1528
	Total external raw data size: 0
Dataset layout information:
	Dataset layout counts[COMPACT]: 21
	Dataset layout counts[CONTIG]: 0
	Dataset layout counts[CHUNKED]: 