## Load and process data

In [1]:
using DataFrames
using Dates

Loading the comma separated data from disk:

In [2]:
@time optData = readtable("../data/intmed_data/optData.csv")

elapsed time: 12.680159701 seconds (3621159908 bytes allocated, 10.43% gc time)


Unnamed: 0,Date,Option_Price,Bid,Ask,Volume,Open_Interest,Strike,Expiry,DAX,EONIA_matched,Time_to_Maturity,IsCall
1,732495,3931.1,,,1,104,1800,732660,5712.69,0.031667592146348,0.466666666666667,true
2,732495,0.1,,,0,5515,1800,732660,5712.69,0.0316675921463482,0.466666666666667,false
3,732495,3734.0,,,0,2152,2000,732660,5712.69,0.0316675921463482,0.466666666666667,true
4,732495,0.1,,,0,20941,2000,732660,5712.69,0.0316675921463482,0.466666666666667,false
5,732495,3536.9,,,0,2,2200,732660,5712.69,0.0316675921463482,0.466666666666667,true
6,732495,0.1,,,0,4626,2200,732660,5712.69,0.0316675921463482,0.466666666666667,false
7,732495,3339.8,,,0,2009,2400,732660,5712.69,0.0316675921463482,0.466666666666667,true
8,732495,0.1,,,0,13367,2400,732660,5712.69,0.0316675921463482,0.466666666666667,false
9,732495,0.2,,,0,2297,2600,732660,5712.69,0.0316675921463482,0.466666666666667,false
10,732495,2945.9,,,0,624,2800,732660,5712.69,0.0316675921463482,0.466666666666667,true


## Choosing the data format

When dealing with data we always have to face a tradeoff between intuitive and user friendly data formats and data formats that are primarily targeting pure speed. In this case, for example, we need to decide on the data type used for date information. The fastest solution would be to keep dates as unintuitive `Int64` numbers, or we could transform them to `Date` format. Before we can make a decision, we first will perform some speed comparisons. As example, we will look at different implementations of a function that lists all unique options in a given dataset.

#### `getAllOptions` with `Int64` dates

The first option is to basically treat all underlying data as `Int64`, providing type stability and hence fast performance:

In [3]:
function getAllOptions1(df::DataFrame)
    arrData = [df[:Strike].data df[:Expiry].data df[:IsCall].data]
    return unique(arrData, 1)
end

getAllOptions1 (generic function with 1 method)

In [4]:
@time allOpts = getAllOptions1(optData)
size(allOpts)

elapsed time: 0.56204353 seconds (104677380 bytes allocated, 10.57% gc time)


(12917,3)

Taking this as a benchmark, we now look for implementations of the same functionality with more intuitive encoding of dates as `Date` type. Therefore, we first transform our data table.

#### Transform date columns to `Date` type

Using the knowledge that there are no `NA` values in both dates columns, conversion to `Date` type is quite fast.

In [5]:
function num2date(numb::Int64)
    return Date(Dates.UTD(numb))
end
function num2date(numb::Array{Int64})
    nDats = size(numb, 1)
    dats = Array(Date, nDats)
    for ii=1:nDats
        dats[ii] = num2date(numb[ii])
    end
    return dats
end

num2date (generic function with 2 methods)

In [6]:
@time begin
    optData[:Date] = num2date(optData[:Date].data)
    optData[:Expiry] = num2date(optData[:Expiry].data)
end
optData

elapsed time: 0.03352016 seconds (33614096 bytes allocated)


Unnamed: 0,Date,Option_Price,Bid,Ask,Volume,Open_Interest,Strike,Expiry,DAX,EONIA_matched,Time_to_Maturity,IsCall
1,2006-07-03,3931.1,,,1,104,1800,2006-12-15,5712.69,0.031667592146348,0.466666666666667,true
2,2006-07-03,0.1,,,0,5515,1800,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,false
3,2006-07-03,3734.0,,,0,2152,2000,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,true
4,2006-07-03,0.1,,,0,20941,2000,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,false
5,2006-07-03,3536.9,,,0,2,2200,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,true
6,2006-07-03,0.1,,,0,4626,2200,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,false
7,2006-07-03,3339.8,,,0,2009,2400,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,true
8,2006-07-03,0.1,,,0,13367,2400,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,false
9,2006-07-03,0.2,,,0,2297,2600,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,false
10,2006-07-03,2945.9,,,0,624,2800,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,true


#### getAllOptions - straightforward implementations for type `Date`

First, let's try a straightforward implementation using built-in functions for the new dataset with dates of type `Date`.

In [7]:
function getAllOptions2(df::DataFrame)
    return unique(df[:, [:Strike, :Expiry, :IsCall]])
end

@time allOpts = getAllOptions2(optData)
size(allOpts)

elapsed time: 20.881455899 seconds (6935176712 bytes allocated, 20.25% gc time)


(12917,3)

This is slower by a factor >200, which of course strongly reduces the usefulness of this approach. An already slightly more sophisticated version makes use of the fact that column `IsCall` is binary with only two different outcomes:

In [8]:
function getAllOptions3(df::DataFrame)
    # get call indices
    callOpts = unique(df[df[:IsCall].data, [:Strike, :Expiry]])
    putOpts = unique(df[!(df[:IsCall].data), [:Strike, :Expiry]])
    return (callOpts, putOpts)
end

@time opts1, opts2 = getAllOptions3(optData)
size(opts1, 1) + size(opts2, 1)

elapsed time: 15.658642067 seconds (5591546736 bytes allocated, 21.87% gc time)


12917

This already provides some additional speed. Still, the performance is not good enough.

#### `getAllOptions` - using parallel computing

Let's now try to use parallel search for unique options. First, let's start additional workers and load all relevant packages on all machines.

In [9]:
addprocs(3)
@everywhere using DataFrames
@everywhere using Dates

Using `pmap` to parallelize the search for unique call options and unique put options:

In [10]:
function getAllOptionsPar1(dat::DataFrame)
    splitted = {dat[dat[:IsCall].data, [:Strike, :Expiry]], dat[!(dat[:IsCall].data), [:Strike, :Expiry]]}
    callOpts, putOpts = pmap(x -> unique(x), splitted)
    callOpts[:IsCall] = true
    putOpts[:IsCall] = false
    return callOpts, putOpts
end

@time kk = getAllOptionsPar1(optData)
size(kk[1], 1) + size(kk[2], 1)

elapsed time: 11.635335523 seconds (178035548 bytes allocated, 0.77% gc time)


12917

Again, this provides some additional speed-up, but still is not sufficient yet. Let's try to split up the data into more than two parts, parallelizing the search to more workers.

In [11]:
@everywhere begin
    function reduceOpts(df1::DataFrame, df2::DataFrame)
        df = [df1; df2]
        return unique(df)
    end
end

function getAllOptionsPar2(df::DataFrame)
    nObs = size(df, 1)
    dfSmall = df[:, [:Strike, :Expiry, :IsCall]]
    stepSize = 550000
    inds = [[1:stepSize:nObs], (nObs+1)]
    nParts = length(inds)-1
    opts = @parallel (reduceOpts) for ii=1:nParts
        unique(dfSmall[inds[ii]:(inds[ii+1]-1), :])
    end
    return opts
end

@time kk = getAllOptionsPar2(optData)
size(kk, 1)

elapsed time: 15.013048183 seconds (582861252 bytes allocated, 0.19% gc time)


12917

This was a step back even. It seems like parallelization involves too much data copying between the machines. One maybe could try to circumvent this problem by storing the original dataset in some shared version on the machines. This way, each machine would own a separate part of the dataset. Still, however, this would not really suffice our original intention: intuitive and user friendly data handling. Hence, we give up on parallel implementations, trying to get additional speed somewhere else.

In [12]:
rmprocs([procs()][2:end])

:ok

In [13]:
nprocs()

4

#### `getAllOptions` - type stable version

So what exactly is it that makes the original solution with dates stored as `Int64` so fast? The problem with all other solutions is that comparison of different options involves comparison of three different fields with different types: strike price given as `Int64`, expiry given as `Date` and option type denoted as `Bool`. This makes comparisons quite costly. So let's now try to conduct comparisons on `Array{Int64, 1}`. Also, looking at the implementation of `unique()` in `Base`, comparisons seem to be faster within a `Set` than in an `Array{Int64, 2}`. Modifying `unique()` for the current situation, we get:

In [14]:
function getAllOptions(df::DataFrame)
    vals1, vals2, vals3 = Int64[], Date[], Bool[]
    valsSet = Set{Array{Int64, 1}}()
    nObs = size(df, 1)
    for ii=1:nObs
        currStr, currExp, currCall = df[ii, :Strike], df[ii, :Expiry], df[ii, :IsCall]
        currVals = Int64[currStr, Dates.value(currExp), currCall]
        if !in(currVals, valsSet)
            push!(valsSet, currVals)
            push!(vals1, currStr)
            push!(vals2, currExp)
            push!(vals3, currCall)
        end
    end
    return DataFrame(Strike = vals1, Expiry = vals2, IsCall = vals3)
end

@time kk = getAllOptions(optData)
size(kk, 1)

elapsed time: 1.914682596 seconds (613310788 bytes allocated, 13.37% gc time)


12917

Although this still is approximately 20 times the time that the original version takes, it is only 2-3 times the speed that I get for the same operation with MATLAB on my machine. Whether this loss in speed is crucial depends on the application and is something that we will have to see...

## Session info

In [15]:
versioninfo()

Julia Version 0.3.5
Commit a05f87b* (2015-01-08 22:33 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3-3240 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libblas.so.3
  LAPACK: liblapack.so.3
  LIBM: libopenlibm
  LLVM: libLLVM-3.3


In [16]:
Pkg.status()

18 required packages:
 - DataArrays                    0.2.9
 - DataFrames                    0.6.0
 - Dates                         0.3.2
 - Debug                         0.0.4
 - Distributions                 0.6.3
 - EconDatasets                  0.0.2
 - GLM                           0.4.2
 - Gadfly                        0.3.10
 - IJulia                        0.1.16
 - JuMP                          0.7.3
 - MAT                           0.2.9
 - NLopt                         0.2.0
 - Quandl                        0.4.0
 - RDatasets                     0.1.1
 - Taro                          0.1.2
 - TimeData                      0.5.1
 - TimeSeries                    0.4.6
 - Winston                       0.11.7
56 additional packages:
 - ArrayViews                    0.4.8
 - BinDeps                       0.3.7
 - Blosc                         0.1.1
 - Cairo                         0.2.22
 - Calculus                      0.1.5
 - Codecs                        0.1.3
 - Color      