## Load and process data

In [1]:
using DataFrames
using Dates

Loading the comma separated data from disk:

In [80]:
@time optData = readtable("optData.csv")

elapsed time: 11.1443119 seconds (3564920440 bytes allocated, 12.48% gc time)


Unnamed: 0,Date,Option_Price,Bid,Ask,Volume,Open_Interest,Strike,Expiry,DAX,EONIA_matched,Time_to_Maturity,IsCall
1,732495,3931.1,,,1,104,1800,732660,5712.69,0.031667592146348,0.466666666666667,true
2,732495,0.1,,,0,5515,1800,732660,5712.69,0.0316675921463482,0.466666666666667,false
3,732495,3734.0,,,0,2152,2000,732660,5712.69,0.0316675921463482,0.466666666666667,true
4,732495,0.1,,,0,20941,2000,732660,5712.69,0.0316675921463482,0.466666666666667,false
5,732495,3536.9,,,0,2,2200,732660,5712.69,0.0316675921463482,0.466666666666667,true
6,732495,0.1,,,0,4626,2200,732660,5712.69,0.0316675921463482,0.466666666666667,false
7,732495,3339.8,,,0,2009,2400,732660,5712.69,0.0316675921463482,0.466666666666667,true
8,732495,0.1,,,0,13367,2400,732660,5712.69,0.0316675921463482,0.466666666666667,false
9,732495,0.2,,,0,2297,2600,732660,5712.69,0.0316675921463482,0.466666666666667,false
10,732495,2945.9,,,0,624,2800,732660,5712.69,0.0316675921463482,0.466666666666667,true


## Choosing the data format

When dealing with data we always have to face a tradeoff between intuitive and user friendly data formats and data formats that are primarily targeting pure speed. In this case, for example, we need to decide on the data type used for date information. The fastest solution would be to keep dates as unintuitive `Int64` numbers, or we could transform them to `Date` format. Before we can make a decision, we first will perform some speed comparisons. As example, we will look at different implementations of a function that lists all unique options in a given dataset.

#### `getAllOptions` with `Int64` dates

The first option is to basically treat all underlying data as `Int64`, providing type stability and hence fast performance:

In [81]:
function getAllOptions1(df::DataFrame)
    arrData = [df[:Strike].data df[:Expiry].data df[:IsCall].data]
    return unique(arrData, 1)
end

getAllOptions1 (generic function with 1 method)

In [82]:
@time allOpts = getAllOptions1(optData)
size(allOpts)

elapsed time: 0.18486432 seconds (83223272 bytes allocated, 20.50% gc time)


(12917,3)

Taking this as a benchmark, we now look for implementations of the same functionality with more intuitive encoding of dates as `Date` type. Therefore, we first transform our data table.

#### Transform date columns to `Date` type

Using the knowledge that there are no `NA` values in both dates columns, conversion to `Date` type is quite fast.

In [83]:
function num2date(numb::Int64)
    return Date(Dates.UTD(numb))
end
function num2date(numb::Array{Int64})
    nDats = size(numb, 1)
    dats = Array(Date, nDats)
    for ii=1:nDats
        dats[ii] = num2date(numb[ii])
    end
    return dats
end

num2date (generic function with 2 methods)

In [84]:
@time begin
    optData[:Date] = num2date(optData[:Date].data)
    optData[:Expiry] = num2date(optData[:Expiry].data)
end
dat

elapsed time: 0.040917825 seconds (33020196 bytes allocated, 72.59% gc time)


Unnamed: 0,Date,Option_Price,Bid,Ask,Volume,Open_Interest,Strike,Expiry,DAX,EONIA_matched,Time_to_Maturity,IsCall
1,2006-07-03,3931.1,,,1,104,1800,2006-12-15,5712.69,0.031667592146348,0.466666666666667,true
2,2006-07-03,0.1,,,0,5515,1800,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,false
3,2006-07-03,3734.0,,,0,2152,2000,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,true
4,2006-07-03,0.1,,,0,20941,2000,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,false
5,2006-07-03,3536.9,,,0,2,2200,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,true
6,2006-07-03,0.1,,,0,4626,2200,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,false
7,2006-07-03,3339.8,,,0,2009,2400,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,true
8,2006-07-03,0.1,,,0,13367,2400,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,false
9,2006-07-03,0.2,,,0,2297,2600,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,false
10,2006-07-03,2945.9,,,0,624,2800,2006-12-15,5712.69,0.0316675921463482,0.466666666666667,true


#### getAllOptions - straightforward implementations for type `Date`

First, let's try a straightforward implementation using built-in functions for the new dataset with dates of type `Date`.

In [85]:
function getAllOptions2(df::DataFrame)
    return unique(df[:, [:Strike, :Expiry, :IsCall]])
end

@time allOpts = getAllOptions2(optData)
size(allOpts)

elapsed time: 19.765476128 seconds (6908768340 bytes allocated, 21.50% gc time)


(12917,3)

This is slower by a factor >200, which of course strongly reduces the usefulness of this approach. An already slightly more sophisticated version makes use of the fact that column `IsCall` is binary with only two different outcomes:

In [86]:
function getAllOptions3(df::DataFrame)
    # get call indices
    callOpts = unique(df[df[:IsCall].data, [:Strike, :Expiry]])
    putOpts = unique(df[!(df[:IsCall].data), [:Strike, :Expiry]])
    return (callOpts, putOpts)
end

@time opts1, opts2 = getAllOptions(optData)
size(opts1, 1) + size(opts2, 1)

elapsed time: 15.526298626 seconds (5590316656 bytes allocated, 23.37% gc time)


12917

This already provides some additional speed. Still, the performance is not good enough.

#### `getAllOptions` - using parallel computing

Let's now try to use parallel search for unique options. First, let's start additional workers and load all relevant packages on all machines.

In [91]:
addprocs(3)
@everywhere using DataFrames
@everywhere using Dates

Using `pmap` to parallelize the search for unique call options and unique put options:

In [92]:
function getAllOptionsPar1(dat::DataFrame)
    splitted = {dat[dat[:IsCall].data, [:Strike, :Expiry]], dat[!(dat[:IsCall].data), [:Strike, :Expiry]]}
    callOpts, putOpts = pmap(x -> unique(x), splitted)
    callOpts[:IsCall] = true
    putOpts[:IsCall] = false
    return callOpts, putOpts
end

@time kk = getAllOptionsPar1(optData)
size(kk[1], 1) + size(kk[2], 1)

elapsed time: 10.456553924 seconds (151892928 bytes allocated, 0.30% gc time)


12917

Again, this provides some additional speed-up, but still is not sufficient yet. Let's try to split up the data into more than two parts, parallelizing the search to more workers.

In [93]:
@everywhere begin
    function reduceOpts(df1::DataFrame, df2::DataFrame)
        df = [df1; df2]
        return unique(df)
    end
end

function getAllOptionsPar2(df::DataFrame)
    nObs = size(df, 1)
    dfSmall = df[:, [:Strike, :Expiry, :IsCall]]
    stepSize = 550000
    inds = [[1:stepSize:nObs], (nObs+1)]
    nParts = length(inds)-1
    opts = @parallel (reduceOpts) for ii=1:nParts
        unique(dfSmall[inds[ii]:(inds[ii+1]-1), :])
    end
    return opts
end

@time kk = getAllOptionsPar2(optData)
size(kk, 1)

elapsed time: 15.226210755 seconds (559941556 bytes allocated, 0.46% gc time)


12917

This was a step back even. It seems like parallelization involves too much data copying between the machines. One maybe could try to circumvent this problem by storing the original dataset in some shared version on the machines. This way, each machine would own a separate part of the dataset. Still, however, this would not really suffice our original intention: intuitive and user friendly data handling. Hence, we give up on parallel implementations, trying to get additional speed somewhere else.

In [100]:
rmprocs([procs()][2:end])

:ok

In [101]:
nprocs()

1

#### `getAllOptions` - type stable version

So what exactly is it that makes the original solution with dates stored as `Int64` so fast? The problem with all other solutions is that comparison of different options involves comparison of three different fields with different types: strike price given as `Int64`, expiry given as `Date` and option type denoted as `Bool`. This makes comparisons quite costly. So let's now try to conduct comparisons on `Array{Int64, 1}`. Also, looking at the implementation of `unique()` in `Base`, comparisons seem to be faster within a `Set` than in an `Array{Int64, 2}`. Modifying `unique()` for the current situation, we get:

In [107]:
function getAllOptions(df::DataFrame)
    vals1, vals2, vals3 = Int64[], Date[], Bool[]
    valsSet = Set{Array{Int64, 1}}()
    nObs = size(df, 1)
    for ii=1:nObs
        currStr, currExp, currCall = df[ii, :Strike], df[ii, :Expiry], df[ii, :IsCall]
        currVals = Int64[currStr, Dates.value(currExp), currCall]
        if !in(currVals, valsSet)
            push!(valsSet, currVals)
            push!(vals1, currStr)
            push!(vals2, currExp)
            push!(vals3, currCall)
        end
    end
    return DataFrame(Strike = vals1, Expiry = vals2, IsCall = vals3)
end

@time kk = getAllOptions(optData)
size(kk, 1)

elapsed time: 1.858124155 seconds (603714500 bytes allocated, 20.86% gc time)


12917

Although this still is approximately 20 times the time that the original version takes, it is only 2-3 times the speed that I get for the same operation with MATLAB on my machine. Whether this loss in speed is crucial depends on the application and is something that we will have to see...

## Data characteristics

In order to get a feeling about the size of the dataset, let's take a look at the number of rows of the table.

In [5]:
nObs = size(dat, 1)

2025129

Hence, be prepared that each operation will take quite some time.

Let's now take a look at missing values:

In [6]:
missVals = [any(isna(dat[:, ii])) for ii=1:size(dat, 2)]

12-element Array{Any,1}:
 false
 false
  true
  true
  true
 false
 false
 false
 false
 false
 false
 false

## Define option type

An option is determined by a unique combination of *strike*, *expiry* and *call/put*.

In [7]:
type Option
    strike::Int64
    expiry::Date
    iscall::Bool
end

Default constructor: option type is set to call option.

In [8]:
function Option(strike::Int64, expiry::Date)
    return Option(strike, expiry, true)
end

function Option(df::DataFrame)
    return Option(df[1, :Strike], df[1, :Expiry], df[1, :IsCall])
end

Option (constructor with 4 methods)

Define writemime methods for customized display of option objects:

In [9]:
import Base.writemime
function writemime(io::IO, ::MIME"text/html", opt::Option)
    opt.iscall ? typ = "Call" : typ = "Put"
    write(io, "<p><strong>$(typ)</strong> option:")
    write(io, "<ul><li>strike:&nbsp&nbsp&nbsp $(opt.strike)</li>")
    write(io, "<li>expiry:&nbsp&nbsp $(opt.expiry)</li></ul></p>")
end
function writemime(io::IO, ::MIME"text/html", opts::Array{Option,1 })
    nOpts = size(opts, 1)
    write(io, "<p><strong>Array</strong> of $nOpts <strong>options</strong>:")
    nToShow = 4
    for ii=1:min(nToShow, nOpts)
        opts[ii].iscall ? typ = "Call" : typ = "Put"
        write(io, "<p><strong>$(typ)</strong> option:")
        write(io, "<ul><li>strike:&nbsp&nbsp&nbsp $(opts[ii].strike)</li>")
        write(io, "<li>expiry:&nbsp&nbsp $(opts[ii].expiry)</li></ul></p>")
    end
    if nOpts > nToShow
        write(io, "<p><strong>...</strong>")
    end
end

writemime (generic function with 19 methods)

In [10]:
opts = Option[Option(dat[ii, :]) for ii=1:30]

## Get unique options / expiries / days

In [11]:
function getObs(df::DataFrame, opt::Option)
    # get all observations for given option
    datExp = df[:Expiry]
    datStr = df[:Strike]
    datCall = df[:IsCall]
    inds = Int64[]
    
    for ii=1:size(datExp, 1)
        #if datExp[ii]::Date == opt[1, 2]::Date
         #   if datStr[ii]::Int64 == opt[1, 1]::Int64
          #      if datCall[ii]::Bool == opt[1, 3]::Bool
        if datExp[ii]::Date == opt.expiry
            if datStr[ii]::Int64 == opt.strike
                if datCall[ii]::Bool == opt.iscall
                    push!(inds, ii)
                end
            end
        end
    end
    return df[inds, :]
end

function getObs(df::DataFrame, dat::Date, col::Symbol)
    # find observations with dat in col
    dats = df[col]
    inds = Int64[]
    
    for ii=1:size(dats, 1)
        if dats[ii] == dat
            push!(inds, ii)
        end
    end
    return df[inds, :]
end


getObs (generic function with 2 methods)

In [12]:
expDate = Date(2011,12,16)
@time expData = getObs(dat, expDate, :Date)
size(expData)

elapsed time: 0.546490682 seconds (195458848 bytes allocated, 22.84% gc time)


(966,12)

In [13]:
function getAllExpiry(df::DataFrame)
    return unique(df[:Expiry])
end
@time expDates = getAllExpiry(dat)
size(expDates)

elapsed time: 0.233406412 seconds (36045232 bytes allocated, 12.78% gc time)


(97,)

In [14]:
function getAllDays(df::DataFrame)
    return unique(df[:Date])
end
@time tradeDays = getAllDays(dat)
size(tradeDays)

elapsed time: 0.129417371 seconds (32552232 bytes allocated, 26.39% gc time)


(1908,)

Benchmark result:

## Find unique options and expiry dates

In [None]:
function getOptionData(opt::Option, data::DataFrame)
    nObs = size(data, 1)
    validInds = falses(nObs)
    for ii=1:nObs
        if data[ii, :Strike] == opt.strike
            if data[ii, :Expiry] == opt.expiry
                if data[ii, :IsCall] == int(opt.iscall)
                    validInds[ii] = true
                end
            end
        end
    end
    return Timedata(dat1[validInds, [:Option_Price, :Bid, 
            :Ask, :Volume, :Open_Interest, :DAX, :EONIA_matched, :Time_to_Maturity]],
    array(dat1[validInds, :Date]))
end
    

Get some helper look-up tables: in which sections to we have to search for individual options? And in which for all options of given expiration date?

In [9]:
nPreAlloc = 20000
expDates = Array(Date, nPreAlloc)
strikes = Array(Int64, nPreAlloc)
optTypes = Array(Bool, nPreAlloc)
firstListings = Array(Date, nPreAlloc)

nOptsFound = 0

@time begin
    for ii=1:nObs
        currExpDate, currStrike = dat[ii, :Expiry], dat[ii, :Strike]
        currDate, currType = dat[ii, :Date], dat[ii, :IsCall]
        # does (expDate, strike, type) combination already occur?
    
        optPresent = false
        for kk=nOptsFound:-1:1
            if (expDates[kk] == currExpDate) && (strikes[kk] == currStrike) && (optTypes[kk] == currType)
                # go to next observation
                optPresent = true
                break
            end
        end
    
        if !optPresent
            nOptsFound = nOptsFound + 1
            expDates[nOptsFound] = currExpDate
            strikes[nOptsFound] = currStrike
            optTypes[nOptsFound] = currType
            firstListings[nOptsFound] = currDate
        end
    end
end

allOpts = DataFrame(expDates = expDates[1:nOptsFound], 
                    strikes = strikes[1:nOptsFound], 
                    firstListings = firstListings[1:nOptsFound])

elapsed time: 639.105780513 seconds (224997735064 bytes allocated, 23.64% gc time)


Unnamed: 0,expDates,strikes,firstListings
1,2006-12-15,1800,2006-07-03
2,2006-12-15,1800,2006-07-03
3,2006-12-15,2000,2006-07-03
4,2006-12-15,2000,2006-07-03
5,2006-12-15,2200,2006-07-03
6,2006-12-15,2200,2006-07-03
7,2006-12-15,2400,2006-07-03
8,2006-12-15,2400,2006-07-03
9,2006-12-15,2600,2006-07-03
10,2006-12-15,2800,2006-07-03


In [8]:
size(allOpts)

(12917,3)

In [31]:
function findprev(A, start)
    for i = start:-1:1
        A[i] != 0 && return i
    end
    0
end
findlast(A) = findprev(A, length(A))

findlast (generic function with 1 method)

In [38]:
startDay = allDays[550]
endDay = allDays[1005]

@time begin
    startInd = findfirst(dat[:Date] == startDay)
    endInd = findlast(dat[:Date] == endDay)
end


elapsed time: 9.087e-6 seconds (80 bytes allocated)


0

In [40]:
size(uniqueOpts)

(12917,3)

In [41]:
13000 * 0.00009

1.1700000000000002

## Bid ask prices

Do bid-ask prices make sense, or are they observed too infrequently? Percentage of **missing bid-ask prices**:

In [6]:
[sum(isna(dat[:Bid]))/nObs sum(isna(dat[:Ask]))/nObs]

1x2 Array{Float64,2}:
 0.789053  0.777314

##  Get smaller subset

## Get list of options

In [51]:
function getAllOptions(dat1::DataFrame)
    optsUnique = unique(dat1[:, [:Strike, :Expiry, :IsCall]])
    nOpts = size(optsUnique, 1)
    return Option[Option(optsUnique[ii, :]) for ii=1:nOpts]
end

getAllOptions (generic function with 1 method)

In [52]:
allOpts = getAllOptions(dat1)

Group options by common expiry date: get Timematr with DAX prices and option prices for all options of a given expiry date

In [60]:
function getOptionsWithExpiry(optList::Array{Option, 1}, expiry::Date)
    # find options with given expiry
    nOpts = size(optList, 1)
    isValid = falses(nOpts)
    for ii=1:nOpts
        if optList[ii].expiry == expiry
            isVald[ii] = true
        end
    end
    return opts = optList[isValid]
end

getOptionsWithExpiry (generic function with 1 method)

In [None]:
function getCohortPrices(dat::DataFrame, optList::Array{Option, 1}, expiry::Date)
    validOpts = getOptionsWithExpiry(optList, expiry)
    
    
    
    
    return Timematr() 
end

Group all options that are listed at a given date (all strikes, all maturities)