In [1]:
using DataFrames

INFO: Precompiling module DataFrames...


In [9]:
df = readtable("data.csv");

## DataFrames

### DataFrame Methods

There are various simple methods you can use to inspect a `DataFrame`

In [10]:
size(df)

(250000,20)

In [11]:
names(df)

20-element Array{Symbol,1}:
 :timestamp             
 :page_group            
 :geo_cc                
 :geo_rg                
 :geo_org               
 :geo_netspeed          
 :user_agent_family     
 :user_agent_major      
 :user_agent_minor      
 :user_agent_os         
 :user_agent_osversion  
 :user_agent_device_type
 :user_agent_model      
 :params_dom_sz         
 :params_dom_ln         
 :params_dom_script     
 :params_dom_img        
 :timers_t_done         
 :timers_t_resp         
 :timers_t_page         

### Each column of a DataFrame is a DataArray

You can reference a column using the column name as a `Symbol` subscript.  A `DataArray` is just a regular array that can contain `NA`, which is Juliaspeak for `NULL`.

In [12]:
df[:timers_t_done]

250000-element DataArrays.DataArray{Int64,1}:
  6257
  5955
 14750
 12266
 10773
  6604
  3502
  6073
  6554
  6546
 12472
 10238
 10995
     ⋮
  7827
  6673
  6624
  6189
  4699
  2968
  7348
  7265
  8626
  3756
  3836
  4439

In [13]:
df[30:40, :timers_t_done]

11-element DataArrays.DataArray{Int64,1}:
  2857
  3056
  5124
  3188
  4841
  4680
  4879
  6106
  4516
  5557
 12049

In [14]:
df[30:40, [:timestamp, :geo_cc, :geo_netspeed, :user_agent_family, :timers_t_done]]

Unnamed: 0,timestamp,geo_cc,geo_netspeed,user_agent_family,timers_t_done
1,1455611221592,PL,,IE,2857
2,1455611283782,PL,,IE,3056
3,1455611355679,PL,,IE,5124
4,1455612940770,GB,Cable/DSL,Safari,3188
5,1455613685100,IE,Dialup,IE,4841
6,1455613730994,IE,Dialup,IE,4680
7,1455614657272,UA,,Firefox,4879
8,1455614335263,UA,,Firefox,6106
9,1455614452250,UA,,Firefox,4516
10,1455612605862,GB,Cable/DSL,Safari,5557


## Stats on DataFrames

Most Julia stats functions run on `AbstractArray`, which is the base type for `Array` as well as `DataArray`, so you can run them on any column of a `DataFrame` that contains numbers. You will probably need to remove `NA`s first using the `dropna` function.

Our test dataset doesn't contain any `NA` values for the `timers_t_done` column, so we're safe.

In [15]:
summarystats(df[:timers_t_done])

Summary Stats:
Mean:         5858.974556
Minimum:      8.000000
1st Quartile: 2357.000000
Median:       3973.000000
3rd Quartile: 6688.000000
Maximum:      2536087.000000


## Histograms

The `hist` function will by default split the dataset into equal sized buckets based on the data's range.  This may not always be what you want, so you can pass in a list of thresholds as the second parameter.

The `hist` function returns a tuple.  The first element is the thresholds used, which might be a `Range` object or an `Array`.  The second element is the list of bucket frequencies.

In [16]:
hist(df[:timers_t_done])

(0.0:200000.0:2.6e6,[249866,84,44,2,1,1,0,1,0,0,0,0,1])

### Creating thresholds based on the data

We could use static thresholds, but that wouldn't adapt to different data sets.  In this case, we develop a Julia function that determines thresholds based on the dataset.

Rather than divide the entire range into a fixed set of buckets, we divide the Inter-Quartile Range.  This has the advantage of excluding outliers from the basic range.  We then include outliers in their own buckets, one for the low bound and one for the high bound.

This is very similar to a box and whiskers plot.

In [17]:
# Function to set histogram thresholds after dropping outliers based on IQR
function getSymmetricThresholds(results::DataFrame; timer::Symbol=:timers_t_done)
    summary = summarystats(results[timer])
    fw  = (summary.q75-summary.q25)*1.5

    low = round(Int64, max(summary.min, summary.q25-fw))
    high = round(Int64, min(summary.max, summary.q75+fw))+1

    thresholds::Array{Int64, 1} = []

    nthresholds=25

    range = high - low

    for i in 0:nthresholds-1
        push!(thresholds, round(Int64, low + i * range/nthresholds))
    end

    push!(thresholds, high)
    if high < round(Int64, summary.max)
        push!(thresholds, round(Int64, summary.max))
    end

    return thresholds
end

getSymmetricThresholds (generic function with 1 method)

#### Julia Functions

Notice that Julia functions are declared using the `function` keyword.  Function parameters may have types attached to them, this is optional, and mainly useful when you overload function names.

Functions may have optional parameters, a `;` separates required parameters from optional ones.

When passing optional parameters to a function, they need to be passed by name, and order doesn't matter.

A function typically only returns a single value, though that value may be a tuple of multiple objects.  The caller can then receive the return value into a single tuple or multiple values enclosed in `()`.

In [18]:
thresholds = getSymmetricThresholds(df)

27-element Array{Int64,1}:
       8
     535
    1062
    1589
    2116
    2643
    3170
    3698
    4225
    4752
    5279
    5806
    6333
       ⋮
    7914
    8441
    8968
    9495
   10023
   10550
   11077
   11604
   12131
   12658
   13185
 2536087

Running the `hist` function using our new thresholds gets us much better granularity into the data.

In [19]:
hist_global = hist(df[:timers_t_done], thresholds)[2]

26-element Array{Int64,1}:
   252
  6337
 19357
 25199
 24620
 21891
 18662
 16302
 14786
 12989
 11284
  9803
  8566
  7237
  6349
  5424
  4757
  4204
  3728
  3096
  2669
  2429
  2098
  1729
  1529
 14702

## Filtering DataFrames

We can also filter a `DataFrame` on the value of one or more fields.  In the following example, we filter on all `:geo_rg` that are not `NA` and equal to `US:: OR`.

In [22]:
results_US = df[!isna(df[:geo_cc]) & (df[:geo_cc] .== "US"), :];

In [23]:
hist_US = hist(results_US[:timers_t_done], thresholds)[2]

26-element Array{Int64,1}:
   249
  6243
 19010
 24561
 23830
 21014
 17859
 15578
 14051
 12363
 10758
  9309
  8103
  6823
  6006
  5101
  4462
  3903
  3483
  2896
  2473
  2265
  1926
  1584
  1419
 13249

### Statistical Correlation

The `cor` function lets us run a correlation between the two histograms that we have

In [24]:
cor(hist_global, hist_US)

0.9995864880638793

We could also run `cumsum` to generate the `CDF` from the histogram and correlate those values.

In [25]:
cor(cumsum(hist_global), cumsum(hist_US))

0.9999733633808021

## Splitting/Grouping a DataFrame

Use the `by` function to run an aggregation on a DataFrame grouped by one or more columns

In [26]:
by(df, :user_agent_family, rows -> median(rows[:timers_t_done]))

Unnamed: 0,user_agent_family,x1
1,(Unknown),3740.0
2,AOL,4857.0
3,Amazon Silk,7599.0
4,Android Browser,11886.0
5,BlackBerry WebKit,8684.0
6,Chrome,3129.0
7,Chrome Frame,4067.0
8,Chrome Mobile,6776.0
9,Chrome Mobile iOS,4257.0
10,Chromium,2772.0


### Problems if the aggregation function returns an array

If the aggregation function returns an array, like the `hist` function does, then we'll actually end up with one row per array element.  Instead we need to serialize the array to a string or create a custom data type that encapsulates the array.  The string method is easier albeit a little slower, but if we're going to export our data to JavaScript, we may need to do this anyway.

In [27]:
by(
    df,
    :user_agent_family, 
    rows -> DataFrame(
        count = size(rows, 1),
        median = median(rows[:timers_t_done]),
        hist = JSON.json(hist(rows[:timers_t_done], thresholds)[2])
    )
)

Unnamed: 0,user_agent_family,count,median,hist
1,(Unknown),73,3740.0,"[0,1,8,9,6,5,7,5,0,2,1,2,4,0,0,3,1,5,3,1,2,2,0,1,0,5]"
2,AOL,15,4857.0,"[0,0,1,1,1,1,1,0,2,2,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,2]"
3,Amazon Silk,2323,7599.0,"[0,0,0,1,15,39,84,117,133,147,145,165,155,113,141,113,108,90,78,70,67,76,53,49,49,315]"
4,Android Browser,1752,11886.0,"[0,1,0,0,0,3,17,29,43,50,66,63,76,65,80,46,64,55,52,48,44,41,52,46,32,779]"
5,BlackBerry WebKit,43,8684.0,"[0,0,0,0,1,1,1,0,2,1,2,1,3,3,4,1,3,0,2,1,0,2,2,0,1,12]"
6,Chrome,53086,3129.0,"[65,2116,6133,7217,6356,5033,3894,3216,2735,2225,1788,1558,1398,1164,993,872,711,606,563,464,437,342,322,275,246,2357]"
7,Chrome Frame,37,4067.0,"[0,0,1,4,4,5,3,4,4,2,1,0,1,0,2,0,1,1,0,1,0,1,0,0,0,2]"
8,Chrome Mobile,31477,6776.0,"[0,1,23,60,230,594,1106,1698,2232,2663,2709,2494,2268,2082,1793,1607,1356,1140,997,835,671,612,504,402,369,3031]"
9,Chrome Mobile iOS,1987,4257.0,"[0,22,86,184,182,198,179,132,107,101,82,69,60,48,42,36,37,39,39,37,26,21,18,31,17,194]"
10,Chromium,5,2772.0,"[0,0,0,0,2,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0]"


### Copy the JSON to a JavaScript file when first testing D3 code

It's easier to start your D3 experimentation with a standalone file rather than within the IJulia interface.  A simpler dev setup is easier to debug.

In [28]:
println("Histogram:\n", JSON.json(hist_global))
println()
println("Thresholds:\n", JSON.json(thresholds))

Histogram:
[252,6337,19357,25199,24620,21891,18662,16302,14786,12989,11284,9803,8566,7237,6349,5424,4757,4204,3728,3096,2669,2429,2098,1729,1529,14702]

Thresholds:
[8,535,1062,1589,2116,2643,3170,3698,4225,4752,5279,5806,6333,6860,7387,7914,8441,8968,9495,10023,10550,11077,11604,12131,12658,13185,2536087]
