# Comprehensions

- load required packages

In [1]:
using DataArrays
using DataFrames
using TimeData
using Dates

- **comprehensions**: easy way to build `Arrays`

In [2]:
[ii for ii=1:4]

4-element Array{Int64,1}:
 1
 2
 3
 4

- **forcing type** of individual entries through prepending type declaration

In [3]:
Float64[ii for ii=1:4]

4-element Array{Float64,1}:
 1.0
 2.0
 3.0
 4.0

- similar logic: collect elements of **any type** in `Array` of type `Any`

In [4]:
Any["Hello" 3 4.0 NA]

1x4 Array{Any,2}:
 "Hello"  3  4.0  NA

- comprehensions also can be used to capture more complex iterated output
- for example: **iteration over** **sample size** or **parameter values**

In [5]:
[Timematr(rand(nObs, 2)) for nObs in [2, 10]]

2-element Array{Timematr{Int64},1}:
 Timematr{Int64}(2x2 DataFrame
| Row | x1       | x2       |
|-----|----------|----------|
| 1   | 0.143767 | 0.252105 |
| 2   | 0.954473 | 0.396473 |,[1,2])                                                                                                                                                                                                                                                                              
 Timematr{Int64}(10x2 DataFrame
| Row | x1        | x2       |
|-----|-----------|----------|
| 1   | 0.129805  | 0.141181 |
| 2   | 0.277486  | 0.106432 |
| 3   | 0.69658   | 0.846779 |
| 4   | 0.660322  | 0.564138 |
| 5   | 0.927377  | 0.196321 |
| 6   | 0.15726   | 0.928393 |
| 7   | 0.0555973 | 0.29965  |
| 8   | 0.994351  | 0.476713 |
| 9   | 0.544934  | 0.877026 |
| 10  | 0.286077  | 0.143029 |,[1,2,3,4,5,6,7,8,9,10])

- using **single index** _ii_, it is **not** directly possible to get **`Array{T, 2}`** through comprehension

In [6]:
[[1 2] for ii=1:4]

4-element Array{Array{Int64,2},1}:
 1x2 Array{Int64,2}:
 1  2
 1x2 Array{Int64,2}:
 1  2
 1x2 Array{Int64,2}:
 1  2
 1x2 Array{Int64,2}:
 1  2

## Splicing

- **successively** returning components of collection
- could be used to **paste** elements of a **collection into function** arguments
- allows easy **creation** of **`Arrays`**

- applying **`[ ]`** to collection only captures whole collection as **single entry of** an **`Array`**

In [7]:
kk = (1, 2, 3, 4)
[kk]

1-element Array{(Int64,Int64,Int64,Int64),1}:
 (1,2,3,4)

- with **splicing**: **each element** of the collection gets its **own entry** within an **`Array`**

In [8]:
[kk...]

4-element Array{Int64,1}:
 1
 2
 3
 4

- works out of the box: even **for new types**

In [9]:
type foo
    value
end

In [10]:
fooObj = foo(3)

foo(3)

In [11]:
kk = (fooObj, fooObj, fooObj)
[kk...]

3-element Array{foo,1}:
 foo(3)
 foo(3)
 foo(3)

- for some types, there might be **more meaningful ways** to vertically **store successive values** than inside of an `Array`
- `vcat`: allows **combination** of objects in specified **structure**

In [12]:
vcat([1 2], [1 2])

2x2 Array{Int64,2}:
 1  2
 1  2

- `vcat` also works for **variable** number of **input arguments**:

In [13]:
kk = ([1 2], [3 4], [5 6])

(
1x2 Array{Int64,2}:
 1  2,

1x2 Array{Int64,2}:
 3  4,

1x2 Array{Int64,2}:
 5  6)

In [14]:
vcat(kk[1], kk[2], kk[3])

3x2 Array{Int64,2}:
 1  2
 3  4
 5  6

- together with splicing, `vcat` conveniently **transforms tuple** of values **into** concise **`Array`**:

In [15]:
vcat(kk...)

3x2 Array{Int64,2}:
 1  2
 3  4
 5  6

In [16]:
vcat([[1 2] for ii=1:4]...)

4x2 Array{Int64,2}:
 1  2
 1  2
 1  2
 1  2

- **`[ ]`** applied to **spliced elements** implicitly **calls `vcat`**
- applied to `Array{Int,2}`, this results in **two-dimensional Array**

In [17]:
kk = [[1 2] for ii=1:4]
[kk...]

4x2 Array{Int64,2}:
 1  2
 1  2
 1  2
 1  2

- both steps performed **simultaneously**:

In [18]:
[[[1 2] for ii=1:4]...]

4x2 Array{Int64,2}:
 1  2
 1  2
 1  2
 1  2

- **alternatively**, same result could be achieved **without splicing** through usage of **two index variables**

In [19]:
[jj for ii=1:4, jj=1:2]

4x2 Array{Int64,2}:
 1  2
 1  2
 1  2
 1  2

- in general, **splicing and comprehension** allows for **data structures different to `Array`** 
- for **example**, application of splicing and comprehension to **`DataFrames`** will return a `DataFrame` again

In [20]:
df = DataFrame()
df[:a] = @data([5, 6, NA])
df[:b] = @data([8, NA, NA])

kk = (df, df)
xx = [kk...]

Unnamed: 0,a,b
1,5.0,8.0
2,6.0,
3,,
4,5.0,8.0
5,6.0,
6,,


In [21]:
typeof(xx)

DataFrame (constructor with 11 methods)

# Iterators

under the hood, comprehensions make use of iterators:

- iterators **successively return** values from a **collection**
- iterators can be **specified for each type**

- for example: **column iterator of `DataFrames`**, which returns a tuple with column name and values given as `DataArray` for each column

In [22]:
df = DataFrame()
df[:a] = @data([5, 6, NA])
df[:b] = @data([8, NA, NA])

[col for col in eachcol(df)]

2-element Array{(Any,Any),1}:
 (:a,[5,6,NA]) 
 (:b,[8,NA,NA])

- although **comprehension** is a quite convenient way to **show the values of an iterator**, keep in mind that it also **automatically** captures the result in an **`Array`**
- this might be **unnecessarily costly** in some regard**?** (time consuming, allocating more memory, ...)

In [23]:
@time kk = [display(col) for col in eachcol(df)]

(:a,[5,6,NA])

(:b,[8,NA,NA])

elapsed time: 0.075189836 seconds (1745540 bytes allocated)


2-element Array{Any,1}:
 nothing
 nothing

- for comparison, only displaying iterated elements:

In [24]:
@time for col in eachcol(df)
    display(col)
end

(:a,[5,6,NA])

(:b,[8,NA,NA])

elapsed time: 0.001346607 seconds (32144 bytes allocated)


- **`DataFrame` iterator** returns tuple, so that **values only** (without column name) are obtained **through indexing**

In [25]:
[col[2] for col in eachcol(df)]

2-element Array{Any,1}:
 [5,6,NA] 
 [8,NA,NA]

two **applications of iterators** come to mind immediately:

- iteratively manipulating entries of a type
- building a new object by iteratively using an existing object

## Iteratively manipulating entries

example: **iteratively manipulating columns of `DataFrame`**

- as seen above, values need to be referenced within column iterator tuple with subindex 2
- **setting first entry of each column to 10**:

In [26]:
for col in eachcol(df)
    col[2][1] = 10
end
df

Unnamed: 0,a,b
1,10.0,10.0
2,6.0,
3,,


- let's try **multiplying each column by 10**:

In [27]:
try
    for col in eachcol(df)
        col[2] = col[2].*10
    end
catch e
    show(e)
end

MethodError(setindex!,((:a,[10,6,NA]),[100,60,NA],2))

- the **correct way** is:

In [28]:
for col in eachcol(df)
    col[2][:] = col[2].*10
end
df

Unnamed: 0,a,b
1,100.0,100.0
2,60.0,
3,,


- applying a similar logic for the **manipulation of** entries of an **`Array{Int, 1}` fails**:

In [29]:
kk = [1, 2, 3, 4]
try
    for entry in kk
        entry[1] = entry*5
    end
catch e
    show(e)
end
kk

MethodError(setindex!,(1,5,1))

4-element Array{Int64,1}:
 1
 2
 3
 4

- trying **similar transformation of** columns of a **`Timenum`** object

In [30]:
dats = [Date(2014,1,1):Date(2014,1,3)]
tn = Timenum(df, dats)

Unnamed: 0,idx,a,b
1,2014-01-01,100.0,100.0
2,2014-01-02,60.0,
3,2014-01-03,,


 - **iterative multiplication** of columns: **fails** (without error!) as `setindex!` methods are not defined

In [31]:
for col in eachvar(tn)
    col = col.*2
end
tn

Unnamed: 0,idx,a,b
1,2014-01-01,100.0,100.0
2,2014-01-02,60.0,
3,2014-01-03,,


- we need **detour to** the underlying **`DataFrame`**
- **be cautious** to correctly index the entries: the following code is **not working**

In [32]:
try
    for col in eachcol(tn)
        col[2] = col[2].*2
        display(col.vals.columns[1])
    end
catch e
    show(e)
end

MethodError(setindex!,((:a,[100,60,NA]),[200,120,NA],2))

- we need an **additional colon indexing** to write into an existing column

In [33]:
for col in eachcol(tn)
    col[2][:] = col[2].*2
end
tn

Unnamed: 0,idx,a,b
1,2014-01-01,200.0,200.0
2,2014-01-02,120.0,
3,2014-01-03,,


## Creating new objects by iteratively manipulating existing objects

- iterator protocols make **recursive data manipulation** easy
- combined **with comprehension**, this allows for **easy creation of new objects**
- for **example**: creating `Array` of squared entries

In [34]:
kk = [1 2 3 4]
kk2 = [ii.^2 for ii in kk]

4-element Array{Any,1}:
  1
  4
  9
 16

- again, there might be **more meaningful ways** to combine the individual parts than an **`Array`** as we get it from comprehension

In [35]:
df = DataFrame()
df[:a] = @data([5, 6, NA])
df[:b] = @data([8, NA, NA])

dats = [Date(2014,1,1):Date(2014,1,3)]
tn = Timenum(df, dats)

Unnamed: 0,idx,a,b
1,2014-01-01,5.0,8.0
2,2014-01-02,6.0,
3,2014-01-03,,


In [36]:
kk = [col.*2 for col in eachvar(tn)]

2-element Array{Any,1}:
 Timenum{Date}(3x1 DataFrame
| Row | a  |
|-----|----|
| 1   | 10 |
| 2   | 12 |
| 3   | NA |,[2014-01-01,2014-01-02,2014-01-03])
 Timenum{Date}(3x1 DataFrame
| Row | b  |
|-----|----|
| 1   | 16 |
| 2   | NA |
| 3   | NA |,[2014-01-01,2014-01-02,2014-01-03])

- as we **iterate over columns**, results should be **combined horizontally**
- simple **splicing** is not sufficient here, as it **uses `vcat`**:

In [37]:
try
    [[col.*2 for col in eachvar(tn)]...]
    catch e
    show(e)
end

ErrorException("variable names must coincide for vcat")

- we need **`hcat` instead**:

In [38]:
hcat([col.*2 for col in eachvar(tn)]...)

Unnamed: 0,idx,a,b
1,2014-01-01,10.0,16.0
2,2014-01-02,12.0,
3,2014-01-03,,


In [39]:
DataArray[tn.vals[1] tn.vals[2]]

1x2 Array{DataArray{T,N},2}:
 [5,6,NA]  [8,NA,NA]

- instead of manually combining manipulated values from an iterator each time, we also could define a **default data structure returned** through function **`map`**

## Map

- through **multiple dispatch**, the **output** of `map` can be **customized to the iterator type** used
- for example: **multiplication of each `DataFrame` column** could be done **in two different ways**
- **first way**: **iterating over** entries of an **`Array`** (which contains the column names) will return an `Array`

In [40]:
df = DataFrame(a = [1, 2, 3], b = [4, 5, 6])
map(nam -> df[nam].*2, names(df))

2-element Array{DataArray{Int64,1},1}:
 [2,4,6]  
 [8,10,12]

- **second way**: using method `map` for **`DataFrame` column iterator**

In [41]:
df2 = map(col -> col.*2, eachcol(df))
df2

Unnamed: 0,a,b
1,2,8
2,4,10
3,6,12


- similarly: **map** for **TimeData iterators**

In [42]:
@time tn2 = map(x -> x.*2, eachvar(tn))

elapsed time: 0.055906843 seconds (1649364 bytes allocated)


Unnamed: 0,idx,a,b
1,2014-01-01,10.0,16.0
2,2014-01-02,12.0,
3,2014-01-03,,


- `map` for `TimeData` iterators `eachvar`, `eachdate` and `eachobs` always **preserves both indices and column names**

- `map` can also be defined for **two collections**

In [43]:
vals1 = [10 20]
vals2 = [40 1]
map(.+, vals1, vals2)

1x2 Array{Int64,2}:
 50  21

## Reduce

- using function `reduce` **individual components** of a collection can be **aggregated**
- through multiple dispatch, `reduce` can have **different implementations** for each **type**
- **using `map` and `reduce`** together, individual entries of iterable collections can be **manipulated and aggregated** to a single result 

**example**: calculating **row means**

In [44]:
df

Unnamed: 0,a,b
1,1,4
2,2,5
3,3,6


In [45]:
meanDf = reduce((x,y) -> (x[2].+y[2])./size(df, 2), eachcol(df))

3-element DataArray{Float64,1}:
 2.5
 3.5
 4.5

**example**: calculating **row sum with weighted columns**

- using **`map`** to calculate **weighted columns**
- using **`reduce`** to **sum up** individual weighted columns

In [46]:
df = DataFrame(a = [1, 2, 3, 4], b = [4, 5, 6, 7], c = [2, 4, 8, 10])

Unnamed: 0,a,b,c
1,1,4,2
2,2,5,4
3,3,6,8
4,4,7,10


- getting weighted columns:

In [47]:
wgts = [0.4 0.2 0.4]
kk = map((x, y) -> x.*y[2], wgts, eachcol(df))

3-element Array{Any,1}:
 [0.4,0.8,1.2,1.6]
 [0.8,1.0,1.2,1.4]
 [0.8,1.6,3.2,4.0]

In [48]:
wgts[1] * [1, 2, 3, 4]

4-element Array{Float64,1}:
 0.4
 0.8
 1.2
 1.6

- aggregation with `reduce`

In [49]:
reduce((x, y) -> (x .+ y), map((x, y) -> x.*y[2], wgts, eachcol(df)))

4-element DataArray{Float64,1}:
 2.0
 3.4
 5.6
 7.0

## Session info

In [50]:
versioninfo()

Julia Version 0.3.5
Commit a05f87b* (2015-01-08 22:33 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3-3240 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libblas.so.3
  LAPACK: liblapack.so.3
  LIBM: libopenlibm
  LLVM: libLLVM-3.3


In [51]:
Pkg.status()

18 required packages:
 - DataArrays                    0.2.9
 - DataFrames                    0.6.0
 - Dates                         0.3.2
 - Debug                         0.0.4
 - Distributions                 0.6.3
 - EconDatasets                  0.0.2
 - GLM                           0.4.2
 - Gadfly                        0.3.10
 - IJulia                        0.1.16
 - JuMP                          0.7.3
 - MAT                           0.2.9
 - NLopt                         0.2.0
 - Quandl                        0.4.0
 - RDatasets                     0.1.1
 - Taro                          0.1.2
 - TimeData                      0.5.1
 - TimeSeries                    0.4.6
 - Winston                       0.11.7
56 additional packages:
 - ArrayViews                    0.4.8
 - BinDeps                       0.3.7
 - Blosc                         0.1.1
 - Cairo                         0.2.22
 - Calculus                      0.1.5
 - Codecs                        0.1.3
 - Color      