# Apply structures

## apply vs. for

The purpose of this lesson is to familiarize you to coding with apply structures in R.

It is well known in the R programming community that for-loops can be slow in R. This slowdown is usually caused by using assignments within a for-loop that causes memory duplication.

As an example lets try get a sum each row of a random `data.frame` with four columns:

In [56]:
nrows = 100
data <- data.frame(
    var1=rnorm(nrows),
    var2=rnorm(nrows),
    var3=rnorm(nrows),
    var4=rnorm(nrows)
    )

Lets create a simple function `rowSums_forloop` that initializes a result vector `rowsum` and in a for-loop goes through the rows of the input `data.frame` and stores the sum of row `i` to the `rowsum[i]`:

In [57]:
library(pryr)

rowSums_forloop <- function (dataframe) {
    # Sums up each row with inefficient for-loop
    # 
    # Args:
    #   dataframe: data.frame to sum over
    #
    # Returns:
    #   rowsum: Vector containing sum of each row
    
    # Generate the vector rowsum and initalize its size
    rowsum <- vector('double',nrow(dataframe))
    
    # Not like this:
    # Generate the vector rowsum, but do not initialize its size
    # rowsum <- c()
    for (i in 1:nrow(dataframe)) {

        # Check memory address of rowsum to verify that it does not change
        # print(address(rowsum))
        
        # For each row store the sum of row i to rowsum[i]
        rowsum[i] <- sum(dataframe[i,])
    }
    return(rowsum)
}

rowsum_forloop <- rowSums_forloop(data)
head(rowsum_forloop)

The previous way is not optimal as it does not utilize the vectorized nature of R.

Vectorized in this context means that as each column of `data` is a vector, `data[i,]` and `data[i+1,]` do not differ in structure. Thus sometimes the operation that is carried out on these rows (here it was the `sum`-function) can be done in parallel across multiple rows at a same time. In addition the memory management is much easier.

To utilize the vectorized nature of R objects one usually uses `apply`-style functions.

The basic `apply`-function has a simple syntax (see: [[1]](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/apply)):

```r
apply(array, margin, func)
```

Here `array` is a data array (vector, matrix, data.frame); `margin` is a integer/vector that determines whether to apply the function on rows (`1`), columns(`2`) or both (`c(1,2)`); and `func` is the function to apply.

Dimensions of the result depend on the dimensions of the original data, the direction of application and the output shape of `func`. In this case we want to apply the `sum` function to each row. `sum` returns a single number as a result, so the result size is a vector with `nrows`.

In [58]:
rowSums_apply <- function(dataframe) {
    # Sums up each row with an apply-function call
    # 
    # Args:
    #   dataframe: data.frame to sum over
    #
    # Returns:
    #   rowsum: Vector containing sum of each row
    # 
    
    # Apply sum-function to each row of dataframe and return the result
    rowsum <- apply(dataframe,1,sum)
    return(rowsum)
}
rowsum_apply <- rowSums_apply(data)

# Check that results match
head(rowsum_apply)
all.equal(rowsum_forloop,rowsum_apply)

Of course a language like R has an internal function for calculating the sum of columns for each row. This function is `rowSums`.

In [59]:
rowsum_baser <- rowSums(data)

# Check that results match
head(rowsum_baser)
all.equal(rowsum_apply, rowsum_baser)


Now that we have three implementations of the same function we can use `microbenchmark`-library to make them compete against each other. `microbenchmark`-function can be used to run a function call in order to generate statistics of the function runtime. By default it runs the code 100 times.

In [60]:
library(microbenchmark)

print(microbenchmark(
    rowSums_forloop(data),
    rowSums_apply(data),
    rowSums(data)
))

Unit: microseconds
                      expr       min         lq        mean     median
 rowSums_forloop_bad(data) 12238.480 12712.6060 13502.50409 13051.2680
     rowSums_forloop(data) 12127.308 12666.6335 13722.15232 13220.5875
       rowSums_apply(data)   260.644   287.7195   313.21951   298.7675
             rowSums(data)    53.362    61.2440    90.07444    63.0640
         uq       max neval
 13673.7445 20095.167   100
 14217.5325 21906.828   100
   317.4025   599.367   100
    65.6810  2508.140   100


As is apparent from the runtimes the for-loop implementation loses by a huge margin. 

In some cases for-loops are as effective as `apply`-functions, but in general it is good to avoid them unless there is a sequential order to the looping e.g. for-loop an iterative algorithm or you've verified that there's not improvement. 

## Other *apply-functions

The following functions apply function for each column of the data:

### lapply

`lapply` is similar to `apply` but it always returns a list as its output. For it the function call is

`lapply(X, FUN, ...)`

### sapply

`sapply` always simplifies to result to the most simple data type available. Call is similar with `lapply`.

### vapply

`vapply` verifies that each call of `FUN` has the type and size of a vector given in `FUN.VALUE`. Call for this function is

`lapply(X, FUN, FUN.VALUE, ...)`

### mapply

`mapply` or multivariate apply can be used to take one argument from one array, second from second, etc. and call a function `FUN` with these arguments. Call for this function is

`mapply(FUN, ...)`

where `...` are arrays given as positional arguments.

In [116]:
# Calculate mean of each column

apply_result <- apply(data,2,mean)
print('apply:')
str(apply_result)

lapply_result <- lapply(data,mean)
print('lapply:')
str(lapply_result)

sapply_result <- sapply(data,mean)
print('sapply:')
str(sapply_result)

vapply_result <- vapply(data,mean,complex(1))
print('vapply:')
str(vapply_result)

# Calculate sum for each row like in the previous example

mapply_result <- mapply(sum,data$var1,data$var2,data$var3,data$var4)
print('mapply:')
str(mapply_result)
all.equal(mapply_result,rowsum_baser)

[1] "apply:"
 Named num [1:4] 0.1275 0.0797 -0.0508 -0.0492
 - attr(*, "names")= chr [1:4] "var1" "var2" "var3" "var4"
[1] "lapply:"
List of 4
 $ var1: num 0.128
 $ var2: num 0.0797
 $ var3: num -0.0508
 $ var4: num -0.0492
[1] "sapply:"
 Named num [1:4] 0.1275 0.0797 -0.0508 -0.0492
 - attr(*, "names")= chr [1:4] "var1" "var2" "var3" "var4"
[1] "vapply:"
 Named cplx [1:4] 0.1275+0i 0.0797+0i -0.0508+0i ...
 - attr(*, "names")= chr [1:4] "var1" "var2" "var3" "var4"
[1] "mapply:"
 num [1:100] -2.037 3.253 -1.286 0.408 -1.113 ...
