# Apply structures

The purpose of this lesson is to familiarize you to coding with apply structures in R.

It is well known in the R programming community that for-loops are slow in R. As an example lets try get a sum each row of a random `data.frame` with four columns:

In [149]:
nrows = 100
data <- data.frame(
    var1=rnorm(nrows),
    var2=rnorm(nrows),
    var3=rnorm(nrows),
    var4=rnorm(nrows)
    )

Lets create a simple function `rowSums_forloop` that initializes a result vector `rowsum` and in a for-loop goes through the rows of the input `data.frame` and stores the sum of row `i` to the `rowsum[i]`:

In [160]:
rowSums_forloop <- function (dataframe) {
    # Sums up each row with inefficient for-loop
    # 
    # Args:
    #   dataframe: data.frame to sum over
    #
    # Returns:
    #   rowsum: Vector containing sum of each row
    
    # Generate the vector rowsum
    rowsum <- c()
    for (i in 1:nrow(dataframe)) {
        # For each row store the sum of row i to rowsum[i]
        rowsum[i] <- sum(dataframe[i,])
    }
    return(rowsum)
}

rowsum_forloop <- rowSums_forloop(data)
head(rowsum_forloop)

The previous way is not optimal as it does not utilize the vectorized nature of R.

Vectorized in this context means that as each column of `data` is a vector, `data[i,]` and `data[i+1,]` do not differ in structure. Thus sometimes the operation that is carried out on these rows (here it was the `sum`-function) can be done in parallel across multiple rows at a same time.

To utilize the vectorized nature of R objects one usually uses `apply`-style functions.

The basic `apply`-function has a simple syntax (see: [[1]](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/apply)):

```r
apply(array, margin, func)
```

Here `array` is a data array (vector, matrix, data.frame); `margin` is a integer/vector that determines whether to apply the function on rows (`1`), columns(`2`) or both (`c(1,2)`); and `func` is the function to apply.

Dimensions of the result depend on the dimensions of the original data, the direction of application and the output shape of `func`. In this case we want to apply the `sum` function to each row. `sum` returns a single number as a result, so the result size is a vector with `nrows`.

In [161]:
rowSums_apply <- function(dataframe) {
    # Sums up each row with an apply-function call
    # 
    # Args:
    #   dataframe: data.frame to sum over
    #
    # Returns:
    #   rowsum: Vector containing sum of each row
    # 
    
    # Apply sum-function to each row of dataframe and return the result
    rowsum <- apply(dataframe,1,sum)
    return(rowsum)
}
rowsum_apply <- rowSums_apply(data)

# Check that results match
head(rowsum_apply)
all(rowsum_forloop == rowsum_apply)

Of course a language like R has an internal function for calculating the sum of columns for each row. This function is `rowSums`.

In [162]:
rowsum_baser <- rowSums(data)

# Check that results match
head(rowsum_baser)
all(rowsum_apply == rowsum_baser)


Now that we have three implementations of the same function we can use `microbenchmark`-library to make them compete against each other. `microbenchmark`-function can be used to run a function call in order to generate statistics of the function runtime. By default it runs the code 100 times.

In [165]:
library(microbenchmark)

print(microbenchmark(
    rowSums_forloop(data),
    rowSums_apply(data),
    rowSums(data)
))

Unit: microseconds
                  expr       min        lq       mean     median         uq
 rowSums_forloop(data) 11957.489 12667.716 13769.8453 13247.0690 14312.6990
   rowSums_apply(data)   257.033   280.026   331.8649   291.9725   307.8665
         rowSums(data)    51.508    57.499    63.3132    62.2505    65.4340
       max neval
 43767.887   100
  1783.261   100
    94.473   100


As is apparent from the runtimes the for-loop implementation loses by a huge margin. In general it is good to avoid for-loops unless there is a sequential order to the looping e.g. for-loop an iterative algorithm.