# Iteration in R

## Apply structures

### apply vs. for

There is a popular belief that for-loops are slow in R. This is not true, but using them improperly can lead to complicated code with a risk of a bad implementation. `apply`-functions are usually recommended as they fit better into R's procedural programming paradigm.

As an example lets try get a sum each row of a random `data.frame` with four columns:

In [7]:
nrows = 100
data <- data.frame(
    var1=rnorm(nrows),
    var2=rnorm(nrows),
    var3=rnorm(nrows),
    var4=rnorm(nrows)
    )

Lets create a simple function `rowSums_forloop` that initializes a result vector `rowsum` and in a for-loop goes through the rows of the input `data.frame` and stores the sum of row `i` to the `rowsum[i]`:

In [8]:
library(pryr)

rowSums_forloop_rows <- function (dataframe) {
    # Sums up each row in a for-loop with row-major order
    # 
    # Args:
    #   dataframe: data.frame to sum over
    #
    # Returns:
    #   rowsum: Vector containing sum of each row
    
    # Generate the vector rowsum and initalize its size
    rowsum <- vector('double',nrow(dataframe))
    
    # Not like this:
    # Generate the vector rowsum, but do not initialize its size
    # rowsum <- c()
    for (i in 1:nrow(dataframe)) {
        for (j in 1:ncol(dataframe)) {

        # Check memory address of rowsum to verify that it does not change
        # print(address(rowsum))
        
        # For each row and column add the value to rowsum[i]
        rowsum[i] <- rowsum[i] + dataframe[i,j]
        }
    }
    return(rowsum)
}

rowSums_forloop_cols <- function (dataframe) {
    # Sums up each row in a for-loop with column-major order
    # 
    # Args:
    #   dataframe: data.frame to sum over
    #
    # Returns:
    #   rowsum: Vector containing sum of each row
    
    # Generate the vector rowsum and initalize its size
    rowsum <- vector('double',nrow(dataframe))
    
    for (j in 1:ncol(dataframe)) {
        for (i in 1:nrow(dataframe)) {
        
            # For each column and row add the value to rowsum
            rowsum[i] <- rowsum[i] + dataframe[i,j]
        }
    }
    return(rowsum)
}

rowSums_forloop_colnames <- function (dataframe) {
    # Sums up each row with one column at the time in a for-loop
    # 
    # Args:
    #   dataframe: data.frame to sum over
    #
    # Returns:
    #   rowsum: Vector containing sum of each row
    
    # Generate the vector rowsum and initalize its size
    rowsum <- vector('double',nrow(dataframe))
    
    for (col in colnames(dataframe)) {
        # For each column add the values to rowsum
        rowsum <- rowsum + dataframe[,col]
    }
    return(rowsum)
}


rowsum_forloop_rows <- rowSums_forloop_rows(data)
rowsum_forloop_cols <- rowSums_forloop_cols(data)
rowsum_forloop_colnames <- rowSums_forloop_colnames(data)

head(rowsum_forloop_rows)
# Check that results match
all.equal(rowsum_forloop_rows,rowsum_forloop_cols)
all.equal(rowsum_forloop_rows,rowsum_forloop_colnames)

One of the previous examples utilizes R's vectorized nature properly while others are a really bad implementation. Can you spot which one is the good one?

Vectorized in this context means that if were going to use the same function `FUN` on each row (column) one at a time, we can just initialize the function once and run the data through it one row (column) at a time. Time is saved as `FUN` can re-used for each iteration of the loop. 

To utilize the vectorized nature of R objects one usually uses `apply`-style functions.

The basic `apply`-function has a simple syntax (see: [[1]](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/apply)):

```r
apply(X, MARGIN, FUN)
```

Here `X` is a data array (vector, matrix, data.frame); `MARGIN` is a integer/vector that determines whether to apply the function on rows (`1`), columns(`2`) or both (`c(1,2)`); and `FUN` is the function to apply.

Dimensions of the result depend on the dimensions of the original data, the direction of application and the output shape of `FUN`. In this case we want to apply the `sum` function to each row. `sum` returns a single number as a result, so the result size is a vector with `nrows`.

In [9]:
rowSums_apply <- function(dataframe) {
    # Sums up each row with an apply-function call
    # 
    # Args:
    #   dataframe: data.frame to sum over
    #
    # Returns:
    #   rowsum: Vector containing sum of each row
    # 
    
    # Apply sum-function to each row of dataframe and return the result
    rowsum <- apply(dataframe,1,sum)
    return(rowsum)
}
rowsum_apply <- rowSums_apply(data)

# Check that results match
all.equal(rowsum_forloop_cols,rowsum_apply)

Of course a language like R has an internal function for calculating the sum of columns for each row. This function is `rowSums`.

In [10]:
rowsum_baser <- rowSums(data)

# Check that results match
all.equal(rowsum_apply, rowsum_baser)


Now that we have five implementations of the same function we can use `microbenchmark`-library to make them compete against each other. `microbenchmark`-function can be used to run a function call in order to generate statistics of the function runtime. By default it runs the code 100 times.

In [11]:
library(microbenchmark)

print(microbenchmark(
    rowSums_forloop_rows(data),
    rowSums_forloop_cols(data),
    rowSums_forloop_colnames(data),
    rowSums_apply(data),
    rowSums(data)
))

Unit: microseconds
                           expr    min      lq      mean   median       uq
     rowSums_forloop_rows(data) 4375.1 9525.45 10878.433 10428.15 12590.00
     rowSums_forloop_cols(data) 4081.6 8936.55 10880.931  9810.20 11602.10
 rowSums_forloop_colnames(data)   35.3   48.75    70.483    56.25    65.40
            rowSums_apply(data)  164.0  228.05   761.330   256.00   327.65
                  rowSums(data)   34.5   50.75   152.847    85.95   103.25
     max neval
 18564.1   100
 82727.0   100
   949.3   100
  5867.7   100
  4746.7   100


As is apparent from the runtimes the bad for-loop implementations were the ones that went through the data one element at a time while the one that added one column at a time to the sum was even better than the `apply`-function.

As a conclusion: The main reason to use `apply`-functions is *convenience* and efficiency through *minimizing risks*. With `apply`-functions you save code as you don't need to keep check of indices.

However do remember that `apply`-functions are meant to be used in cases where you can split input into independent chunks.

## Other *apply-functions

### lapply

`lapply` is similar to `apply` but it always operates on columns and it returns a list as its output. Call for this function is

`lapply(X, FUN)`

### sapply

`sapply`  is similar to `apply` but it always operates on columns and always simplifies to result to the most simple data type available. Call for this function is

`lapply(X, FUN)`

### vapply

`vapply`  is similar to `apply` but it always operates on columns and verifies that each call of `FUN` has the type and size of a vector given in `FUN.VALUE`. Call for this function is

`lapply(X, FUN, FUN.VALUE)`

### mapply

`mapply` or multivariate apply can be used to take one argument from one array, second from second, etc. and call a function `FUN` with these arguments. Call for this function is

`mapply(FUN, ...)`

where `...` are arrays given as positional arguments.

In [12]:
# Calculate mean of each column

apply_result <- apply(data,2,mean)
print('apply:')
str(apply_result)

lapply_result <- lapply(data,mean)
print('lapply:')
str(lapply_result)

sapply_result <- sapply(data,mean)
print('sapply:')
str(sapply_result)

vapply_result <- vapply(data,mean,complex(1))
print('vapply:')
str(vapply_result)

# Calculate sum for each row like in the previous example

mapply_result <- mapply(sum,data$var1,data$var2,data$var3,data$var4)
print('mapply:')
str(mapply_result)
all.equal(mapply_result,rowsum_baser)

[1] "apply:"
 Named num [1:4] 0.0985 -0.0667 0.0337 -0.152
 - attr(*, "names")= chr [1:4] "var1" "var2" "var3" "var4"
[1] "lapply:"
List of 4
 $ var1: num 0.0985
 $ var2: num -0.0667
 $ var3: num 0.0337
 $ var4: num -0.152
[1] "sapply:"
 Named num [1:4] 0.0985 -0.0667 0.0337 -0.152
 - attr(*, "names")= chr [1:4] "var1" "var2" "var3" "var4"
[1] "vapply:"
 Named cplx [1:4] 0.0985+0i -0.0667+0i 0.0337+0i ...
 - attr(*, "names")= chr [1:4] "var1" "var2" "var3" "var4"
[1] "mapply:"
 num [1:100] 4.358 -0.229 -3.684 -0.887 -0.929 ...
