# MiCM Workshop Series - R Programming Beyond the Basics
## Efficient Programming and Computing
### - Yi Lian
### - August 13, 2019

## Outline
**_Morning_**
1. __An overview of efficiency__
    - General rules
    - R-specific rules
    - Time your program in R
2. __Efficient programming__
    - Powerful functions in R
            - ifelse(), cut() and split()
            - aggregate(), by(), apply() family
    - Write our own functions in R
            - function()
    - Examples and exercises
        - Categorization, conditional operations, etc..
        
**_Afternoon_**
3. __Efficient computing__
    - Parallel computing
            - Package 'parallel'
    - Integration with C++
            - Package 'Rcpp'
    - Integration with Fortran
    - Examples and exercises
        - Implement our own gradient descent function written in R, Rcpp or Fortran!

## 1. An overview of efficiency
#### Why?
- Era of big data/machine learning/AI
- Large sample size and/or high dimension

### 1.1 General rules
- Reading/writing data and other operations take time (CPU)
- Objects and operations take memory
    - http://adv-r.had.co.nz/memory.html
    - R will do "garbage collection" automatically when it needs more memory, which takes time
- Setups take time (overhead)
- Programming languages are different and are fast/slow at different things
- Efficient programming $\neq$ efficient computing
    - Shorter codes do not necessarily lead to shorter run time.
- __Avoid duplicated operations, especially expensive operations__
    - Matrix mulplications, inversion, etc..
- __Test your program__

### 1.2 R-specific rules
- R emphasizes flexibility but not speed
    - Very good for research
- R is designed to be better with vectorized operations than loops
- Without specific setups, R only uses 1 CPU core/thread
    - Setting up parallel (multicore) computing takes time (overhead)
- Use well-developped R functions and packages
    - Some of them have core computations written in other languages, e.g. C, C++, Fortran

In [1]:
# On MacOS, CPU usage can go up to 100% when 1 core is used.
# Let's try to inverse a 1000 x 1000 matrix.
A <- diag(500)
A.inv <- solve(A)

# optim in R calls C programs, run optim to see source code.
# optim

### 1.3 Time your program in R
        - proc.time(), system.time()
        - microbenchmark()

In [2]:
# Calculate the square root of integers 1 to 1,000,000 using three different operations:
# Vectorized operation
t <- system.time( x1 <- sqrt(1:1000000) )

# For loop
x2 <- rep(NA, 1000000)
t0 <- proc.time()
for (i in 1:1000000) {
    x2[i] <- sqrt(i)
}
t1 <- proc.time()

identical(x1, x2)

In [3]:
# As we can see, R is not very good with loops.
t; t1 - t0

   user  system elapsed 
  0.006   0.004   0.011 

   user  system elapsed 
  0.067   0.002   0.069 

In [4]:
library(microbenchmark)
result <- microbenchmark(sqrt(1:1000000),
                         for (i in 1:1000000) {x2[i] <- sqrt(i)},
                         unit = "s", times = 20
                        )
summary(result)
# Result in seconds

expr,min,lq,mean,median,uq,max,neval
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
sqrt(1:1e+06),0.004201712,0.004397159,0.006559282,0.00519702,0.008317069,0.01094512,20
for (i in 1:1e+06) { x2[i] <- sqrt(i) },0.059184854,0.060054009,0.063860896,0.06349835,0.06613172,0.07619752,20


In [5]:
# Use well-developped R functions
result <- microbenchmark(sqrt(500),
                         500^0.5,
                         unit = "ns", times = 1000
                        )
summary(result)
# Result in nanoseconds

expr,min,lq,mean,median,uq,max,neval
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
sqrt(500),77,88.5,101.775,93,99,2489,1000
500^0.5,153,163.0,180.67,168,174,3911,1000


##### In summary, keep the rules in mind, test your program, time your program.
## 2. Efficient programming
R has many powerful and useful functions that we can use to achieve efficient programming and computing.