# MiCM Workshop Series - R Programming Beyond the Basics
## Efficient Programming and Computing
### - Yi Lian
### - August 13, 2019

## Outline
**_Morning_**
1. __An overview of efficiency__
    - General rules
    - R-specific rules
    - Time your program in R
2. __Efficient programming__
    - Powerful functions in R
            - ifelse(), cut() and split()
            - aggregate(), by(), apply() family
    - Write our own functions in R
            - function()
    - Examples and exercises
        - Categorization, conditional operations, etc..
        
**_Afternoon_**
3. __Efficient computing__
    - Parallel computing
            - Package 'parallel'
    - Integration with C++
            - Package 'Rcpp'
    - Integration with Fortran
    - Examples and exercises
        - Implement our own gradient descent function written in R, Rcpp or Fortran!
        
##### Important note! There are many advanced and powerful packages that do different things. However, they will not be covered in this tutorial.
##### Here is a list of some awesome packages for data manipulation.
https://awesome-r.com/#awesome-r-data-manipulation

## 1. An overview of efficiency
#### Why?
- Era of big data/machine learning/AI
- Large sample size and/or high dimension

### 1.1 General rules
- Reading/writing data and other operations take time (CPU)
- Objects and operations take memory
    - http://adv-r.had.co.nz/memory.html
    - R will do "garbage collection" automatically when it needs more memory, which takes time
- Setups take time (overhead)
- Programming languages are different and are fast/slow at different things
- Efficient programming $\neq$ efficient computing
    - Shorter codes do not necessarily lead to shorter run time.
- __Avoid duplicated operations, especially expensive operations__
    - Matrix mulplications, inversion, etc..
- __Test your program__

### 1.2 R-specific rules
- R emphasizes flexibility but not speed
    - Very good for research
- R is designed to be better with vectorized operations than loops
- Without specific setups, R only uses 1 CPU core/thread
    - Setting up parallel (multicore) computing takes time (overhead)
- Use well-developped R functions and packages
    - Some of them have core computations written in other languages, e.g. C, C++, Fortran

In [1]:
# On MacOS, CPU usage can go up to 100% when 1 core is used.
# Let's try to inverse a large matrix.
# A <- diag(5000)
# A.inv <- solve(A)

# optim in R calls C programs, run optim to see source code.
# optim

### 1.3 Time your program in R
        - proc.time(), system.time()
        - microbenchmark()

In [2]:
# Calculate the square root of integers 1 to 1,000,000 using two different operations:
# Vectorized operation
t <- system.time( x1 <- sqrt(1:1000000) )

# For loop
x2 <- rep(NA, 1000000)
t0 <- proc.time()
for (i in 1:1000000) {
    x2[i] <- sqrt(i)
}
t1 <- proc.time()

identical(x1, x2)

In [3]:
# As we can see, R is not very good with loops.
t; t1 - t0

   user  system elapsed 
  0.006   0.004   0.011 

   user  system elapsed 
  0.067   0.002   0.069 

In [4]:
library(microbenchmark)
result <- microbenchmark(sqrt(1:1000000),
                         for (i in 1:1000000) {x2[i] <- sqrt(i)},
                         unit = "s", times = 20
                        )
summary(result)
# Result in seconds

expr,min,lq,mean,median,uq,max,neval
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
sqrt(1:1e+06),0.004201712,0.004397159,0.006559282,0.00519702,0.008317069,0.01094512,20
for (i in 1:1e+06) { x2[i] <- sqrt(i) },0.059184854,0.060054009,0.063860896,0.06349835,0.06613172,0.07619752,20


In [5]:
# Use well-developped R functions
result <- microbenchmark(sqrt(500),
                         500^0.5,
                         unit = "ns", times = 1000
                        )
summary(result)
# Result in nanoseconds

expr,min,lq,mean,median,uq,max,neval
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
sqrt(500),77,88.5,101.775,93,99,2489,1000
500^0.5,153,163.0,180.67,168,174,3911,1000


##### In summary, keep the rules in mind, test your program, time your program.
## 2. Efficient programming
R has many powerful and useful functions that we can use to achieve efficient programming and computing.
##### Let's play with some data.

In [40]:
data <- read.csv("https://raw.githubusercontent.com/ly129/MiCM/master/sample.csv", header = TRUE)
head(data, 10)

X,Sex,Wr.Hnd,NW.Hnd,Pulse,Smoke,Height,Age
<int>,<fct>,<dbl>,<dbl>,<int>,<fct>,<dbl>,<dbl>
1,Male,21.4,21.0,63.0,Never,180.0,19.0
2,Male,19.5,19.4,79.0,Never,165.0,18.083
3,Female,16.3,16.2,44.0,Regul,152.4,23.5
4,Female,15.9,16.5,99.0,Never,167.64,17.333
5,Male,19.3,19.4,55.0,Never,180.34,19.833
6,Male,18.5,18.5,48.0,Never,167.0,22.333
7,Female,17.5,17.0,85.0,Heavy,163.0,17.667
8,Male,19.8,20.0,,Never,180.0,17.417
9,Female,13.0,12.5,77.0,Never,165.0,18.167
10,Female,18.5,18.0,75.0,Never,173.0,18.25


In [41]:
summary(data)

       X              Sex         Wr.Hnd          NW.Hnd          Pulse       
 Min.   :  1.00   Female:47   Min.   :13.00   Min.   :12.50   Min.   : 40.00  
 1st Qu.: 25.75   Male  :53   1st Qu.:17.50   1st Qu.:17.45   1st Qu.: 50.25  
 Median : 50.50               Median :18.50   Median :18.50   Median : 71.50  
 Mean   : 50.50               Mean   :18.43   Mean   :18.39   Mean   : 69.90  
 3rd Qu.: 75.25               3rd Qu.:19.50   3rd Qu.:19.52   3rd Qu.: 84.75  
 Max.   :100.00               Max.   :23.20   Max.   :23.30   Max.   :104.00  
                                                              NA's   :6       
   Smoke        Height           Age       
 Heavy: 6   Min.   :152.0   Min.   :16.92  
 Never:79   1st Qu.:166.4   1st Qu.:17.58  
 Occas: 5   Median :170.2   Median :18.46  
 Regul:10   Mean   :171.8   Mean   :20.97  
            3rd Qu.:179.1   3rd Qu.:20.21  
            Max.   :200.0   Max.   :73.00  
            NA's   :13                     

#### a. Calculate the mean writing hand span of all individuals
    mean(x, trim = 0, na.rm = FALSE, ...)

#### b. Calculate the mean height of all individuals, exclude the missing values

#### c. Calculate the count/proportion of males and females
    table(...,
      exclude = if (useNA == "no") c(NA, NaN),
      useNA = c("no", "ifany", "always"),
      dnn = list.names(...), deparse.level = 1)

    prop.table()

#### d. Calculate the count of each smoking status group

#### e. Calculate the count of males and females in each smoking status group

#### f. Calculate the standare deviation of writing hand span of females

#### g. Calculate the standard deviation of writing hand span of all different gender-smoking groups

In [50]:
by