# MiCM Workshop Series - R Programming Beyond the Basics
## Efficient Coding and Computing
### - Yi Lian
### - August 13, 2019

## Outline
**_Morning_**
1. __An overview of efficiency__
    - General rules
    - R-specific rules
    - Time your program in R
2. __Efficient coding__
    - Powerful functions in R
            - aggregate(), by(), apply() family
            - ifelse(), cut() and split()
    - Write our own functions in R
            - function()
    - Examples and exercises
        - Categorization, conditional operations, etc..
        
**_Afternoon_**
3. __Efficient computing__
    - Parallel computing
            - Package 'parallel'
    - Integration with C++
            - Package 'Rcpp'
    - Integration with Fortran
    - Examples and exercises
        - Implement our own gradient descent function written in R, Rcpp or Fortran!
        
##### Important note! There are (too) MANY advanced and powerful packages that do different things. However, they will not be covered in this tutorial.
##### Here is a list of some awesome packages for data manipulation.
https://awesome-r.com/#awesome-r-data-manipulation

## 1. An overview of efficiency
#### Why?
- Era of big data/machine learning/AI
- Large sample size and/or high dimension

### 1.1 General rules
- Reading/writing data takes time (CPU)
    - Memory allocation and re-allocation
    - A not really appropriate illustration 
        - https://www.amazon.ca/Sandisk-Memory-Standard-Packaging-SDSDUNC-128G-GN6IN/dp/B0143IISD0/ref=sr_1_3?crid=36QB2Y3GKN3P6&keywords=sd+card+for+camera&qid=1561902726&s=electronics&sprefix=sd+card+for+camera%2Caps%2C128&sr=1-3
- All operations take time (CPU)
- Objects and operations take memory
    - http://adv-r.had.co.nz/memory.html
    - e.g. R will do "garbage collection" automatically when it needs more memory, which takes time
- Setups take time (overhead)
- Programming languages are different and are fast/slow at different things
- Efficient coding $\neq$ efficient computing
    - Shorter codes do not necessarily lead to shorter run time.
- __Avoid duplicated operations, especially expensive operations__
    - Matrix mulplications, inversion, etc..
    - Store the results that will be used later as objects.
- __Test your program__

### 1.2 R-specific rules
- R emphasizes flexibility but not speed
    - Very good for research
- R is designed to be better with vectorized operations than loops
- Without specific setups, R only uses 1 CPU core/thread
    - Setting up parallel (multicore) computing takes time (overhead)
- Use well-developped R functions and packages
    - Some of them have core computations written in other languages, e.g. C, C++, Fortran

In [1]:
# On MacOS, CPU usage can go up to 100% when 1 core is used.
# Let's try to inverse a large matrix.
# A <- diag(5000)
# A.inv <- solve(A)

# optim in R calls C programs, run optim to see source code.
# optim

### 1.3 Time your program in R
        - proc.time(), system.time()
        - microbenchmark()

##### Example
Calculate the square root of integers 1 to 1,000,000 using two different operations:

In [10]:
# Vectorized operation
t <- system.time( x1 <- sqrt(1:1000000) )

In [11]:
# We can do worse
# For loop with memory pre-allocation
x2 <- rep(NA, 1000000)
t0 <- proc.time()
for (i in 1:1000000) {
    x2[i] <- sqrt(i)
}
t1 <- proc.time()

identical(x1, x2)

In [12]:
# Even worse
# For loop withour memory pre-allocation
x3 <- NULL
t2 <- proc.time()
for (i in 1:1000000) {
    x3[i] <- sqrt(i)
}
t3 <- proc.time()

identical(x2, x3)

In [5]:
# As we can see, R is not very good with loops.
t; t1 - t0; t3 - t2

   user  system elapsed 
  0.006   0.004   0.010 

   user  system elapsed 
  0.068   0.002   0.071 

   user  system elapsed 
  0.297   0.072   0.376 

In [6]:
library(microbenchmark)
result <- microbenchmark(sqrt(1:1000000),
                         for (i in 1:1000000) {x2[i] <- sqrt(i)},
                         unit = "s", times = 20
                        )
summary(result)
# Result in seconds

expr,min,lq,mean,median,uq,max,neval
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
sqrt(1:1e+06),0.004028415,0.004173729,0.009392475,0.007156471,0.009549327,0.04945029,20
for (i in 1:1e+06) { x2[i] <- sqrt(i) },0.058979275,0.06019696,0.069452704,0.062624584,0.065839076,0.15816191,20


In [7]:
# Use well-developped R functions
result <- microbenchmark(sqrt(500),
                         500^0.5,
                         unit = "ns", times = 1000
                        )
summary(result)
# Result in nanoseconds

expr,min,lq,mean,median,uq,max,neval
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
sqrt(500),78,88,94.092,92,97,755,1000
500^0.5,153,161,176.746,167,173,4191,1000


##### In summary, keep the rules in mind, test your program, time your program.
## 2. Efficient coding
R has many powerful and useful functions that we can use to achieve efficient coding and computing.
### Let's play with some data.

In [6]:
data <- read.csv("https://raw.githubusercontent.com/ly129/MiCM/master/sample.csv", header = TRUE)
head(data, 10)

X,Sex,Wr.Hnd,NW.Hnd,Pulse,Smoke,Height,Age
<int>,<fct>,<dbl>,<dbl>,<int>,<fct>,<dbl>,<dbl>
1,Male,21.4,21.0,63.0,Never,180.0,19.0
2,Male,19.5,19.4,79.0,Never,165.0,18.083
3,Female,16.3,16.2,44.0,Regul,152.4,23.5
4,Female,15.9,16.5,99.0,Never,167.64,17.333
5,Male,19.3,19.4,55.0,Never,180.34,19.833
6,Male,18.5,18.5,48.0,Never,167.0,22.333
7,Female,17.5,17.0,85.0,Heavy,163.0,17.667
8,Male,19.8,20.0,,Never,180.0,17.417
9,Female,13.0,12.5,77.0,Never,165.0,18.167
10,Female,18.5,18.0,75.0,Never,173.0,18.25


In [7]:
summary(data)

       X              Sex         Wr.Hnd          NW.Hnd          Pulse       
 Min.   :  1.00   Female:47   Min.   :13.00   Min.   :12.50   Min.   : 40.00  
 1st Qu.: 25.75   Male  :53   1st Qu.:17.50   1st Qu.:17.45   1st Qu.: 50.25  
 Median : 50.50               Median :18.50   Median :18.50   Median : 71.50  
 Mean   : 50.50               Mean   :18.43   Mean   :18.39   Mean   : 69.90  
 3rd Qu.: 75.25               3rd Qu.:19.50   3rd Qu.:19.52   3rd Qu.: 84.75  
 Max.   :100.00               Max.   :23.20   Max.   :23.30   Max.   :104.00  
                                                              NA's   :6       
   Smoke        Height           Age       
 Heavy: 6   Min.   :152.0   Min.   :16.92  
 Never:79   1st Qu.:166.4   1st Qu.:17.58  
 Occas: 5   Median :170.2   Median :18.46  
 Regul:10   Mean   :171.8   Mean   :20.97  
            3rd Qu.:179.1   3rd Qu.:20.21  
            Max.   :200.0   Max.   :73.00  
            NA's   :13                     

#### a1. Calculate the mean writing hand span of all individuals
    mean(x, trim = 0, na.rm = FALSE, ...)

#### a2. Calculate the mean height of all individuals, exclude the missing values

#### a3. Calculate the mean of all continuous variables
    apply(X, MARGIN, FUN, ...)

In [8]:
cts <- data[ , c("Wr.Hnd", "NW.Hnd", "Pulse", "Height", "Age")]
apply(X = cts, MARGIN = 2, FUN = mean, na.rm = TRUE)

#### b1. Calculate the count/proportion of males and females
    table(...,
      exclude = if (useNA == "no") c(NA, NaN),
      useNA = c("no", "ifany", "always"),
      dnn = list.names(...), deparse.level = 1)

    prop.table()

#### b2. Calculate the count of each smoking status group

#### b3. Calculate the count of males and females in each smoking status group

In [9]:
table(data[, c("Sex", "Smoke")])

        Smoke
Sex      Heavy Never Occas Regul
  Female     3    40     3     1
  Male       3    39     2     9

In [10]:
table(data$Sex, data$Smoke)

        
         Heavy Never Occas Regul
  Female     3    40     3     1
  Male       3    39     2     9

#### c1. Calculate the standare deviation of writing hand span of females

In [11]:
aggregate(Wr.Hnd~Sex, data = data, FUN = sd)

Sex,Wr.Hnd
<fct>,<dbl>
Female,1.519908
Male,1.712066


In [12]:
aggregate(data$Wr.Hnd, by = list(data$Sex), FUN = sd)

Group.1,x
<fct>,<dbl>
Female,1.519908
Male,1.712066


In [13]:
by(data = data$Wr.Hnd, INDICES = list(data$Sex), FUN = sd)

: Female
[1] 1.519908
------------------------------------------------------------ 
: Male
[1] 1.712066

In [14]:
tapply(X = data$Wr.Hnd, INDEX = list(data$Sex), FUN = sd)

##### aggregate( ), by( ) and tapply( ) are all connected.

#### c2. Calculate the standard deviation of writing hand span of all different gender-smoking groups

In [15]:
aggregate(Wr.Hnd~Sex + Smoke, data = data, FUN = length)

Sex,Smoke,Wr.Hnd
<fct>,<fct>,<int>
Female,Heavy,3
Male,Heavy,3
Female,Never,40
Male,Never,39
Female,Occas,3
Male,Occas,2
Female,Regul,1
Male,Regul,9


#### d1. Categorize 'Age' - make a new binary variable 'Young'
    ifelse(test, yes, no)

In [16]:
old.age <- 20
data$Young <- ifelse(data$Age >= old.age, "No", "Yes")
head(data)

X,Sex,Wr.Hnd,NW.Hnd,Pulse,Smoke,Height,Age,Young
<int>,<fct>,<dbl>,<dbl>,<int>,<fct>,<dbl>,<dbl>,<chr>
1,Male,21.4,21.0,63,Never,180.0,19.0,Yes
2,Male,19.5,19.4,79,Never,165.0,18.083,Yes
3,Female,16.3,16.2,44,Regul,152.4,23.5,No
4,Female,15.9,16.5,99,Never,167.64,17.333,Yes
5,Male,19.3,19.4,55,Never,180.34,19.833,Yes
6,Male,18.5,18.5,48,Never,167.0,22.333,No


##### R has if (test) {opt1} else {opt2}, what is the advantage of ifelse( )?

In [17]:
if (data$Age >= 28) {
    data$Young2 = "No"
} else {
    data$Young2 = "Yes"
}

“the condition has length > 1 and only the first element will be used”

In [18]:
# Delete Young2
data <- data[, -10]

##### ifelse( ) is vectorized!!!
#### d2. Categorize 'Wr.Hnd' into 5 groups - make a new categorical variable with 5 levels
    1. =< 16: Stephen Curry
    2. 16~18: Drake
    3. 18~20: Fred VanVleet
    4. 20~22: Jeremy Lin
    5. >  22: Kawhi Leonard

In [19]:
cut.points <- c(0, 16, 18, 20, 22, Inf)
data$Hnd.group <- cut(data$Wr.Hnd, breaks = cut.points, right = TRUE)
head(data)

X,Sex,Wr.Hnd,NW.Hnd,Pulse,Smoke,Height,Age,Young,Hnd.group
<int>,<fct>,<dbl>,<dbl>,<int>,<fct>,<dbl>,<dbl>,<chr>,<fct>
1,Male,21.4,21.0,63,Never,180.0,19.0,Yes,"(20,22]"
2,Male,19.5,19.4,79,Never,165.0,18.083,Yes,"(18,20]"
3,Female,16.3,16.2,44,Regul,152.4,23.5,No,"(16,18]"
4,Female,15.9,16.5,99,Never,167.64,17.333,Yes,"(0,16]"
5,Male,19.3,19.4,55,Never,180.34,19.833,Yes,"(18,20]"
6,Male,18.5,18.5,48,Never,167.0,22.333,No,"(18,20]"


In [20]:
data$Hnd.group <- cut(data$Wr.Hnd, breaks = cut.points, labels = FALSE, right = TRUE)
head(data)

X,Sex,Wr.Hnd,NW.Hnd,Pulse,Smoke,Height,Age,Young,Hnd.group
<int>,<fct>,<dbl>,<dbl>,<int>,<fct>,<dbl>,<dbl>,<chr>,<int>
1,Male,21.4,21.0,63,Never,180.0,19.0,Yes,4
2,Male,19.5,19.4,79,Never,165.0,18.083,Yes,3
3,Female,16.3,16.2,44,Regul,152.4,23.5,No,2
4,Female,15.9,16.5,99,Never,167.64,17.333,Yes,1
5,Male,19.3,19.4,55,Never,180.34,19.833,Yes,3
6,Male,18.5,18.5,48,Never,167.0,22.333,No,3


In [21]:
groups <- c("Curry", "Drake", "VanVleet", "Lin", "Leonard")
data$Hnd.group <- cut(data$Wr.Hnd, breaks = cut.points, labels = groups, right = TRUE)
head(data)

X,Sex,Wr.Hnd,NW.Hnd,Pulse,Smoke,Height,Age,Young,Hnd.group
<int>,<fct>,<dbl>,<dbl>,<int>,<fct>,<dbl>,<dbl>,<chr>,<fct>
1,Male,21.4,21.0,63,Never,180.0,19.0,Yes,Lin
2,Male,19.5,19.4,79,Never,165.0,18.083,Yes,VanVleet
3,Female,16.3,16.2,44,Regul,152.4,23.5,No,Drake
4,Female,15.9,16.5,99,Never,167.64,17.333,Yes,Curry
5,Male,19.3,19.4,55,Never,180.34,19.833,Yes,VanVleet
6,Male,18.5,18.5,48,Never,167.0,22.333,No,VanVleet


cut -> split -> list -> lapply

sapply

function()
aggregate(), _apply() with our own function