<a href="https://colab.research.google.com/github/Jinzhao-Yu/BioStat615/blob/main/BIOSTAT615_Lecture_1_Fall_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BIOSTAT615 Lecture 1

## Introduction
This [Google Colab](https://research.google.com/colaboratory/) notebook contains example code and relevant notes for the [BIOSTAT615](https://sph.umich.edu/admissions/courses/course.php?courseID=BIOSTAT615) class from the [University of Michigan School of Public Health](https://sph.umich.edu/). The instructor recommends 

## Access Privilege and Copyrights
You can view this notebook only through [University of Michigan Google accounts](https://its.umich.edu/communication/collaboration/google). The instructor, Hyun Min Kang, retains all rights provided by copyright law (i.e. [“All rights reserved”](https://en.wikipedia.org/wiki/All_rights_reserved)). You may not reproduce, resell, distribute, publicly perform, create derivative works, translate, transmit, post, republish, exploit, copy or otherwise use our this content for any part of the work without instructor's permission. 

## 1. Efficiency matters - A simple example
Below is an example illustrating that slight differences in implementation details can make a big difference in computational speed.

In [None]:
## sample size is 20 million
n = 2e7

## system.time() evaluates computational time
## method 1 - using for loop 
system.time({
  x=seq(n) 
  for (i in 1:length(x)) {
    x[i]=x[i]^2
  }
})

## method 2 - using vector operation
system.time({
  x=seq(n)^2
})

## the above two routines result in same output.
## can you guess which one would be faster? by how much?

   user  system elapsed 
  1.961   0.119   2.119 

   user  system elapsed 
  0.127   0.081   0.212 

## 2. R storage - A puzzling example

In [1]:
## Before running this code, try to predict what will happen 
## s is 2 billion
s=2000000000

## first, print the value of s
print(s)

## second, print the value of s+s
print(s+s)

## what do you expect to be printed?

[1] 2e+09
[1] 4e+09


In [2]:
## Before running this code, try to predict what will happen 
## only difference from above is to put L at the end (what does it do?)
s=2000000000L

## run the same sequence of code as before.
print(s)
print(s+s)
## what do you expect to be printed?

[1] 2000000000


“NAs produced by integer overflow”


[1] NA


In [3]:
s=1000000000L

## run the same sequence of code as before.
print(s)
print(s+s)
## what do you expect to be printed?

[1] 1000000000
[1] 2000000000


In [4]:
## Let's figure out why the difference happens
## by comparing the storage mode of two assigned values

## In the first example, s was assigned as number without suffix
s=2000000000

## Find out how the variable is stored
storage.mode(s)

In [5]:
## In the secoond example, s was assigned with suffix L
s=2000000000L

## Find out how the variable is stored.
storage.mode(s)

**Question:** What difference do you see between the two? How is the storage mode related to the warning messages?

'double' and 'integer'

## 3. R storage - limits and precisions

In [7]:
## This code simply displays various types of machine precision
noquote(format(.Machine))

               double.eps            double.neg.eps               double.xmin 
             2.220446e-16              1.110223e-16             2.225074e-308 
              double.xmax               double.base             double.digits 
            1.797693e+308                         2                        53 
          double.rounding              double.guard         double.ulp.digits 
                        5                         0                       -52 
    double.neg.ulp.digits           double.exponent            double.min.exp 
                      -53                        11                     -1022 
           double.max.exp               integer.max               sizeof.long 
                     1024                2147483647                         8 
          sizeof.longlong         sizeof.longdouble            sizeof.pointer 
                        8                        16                         8 
           longdouble.eps        longdouble.neg.eps 

**Q1:** Can you tell which value is relevant to the warning message before?

`integer_max`: the max that integer can store

**Q2:** Can you explain what other variables represent?

## 4. Useful facts on default storage mode

In [8]:
## what is the default storage mode for vectors?
x = c(1,2,3,4,5)
storage.mode(x)

In [9]:
## what is the default storage mode for intervals?
x = 1:5
print(x)
storage.mode(x)

[1] 1 2 3 4 5


In [10]:
## what about the results of comparison operator?
x = 1:5 > 3
print(x) ## do you know what is contained in x?
storage.mode(x) 

[1] FALSE FALSE FALSE  TRUE  TRUE


In [11]:
## what about strings?
x = paste0(1:5,collapse=",")
print(x) ## do you know what is contained in x?
storage.mode(x)

[1] "1,2,3,4,5"


In [12]:
## what about list?
x = list(name="Hyun",score=95,grade="A")
print(x)
storage.mode(x)

$name
[1] "Hyun"

$score
[1] 95

$grade
[1] "A"



In [13]:
## what about data frame?
x = data.frame(
  names=c("Hyun","Mike","Bhramar"),
  scores=c(77,99,95),
  grades=c("B+","A+","A")
)
print(x)        ## do you know what is stored in x?
storage.mode(x) ## do you know what is the storage mode mode of data frame?

    names scores grades
1    Hyun     77     B+
2    Mike     99     A+
3 Bhramar     95      A


## 5. Floating point - Another puzzling example

In [14]:
## The value of x is one quadrillion
x = 1e15

## what do you expect to be printed? why?
print(x+1-x)

[1] 1


In [15]:
## The value of x is ten quadrillion
x = 1e16

## what do you expect to be printed? why?
print(x+1-x)

[1] 0


**Question:** Can you explain why you see a difference between the two examples?

## 6. Floating point - More puzzling examples

`pnorm(z, mu, s2)` evaluates $\Pr(Z < z)$ when $Z \sim \mathcal{N}(\mu,s^2)$.

Let's evaluate $\Pr(Z < -9)$ when $Z \sim \mathcal{N}(0,1)$

In [16]:
## pnorm() is a CDF for normal distribution
## Calculate Pr(Z < -9) when Z ~ N(0,1)
pnorm(-9,0,1)

Now, let's evaluate $\Pr(Z > 9)$ when $Z \sim \mathcal{N}(0,1)$ instead.

We know that $\Pr(Z > 9) = 1 - \Pr(Z < 9)$.  

We also know that $\Pr(Z > 9) = \Pr(Z < -9)$ because the pdf is symmetric across zero.

In [19]:
pnorm(9,0,1)

In [17]:
## To calculate Pr(Z > 9) = 1 - Pr(Z < 9) 
1-pnorm(9,0,1)
## This value should be the same to Pr(Z < -9), is it true?

In [18]:
## Alternative, we can calculate Pr(Z > 9) directly 
pnorm(9,0,1,lower.tail=FALSE)
## Is this value the same to Pr(Z < -9)?

Finally, let's evaluate $Pr(Z < -40)$ when $Z \sim \mathcal{N}(0,1)$.

In [20]:
## To calculate Pr(Z < -40)
pnorm(-40,0,1)
## what did you expect to be printed?
## why do you think that you see the outcome below?

In [21]:
## Alternatively, you can calculate Pr(Z < -40) more precisely
pnorm(-40,0,1,log.p=TRUE)
## how do you interpret the result?

## 7. Memory allocation for different data types

In [22]:
## function print_type_size(): 
##   takes a variable and prints the storage mode and memory allocated.
print_type_size <- function(obj) {
  cat(paste0("R memory allocated for ", storage.mode(obj), 
             " of length ", length(obj), " is ", object.size(obj), " bytes.\n"))
}

In [23]:
## in each sample, we will allocate 1K elements
vec_size = 1024L

In [24]:
## what do you expect to be printed?
numeric_vec = rep(1.0,length=vec_size)
print_type_size(numeric_vec)

R memory allocated for double of length 1024 is 8240 bytes.


In [25]:
## what do you expect to be printed?
integer_vec = rep(1L,length=vec_size)
print_type_size(integer_vec)

R memory allocated for integer of length 1024 is 4144 bytes.


In [26]:
## what do you expect to be printed?
logical_vec = rep(TRUE,length=vec_size)
print_type_size(logical_vec)

R memory allocated for logical of length 1024 is 4144 bytes.


In [27]:
## what do you expect to be printed?
A_list = as.list(numeric_vec)
print_type_size(A_list)

R memory allocated for list of length 1024 is 65584 bytes.


## 8. Binary representation of non-negative integers

In [28]:
## function print_binary_integer_16bit():
##   takes an integer value (0 to 65536) and print as binary string
## NOTE: You do not need to understand the details of this function.
print_binary_int16 <- function(strval) {
  val = as.integer(strval) 
  if ( is.na(val) || val < 0 || val > 65536L ){
    cat(paste0(strval, " is not a valid integer between 0 and 65536!\n"))
  } else {
    ## create a sequence of 2^16, ..., 2, 1
    bases = bitwShiftL(1L,seq(15L,0L,by=-1L))
    ## compute the binary value
    binary_val = ifelse(bitwAnd(val,bases)>0,1L,0L)
    ## print out the result
    cat(binary_val,"\n",sep="")
  }
}

In [31]:
# Obtain an integer from keyboard, and try the method
x <- readline("Type an integer: ")
print_binary_int16(x)

Type an integer: 35394
1000101001000010


In [32]:
## function print_binary_int32():
##   takes a 32-bit integer value (0 to 217483647) and print as binary string
## NOTE: You do not need to understand the details of this function.
print_binary_int32 <- function(strval) {
  val = as.integer(strval)
  if ( is.na(val) || val < 0){
    cat(paste0(strval, " is not a valid integer between 0 and 217483657!\n"))
  } else {  
    ## use intToBits() function to convert integer to 32 bitwise arrays 
    cat(paste0(as.integer(rev(intToBits(val))),collapse=""))
    cat("\n")
  }
}

In [33]:
# Obtain an integer from keyboard, and try the method
x <- readline("Type an integer: ")
print_binary_int32(x)

Type an integer: 46494655
00000010110001010111001110111111


In [34]:
## function print_binary_float64():
##   takes a 64-bit numeric value and print as binary string
## NOTE: You do not need to understand the details of this function.
## See https://en.wikipedia.org/wiki/Double-precision_floating-point_format to understand more details
print_binary_float64 <- function(strval) {
  val = as.double(strval)
  if ( is.na(val) ){
    cat(paste0(strval, " is not a valid numeric value\n"))
  } else {  
    ## use numToBits() function to convert integer to 32 bitwise arrays 
    arr = as.integer(rev(numToBits(val)))
    cat(paste0(arr[1],collapse=""))
    cat(" ")
    cat(paste0(arr[2:12],collapse=""))
    cat(" ")
    cat(paste0(arr[13:64],collapse=""))
    cat("\n")
  }
}

In [35]:
# Obtain an numeric value from keyboard, and try the method
x <- readline("Type an numeric value: ")
print_binary_float64(x)

Type an numeric value: 5787677558798
0 10000101001 0101000011100011000111011100001000000011100000000000
