# Appendix B Programming Basics

## B.1 Basic objects in `R`

### Vectors

The most basic element of statistics is a vector of numeric data. To create a vector in `R`, we use the following syntax.



In [13]:
v <- c(1, 2, 3, 4, 5)
y = c(1,2,3,4,5)
print(v)
print(y)

[1] 1 2 3 4 5
[1] 1 2 3 4 5


This create a variable named `v` and stores in it the vector of numbers `(1, 2, 3, 4, 5)`. Some things to note:

-  The left arrow `<-` is an operator used to assign values to objects. The equal sign `=` also works in `R`.  

- `c()` is a function that "combines values into a vector or list" (see [this help file](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/c) or type `?c` in the code cell). 

We can create a longer vector by combining two vectors. 

In [14]:
w <- 6:10

# seq(from=6,to=10,by=...)

x <- c(v,w)

print(x)

# Add more elements...
x <- c(x,100)
print(x)

 [1]  1  2  3  4  5  6  7  8  9 10
 [1]   1   2   3   4   5   6   7   8   9  10 100


Almost all operations in R are "vectorized", meaning that they can operate on vectors. As an example, to multiply each element of `x` by the number 2, we simply type the following. 

In [3]:
y <- 2*x
print(y)
# To add up all elements in `y`, we write 
sum(y)
# To calculate the mean of y
mean(y)
# or 
sum(y)/length(y)

 [1]   2   4   6   8  10  12  14  16  18  20 200


Numeric vectors hold integers or doubles. By default, if we enter a number in `R` it is stored as a double. To explicitly store integers, we can use the as.integer function.

In [4]:
typeof( 1 )

typeof(as.integer(1))

It is possible to assign names to each entry of a vector. 

In [15]:
(v = c(a=1, b=2, c=3))
 (z = c(1,2,3))

In [6]:
names(v)


In [16]:
(v = c(a=1, b=2, c=3))
# c(a=1, b=2, stats=3) == c(1, 2, 3)
names(v)
names(v) <- c("stats", 'ds', 'cs')

v["stats"]
v[1]

### Logical values

In `R`, the *boolean* (true / false) values are represented by the special values `TRUE` and `FALSE`, commonly abbreviated `T` and `F`.

In [8]:
u <- c(T, F, TRUE, TRUE, FALSE)
print(u)

[1]  TRUE FALSE  TRUE  TRUE FALSE


Logical values are results of logical statements (questions), for instance, comparisons of numbers. 

In [9]:
v <- c(1, 2, 3)
print(v > 1)
print(v[v > 1])

[1] FALSE  TRUE  TRUE
[1] 2 3


Often you will need to combine multiple logical conditions. To do this we have the **logical operators** (`&&` and `||`), which take the logical `and` and `or`, respectively, of several logical conditions.

In [10]:
# Weather
rain <- TRUE
temp <- 41
can_jog <- (rain == FALSE) && (temp > 62)
can_jog

There is a subtle but important difference betwen the single and double versions of these operators. The single `&` performs entrywise `AND` over logical vectors:

In [11]:
# today's weather and Tuesday's weather
rain <- c(TRUE, FALSE)
temp <- c(41, 65)
can_jog1 <- (rain == FALSE) & (temp > 60)
can_jog1

In [12]:
can_jog2 <- (rain == FALSE) && (temp > 60)


ERROR: Error in (rain == FALSE) && (temp > 60): 'length = 2' in coercion to 'logical(1)'


Be careful when testing for equality in conditionals. The `==` operator will return a *vector* of logicals. If you want to make sure that any/all entries of a vector are `TRUE`, use the `any()` or `all()` functions:

In [None]:
v1 = c(1, 2, 3)
v2 = c(1, 1, 2)
v1 == v2

#if (v1 == v2) { print("Wrong!") }
#if (all(v1 == v2)) { print("All!") }
#if (any(v1 == v2)) { print("Any!") }
# ?identical

In [None]:

all(v1 == v2)

any(v1 == v2)

### Missing values

A very statistical feature of `R` that sets it apart from other languages is the built-in ability to handle missing data via the special value `NA` (not available). Think of `NA` as saying that `R` doesn't know the value of something. Is `NA` greater than 5? `R` doesn't know, because `R` don't know what the unobserved value supposed to be. Yet the `NA` is still counted as one sample, just not observed. 

In [17]:
x <- NA

is.na(x)


print(x > 5)
length(c(1, 2, NA))


[1] NA


### Matrix and array 

We can create vectors with more than one dimenions, which is known as a matrix (with two dimensions) or an array (with more than two dimensions). 

In [None]:
A <- matrix(1:9, nrow = 3, ncol = 3, byrow=T)

B <- array(1:20, dim=c(2,5,2))

### String 

`R` can also store characters in a string.

In [22]:
wrd = c("a","new","sentence")
u = c("_", "!", "?")
wrd
typeof(wrd)
print("this is a sentence")
length("this is a sentence")

[1] "this is a sentence"


In [23]:
length(wrd)

In [None]:
wrd
u

In [None]:
### Use paste() function to manipulate strings
paste(wrd,u,sep="/")

# Data/session

In [None]:
(split.str=strsplit("this is a sentence", split=" "))


In [None]:
### Use strsplit() to split strings
split.str=strsplit("this is a sentence", split=" ")

typeof(split.str)
length(split.str)
split.str[[1]]

In [27]:
wrd = c("a","new","sentence",10,11)
print(wrd)
test<-c(10,11)
print(test)

[1] "a"        "new"      "sentence" "10"       "11"      
[1] 10 11


In [30]:
wrd[3]

## Lists
Lists are another type of sequence data type found in `R`. Unlike atomic vectors, lists can hold objects of multiple types.

In [29]:
# list(1, 2, 3, "a") %>% str
x = list('a', 1L, FALSE, pi, list(1:3))
print(x)

[[1]]
[1] "a"

[[2]]
[1] 1

[[3]]
[1] FALSE

[[4]]
[1] 3.141593

[[5]]
[[5]][[1]]
[1] 1 2 3




In [31]:
x[[1]]

In [None]:
typeof(x[4])

typeof(x[[4]])


As the printout suggests, we can think of a list as a _vector of vectors_. For this reason, they are sometimes referred to as _recursive vectors_.

The `str` command will print out the **str**ucture of an object. 

In [None]:
str(x) 

Just like atomic vectors, we can name each individual entry of a list:    

In [None]:
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
names(x_named)

### Sub-setting lists
Subsetting lists is a little more complex than subsetting atomic vectors. We will use the following example list.

In [None]:
str(example_list <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5)))

In [None]:
print(example_list)

The `[]` operator extracts a sub-list. That is, the return type will always be a list:

In [None]:
example_v <- 1:3
str(example_list)
example_v[1]
str(example_list[1])
str(example_list[[1]])

As with atomic vectors, the single brackets accept integer, logical and character vectors.

In [None]:
str(example_list)
example_list[c(1,2,4)]
str(example_list[c('a', 'd')])
str(example_list[c(TRUE, TRUE, FALSE, TRUE)])  # what happened here?
str(example_list[-3])

The double-brackets `[[]]`  will extract a single component from the list.

In [None]:
str(example_list)
example_list[["d"]]

We can also pass an integer vector to `[[]]`. This will index into successive levels of the list:

In [None]:
str(example_list)
example_list[[c(4,2)]]

## B.2 Data frame

We use the `load()` function to read an `RData` file. The file `flint.RData` contains the `flint` data set. 

In [None]:
load('../Data/flint.RData')

This has loaded a data frame into an object called `flint` containing real data set on an **ongoing** public health crisis ï¼ˆsee the [Flint water crisis](https://en.wikipedia.org/wiki/Flint_water_crisis)). We will use this data set to explore the `data.frame` in `R`. 

In [None]:
names(flint)

In [None]:
summary(flint)

In [None]:
sd(flint$`Lead (ppb)`)

In [None]:
sd(flint$`Lead (ppb)`, na.rm = TRUE)

In [None]:
testdata<-flint$`Lead (ppb)`
testdata<-testdata[!is.na(testdata)]
# testdata

In [None]:
summary(testdata)

In [None]:
hist(testdata)

In [None]:

hist(v[v < 10000 & v > 500])
abline(v=15)

In [None]:
table(flint$`Zip Code`)

In [None]:
z <- flint$`Zip Code`
mean(v[z == "48529"], na.rm=T)

In [None]:
plot(flint$`Lead (ppb)`~flint$`Copper (ppb)`,xlab='Copper (ppb)', ylab='Lead (ppb)',pch=16, col='blue')

Many object types in `R` are actually lists with additional attributes. For example,  data frames are lists. The `names()` of a data frame correspond to columns. This means we can use the list indexing methods shown above to access columns.

In [None]:
typeof(flint)
names(flint)
head(flint[[2]])

The `tidyverse` package uses another structure known as the `tibble`, which is a modern reimagining of the `data.frame` (see [here](https://tibble.tidyverse.org/)). This note will not touch on the differences between `tibble` and `data.frame`. It suffices to remember to use `tibble` when using `tidyverse`.



## B.3 Function

Often when programming, we find ourselves repeating the same block of code with minor modifications. It seems to be a good idea to wrap-up these blocks for repeated uses. Most programming languages allow us to create _functions_ for exactly this purpose.  

Above we saw that typing `c(1,2,3,4,5)` creates a vector of 1 to 5. `c()` is an example of a *function*. The general form of a function in `R` (and most other programming languages) is:

    <function name>(<function arguments>)
    
In the above example, the function is named `c`, and the arguments were `1, 2, 3, 4, 5`. Another example of a function is `print`, which prints its arguments to the screen:

In [None]:
print("I am a function named print")
?print

We can create a simple function that requires no argument. 


In [None]:
greet <- function( ) {
    return("Nice to meet you!")
    }
greet
greet()

`greet` is the **name** of our function and  the code between the curly brackets `{` and `}` is the **body** of the function.

We can modify `greet()` to take an argument. In what follows, `x` is the **argument** to the function.

In [None]:
greet <- function( x ) {
    paste("Nice to meet you, ", x, "!",sep='')
}

We can supply the **default** value of argument as the code below shows.

In [None]:
greet <- function(x = "friend") {
    paste("Nice to meet you, ", x, "!",sep='')
}

In [None]:
# If we suply the argument, the function works as before.
greet('stranger')
# If we don't, it uses the default argument.
greet()

Let's see what happens when we pass along the empty string "" as an argument to `greet`.

In [None]:
greet("")

Perhaps, we don't like the space between "you" and "!" in this case. We can add a check to see if the argument is an empty string.

In [None]:
greet <- function(x = "friend") {
    if (nchar(x) == 0) {
        "Nice to meet you!"
    } else {
        paste("Nice to meet you, ", x, "!")
    }
}

In [None]:
greet("")

We just saw an instance of **conditional execution** of code using the `if` statement.

Now let's write some functions that are more statistical. Suppose that we watn to standardize a vector `x`.

In [None]:
x <- c(1,5,-11,20)
print(x)

In [None]:
x_centered <- x-mean(x)
print(x_centered)
var(x_centered)

In [None]:
x_std <- x_centered / sd(x_centered) # divide by std dev to have variance 1
print(x_std)
var(x_std)

Now let say we have to perform this task again for another vector.  We can simply repeat the above calculations.  

In [None]:
y <- c(-12, 3, 14, 56)
y_centered <- y - mean(y)
y_std <- y_centered / sd(y_centered)
print(y_std)
var(y_std) 

Or, we could write a function in `R` to help us achieve what we want with one call of this function!

In [None]:
standardize <- function(x) { # name of the function and arguments
    x_centered <- x - mean(x)    # body of the function
    x_centered / sd(x_centered)  # body of the function
}

In [None]:
xstd <- standardize(x)
print(xstd)
var(xstd)

In [None]:
ystd <- standardize(y)
print(ystd)
var(ystd)

Now we have a `standardize()` function that standardizes any vectors easily. But what if we need to standardize hundreds of vectors in a data frame? Soon we will learn about iteration and ways to cut down further on repetition.

Often when writing functions we need to do different things depending on what data is passed in. This is known as *conditional execution*, and is accomplished using the `if/else` construct:
```{r}
if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}
```

`if/else` and `ifelse()` are very different. `ifelse()` is a *function* that takes three vector arguments and returns a new vector. `if/else` tells R to conditionally execute code. 

The `condition` part of the `if` statement must evaluate to either a single `TRUE` or `FALSE`. If it does not, you will get a warning:

In [None]:
if (c(T, F)) { 1 } else { }

ifelse(
    1:10 > 5,
    "A",
    "B"
)

Note that a condition of `NA` will generate an error. This is one of the most common issues when writing conditions in `R`. 

In [None]:
if (NA) { 1 }

By default, the last expression evaluated inside a function block is the value returned. However, we can use an explicit **return statement** to return early.


In [None]:
standardize2 <- function(x) {
    x_centered <- x - mean(x)
    stdev <- sd(x_centered)
    if (stdev == 0) {
        return(x)
    }
    x_centered / sd(x_centered)  
}

In [None]:
standardize2(c(-2, -1, 1, 2))

In [None]:
standardize2(c(1,1,1,1))

In [None]:
standardize(c(1,1,1,1))

## B.4 Iteration

**Iteration** is an important concept in programming. Iteration means, roughly, running the same piece of code repeatedly. Let us consider the famous *Fibonacci sequence*, whose $n$th term is defined as for $n\geq 2$,
$$ F(n+1) = F(n) + F(n-1), $$
where $F(1) = 0$ and $F(2) = 1$.

  There are many ways to perform iteration in `R`. The one you have probably heard of is the **for-loop**:
```
for (<index> in <vector>) {
    [do something for each value of <index>]
}
```
The code below computes the first 10 Fibonacci numbers using a **for-loop**.

In [None]:
previous = 0
current = 1
for (i in 1:10) {
    print(previous)
    new = current + previous
    previous = current
    current = new
}

The **for-loop** should have three components:
1. The *output*, in this case the printed numbers. 
2. The *sequence* of values along which we will iterate. Here we use 1 to 10. 
3. The *body*, which is the piece of code that gets executed in each iteration of the loop. In the example above, the body first runs `print(previous)`, then `new = current + previous`, etc.

Many operations in statistics are applied on all observations. However, we don't use for-loops that often in `R`, since better alternatives are available due to _vectorization_. For example, the following approaches produce identical outputs.
The latter is more concise using vectorization, and can be faster in some cases. 

In [None]:
# sum the numbers 1 to 100
output <- 0
v <- 1:100
for (i in v) {
    output <- output + i
}

output = sum(1:100)

Sometimes we can not specific _a priori_ the sequence to iterate over. We can use a while-loop for iteration. 
```
while (<condition>) {
    <body>
}
```
The `while` loop will continue running until `<condition>` returns `FALSE`.

Here's an example of how we would use a `while` loop. The following command counts the number of heads and tails encountered in tosses of a fair coin until the third head is encountered.

In [None]:
heads <- 0; tails<-0;
while(heads<=2){
    toss<- sample(c(0,1),1);
    heads <- heads + toss;
    tails <- tails + (1-toss);
}
print(heads)
print(tails)

As another example, say we want to print all Fibonacci numbers less than 1000. We do not know how many terms that sequence will have. We can resort to the while-loop.

In [None]:
previous = 0
current = 1
while (previous < 1000) {
    print(previous)
    new = current + previous
    previous = current
    current = new
}