### Reading Data
Data are stored in so many formats there could never be a single way to read them all and retain the strengths of the format. Base R does a great job reading tabular data and simple text files, but for specialized file types you'll need to load a library containing functions someone else has written to deserialize the data and translate it into a standard R object type.  

Before you read in your data, spend two seconds thinking about the name you will use to refer to it once it loads, then you can go ahead and assign (`<-`) the output of the read function to a named object. Here are some common data types you might read using the Champions Workspace.  

**CSV**

In [None]:
dat <- read.csv("../your_file.csv")

**JSON**

In [None]:
library(jsonlite)
dat <- fromJSON("../your_file.json")

**Excel**

In [None]:
library(readxl)
dat <- read_excel("../your_file.xls")
dat <- read_excel("../your_file.xlsx")

**Big Data**

In [None]:
library(data.table)
dat <- fread("../your_file.csv")

If you read in a CSV or Excel sheet you will probably recognize the resulting table of class `data.frame`. If you read in a JSON you might recognize the result as a table if it was a table before being serialized as JSON. If it was anything other than a table, what you see in the `dat` object will be a `list`. 

## Whitespace is Meaningless in R
One of the joys of R is that it doesn't derive meaning from whitespace. Your code can be as tidy as you want, without generating errors due to a misplaced tab. Personally I like to wrap function calls with lots of arguments to make them easier to read. While python uses whitespace to group lines of code into a single expression, R uses curly braces for the same purpose.    

In [None]:
suppressStartupMessages({
  library(data.table)
  library(jsonlite)
  library(readxl)
})

## Comments
The R comment character is the hash (as with python), but the fun thing about a Jupyter Notebook is you don't need to comment your code inside a code block, you can use markdown and expand your comments into a story.  

In [4]:
# This hash symbol tells R not to evaluate this line, or the next one
# getwd()
list.files()

## Object types
The most important thing to know about R is the object types for holding data, and the second most important thing to know is how to retrieve that data.

In [16]:
# text must be quoted
a <- "test"

In [15]:
# if it is not R will search for an object name instead
b <- take2

ERROR: Error in eval(expr, envir, enclos): object 'take2' not found


The colon operator generates an inclusive sequence, increasing by 1. This is useful for indexing

In [5]:
2:14

vectors are one-dimensional and must contain a single data type. 

In [19]:
c(1, 2, 3, 4, 5)
c("a", "b", "c")
# mix them and R will coerce to the most lenient that will allow them all in the set
c(1, 2, "c")

data.frames store tabular data in rows and columns. Columns can be different data types, but each row of a column must be consistent

In [20]:
data.frame(strings = LETTERS[1:6], numbers = 6:11)


strings,numbers
<chr>,<int>
A,6
B,7
C,8
D,9
E,10
F,11


matrices are similar, but the entire table must be one data type. Matrices are vectors with dimension attributes.

In [21]:
matrix(1:9, nrow = 3, ncol = 3, dimnames = list(1:3, LETTERS[1:3]))


Unnamed: 0,A,B,C
1,1,4,7
2,2,5,8
3,3,6,9


arrays are matrices with more than 2 dimensions. n-dimensions are difficult to present on screen.

In [31]:
myarr <- array(1:27, dim = c(3,3,3))
dim(myarr)
myarr[,,1]
myarr[,,2]
myarr[,,3]

0,1,2
1,4,7
2,5,8
3,6,9


0,1,2
10,13,16
11,14,17
12,15,18


0,1,2
19,22,25
20,23,26
21,24,27


lists can store any data type

In [33]:
list(
    c(1,2), 
    data.frame(a = 1:3, b = 4:6)
)

a,b
<int>,<int>
1,4
2,5
3,6


elements of vectors and lists can be named

In [34]:
c(a = 1, b = 2, c = 3)
list(myvector = c(1,2), mydf = data.frame(a = 1:3, b = 4:6))

a,b
<int>,<int>
1,4
2,5
3,6


## Extracting data
R indexes start a 1, which is unusual for most programming languages.  
The complexity of indexing 1, 2, and n-dimensional structures is one of R's greatest strengths.  
Paradoxically it is the part of R most people love to hate, and entire packages are built around simplifying it.

In [None]:
nums <- c(5, 2, 3, 4, 1)
which(nums == 5)

Use square brackets to select a value from an object.  
Elements in a vector have defined positions

In [None]:
nums[5]
nums[4:5]

If the elemensts are named:

In [None]:
nameval <- c(a=1, b=2, c=3)
nameval["b"]

if the object has dimensions you can use matrix notation and specify [rows, cols]:

In [None]:
dat <- data.frame(strings=LETTERS[1:5], numbers=6:11)
dat[2, 2]

leave the rows index out to imply all rows:

In [None]:
dat[, 2] <- dat[, 2] + 10
dat

If the object is a list you can use list notation, which is a double square bracket to index by a single position or name:

In [None]:
myx <- list(
    myvector = c(1,2), 
    mydf = data.frame(a = 1:3, b = 4:6), 
    mymatrix = matrix(1:9, nrow = 3)
)
myx[[2]]
myx[["mydf"]]

or a dollar sign to index by a single name:

In [None]:
myx$myvector

If you need more than one element from a list, use vector notation which returns another list:

In [None]:
myx[c(1,2)]


## Custom Functions
Many operations involve applying a function to each element of a set. Custom functions are created using the `function()` function, with the arguments specified in the function arguments. The function will return the last value in the expression block, or you can explicitly set the `return()` value. The value here is you can of course write custom analyses, but you can also use these to leverage any built-in function (such as `lapply`) which can iteratively apply a function to a set.

In [8]:
addition <- function(a, b) {
  a + b
}

addition(11, 17)

## For loops
The `for` loop is used extensively in python and is sometimes useful in R as well. The `in` keyword is reserved for the `for` loop, so you can call a routine on each element *in* a set.
For loops will auto-increment after each iteration.

In [11]:
loopcontrolvector <- 1:5
for(i in loopcontrolvector) {
  print( sum( loopcontrolvector[1:i] ) )
}

[1] 1
[1] 3
[1] 6
[1] 10
[1] 15


In practice, many `for` loops are replaced by operations that natively act on sets and perform implicit looping rather than requiring an explicit loop to be run.

## Logicals
Logicals consist of the keywords TRUE and FALSE, but any non-zero number will evaluate to TRUE as well just as the 0 value will evaluate to FALSE.

In [12]:
for(i in loopcontrolvector) {
  print(i < 3.5)
}

[1] TRUE
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE


## While loops
The `while` loop runs until its control expression evaluates to FALSE. The control expression typically must be incremented outside the loop.

In [13]:
x <- 0
while(x < 5) {
  print(10^x)
  x <- x+1
}

[1] 1
[1] 10
[1] 100
[1] 1000
[1] 10000


## If... Else control
These are pretty straightforward, use the format if() {...} else if() {...} else {...}. Chain as many else if() blocks as necessary.

In [14]:
x <- 17
if(x < 5) {
  print("its a small number")
} else if(x > 100) {
  print("its a big number")
} else {
  print("its between 5 and 100")
}

[1] "its between 5 and 100"
