## Note taking
### Header and Numering
- For course name, use header 1 with numbering.
- For week names in courses, use header 2 with numbering.
- For main topics of each course, use header 3 with numbering.
- Cluster specific elements of each topic together, you don't need to follow video pattern.

Do the heading and numbering before actual start. Note other materials at the same time if necessary.

### Tips
- Change the goal - don't aim to finish it by a fixed time, don't judge how it'll impact your destiny.
- Don't look at the number and duration of tasks that needs to be completed. If you want to stop at first impression, stop after trying for 30 seconds.
- Find lone, quiet, dark, cold environment with soothing sound.
- Try to achieve flow - make sure the task is interesting, attainable but challenging.
- Take regular breaks and food. You'll have less problem going through materials after long sleep.
- If all opportunities were taken from you, you'd be instantly motivated.

# Course 1: Data Science Introduction

### Types of Data Science Questions
- Descriptive - statistical summary without inference.
- Exploratory - correlation, not causation.
- Inferential - use small amount of data to generalize info about large group with confidence level.
- Predictive analysis - use current and historic data to make predictions.
- Causal analysis** - what happens to one variable when we manipulate another.
- Mechanistic analysis - exact change in one variable by change of another.!

### Experimental design
- Formulate question
- Design experiment
- Identify problems
- Collect data.

### Big data
- Volume
- Velocity
- Variety


### Linking R Studio and Github
- R studio > Tools > Global Options > Git/SVN (ensure the git directory path is correct) >  Click create RSA key >  Click view public key and copy it > Login to Github > settings > SSH and GPG Keys > New key > paste the public key in key box

- Create a new repository > copy the URL > Create new project in R studio > Version control > Git > paste repository URL, name, local location.

- Create a script > Environment > Git select the script to stage your change > click commit > the new window will show the changes in the lower quadrant, commit message in upper left > commit and close > Push your changes > check at Github.

In [None]:
#Installing Packages
install.packages("package name")

#What packages are installed
installed.packages()
library()

#Updating packages
update.packages()

#Unloading packages
detach()

#Uninstalling packages
remove.packages("package name")

In [None]:
#Getting help

help(package = "ggplot2")
browseVignettes("ggplot2")

# Course 2: R Programming

## R Nuts and Bolts

### 5 types of objects

- Character
- Numeric
- Integer
- Complex
- Logical (boolean)

### Data Types
- Vectors can only contain objects of same class - just integer, or just text and so on. If we put different class of objects, R will convert them to one.

- Lists can contain objects of different classes.
- Matrix in R is filled column wise. It's written with number of rows first, and then number of columns.
- Factors is used for coded numbers - 1 for Female and 2 for Male etc.
- Data frame stores tabular data.
X <- data.frame(foo = 1:4, bar = c(T,T,F,F))
- Objects have attributes - class, dimension, length.
- Mathematically problematic case is NAN, usual missing value is NA.



### Reading data

#good for moderate sized data, these functions can automatically figure out data types columns etc. (though mentioning them can make the functions faster)
read.table()
read.csv()

For reading large data, read help page of read.table, make rough calculation of needed memory to load the data
Set comment.char = "" if there are no comment lines in your data.
If you don't specify what class types are, R will take resource and time to figure it out.
For big data, you should specify class type.
colClasses = "numeric" sets all class types of the data as Numeric.
nrows = 100 loads first 100 rows, and you can check and fix what data classes are from them.
sapply(dataname, class) figures the class out, and you can apply the classes with colClasses.

In [None]:
#Connection

#r is for reading, w is for writing, a is for appending.
con <- file("temp.txt", "r") # creates connection to the text file temp.txt.
data <- read.csv(con) # reads data from the connection
close(con) # closes the connection.

#The code above does the same as
read.csv("temp.txt") # but connections helps at times.

### Sub setting

"[" returns objects of same class - subsetting vector will return vector, same for list and so on. It can be used to get more than one element of the object.

"[[" returns single object form list or data frame. It returns a single element, and thus the class of returned object depends on the specific elemnt it returns.

"$" returns objects from list or data frame by name. For example it can return specific column.

For sub-setting from a matrix x as vector,

x[1,2] returns value of first row, second column.
x[1,] returns all from first row, x[,1] returns all from second column.
x[1,2, drop = FALSE] returns matrix instead of vector.

### Subsetting from lists

In [None]:
x <- c("a", "b", "c", "c", "d", "a")

#use of numeric index
x[1]
x[2]
x[1:4]


#logical index
x[x > "a"] #a is of lowest order
u <- x > "a"
u #returns elements that are greater than a


#subsetting lists
x <- list(foo = 1:4, bar = .6) #first element is a sequence, second element is a number
x[1] #by using single bracket, we get back a list from the list
x[[1]] #by using double bracket, we get back just a sequence, not a list


x$bar #subsetting by name, returns a single number
x[["bar"]] #searching by string, returns a single number
x["bar"] #searching by string, returns a list with 1 value


#subsetting multiple elements by using single bracket "[]"
x <- list(foo = 1:4, bar = .6, baz = "hello") #list from which we want to subset
x[c(1,3)] #passing a vector of indices we want to subset, which returns foo and baz
#we can't use "[[]]" or "$" if we want to extract multiple elements of list


#"[[]]"" can be used to index a list with computer index
x <- list(foo = 1:4, bar = .6, baz = "hello") #list from which we want to subset
name <- "foo"
x[[name]] #computed index for 'foo', returns value of foo
x$name #error: "name" field doesn't exist in the original list
x$foo #returns value of "foo", as foo" field exists


#"[[]]" can take an integer sequence
x <- list(a = list(10, 12, 14), b=(3.14, 2,81)) #making a nested list.
#now, lets assume we want to extract 14. This is the 3rd element of the first element a!
x[[c(1,3)]]
#or
x[[1]][[3]] #returns 14

#similarly
x[c(2,1)] #returns 3.14

### Subsetting from matrices

In [10]:
x <- matrix(1:6, 2, 3) #matrix is written by nrow first, and it fills clumns first
x[1, 2] #subsetting from matrix: row index, column index

#subsetting multiple values
x[1,] #all from first row
x[,2] #all from second column

#Usually "[]" returns element of same class, but if we subset one element from matrix, we'll get a vector of length 1 rather than a 1*1 matrix. This behavior can be turned off by setting drop = FALSE.
x <- matrix(1:6, 2, 3)
x[1, 2] #returns vector
x[1, 2, drop = FALSE] #returns a 1*1 matrix
x[1, , drop = FALSE] #returns matrix

0,1,2
1,3,5
2,4,6


### Partial Matching

If an element name is aabbcc, x$a will return the elements. [["a"]] won't work, [["a", exact = FLASE]] will.

### Removing NA

In [None]:
X <- c(1,2,NA,4,NA)
Bad <- is.na(x)
x[!Bad]


# Removing rows of df with NA values
airquality[1:6,] # Shows first 6 rows and all columns of airquality data frame.
Good <- complete.cases(airquality)
airquality[good, ][1:6, ]

### Vectorized operations
If we add 2 vectors, the equivalent elements will be added.
Same goes for multiplying or dividing by a number.
Same goes for matrix.

X * X returns element wise multiplication.
X %*% Y returns true matrix multiplication.

## Control Structures

In [None]:
#Control structure
#If, else if, else; for; while (loop while a condition is true); repeat (infinite loop); break; next (skip an iteration of a loop), return (exit a function).

for(i in 1:10){
    Print(i)}

x <- matrix(1:6, 2, 3)

for(i in seq_len(nrow(x))){
    for(j in seq_len(ncol(x))){
        print(x[I , j])}}

count <- 0
while(count <10){
    print(count)
    count <- count+1}

z <- 5
while(z >= 3 && z <= 10){
    print(z)
    coin <- rbinom(1,1,.5)
    if(coin == 1){
        z <- z+1}
    else{
        z <- z-1}}

## Functions

Functions can be passed as arguments in other functions. They can be nested inside other functions.

#### Function Arguments  
*Positional Arguments*: If we pass function argument values with argument name (e.g. Data = myData), we'll able to put the argument outside official order. If we don't name the arguments, we'll have to put it in official order.  
It also allows us to skip default arguments, and name the argument we want to change.

#### Lazy Evaluation
If we don't input values of all the arguments, it'll evaluate arguments when needed.  

#### The "..." Argument
... can be used to pass default arguments while extending another function. It's also used when the number of arguments can't be known in advanced.

In [None]:
# This function adds two user specified number

add2 <- function(x,y){
  x+y
}

# The function checks which values of input are greater than user specified value n
# Here, 10 is taken as default value if user doesn't specify a value.

aboven <- function(x,n = 10){
  results <- x > n
  x[results]
}

columnmean <- function(y, removeNA= TRUE){  #RemoveNA = TRUE skips NA values from calcuation
  nc <- ncol(y)         #number of columns
  means <- numeric(nc)  #empty vector that will hold mean values of columns
  for(i in 1:nc){
    means[i] <- mean(y[,i], na.rm = removeNA) #take all rows of ith column, calculate
    # mean, and assign the value in the ith position of means vector
  }
  means
}

### Date and Time  
Date is represented by **Date** class.  
Time is represented by POSIXct (stores time as seconds since base date, appropriate for use in dataframe), or POSIXlt class (stores much more info - day of week, month, year etc.).


In [None]:
x <- Sys.time()
p <- as.POSIXct(x)
unclass(p)

#### strptime
The **strptime** function extracts time info from string (and not other classes like Date).

In [None]:
datestring <- c("January 10, 2012 10:40")
strptime(datestring, "%B %d, %Y %H:%M") #what each symbol means can be found in hlep of strptime

### Loop Function  

lapply: Loop over a list and evaluate a function on each element  
sapply: same as lapply, but to simplify the results  
mapply: multivariate version of lapply.  

Split is often used with lapply.  

#### lapply  

    
lapply returns a list.

In [None]:
# function(list, function, ...)
# func <- match.function()
# if (!is.vector(X) || is.object(X)){
#     X <- as.list(X)
#     .Internal(lapply(X,func))}


x <- list(a= 1:5, b=rnorm(10))
lapply(x, mean)   #lapply went over all values of a and b, and returned single mean value

x <- 1:4
lapply(x,runif) #loops over the list, given 1 rand uniform vector for 1, 2 for 2 and so on.

#we can pass extra arguments of the function in lapply
x <- 1:4
lapply(x, runif, min = 0, max = 10)

### Anonymous function with lapply  
We can define functions inside lapply that only performs inside the lapply.


In [None]:
x <- list(a = matrix(1:4,2,2), b = matrix(1:6, 3, 2)) # creating list of 2 matrix
lapply(x, function(elt) elt[,1]) #function that extracts first column, only lives inside this lapply


### sapply  
sapply works like lapply, but it doesn't always return a list. If the results contain single value for all places, it'll return vector. If it returns vector of same number of values for all, it'll return matrix.  

### apply  
apply is used for multidimensional array (e.g. 2 dimensional matrix). 

In [None]:
a <- matrix(rnorm(50),10, 5)
b <- apply(a, 2, mean) # It'll return a vector of length 5, that has the mean of each columns. 2 is the second dimension (columns), and it colapses first dimension.

c <- apply(a, 1, sum) # this preserves rows, and collapses columns.

#But it's better to use optimised functions for sum and mean operations

# rowSums = apply(x,1, sum)
# rowMeans = apply(x, 1, mean)
# colSums = apply(x, 2, sum)
# colMeans = apply(x, 2, mean)

d <- matrix(rnorm(50), 5, 10)
e <- apply(d, 1, quantile, probs = c(.25, .75)) #we need to specify 25th and 75th percentile for quantile function, it'll calculate the percentiles over each row. It'll collapse the columns, and return 2 row, 5 column (for the two results against each row) matrix.

f <- array(rnorm(2 * 2 * 5), c(2, 2, 5))
g <- apply(f, c(1,2), mean) #Here, we're preserving 1st and 2nd dimensions, but collapsing the 3rd. This'll return a mean matrix.

h <- rowMeans(f, dim=2)

### mapply  
If we need to apply different arguments to different lists, then we'll have to use for loop. But we can also use Multivariate apply or mapply - can apply a function to multiple lists in parallel. The number of arguments needs to be at least equal to number of objects passed.


In [None]:
i <- list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
j <- mapply(rep, 1:4, 4:1)

### tapply  
tapply(vector, index, FUN, ..., simplify = TRUE)  
Applies function over subset of a vector. For example, we can calculate summary statistics (mean, standard deviation etc.) of a group of a vector.  
It takes a vector, another vector of same length that identifies what groups does the elements of initial vector belongs to. Then the function we want to apply, the other arguments "...", lastly simplify like sapply or not.

In [None]:
k <- c(rnorm(10), runif(10), rnorm(10,1)) # 10 normal random variables, 10 uniform random variables, 10 normal random variables with mean 1.

l <- gl(3,10) #factor vector with 3 levels that's gonna repeat 10 times.

m <- tapply(k, l, mean, simplify = FALSE) #if we keep simplify as FALSE, we'll get list of 3 elements, where each element is the mean of the subgroup.

n <- tapply(k, l, range) # range returns two observation for each group - the min and the max


### Split  
split takes a vector and factor vector like tapply, but instead of applying a function, it splits the initial vector among the groups. We can then apply lapply or sapply on those groups.

In [None]:
k <- c(rnorm(10), runif(10), rnorm(10,1)) # 10 normal random variables, 10 uniform random variables, 10 normal random variables with mean 1.

l <- gl(3,10) #factor vector with 3 levels that's gonna repeat 10 times.

o <- split(k,l)

p <- lapply(split(k,l ), mean)

library(datasets) #library for data sets

q <- head(airquality) #loading first few rows of air quality data

#Now, we want to split the data into months (each month has about 3o observations), and then calculate mean of each month.

r <- split(airquality, airquality$Month) #split the dataframe into months. Since the data is arranged in some specific months, we can use the value to group data.

s <- lapply(r, function(x) colMeans(x[c("Ozone", "Solar.R", "Wind")])) #The anonymous function will calculate mean of Ozone, Solar radiation and Wind columns.

t <- sapply(r, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE)) #If we apply sapply instead of lapply, we'll get a matrix back as the result elements are of same length

#Multi-level split
u <- rnorm(10)
f1 <- gl(2, 5)
f2 <- gl(5, 2)

v <- interaction(f1, f2) #by concatenating the two filters, we'll have 10 levels

w1 <- str(split(u, list(f1, f2))) #this'll split the data into 10 different levels. I don't need to pass interaction function for it to work. Multiple filters can indicate teen-male, middle aged-female etc.

w2 <- str(split(u, list(f1, f2), drop = TRUE)) #drop = TRUE drops empty levels

## Debugging  
**Message**: notification, execution continues.
**Warning**: indication that something is wrong but not fatal. Execution continues.
**Error**: fatal problem notification, execution stops.
**Condition**: programmer created notification that something unexpected can occur.

### Questions to ask if something's wrong  
- What was your actual input? How did you call the function?  
- What were you expecting? Output, messages, other results?  
- What did you get?  
- How does what you get differ from what you were expecting?  
- Can you reproduce the problem exactly?  


### Debugging tools in R
**Traceback**: points out the function call stack after an error occurs.  
**Debug**: flags a function for "debug", executes one line at a time.  
**browser**: suspends the execution of a function wherever it is called and puts the function in debug mode.  
**Trace**: allows you to insert debugging code into a function a specific places. Usually done on other's code.  
**Recover**: allows you to modify the error behavior so that you can browse the function call stack.  


In [None]:
mean(x)
traceback() #where the error occurred. need to call immediately after error

lm(y-x)
traceback()

debug(lm)
lm(y-x) #it'll create an environment with only lm elements, show entire code of lm, and by pressing n+enter you'll be able to execute one line at a time.

options(error=recover)
read.csv("nosuchfile")

## str function
Short for structure. It's a diagnostic function, an one liner alternative to summary.

In [None]:
str(lm)

x <- rnorm(100, 2, 4)
summary(x)
str(x)

library(datasets)
head(airquality)
str(airquality)

m <- matrix(rnorm(100), 10, 10)
m
str(m)

s <- split(airquality, airquality$Month)
str(s)

## Generating Random Numbers

rnorm generates random normal variates given mean and std. dev.

dnorm: evaluate normal probability density (given mean and std. dev.) at a point.

pnorm: evaluate the cumulative distribution function for normal distribution.

rpois: generate random Poisson variates with given rate.

Every distribution has 4 functions -
d for density
r for random number generation
p for cumulative distribution
q for quantile function

In [None]:
#set.seed repeats the random numbers generated previously
set.seed(1)
a <- rnorm(5)

b <- rnorm(5)

set.seed(1)
c <- rnorm(5)

a
b
c

d <- rpois(10,1) #generate 10 integer poisson data, with roughly mean 1
e <- rpois(10,10) #generate 10 integer poisson data, with roughly mean 10

f <- ppois(2,2) #probability of getting less than 2 if mean is set to 2
g <- ppois(4, 2) #probability of getting less than 4 if mean is set to 2

d
e
f
g


## Simulating Linear Model

We can simulate value from a model (e.g. linear model).

For linear model, we can consider
y = mx + c + epsilon
where epsilon is noise value with given mean and std. dev.

Now, we might need count (integer) variable instead of continuous variable. The error distribution is going to be Poisson distribution.

## For Poisson model
Y ~ Poisson(mu)
log mu = mx + c



In [None]:
set.seed(20) #for reproducability
a <- rnorm(100)
b <- rnorm(100, 0, 2) #mean of epsilon is 0, std. dev. is 2
c <- .5 + 2 * a + b #c = .5, m = 2

summary(c)
plot(a,c)

#considering x as binary variable (data against male-female)
set.seed(10) #for reproducability
d <- rbinom(100, 1, .5)
e <- rnorm(100, 0, 2) #mean of epsilon is 0, std. dev. is 2
f <- .5 + 2 * d + e #c = .5, m = 2

summary(f)
plot(d,f)

#simulating from Poisson model
#For g = .5 and h = .3

set.seed(1)
i <- rnorm(100) #independent variable
log.mu <- .5 + .3 * i #log of linear predictor
j <- rpois(100, exp(log.mu)) #Expentiate the log to get mean

summary (j)
plot(i, j)

## Random Sampling

In [None]:
set.seed(1)
a <- sample(1:10, 4) #sample 4 values from 1 to 10 without replacement
a

b <- sample(1:10, 4)
b

c <- sample(letters, 5)
c

d <- sample(1:10) #1 to 10 in random order
d

e <- sample (1:10)
e

f <- sample(1:10, replace = TRUE) #sample with replacement
f

## R Profiler

R Profiler states why program is taking long time. Useful for large programs.

Make it work first, make it readable, only after that optimize.

system.time() tells us required time for computation. User time is required CPU time, elapsed time is what you see.

Rprof() starts profiler
summaryRprof() summarises output from Rprof()
we can't use system.time() adn Rprof() together.

by.total tells total time spent in function.
by.self tells how much time the program spends after substracting time of low level helper function.

In [None]:
#elapsed time > user time
system.time(readLines("http://www.jhsph.edu")) #as network time adds to elapsed time

#elapsed time < user time. Parallel process makes user time twice.
hilbert <- function(n){
    i <- 1:n
    1/outer(i-1,i,"+")
}
x <- hilbert(1000)
system.time(svd(x))



# Course 3: Getting and Cleaning Data

## Reading Data

Raw data > Processing script > Tidy processed data > Analysis > Communication

### Components of tidy data

Deliveribles
1. Raw data
2. Tidy data
3. Metadata: Code book describing each variable and its values in tidy data set. E.g. unit of a column of tidy data.
4. Explicit and exact recipe you used to go from 1 to 2 and 3.

Components of tidy data:
1. Each variable in 1 column.
2. Each observation of that variable in different row.
3. One table for each kind of variable.
4. Linking column of different tables (keys).
5. Having a row with human readable variable names - AgeAtDiagnosis instead of AgeDx.
6. In general, data should be saved one file per table.

The code book:
1. Code book: Each variable and their units
2. Summary
3. Info about data collection experiment.

Instruction list:
1. Script that outputs tidy data on input of raw data without any modification requirement.
2. Commenting on the steps.

### Get/set working directory

getwd() for getting current directory, setwd() for setting new directory.

For windows machines, we need to use back slashes, or c(directory).

file.exists("directoryName") searches if any subdirectory with name directoryName exists in the current directory.
dir.create("directoryName") creates directory of name directoryName.

In [None]:
if(!file.exists("data")){dir.create("data")} #searches for subdirectory "data" in current directory, and creates the subdirectory if it doesn't exist.

download.file() downoads file from internet.

In [None]:
#fileUrl <- "https://"
#download.file(fileUrl, destfile = "./destDirectory/data.csv", method = "curl")
#list.files("./destDirectory")

#dateDownloaded <- date(), as data changes with date
#dateDownloaded

### Loading local flat file

#### Reading CSV files
read.table() is the most common function. Not the best tool for loading big data. Important parameters include file, header, sep, row.names etc.
quote = "" means no quotes.
na.strings - sets teh character that represents missing values.
nrows - how many rows to read.
skip - number of lines to skip before starting to read.

#### Reading excel files
"xlsx" library can be used for opening excel files.
Parameters of read.xlsx includes sheetIndex, header = TRUE etc, colIndex (e.g. 2:3), rowIndex (e.g. 1:4) for loading specific portion of the files.

write.xlsx can be used to write in excel file.
read.xlsx2 can be faster than read.xlsx, but it's bit unstable for subset reading.

XLConnect is a good library if someone works with loads of excel files.
XLConnect vignette: https://cran.r-project.org/web/packages/XLConnect/vignettes/XLConnect.pdf

Usually, csv is faster to work with.

#### Reading XML data
Usually received from structured web data. Two components are Markup and Content.

Tags
Start tags <section>
End tags </section>

Attributes
<img src="jeff.jpg" alt="instructor"/>


## Reading Data from Web

### Reading XML data
We need XML library.

Xpath language is used for extracting xml data. Tutorial - stat.berkeley.edu/~statcur/Workshop2/Presentations/XML.pdf


In [None]:
library(XML)
fileUrl <- "http://www.w3schools.com/xml/simple.xml"
doc <- xmlTreeParse(fileUrl, useInternal=TRUE) #this loads the page in memory
rootNode <- xmlRoot(doc) #wrapper of entire doc
xmlName(rootNode) #get the name out

names(rootNode) #tells all the nested elements of doc

rootNode[[1]] #first element of rootNode
rootNode[[1]][[1]] #first sub component of first element

xmlSApply(rootNode, xmlValue) #loop through all values of rootNode, and return xml values of all tags

xpathSApply(rootNode,"//name",xmlvalue) #returns "name" node values
xpathSApply(rootNode,"//price",xmlvalue) #returns "price" node values

#if we extract information from html file instead of xml file, we'll use html tree parse instead of xml
fileUrl <- "http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens"
doc <- htmlTreeParse(fileUrl,useInternal=TRUE) #useInternal = TRUE returns all nodes

#extracting specific data. For this, we need take a look at source code of page, and decide what node we're interested in.
scores <- xpathSApply(doc,"//li[@class='score']",xmlValue)
teams <- xpathSApply(doc,"//li[@class='team-name']",xmlValue)

#show the data
scores
teams

### Reading JSON data

Organized data structure that is widely used.

Good tutorial: r-bloggers.com/new-package-jsonlite-a-smarter-json-encoderdecoder

In [None]:
#Reading JSON data
library(jsonlite)
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")
names(jsonData) #names of objects and attributes
names(jsonData$owner) #attribute info of a specific object
jsonData$owner$login #entries of attribute

#writing data frames in JSON
myjson <- toJSON(iris, pretty=TRUE) #iris is a popular dataset, which'll be converted to JSON. Pretty = TRUE indents the file.
cat(myjson) #prints the file

iris2 <- fromJSON(myjson)
head(iris2)

### Data.table package

Faster and data efficient than data frame.

In [None]:
library(data.table)
DF = data.frame(x=rnorm(9),y=rep(c("a","b","c"),each=3),z=rnorm(9))
dfh <- head(DF,3)
dfh

DT = data.table(x=norm(9),y=rep(c("a","b","c"),each=3),z=rnorm(9))
dth <- head(DT,3)
dth

#summary of data
tables()

#subsetting rows
DT[2,] # select all columns of 2nd row
DT[DT$y="a"] #select all data where y = "a" - like filter function, but for a column?

DT[c(2,3)] #data tables subset rows, the operation will return 2nd and 3rd rows
DT[,c(2,3)] #this doesn't return 2nd and 3rd column

#Expression in R is a collection of statements enclosed in curley brackets
DT[,list(mean(x),sum(z))] #will return mean of x, and sum of z values
DT[,table(y)] #returns table of y values - like filter function

#adding new columns
DT[,w:=z^2] #creates a new column with name w, which includes square of z values - like mutate

#Multiple operations
DT[,m:= {tmp <- (x+z)}; log2(tmp+5)] #store x+z in temp variable, add 5 and calculate 2 based log, store it in new column m - advanced mutate like

#Gouping
DT[,a:=x>0] #this'll return a boolean column
DT[,b:=mean(x+w),by=a] #this'll calculate mean of x+w when a is TRUE, and apply it in all a=TRUE. It'll also calculate mean of x+w when a is FALSE, and apply it in all a = FALSE.

#Special character (e.g. .N)
set.seed(123);
DT <- data.table(x=sample(letters[1:3], 1E5, TRUE)) #10^5 a, b and cs
DT[, .N, by=x] #.N counts the number of times a group (here by x) appears.

#Keys: if we define keys, subsetting can be done easily

DT <- data.table(x=rep(c("a", "b", "c"), each=100), y=rnorm(300)) #a table of two variable
setkey(DT,x) #set variable x as key
DT['a'] #group by the key

#Keys to join multiple tables
DT1 <- data.table(x=c('a', 'a', 'b', 'dt1'), y=1:4)
DT2 <- data.table(x=c('a', 'b', 'dt2'),z=5:7)
setkey(DT1, x); setkey(DT2, x)
merge(DT1, DT2)

#Reading from external source
#fread

### mySQL
Data is structured in databases, each database as many tables, each tables has many fields, many fields has many records.

Different tables are connected with keys.

First step is to install mySQL from its website. And then installing RMySQL for R.

Tutorial for windows install:
http://www.ahschulz.de/2013/07/23/installing-rmysql-under-windows/

mySQL commands collection:
http://www.pantz.org/software/mysql/mysqlcommands.html

R commands for mySQL:
https://www.r-bloggers.com/2011/08/mysql-and-r/

In [None]:
#loading entire database
ucscDb <- dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu") #this will create a connection
result <- dbGetQuery(ucscDb,"show databases;"); dbDisconnect(ucscDb) #applying query, get data, and disconnect, which should return "TRUE"
result #shows avaiable databases

#Loading specific database
hg19 <- dbConnect(MySQL(),user="genome",db="hg19",host="genome-mysql.cse.ucsc.edu") #one particular database "hg19"
result <- dbListTables(hg19) #there are multiple tables under this database
length(allTables) #number of tables
allTables[1:5] #first 5 tables

#loading specific table
dbListFields(hg19,"affyU133Plus2") #loading one table, returns fields list. hg19 was the connection to the table in database.

dbGetQuery(hg19,"select count(*) from affyU133Plus2") #select rows of 1 field with mySQL query command

#read from table
affyData <- dbReadTable(hg19,"affyU133Plus2") #read table
head(affyData) #first 6 rows

#reading specific amount of data
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3") #Creates connection to specific data (here against specific values of a specific field), as entire data can be too big
affyMis <- fetch(query); quantile(affyMis$misMatches) #tells us about the sample we collected above
affyMisSmall <- fetch(query,n=10); dbClearResult(query) #loading small amount of data. We need to clear query after fetching data, which'll return "TRUE".
dim(affyMisSmall) #returns dimension of the data

#Remember to close the connection
dbDisconnect(hg19) #should return "TRUE"

### Reading from HDF5
Used to store large hierarchical data.

The R package is installed through bioconductor.

Tutorial:
https://www.bioconductor.org/packages/release/bioc/manuals/rhdf5/man/rhdf5.pdf

In [None]:
#installing hdf5 R library
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")

library(rhdf5) #loading library
created = h5createFile("example.h5") #creading hdf5 file

#creating hierarchical group
created = h5createGroup("example.h5", "foo")
created = h5createGroup("example.h5", "baa")
created = h5createGroup("example.h5", "foo/foobaa")
h5ls("example.h5")

#writing to hdf5
A = matrix(1:10,nr=5,nc=2) #creating data
h5write(A,"example.h5","foo/A") #example.h5 is the file, foo/A is the group
B = array(seq(0.1,2.0,by=0.1),dim=c(5,2,2)) #we can also write multi dimensional array
attr(B,"scale") <- "liter" #we can add attribute
h5write(B,"example.h5","foo/foobaa/B") #we can add teh array to a particular sub group
h5ls("example.h5") #what are the data

#writing a data set
df = data.frame(1L:5L,seq(0,1,length.out=5)
               c("ab","cde","fghi","a","s"),stringAsFactors = FALSE)
h5write(df,"example.h5","df") #writing data frame directly in top level group
h5ls(example.h5) #data summary

#reading data
readA = h5read("example.h5","foo/A") #reading data
readB = h5read("example.h5","foo/foobaa/B") #reading sub dataset
readdf = h5read("example.h5","df") #reading from top level group
readA

#writing and reading in chunks
h5write(c(12, 13, 14),"example.h5","foo/A",index=list(1:3,1)) #in the mentioned file (example.h5,foo/A), write to the first 1st 3 rows and 1st column (1:3,1) specified values (12, 13, 14)
h5read("example.h5","foo/A") #it's also possible to read specif index in read also like above (shown in the previous line)

### Reading data from the web

**Webscraping**
Programatically extracting data from HTML code of website.

Interesting story:
https://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/

Searching web scrapping in R-bloggers
httr help file is helpful.

In [None]:
connec = url("http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en") #connect with site
htmlCode = readLine(connec) #Read data
close(connec) #close connection after usage
htmlCode #the data collected

#Getting readable data from site using XML library
library(XML)
url <- "http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en"
html <- htmlTreeParse(url, useInternalNodes=T)

xpathSApply(html,"//title", xmlValue)
xpathSApply(html,"//td[@id='col-citedby']", xmlValue)

#Using GET from HTTR package
library(httr); html2 = GET(url)
content2 = content(html2,as="text")
parseHtml = htmlParse(content2, asText = TRUE)
xpathSApply(parseHrml, "//title", xmlValue)

#Accessing websites with passwords
pg1 = GET("http://httpbin.org/basic-auth/user/passwd")
pg1 #will ask for password with status 401

pg2 = GET("http://httpbin.org/basic-auth/user/passwd",
         authenticate("user","passwd")) #test website, username is "user" and password is "passwd"
pg2 #will return status 200

#Using handles, to save authentication across multiple sites
google = handle("http://google.com")
pg1 = GET(handle=google,path="/")
pg2 = GET(handle=google, path="search")

### Reading form APIs
For accessing Twitter data, we need to create an application.
In general, API requrie reading of documentation.

HTTR works well with FB, Google, Twitter, Github etc.
HTTR demos in Github can be helpful.

(You had already done this homework in CS50!)

In [None]:
myapp = oauth_app("twitter", key="yourConsumerKeyHere", secret="yourConsumerSecretHere") #pass your credential
sig = sign_oauth1.0(myapp, token="yourTokenHere", token_secret="yourTokenSecretHere") #generate authentication
homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json",sig) #what data I want to get by passing the authentication of previous line.
                                                                            #The URL is found at Twitter documentation > GET statuses/home_timeline > resource URL

json1 = content(homeTL) #returns structured R object (which is hard to read)
json2 = jsonlite::fromJSON(toJSON(json1)) #from JSON reformats it, jsonlite converts it to data frame
json2[1,1:4] #look at first row of first 4 columns of the data frame

### R Package for almost all needs

file - open a connection to local text file.
url - open a connection to URL
gzfile - open a connection to a .gz file
bzfile - open a connection to a .bz2 file
?connections for more info

**It's important to close connection after use**

Foreign package
read.foo: read.dta(stata), read.spss(SPSS)

It's possible to read image data, GIS data (rdgal, rgeos, raster), music (tuneR, seewave).

You can find package jusy by googling.

## Raw to Tidy Data***
### Subsetting data
Lecture notes: www.biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf

In [None]:
#Subsetting
set.seed(13435) #for reproducable sampling
x <- data.frame("var1"=sample(1:5), "var2"=sample(6:10), "var3"=sample(11:15)) #has 3 variables, each variable has random sequence of 5 values
x <- x[sample(1:5),];x$var2[c(1,3)]=NA #now random sample from first 5 rows and all 3 columns, and make some values NA
x

#Row and column wise subsetting. Remember "[]" is used to subset multiple values.
x[,1] #Select first column
x[,"var1"] #select all rows of var1 column
x[1:2, "var2"] #select first 2 rows of var2 column

#Logicals, ands and ors
x[(x$var1 <= 3 & x$var3 > 11),] #select the rows of all columns, where value of var1 is less than-equal to 3 and value of var3 is greater than 11
x[(x$var1 <= 3 | x$var3 > 15),] #select the rows of all columns, where value of var1 is less than-equal to 3 or value of var3 is greater than 15

#Dealing with missing values
x[which(x$var2 >= 6),] #select all values where var2 is greater than-equal to 6, which'll not include NA values of var2. And since NA only exist in var2, the entire data frame will be NA free.

#sorting
sort(x$var1) #sort var1 values in ascending order
sort(x$var1, decreasign = TRUE) #sort var1 in descending order
sort(x$var2, na.last=TRUE) #put NA at last
x[order(x$var1),] #order by a column value - apply ordering on var1 first, then use the ordering in the data frame
x[order(x$var1,x$var3),] #it'll order the data by var1, and then if there are multiple same values of var1, it'll order in those by var3 values

##Ordering with plyr
#library(plyr)
#arrange(x,var1) #pass data frame, variable - and it'll sort the data frame by that variables

#arrange(X,desc(var1)) #data frame, variable wrapped in desc() for descending order

#Adding new column
x$var4 <- rnorm(5)

#Adding rows and columns
y <- cbind(x,rnorm(5)) #column bind, add a new column of 5 normal variable at right-most side
y <- cbind(rnorm(5), x) #column bind in leftmost side of x
y <- rbind(x,rnorm(3)) #bind a row at last

### Summarizing data - Very Useful
In order to clean data, we need to look at the summaries, weirdness of data.

Example, restaurant data of Baltimore city government.

In [None]:
if(!file.exists("./data")){dir.create("./data")} #if directory doesn't exist, create a directory
fileURL <- "file URL.csv?accessType=DOWNLOAD" #check the URL, create connection
download.file(fileUrl,destfile="./data/restaurants.csv",method="curl") #where to download data
restData <- read.csv("./data/restaurants.csv") #loading the downloaded data

head(restData, n=3) #load first 3 rows of the data
tail(restData, n=3) #load last 3 rows of the data

#overall summary of data
summary(restData)
str(restData)
quantile(restData$councilDistrict,na.rm=TRUE) #quantile of values
quantile(restData$councilDistrict,probs=c(0.5, 0.75, 0.9)) #50th is median, 75th is 3rd quarter etc.
table(restData$zipCode,useNA="ifany") #use NA if any = if there are missing values, store them seperately
table(restData$councilDistrict,restData$zipCode) #two dimensional matrix of two variables

#check for missing values
sum(is.na(restData$councilDistrict )) #total missing values
any(is.na(restData$councilDistrict)) #if there are any missing values
all(restData$zipCode > 0) #if everysingle values satisfies the condition or not

#row and column sums
colSums(is.na(restData)) #colSums and rowSums, gives back sum of NA values for each column
all(colSums(is.na(restData))==0) #if all colSums for NA values are 0 or not, returns boolean

#values with specific characteristics
table(restData$zipCode %in% c("what to look for")) #are there any "searched" values in zipCode column
table(restData$zipCode %in% c("21212", "21213")) #search if there are one or the other value in the column
restData[restData$zipCode %in% c("21212", "21213"),]#returns subsetted dataset with the rows where value
                                                    #in zipCode column is "21212" or "21213", for all columns

#cross tabs
data(UCBAdmission) #load R dataset
DF = as.data.frame(UCBAdmissions) #create data frame
summary(DF) #summary of the data

xt <- xtabs(Freq ~ Gender + Admit, data=DF) #Freq is the data you want to show in table, you can break the shown data in multiple variables
                                            #+ that'll be shown in a multiple dimensional matrix (gender and admission status here)

#Flat tables - cross tabs with multiple variables
wrapbreaks$replicate <- rep(1:9, len=54) #warpbreaks is another standard R dataset, replicate variable is added, to make total variable number 3
xt = xtabs(breaks ~., data=warpbreaks) #the value that'll appear in table is breaks, this will be broken by all variables through "."

#output above will be hard to understand because of multiple tables, Flat Tables can help us understand it
ftable(xt) #will summarize the data in a compact form

#Size of a data set
fakeData = rnorm(1e5)
object.size(fakeData)
print(object.size(fakeData),units="Mb") #size in Mb

### Creating New Variables
Often, raw data won't have a variable in need, transformation will be needed to create that variable. They'll usually be added to the data frame.
Common variables include:
- Missingness indicator
- Cutting up quantitative variables into factor values
- Applying transforms

Categorical variable: all values belong to fixed groups.

Factor variable: numeric or string categorical value.

In [None]:
if(!file.exists("./data")){dir.create("./data")} #if directory doesn't exist, create a directory
fileURL <- "file URL.csv?accessType=DOWNLOAD" #check the URL, create connection
download.file(fileUrl,destfile="./data/restaurants.csv",method="curl") #where to download data
restData <- read.csv("./data/restaurants.csv") #loading the downloaded data

s1 <- seq(1,10,by=2); s1 #sequence is used to index different operations on data. We specify the min value, max value.
                        #There are two ways to specify how many values to generate.
                        #One is by, which starts with min value and increments by 2.
s2 <- seq(1,10,length=3) #Another is length, which'll start at min, end at max, and create 3 equally spaced values.

x <- c(1,3,8,25,100); seq(along = x) #Yet another way is to use seq(along="data") that creates indices of same length.

#Subsetting variables
restData$nearMe = restData$neighborhood %in% c("Roadland Park","Homeland") #assign to data frame a new variable, that continues rows with specific values
table(restData$nearMe)

#creating binary variables
restData$zipWrong = ifelse(restData$zipCode <0, TRUE, FALSE) #assign TRUE if zipCode is less than 0, else FALSE
table(restData$zipWrong,restData$zipCode < 0) #show a table where whether zipcode is wrong and whether the zip code is less than zero (Less than Zero will be TRUE-TRUE)


#Creating categorical variables
restData$zipGroups = cut(restData$zipCode,breaks=quantile(restData$zipCode)) #cut (value to cut, cutting parameter) the zipcode data into quantiles
table(restData$zipGroups) #categorical factor variables - 0-25th, 25-50th, 50-75th, 75-100th

table(restData$zipGroups, restData$zipCode) #which values lands in which clusters

#Easier cutting with Hmisc library
#library(Hmisc)
restData$zipGroups = cut2(restData$zipCode,g=4) #just mention to cut the data in 4 pieces - it'll find the quantiles and break
table(restData$zipGroups)

#Creating factor variables
restData$zcf = factor(restData$zipCode)
restData$zcf(1:10)
class(restData$zcf)

#Levels of factor variable
yesno <- sample(c("yes", "no")), size = 10, replace = TRUE) #create a vector of 10 randomly repeated "yes" and "no"
yesnofac <- factor(yesno, levels=c("yes","no")) #turn the vector into factor variable.
relevel(yesnofac,ref="yes") #By default, the lowest alphabet is first varialbe, but we can define level order.
as.numeric(yesnofac) #to change the factor variables back to numeric variables - it'll assign 1 to lowest variable, then 2, 3...

#Cutting produces factor variables
library(Hmisc)
restData$zipGroups = cut2(restData$zipCode,g=4) #cut the variables into 4 groups
table(restData$zipGroups) #the class will be factor variable

#Using mutate function
library(Hmisc); library(plyr) #cut2 is in Hmisc library, Mutate is in plyr library
restData2 = mutate(restData, zipGroups = cut2(zipCode,g=4)) #create a new dataframe restData2,
                                                            #and simultaneously create a new variable zipGroups from old dataframe restData,
                                                            #and add the new variable with the old dataframe restData
table(restData2$zipGroups) #view the table

### Common transformations
- abs(x) #absolute value
- sqrt(x) #square root
- ceiling(x) #ceiling(3.475) is 4
- floor(x) #fllor(3.475) is 3
- round(x,digits=n) #round(3.475,digits=2) is 3.48
- signif(x,digits=n) #signif(3.475,digits=2) is 3.5
- cos(x), sin(x) etc.
- log(x) #natural logarithm
- log2(x), log10(x) #other common logs
- exp(x) exponentiating x


### Notes and further reading
A tutorial from the developers of plyr http://plyr.had.co.nz/09-user/
Andrew Jaffe's R notes on transformation biostat.jhsph.edu/~ajaffe/lec_winterR/Lecture%202.pdf

### Reshaping data

Nice tutorial on reshaping
slideshare.net/jeffreybreen/reshaping-data-in-r
Good plyr primer - r-bloggers.com/a-quick-primer-on-split-apply-combine-problems/

Other important functions:
- acast - (like dcast above) for casting multi-dimensional arrays
- arrange - for faster reordering without using order() commands
- mutate - adding new variables

In [None]:
#library(reshape2) #loading library
head(mtcars) #standard dataset of R


#Melting data frames
mtcars$carname <- rownames(mtcars) #make a vector of car names
carMelt <- melt(mtcars, id=c("carname","gear","cyl"),measure.vars=c("mpg","hp")) #define which variables are ID, and which are measure variables with "melt" function.
head(carMelt,n=3) #this'll create rows against 3 ID groups - carname, gear, cyl; and create a variable column for mpg or hp
tail(carMelt,n=3)


#casting data frames
cylData <- dcast(carMelt, cyl ~ variable) #This'll create a matrix with different cylinder values in row
                                        #and variables (mpg and hp in this case) in column values
                                        #by default it summarises the variables in count (how many data are there)
cylData <- dcast(carMelt, cyl ~ variable, mean) #instead of count, we can get mean values of variables 

#Averaging values
head(InsectSprays) #loading standard dataset of R
tapply(InsectSprays$count,InsectSprays$spray,sum) #tapply - I'm gonna apply a function (sum) along a column (InsectSprays$count), along identifier (InsectSpray$spray)

#Another way to average values
spIns = split(InsectSprays$count, InsectSprays$spray) #take InsectSprays$count, and split them by each of the different InsectSprays$spray
spIns

#1 - lapply
sprCount = lapply(spIns, sum) #take the sum for each of the seperated spray values - returns list
unlist(sprCount) #make a vector from the list

#2 - sapply
sapply(spIns,sum) #return vector of sums of sprays

#3 - with plyr package
ddply(InsectSprays,.(spray),summarize,sum=sum(count)) #ddply(data,.(variables we want to summarize), we want to summarize the variable
                                                    #the summarization we want is sum of counts)
#This can be used to calculate value and apply to each variable.
#Lets assume we want to subtract the total count from the actual count
spraySums <- ddply(InsectSprays,.(spray),summarize,sum=ave(count,FUN=sum)) #spraySums is of the same length as the original dataset
                                    #now I pass it to ave function applied to count, with sum as sub-function
                                    #the result will be that, every time A is in the dataset, it'll repeat the sum of all values of A


### Continued: dplyr package for managing data frames
The deplyr package is specific for dataframes. It's an updated version of plyr package.
Loading this package can generate warnings, but it's no issue.

Assumptions of deplyr package:
- There is one observation per row
- Each column represents a field
- Primiary implementation will be R
- Other implementations, data table package, relational database systems can be used

#### dplyr verbs

**Select**
returns a subset of columns of dataframe.  

**Filter**
Extract a subset of rows from a data frame based on logical conditions.  

**Arrange**
reorder rows of a dataframe.  

**Rename**
Rename variables in a dataframe.  

**Mutate**
Add new variables/columns or transform existing variables.  

**Summarize**
Generate summary statistics of different variables in data frame.

#### dplyr properties
- First argument is a data frame.
- Subsequent arguments describe what to do with it.
- Columns can be referred without "$" operator.
- Result is a new dataframe.
- Data frames must be properly formatted and annotated (variable names, factor names etc.) for this all to be useful.

Can be used widely (e.g. with data.table, SQL)

#### dplyr

In [None]:
##Select

library(dplyr)
dataName <- readRDS("name.rds") #can be other file formats
dim(dataName)
str(dataName)

#Getting field names for subsetting columns
names(dataName) #returns field names

#Selecting multiple columns
head(select(dataName,col1:col3)) #returns first 3 columns, which can't be done without dplyr
head(select(dataName, -(col1:col3))) #returns all columns except col1 to col3

#Doing the same operation without dplyr
i <- match("colName1", names(dataName))
j <- match("colName2", names(dataName))
head(dataName[,-(i:j)]) #which requires extra code


##Filter
subsetData1 <- filter(dataName, colN > 10) #creating filter to take rows wtih colN values greater than 10
subsetData2 <- filter(dataName, col1 > 10 & col2 < 20) #filter with multiple arguments


##Arrange
dataName <- arrange(dataName, colDate) #arrange the data based on date (stored in colDate) with lowest to highest date
head(dataName)
tail(dataName)

dataName <- arrange(dataName, desc(date)) #arrange against dates in descending order (greatest to lowest)


##Rename
dataName <- rename(dataName, newCol1Name = oldCol1Name, newCol2Name = oldCol2Name)
head(dataName)


##Mutate - to create new variable and adding them to data frame
dataName <- mutate(dataName, newCol = col1-mean(col1, na.rm=TRUE)) #create a new column, which is old values of col1 - mean value of col1
head(select(dataName, col1, newCol))

dataName <- mutate(dataName, newCol2 = factor(1*(col2 > 100), labels = c("Yes", "No"))) #create a new field and append to dataframe
                                                                                        #"NO" if value of a row-cell of col2 is greater than 100, else "Yes"
dataGroup1 <- group_by(dataName, newCol2)
dataGroup1

summarize(datagroup1, col1 = mean(col1, na.rm = TRUE), col2 = max(col2), col3 = median(col3)) #I want to know what the mean of col1 for "Yes" and "No", similarly for col2 and 3.

dataName <- mutate(dataName, year=as.POSIXlt(colDate)$year + 1900) #similarly, we can take summary over years
years <- group_by(dataName, year)
summarize(years, col1 = mean(col1, na.rm = TRUE), col2 = max(col2), col3 = median(col3))

# %>% (Pipeline) Operator
dataName %>% mutate(month = as.POSIXlt(date)$colMonth + 1) %>% group_by(month) %>% summarize(col1 = mean(col1, na.rm = TRUE), col2 = max(col2), col3 = median(oldColName3))
#here, output of one operation is fed to next operation via pipeline operator. It allows you to not use many temporary variables, and it's possible to chain related operations.

### Merging data
- R data merging page: statmethods.net/management/merging.html
- plyr information
- Types of joins: wikipedia Join_(SQL)

In [None]:
if(!file.exists("./data")){dir.create("./data")} #create directory if doesn't exist
fileUrl1 = "URL.csv" #URL of file 1
fileUrl2 = "URL.csv" #URL of file 2
download.file(fileUrl1,destfile="./data/file1.csv",method="curl") #download file 1
download.file(fileUrl2,destfile="./data/file2.csv",method="curl") #download file 2
data1 = read.csv("./data/file1.csv"); data2 <- read.csv("./data/file2") #read the files
head(file1,2) #load first 2 rows of file 1
head(file2,2) #load first 2 rows of file 2

#Merging two datasets that share a common ID with merge()
#important parameters: x, y, by, by.x, by.y, 
#all (by default it'll try to merge by all columns with common names)

mergedData = merge(file1, file2, by.x="file1CommonID",by.y="file2CommonID",all=TRUE)
#all=TRUE means if there are missing values in one against other, it should include them with NA values

head(mergedData)

intersect(names(file1), names(file2)) #check which field names of file1 and 2 are similar, since they can be used for merging

mergedData2 = merge(file1, file2, all=TRUE) #it'll try to merge against all columns with same name
#since many data won't match directly, it can end up adding all rows of both files in the merged file

#Joining with plyr package
df1 = data.frame(id=sample(1:10), x=rnorm(10)) #data frame of two columns, first one is sampled from 1 to 10, second one is 10 random number
df2 = data.frame(id=sample(1:10), y=rnorm(10)) #same as previous
arrange(join(df1, df2), id) #plyr can only merge identically named columns automatically. It arranges the data by increasing order of 'id'.

df3 = data.frame(id=sample(1:10),z=rnorm(10))
dfList = list(df1, df2, df3)
join_all(dfList) #it's simpler to join multiple dataframes with plyr's "join_all" command, that merges based on common variable

## Editing Data

### Editing text variables
How to edit texts to get make them tidy.

Names of variables should be:
- ALl lower case when possible
- Descriptive (diagnosis versus Dx)
- Non duplicated
- Not have underscores, dots or white spaces

Variables with character values
- Should usually be made factor variables
- Should be descriptive (should be TRUE/FALSE instead of 1/0 and Male/Female instead of M/F)

In [None]:
tolower(names(fileName)) #make all field names lower case
toupper(names(filename)) #make all field names upper case

splitNames = strsplit(names(fileName),"\\.") #split the column names where there is a ".". As "." is a reserved character, we had to use "\\".

#revise - lists
mylist <- list(letters = c("A", "b", "c"), numbers = 1:3, matrix(1:25, ncol =5))
head(mylist)
mylist[1] #will return the first element - letters
mylist$letters #will return the element named "letters"
mylist[[1]] #will return the sequence of first element

#in the data, if the field with "." in name is 6th element
splitNames[[6]][1] #will return the first element (before ".") of 6th element

firstElement <- function(x){x[1]} #select first element
sapply(splitNames,firstElement) #split names, select first elements

#Substitute - sub()
sub("_","",names(fileName),) #substitute "_" with (blank) "" in field names
gsub("_","",stringName) #gsub substitutes multiple elements from a string

#Search strings
grep("searchString",fileName$colName) #will return element positions
grep("searchString",fileName$colName, value=TRUE) #will return elements

length(grep("searchString",fileName$colName)) #if the lengh is "0", that means search result is "0"

table(grepl("searchString"),fileName$colName) #will return a vector with TRUE(matches), and FLASE(non-matches) values

dataName2 <- dataName[!grepl("searchString,fileName$colName"),] #subset data if "searchString" doesn't appear

#stringr library
library(stringr)

nchar("Sunzid Hassan") #returns the number of characters in provided string

substr("Sunzid Hassan", 1, 6) #will return 1st to 7th letter

paste("Sunzid", "Hassan") #will return 1 string seperated by space, seperation can be set.

paste0("Sunzid", "Hassan") #will return 1 string without space

str_trim("Sunzid    ") #will trim of excess spaces at the beginning or end of string

### Regular Expressions

Can be thought of as a combination of literals (e.g. actual words) and matacharacters (e.g. that allows us to search).

**^ - start of line**  
^I think #will return lines that starts with "I think", but not if it's in the middle or end.
morning$ #will return lines that ends with mornings, but not if it starts with or contains in middle.

**[ ]**  
[Bb] #will return match for either upper or lower case
^[Ii] am #will return both "I am" and "i am" at start.

**Search for range**  
^[0-9][a-zA-Z] #returns if line starts with these. Here, it'll search for numbers, followed by any letter.
[^?.]$ #"^" here indicates anything that doesn't have a "?" or "." at the end of the line.

**"."**  
can be used to represent any character
9.11 will match all with 9 (any character) 11.

**|** or metacharacter  
flood|fire will return both floor or fire.
flood|earthquake|hurricane will return many matches.

^[Gg]ood|[Bb]ad searches for Good/good at first, or Bad/bad anywhere in the line.

^([Gg]ood|[Bb]ad) Good/good/Bad/bad at beginning of the line.

[Gg]eorge( [Ww]\.)? [Bb]ush  
Question mark after parenthesis indicates the part is optional. So there can or can't be W/w. in the middle of G/george B/bush.
Adding \ before . means it's literal ".", and not a metacharacter.

*** and +**  
Indicates Repetation. "*" means any number, including none of the item, and + means at least one of the item.

(.*) will return any character between parenthesis repeated any number of times, or parenthesis with nothing inside it.

[0-9]+(.*)[0-9]+ will look for at least one number [0-9]+, then any/nothing in between (.*), and then at least one number again [0-9]+.

**{}** interval quantifier  
Min and max number of matches of an expression.
[Bb]ush( +[^ ]+ +){1,5} debate  
B/bush at first, then -  
"( +" indicates at least one space after that, followed by "[^ ]+" which indicates something that's not a space, followed by " +)" which is at least 1 space  -
{1,5} this entire space-word-space will be repeated 1 to 5 times.

{m,n} means at least m times, but not more than n times.
{m} means exactly m matches.
{m,} means at least m matches.

In addition to limiting scope of alternatives divided by "|", parenthesis remember text matches by subexpression enclosed.
We refer to the matched text with \1, \2 etc.

 +([a-zA-Z]+) +\1 +  
 It'll search " +" space, then at least one upper or lower case character, then at least one " +" space, then exact match as the part within parenthesis "\1", then at least one " +" space.
 
The * is greedy, so it matches the longest possible string that satisfies original expression.
 
^s(.*)s will return multiple words if they start fand end with s, instead of one word.

^s(.*?)s$ is not greedy, as *? searches for smaller number of matches, with s$ at last.



### Working with dates

Abbreviation:
%d = day as number (0-31)
%a = abbreviated weekday
%A = unabbreviated weekday
%m = month (00-12)
%b = abbreviated month
%B = unabbreviated month
%y = 2 digit year
%Y = four digit year

Tutorial:
r-statistics.com/2012/do-more-with-dates-and-times-in-r-with-lubridate-1-1-0/

Class POSIXlt and POSIXct with ?POSIXlt or ?POSIXct

In [None]:
d1 = date() #returns current date and time of class "character"
d1

d2 = Sys.Date() #returns just date of class "date". Howver, working with date format can be tricky.
d2

format(d2, "%a %b %d") #change date formating

#as.date()
x = c("1jan1960", "2jan1960", "31mar1960", "30jul1960"); z = as.Date(x,"%d%b%Y") #format string as date
z
z[1] - z[2] #difference of two days
as.numeric(z[1]-z[2]) #numeric difference of 2 dates

#Converting to Julian
weekdays(d2) #returns weekday of the provided date
months(d2)
julian(d2) #number of days since origin (1970-01-01)

#Lubridate package
library(lubridate); ymd("20130108") #year-month-date ymd command will format the input as date without any extra argument
mdy("08/04/2013")
dmy("03-04-2013")

#Dealing with times
ymd_hms("2011-08-03 10:15:03") #date and time formatting

Sys.timezone(location = TRUE)
OlsonNames(tzdir = NULL) #thi'll return time zone

ymd_hms("2011-08-03 10:15:03", tz="Asia/Dhaka")

#Some function have slightly different syntax
x = c("1jan1960", "2jan1960", "31mar1960", "30jul1960"); z = as.Date(x,"%d%b%Y") #format string as date
wday(x[1]) #returns day in number, instead of weekdays() in base R
wday(x[1], label=TRUE) #returns day

### Data resources
Open goverment sites that contains data:
- UN - data.un.org
- US - data.gov
- UK - data.gov.uk
- France - data.gouv.fr
- Ghana - data.gov.gh
- Australia - data.gov.au
- Germany - govdata.de
- Hong Kong - gov.hk/en/theme/psi/datasets
- Japan - data.go.jp
- Many more - data.gov/opendatasites

Data on human health - gapminder.org
- US Survey data - asdfree.com
- Infochimps marketplace - infochimps.com/marketplace
- Kaggle data - kaggle.com

Collections by data scientists
- Hilary Mason http://bitly.com/bundles/hmason/1
-Peter Skomoroch https://delicious.com/pskomoroch/dataset
- Jeff Hammerbacher http://www.quoro.com/Jeff-Hammerbacher/Introduction-to-Data-Science-Data-Sets
- Gregory Piatetsky-Shapiro http://www.kdnuggets.com/gps.html

- http://blog.mortardata.com/post/67652898761/6-dataset-lists-curated-by-data-scientists

More specialized collections
- Stanford large network data
- UCI machine learning
- KDD Nudgets dataset
- CMU statlib
- Gene expression omnibus
- ArXiv data
- Public data sets on amazon web services

Some APIs with R interfaces
- Twitter and twitterR package
- figshare and rfigshare
- PLoS and rplos
- rOpenSci
- Facebook and RFacebook
- Google maps and RGoogleMaps

# Statistical Inference

## Probability and Expected Values

### Statistical Inference
Generating conclusion about a population from a noisy sample.

### Probability
Given a random experiment, a probability measure is a population quantity that summarizes the randomness. Probability is a character of population, which is predicted from sample.

#### Probability mass functions
Random variable can be discrete (e.g. dice roll, coin toss, possible web site traffic on a given day), or continuous( - expressed with range).
Probability mass function takes discrete outcome value as input, and outputs probability. Sum of probability of all discrete outcome is 1.
If we flip a coin, probability of head is P(H) = (.5)^(1-1)*(.5)^(1-0) = 0.5

#### Probability density function
Associated with continuous random variables. The rules are similar - total area under it must be 1, vlaue must be greater than or equal to 0 everywhere.  
The function is bell shaped for normally distributed data.
Area under a variable expresses the probability of that variable (e.g. probability of IQ between 100 and 110, when mean is 100, and standard deviation is 10). Again probability is an attribute of population, we use data to evaluate our estimation of probability.

We can take simpler density function (e.g., a right triangle), but the area needs to be 1.

##### Cumulative distribution function and survival function
Cumulative value is value till the point (area under function till value of 110 IQ from LHS). Survival function is area from RHS, or 1-Area from LHS.

##### Quantiles
95th percentile - distribute all in ascending order, which point is above 95% of observation. Percentile is quantile, excpet it's expressed with percentage instead of proportion. Median is 50th percentile.
If we pick a point from a distribution, the probability that the value will be less than 95th percentile value is 95%, and it'll be greater than is 5%.
Population median is estimand, sample median is estimator. Statistical inference links sample to population.

### Conditional Probability
P(A|B) = P(A AND B)/P(B)

#### Baye's rule
Determining P(A|B) from P(B|A), with some extra information.
"+" the test says the person has disease.
"-" the test says the person don't have the disease.
D the person actually has the disease.
Dc the person don't have the disease.

Sensitivity = P(+|D); probability that the test will be positive given the person actually has the disease.
Specificity = P(-|Dc); probability that the test will be negative given the person don't have the disease.

In absence of a test, probability of having disease is the Prevalence of a disease = P(D)
Positive predictive value = P(D|+)
Negative predictive value = P(Dc|-)

Example:
A test has -
Sensitivity of 99.7%
Specificity of 98.5%
Prevalance of disease is 0.1% of population.
What is positive predictive value P(D|+)

Using Baye's formula:
P(D|+)= 

$P(D|+) = {(P(+|D)P(D)\over P(+|D)P(D)+P(+|Dc)P(Dc))}$

$P(D|+) = {(P(+|D)P(D)\over P(+|D)P(D)+(1-P(-|Dc))(1-P(D))}$

$P(D|+) = {Sensitivity \times Prevalence\over Sensitivity \times Prevalence +(1-Specificity) \times (1-Prevalence)}$

$=.062$

Even after getting diagnosed with Positive, there is just 6% chance that the person is actually positive!

Similarly,

$P(Dc|+) = {(P(+|Dc)P(Dc)\over P(+|D)P(D)+P(+|Dc)P(Dc)}$

#### Likelihood ratios

${P(D|+) \over P(Dc|+)} = {(P(+|D) \over P(+|Dc)} \times {P(D) \over P(Dc)}$

${P(D|+) \over P(Dc|+)}$
Odds of disease given a positive test result.

${P(D) \over P(Dc)}$
Odds of disease in the absence of a test result.

${P(+|D) \over P(+|Dc)}$
Diagnostic likelihood ratio for a positive test result.

$post-test odds of disease = diagnostic likelihood ratio \times pre-test odds of disease$

In case of diagnostic case above
$DLR = .997/(1-.985) = 66$
In other words, the hypothesis of getting disease is 66 times more supported by the data, regardless of the pre-test odds of disease, then the hypothesis of no disease.

Given negative rest result.

Specificity 98.5%, sensitivity 99.7%.
DLR- = (1-.997)/.985 = .003
In other words, given negative test result, the hypothesis of not getting disease is .003 times supported by the data.

#### Independence
Event A is independent of event B if,  
P(A|B) = P(A) where P(B) > 0.  
P(A$\cap$B) = P(A)P(B)

We can't just multiply probabilities to calculate joint probabilities, because they might not be independent.

IID - Independent and Identically Distributed (all drawn from same population distribution) random variables.

### Expected values
Expected values (sample mean and variance) is used to estimate population expected values (mean and variance).

In [None]:
pbeta(c(.4, .5), 2, 1) #beta is a right triangle density. 2 is the height, 1 is base, area is 1.
#.4 and .5 is the probability at 40% and 50% of independent variables.

pnorm() #for normal distribution

#q means quantile
qbeta(0.5, 2, 1) #50th quantile of beta distribution where height = 2, base = 1

## Variability, Distribution and Asymptotics

## Intervals, Testing and Pvalues

## Power, Bootstrapping and Permutation Tests