## Day 1

# [Signal Curriculum](https://github.com/signaldatascience/R-curriculum/blob/master/schedule.md)

Working directory work: getwd() and setwd()

Note: x in c(1:10) yields 1, ..., 10. Also, indexing of vector begins with 1 in R.

[Time Functions in R (StackOverflow)](http://stackoverflow.com/questions/6262203/measuring-function-execution-time-in-r/33375008#33375008)

system.time(replicate(1000, function_tracking(arguments)))

[Defining an R function](https://www.r-bloggers.com/how-to-write-and-debug-an-r-function/)

name.of.function <- function(argument1, argument2) {
    statements
    return(something)
}

**Thing I keep screwing up:** numeric(value) means make a BLANK numeric. as.numeric(value) creates numeric from value.

## [Memoization](https://en.wikipedia.org/wiki/Memoization)

Storing the results of expensive functions 

### Neat way to do Fibanacci Numbers

$$\begin{bmatrix} 1&1\\1&0 \end{bmatrix} \begin{bmatrix} F_{n+1}\\F_n\end{bmatrix} = \begin{bmatrix} F_{n+2}\\F_{n+1}\end{bmatrix}$$

$$\begin{bmatrix} 1&1\\1&0 \end{bmatrix}^n \begin{bmatrix} 1\\1\end{bmatrix} = \begin{bmatrix} F_{n+2}\\F_{n+1}\end{bmatrix}$$

This runs in $F_n -> O(n)$

You can subsequently improve time by using only squares and squares-of-squares to calculate high numbers, getting it down to $F_n -> O(log(n))$.

P.S. Don't ask RC R questions.



## Day 2

Oh weird! R automatically parses columns that draw from a small set of strings as categorical variables. (can turn off this behavior with stringsAsFactors = FALSE, but it is performed by default in R)

To see if something is a data.frame, use class(df) (its underlying type is list).

Combine dataframes using cbind() columnwise and rbind() rowwise.

To combine dataframes with different collumn numbers, use plyr::rbind.fill()

Please limit yourself to combining data.frame type objects with cbind() or you may see unintended behaviors.

Lexical Scoping

Dynamic Scoping


**x["A"] returns a list with 1 item.**

**x\$A returns the value enclosed in "A"**


str(A) = structure of list A

\$ is a shorthand for [[ ]] in r

# Description of a Normal Distribution

$$y = ax + \epsilon$$
$$(\sigma_\epsilon = b)$$
$$a^2\sigma_{x}^2 + \sigma_\epsilon^2  = \sigma_y^2$$

$$a^2Var(x) + Var(\epsilon) = Var(y)$$



a = slope = correlation (x,y) = $R^2$

b = std in error

$$$$

Assume a purely numeric dataframe that you want to transverse in an angular spiral starting at the upper-left corner and ending up at the center item.

Grab entire row or column, sometimes reverse it, remove when grabbed.

Avoid index manipulation if you can?

R is good for this problem: get leftmost collumn w/ ef[1] and remove column  via ef[-1]; remove row via df[-nrow(df),].

Reverse with rev()

R has automatic coersion in many circumstances (ex: returning a vector when you slice the last column). Watch out!

Avoided by: df[-1,,drop=FALSE]

Standard Linear Regression Notation
$$y = mx + b + \epsilon$$

$$P(A|B) = \dfrac{P(B|A)P(A)}{P(B}$$
 
**An intuitive explanation worth noting**

$$Var(X) = \sum_{x \in X}{((x - \bar{x})^2)}$$

$$\sigma_x = \sum_{x \in X}{|(x - \bar{x})|} = \sqrt{Var(x)}$$


# SQL Notes

SELECT continent, name, area FROM world x
  WHERE area >= ALL
    (SELECT area FROM world y
        WHERE y.continent=x.continent
          AND area>0)
          
SELECT continent, name FROM world x
   WHERE name = 
      (SELECT MIN(name) FROM world y WHERE x.continent = y.continent)

          
/* MIN(name) OR SELECT TOP 1 name WHERE ___ ORDER BY name*/

scrap, not useful: /*So... for each continent check all name in country for population<=25M */
/*NOT, ANY, ALL ? */

Thing that worked...
SELECT name, continent, population FROM world x
   WHERE continent NOT IN
      (SELECT continent FROM world y
      WHERE x.continent = y.continent
      AND population >25000000)


SELECT name, continent, population FROM world x
   WHERE continent NOT IN
      (SELECT continent FROM world y
      WHERE x.continent = y.continent
      AND population >25000000)
      

   AND x.name != y.name
   
 Use max in some way?
 
these did not work
SELECT TOP 1, TOP 2 FROM (spit out ordered database here for each continent)

SELECT name, continent, population FROM world x
   WHERE name IN
(SELECT name FROM world y
   WHERE x.continent = y.continent
   AND x.name != y.name
   AND x.population > 3*y.population)
   
this did work
SELECT name, continent FROM world x
   WHERE population >  
      ALL (SELECT 3*population FROM world y 
      WHERE x.name != y.name 
      AND x.continent = y.continent)
      **Note: they prefer y.name != x.name notation, and I agree it's probably better.**
      


# Vectorise Your Code
aka Advanced R Day 5

Why?
- Useful as an abstraction
- Loops in vectorised functions are written in C instead of R (much faster)

Examples:
- rowSums
- colSums
- rowMeans
- colMeans
- cumSum
- diff
- vapply


?lookup tables? (seems similar to dictionary, and it's apparently a fast operation)

match and iteger subsetting is faster than rownames and character subsetting?

extracting or replacing values in scattered locations on matrix or df? subset w/integer matrix.

Matrix algebra executed highly efficiently via external library BLAS.

Downside of vectorization: harder to predict time when things scale up. Could be in a favorable direction, though.

Can write own vectorised function in C++ w/ Rcpp

side-note: %in% is a function

apply always turns its data into a matrix. This makes inserting dfs into it inadvisable from a speed standpoint.

You can do some useful work for your funciton by predefining things like...
- colClasses for read.csv()
- levels for factor()
- findInterval() labels for cut() (or labels=FALSE)
- unlist(x, use.names=FALSE) is faster than just unlist(x)
- interaction(drop=TRUE) runs faster

?method dispatch? (apparently computationally expensive) ?related to dynamism of R as a language?

Method dispatch: Determining which variant of a method to call for the input provided 

side-note: S4 is a newer set of object types in R. Access subsets in S4 using object@"name" while S3 uses object$name and relies heavily on lists. Not terrifically likely to come up anytime soon, but interesting side-note to maybe pursue later.

Speeding up functions by bypassing method dispatch entirely:
- S3: cal generic.class() instead of generic
- S4: findMethod() to pick specific dispatch, save it to a variable, and call that variable

mean.default(numeric_vector): faster than just mean(), but **less failsafe** if you feed it a non-numeric.

Faster way to turn numeric vectors into a dataframe (as.data.frame is safe vs. data types, but slow and relies on rbind())
But if you hand it the wrong data type, you get a corrupted dataframe and a series of warnings.

quickdf = function(n){
  class(n) = "data.frame"
  #Yes! Apparently you can do this!
  attr(n, "row.names") = .set_row_names(length(n[[1]]))
  1
}


*Person writes about how he rewrote the source code of as.data.frame.list() to yield a variant optimized for speed. Recommends other people look up source code when they need to take something extant and make something differently-optimized, or to make something that makes different expectations of the data than the norm.*



# Solutions for Self-Assessment 1

Sept 19, 2016

(Huh, okay, noted: Will finished well before me, Michelle and Micah took another 2 hours to finish.)

X = runif(n_trials)
Y = runif(n_trials, max=X)

qplot(X,Y)

NOTE THIS!
df = select(df, Extraversion, Neuroticism, active:scornful)

lm(Neuroticism ~ . - Extraversion, df)

### SQL section

SELECT Salary
FROM (
    SELECT Salary
    FROM Employees
    ORDER BY Salary DESC
    LIMIT 2
)
ORDER BY 


CROSS JOIN (maximally inclusive join, potentially lots of NULLS)
INNER JOIN require some key = some key, no generated join NULLS

LEFT JOIN: anything in left table sticks, may result in NULLS in columns from right table.

(Not following, look at the [wikipedia article](https://en.wikipedia.org/wiki/Join_(SQL)))

Solutions in same folder as self-assessment.





(SELECT name, SUM (leadrole),
CASE WHEN ord !=1 THEN 1 ELSE 0 END
#make sure to fuse on person
 FROM
(SELECT yr,COUNT(title) AS c FROM
   movie JOIN casting ON movie.id=movieid JOIN actor ON actorid=actor.id
 WHERE name='John Travolta'
 GROUP BY yr) AS t
)

ORDER BY actor.name
#What I want to do is sort by name and then apply count(movieid) 
#applied over a list of unique 

SELECT name FROM
  movie JOIN casting ON movie.id=movieid INNER JOIN actor ON actorid=actor.id
GROUP BY name
HAVING ord = 1  & COUNT(movie.id) > 30
COUNT(

## Correlation Coefficients

In a bid to understand Pearson Correlation Coefficients and what distinguishes them from Polychoric Corellation Coefficients 

(Some words I'm trying to understand: Pearson's assumesa an underlying joint normal distribution, Polychoric can make other assumptions (I'm presuming Binomial, in the instance I'm trying to mess with?))

$$PearCorCoef(X,Y) = \dfrac{cov(X,Y)}{\sigma_X \sigma_Y}$$

![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/400px-Correlation_examples2.svg.png)

<center>Covariance Function (aka Kernel?)</center>
$$Cov$$

### Quick notes on factor analysis (from Wikipedia)

Exploratory Factor Analysis (EFA) can fairly be considered a more complicated variant of PCA, where factor analysis also uses error terms (and would consider the eigenvalues of PCA to be inflated component loadings contaminated with error variance).

FA assumes there are a small number of independent underlying (unseen) variables that determine joint responses in sets of the observed variables. This independence assumption makes it unsuitable to biology. It's practically not used in physics, bio, or chem. It does get used in psychometrics personality theories, marketing, product management, and operations research.

$$x-\mu  = LF + \epsilon$$

With $F$ as unobserved random variables, $\mu_i = mean(x_i)$, L = matrix of constant modifiers $l_i$ for $x_i$, and $\epsilon_i$ as stochastic error terms for each item.

The number of factors is picked beforehand, as $k<max(i)$, and $F_{1 \ to \ k}$ aims to explain as much of the *shared* variance as it can in that many linear combinations (unlike PCA, it likely will not capture all of the variance.)

EFA is a generative model.



## Words from R's optim function
(parse and look up later)

"General-purpose optimization based on Nelder–Mead, quasi-Newton and conjugate-gradient algorithms. It includes an option for box-constrained optimization and simulated annealing."



# Clustering

## Clustering covered by assignment
- Heirarchical clustering
- K-means clustering
- Mixture Models
    - Univariate Mixture Model
    - Parametric
    - Semiparametric
    - Non-Parametric
- Kohonen Net (? neural nets plus k-means? confused)

## New Clustering Methods

- Genetic algorithms
- Simulated annealing,
- Tabu search
- Randomized branch-and-bound
- Hybrid search