# Rough Outline

## Core
- Vectors, Boolean vectors, indexing
- Matrices & indexing
- Negative indices
- Defining functions
  - Function composition
  - Function scoping
  - Pass-by-value
- Dataframes (mtcars)
- Missing values
- Factors

## Data transformation
- Wide-format vs. long-format

## Visualization

## reshape2, dplyr

## Misc
- `help(...)` or `? ...`
- Tab-completion in Jupyter
- SparkR (if possible)

## To-do
- dpylr examples
- ggplot2 examples

# Brief Introduction to R & Feature Transformation
## Chris Hodapp <hodapp87@gmail.com>

## CincyFP, 2016 December 13

## Front matter

This is all done in Jupyter (formerly IPython) and IRkernel.
- https://jupyter.org/
- https://irkernel.github.io/

Visit http://... to use this same notebook in your browser.

(...unless you're reading this later, of course.  Go fire up your own docker container with `"docker run -d -p 8888:8888 jupyter/r-notebook"` or something.)

![](r-matey2.png)

(thanks Creighton)

## What is R?

- An interpreted, dynamically-typed language based on S and made mainly for interactive use in statistics and visualization

- Sort of like MATLAB, except statistics-flavored and open source

- A train-wreck that is sometimes confused with a real programming language.
  - *"R is a dynamic language for statistical computing that combines lazy functional features and object-oriented programming. This rather unlikely linguistic cocktail would probably never have been prepared by computer scientists, yet the language has become surprisingly popular."*
  - The R Inferno (Patrick Burns), http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

## So... why use it at all?

- Stable and documented extensively!
- Excellent for exploratory use interactively!
- Epic visualization!
- Magical, fast, and elegant for arrays, tables, vectors, and linear algebra!
- Huge standard library!
- Packages for everything else on CRAN!
- Still sort of FP!
- Excellent tooling! (Sweave, Emacs & ESS mode, RStudio, Jupyter...)

## How do I use R?

*Do you need plotting or visualization?*
Use [ggplot2](http://ggplot2.org/). Completely ignore built-in plotting.

*Do you need to transform tabular/vector/list/array/matrix/DataFrame data somehow?*
Just use [dpylr](https://cran.r-project.org/package=dplyr) or [reshape2](http://seananderson.ca/2013/10/19/reshape.html). Completely ignore built in `*apply` functions.

*Do you need something else?* Search [CRAN](https://cran.r-project.org/).

*Does no CRAN package solve your problems? Do you need to write "real"(tm) software for production?* Strongly consider giving up.

## Obligatory R notebook demonstration...

## dplyr

- See: Introduction to dplyr, https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html
- `filter`, `slice` - Select rows (filter is by predicate, slice is by position)
- `arrange` - Reorder rows
- `select`, `rename` - Select columns
- `distinct` - Choose only *distinct* rows
- `mutate`, `transmute` - Make new columns from existing ones
- `summarise` - Collapse frame to single row with aggregate functions
- `sample_n`, `sample_frac` - Randomly sample (by count or by percentage)
- `group_by` - Group observations (most of above worked on grouped observations)

## Motivating Example

- Example data set from: https://archive.ics.uci.edu/ml/datasets/Letter+Recognition
- 20,000 samples, 16 dimensions

In [131]:
# I'm not going to remember this code
letters <- read.table("letter-recognition.data", sep=",", header=FALSE);
colnames(letters) <- c("Letter", "Xbox", "Ybox", "Width", "Height",
                       "OnPix", "Xbar", "Ybar", "X2bar", "Y2bar",
                       "XYbar", "X2Ybar", "XY2bar", "Xedge",
                       "XedgeXY", "Yedge", "YedgeYX");

## *Curse of Dimensionality* (Bellman)

![](https://upload.wikimedia.org/wikipedia/commons/c/cc/Data3classes.png)

![](https://upload.wikimedia.org/wikipedia/commons/5/52/Map1NN.png)

- Intuition from k-nearest neighbor: If each sample occupies a certain amount of 'space' in the input space, the number of samples required to still 'cover' that space increases exponentially with the number of dimensions.
- If possible: Don't add more dimensions. Either reduce dimensions, or increase samples.

## Feature Transformation

### General form

$$x : \mathcal{F}^N \mapsto \mathcal{F}^M, M < N$$

*(though actually `M >> N` is useful too and is the basis for [kernel methods](https://en.wikipedia.or/wiki/Kernel_method) such as SVMs)*

### Subsets
- *Feature Selection*: Loosely, throw away dimensions/features.
- Information gain, Gini index, entropy, variance, statistical independence...

- *Filtering*: Reduce features first, and then perform learning. Learning can't feed information 'back' to filtering.

- *Wrapping*: Reduce features based on how learning performs.

### Subsets

#### Linear

- Transformation $x : \mathcal{F}^N \mapsto \mathcal{F}^M, M < N$ is defined by $N\times M \textrm{ matrix }\mathcal{P}_x$

- e.g. 4-dimensional feature space mapped to 2 dimensions, $(x_1, x_2, x_3, x_4) \mapsto (2x_1-x_2, x_3 + x_4)$

Then for samples as column vectors...
$$
\mathcal{P}_x=
  \begin{bmatrix}
    2 & -1 & 0 & 0 \\
    0 & 0 & 1 & 1
  \end{bmatrix}
$$

Consider data expressed as an $n\times m$ matrix with each column representing one *feature* (of $m$) and each row one *sample* (of $n$):

$$
X=
  \begin{bmatrix}
    a_1 & b_1 & c_1 & \cdots\\
    a_2 & b_2 & c_2 & \cdots\\
    a_3 & b_3 & c_3 & \cdots\\
    \cdots & \cdots & \cdots & \cdots\\
    a_n & b_n & c_n & \cdots\\
  \end{bmatrix}
$$

- Focus on first feature $A=\left\{a_1, a_2, \dots\right\}$
- Mean = $\left\langle a_i \right\rangle_i = \frac{1}{n} \sum_i^n a_i=\mu_A$
  - $\left\langle \dots \right\rangle$ = expectation operator

- Variance:
$$\sigma_A^2=\left\langle \left(a_i-\left\langle a_j \right\rangle _j\right)^2 \right\rangle_i=\left\langle \left(a_i-\mu_A\right)^2 \right\rangle_i = \frac{1}{n-1}\sum_i^n \left(a_i-\mu_A\right)^2$$

*(if you want to know why it is $\frac{1}{n-1}$ and not $\frac{1}{n}$, ask a statistics PhD or something)*

- Consider another sample $B=\left\{b_1, b_2, \dots\right\}$, and assume that $\mu_A=\mu_B=0$ for sanity
- Covariance of $A$ and $B$:
$$\sigma_{AB}^2=\left\langle a_i b_i \right\rangle_i=\frac{1}{n-1}\sum_i^n a_i b_i$$

Treating $A$ and $B$ as vectors:

$$\sigma_{AB}^2=\frac{A\cdot B}{n-1}$$

Recalling our data matrix:

$$
X=
  \begin{bmatrix}
    a_1 & b_1 & c_1 & \cdots\\
    a_2 & b_2 & c_2 & \cdots\\
    a_3 & b_3 & c_3 & \cdots\\
    \cdots & \cdots & \cdots & \cdots\\
    a_n & b_n & c_n & \cdots\\
  \end{bmatrix}
$$

It can be rewritten as column vectors:

$$
X=
  \begin{bmatrix}
    a_1 & b_1 & c_1 & \cdots\\
    a_2 & b_2 & c_2 & \cdots\\
    a_3 & b_3 & c_3 & \cdots\\
    \cdots & \cdots & \cdots & \cdots\\
    a_n & b_n & c_n & \cdots\\
  \end{bmatrix}
  =\begin{bmatrix}
  A & B & C & \cdots
  \end{bmatrix}
$$

Then *covariance matrix* is:

$$\mathbf{S}_X=\frac{X^\top X}{n-1}=
  \begin{bmatrix}
    \sigma_{A}^2 & \sigma_{AB}^2 & \sigma_{AC}^2 & \sigma_{AD}^2 & \cdots \\
    \sigma_{AB}^2 & \sigma_{B}^2 & \sigma_{BC}^2 & \sigma_{BD}^2 & \cdots \\
    \sigma_{AC}^2 & \sigma_{BC}^2 & \sigma_{C}^2 & \sigma_{CD}^2 & \cdots \\
    \sigma_{AD}^2 & \sigma_{BD}^2 & \sigma_{CD}^2 & \sigma_{D}^2 & \cdots \\
    \cdots & \cdots & \cdots & \cdots
  \end{bmatrix}
$$

- Square ($m \times m$), symmetric, variances on diagonals, covariances off diagonals

- If all features are completely independent of each other, then all covariances are 0.
- That is: The covariance matrix is a *diagonal matrix* (all zeros, except for its diagonals).
- So... What is this matrix $P$ such that for $Y=XP$, covariance matrix $\mathbf{S}_Y$ is diagonal?

- Like basically every other question in linear algebra, the answers are:
  - Eigendecomposition
  - SVD

- That magical transform matrix $P$ equals a matrix whose columns are eigenvectors of $X^\top X$. (Left as an exercise for the reader.)  Since covariance matrix $X^\top X$ is a symmetric and positive semidefinite matrix, its eigenvectors form an orthogonal basis with non-negative eigenvalues (obviously).
- Eigenvectors are the *principal components* of $X$ (in order, if eigenvalues decreasing).
- Corresponding eigenvalues are the variance of $X$ 'along' each component (also equal to the diagonals of $\mathbf{S}_Y$) - or the 'variance explained' by each component

## Principal Component Analysis

- We have thus just derived (in abbreviated fashion) a ridiculously useful tool called PCA (Principle Component Analysis).
  - It is a linear algebra method that tries to find uncorrelated Gaussians.  Uncorrelated sometimes coincides with statistically independent.
  - *ICA (Independent Componenent Analysis)* derives independent features using probability and information theory.

## Random Projections / RCA

- This is a stupid, stupid algorithm that shouldn't work:
  1. Pick $m$ random directions in the $n$-dimensional space, $m < n$.
  2. Project the $n$-dimensional data onto them.
  3. Is the projection good enough (e.g. low reprojection error)?
     - Yes: You're done.
     - No: Repeat step 1.
- It does work - very quickly, and irritatingly well.

# Other References

- Official R intro: https://cran.r-project.org/doc/manuals/R-intro.html
- Evaluating the Design of the R Language (Morandat, Hill, Osvald, Vitek): http://r.cs.purdue.edu/pub/ecoop12.pdf
- Impatient R, http://www.burns-stat.com/documents/tutorials/impatient-r/
- R: The Good Parts, http://blog.datascienceretreat.com/post/69789735503/r-the-good-parts
- ISLR (Intro. to Statistical Learning in R): http://www-bcf.usc.edu/~gareth/ISL/
- ESL (Elements of Statistical Learning): http://statweb.stanford.edu/~tibs/ElemStatLearn/

- For PCA: http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf