Goodfellow ~Ch 5.1-5.4
- Learning algorithms
- Capacity, overfitting, underfitting
- Hyperparameters & validation sets
- Estimators, bias, variance

### What is learning?
- Performance (P) improves given experience (E) with tasks in some class of tasks (T)
- Learning is not the task, but the process that enables performance on the task (limitations)
- Definitions: Programs, algorithms, and processes (stochasticity?)
- ML systems considered as processing _examples_ composed of _features_

### Tasks


#### Classification
- Which of $k$ categories does an input belong to?
- $f:\mathbb{R}^n\rightarrow\{1,\dots,k\}$
- May also output distribution over classes
- Missing values (learn *set* of functions): common in medical applications
- Examples: 
  - MNIST (figure)
  - Anomaly detection (e.g. spam, credit card fraud)
  - Neuroscience:
    - Classify neuron type from electrophysiology data
    - Classify disease state from neuroimaging data (e.g. tumour in structural MRI)
    - Predict behavioural class (e.g. "currently reaching" vs. "resting") from brain measurements

#### Regression
- What value of a quantity is associated with a given input?
- $f:\mathbb{R}^n\rightarrow\mathbb{R}$
- Examples:
  - Neuroscience:
    - Predict *level*/*quantity*: firing rate, survival time, gene expression

#### Structured output
- Output is a vector or some other structure with relationships between members.
- Subsumes all other mapping tasks, though typically not applied to the well-known cases given above.
- Examples:
  - Partition of input (e.g. superpixels)
  - Image captioning (sentence describing image)
  - Parsing sentences into a tree describing grammatical structure

#### Transcription, translation
- Transcription: Observe a relatively unstructured representation and transcribe into discrete, textual form. 
  - e.g. OCR, speech recognition
- Translation: Sequence of symbols in one language to sequence of symbols in another language.

#### Synthesis and sampling
- Output is a newly generated example that is *similar* to the training data.
- Like structured output, but without a single correct output for each (implicit distribution).
- Examples
  - Generate audio for given sentence

#### Missing value imputation
- $\mathbf{x}\in\mathbb{R}^n$ with some $x_i$ missing
- Compare to sampling (partial).

#### Denoising
- Predict clean example $\mathbf{x}\in\mathbb{R}^n$ given a corrupted example $\tilde{\mathbf{x}}\in\mathbb{R}^n$.
- Unknown corruption process; i.e. learn $p(\mathbf{x}|\tilde{\mathbf{x}})$

#### Probability mass/density estimation
- Implicitly subsumes other tasks, and once we have explicitly obtained $p(\mathbf{x})$ we can perform the other tasks as well (e.g. missing value imputation).
- $p_\mathrm{model}:\mathbb{R}^n\rightarrow\mathbb{R}$

### The performance measure, $P$
- Quantitative
- Usually task-specific (clarify)
- Choice not obvious: penalize frequent small mistakes or infrequent large mistakes? Global vs. local errors?
- Accuracy and error rate (expected 0-1 loss)
- Test set

### The experience, $E$
- Datasets: collections of many examples/data points
- Design matrix
  - e.g. $\mathit{\mathbf{X}}\in\mathbb{R}^{150\times4}$ for irises (150 examples, 4 features). $X_{i,1}$ is the sepal length of plant $i$.
  - Not always possibly; some data (e.g. images of different sizes) are heterogeneous and are described as sets instead of matrices: $\{\mathbf{x}^{(1)},\mathbf{x}^{(2)},\dots,\mathbf{x}^{(m)}\}$


#### Unsupervised learning
- Experience a dataset with many features and learn useful structural properties
- Typically want to learn the entire probability distribution that generated the dataset (explicitly or not)
- Learn $p(\mathbf{x})$ from $\mathbf{x}$ examples.
- e.g. clustering

#### Supervised learning
- Each example experienced is associated with a label or target.
  - Labels may be simple numbers (e.g. class numbers) or more complex (e.g. correctly transcribed sentence).
- Learn $p(\mathbf{y}|\mathbf{x})$ from $(\mathbf{x},\mathbf{y})$ examples.

#### Supervised vs. unsupervised
- Given the chain rule, an unsupervised problem may be decomposed into $n$ supervised problems: 
$$p(\mathbf{x})=\prod_{i=1}^{n}p(\mathrm{x}_i|\mathrm{x}_1,\dots,\mathrm{x}_{i-1})$$
- By the definition of the conditional density, a supervised problem may be solved by unsupervised learning of the joint distribution:
$$p(y|\mathbf{x})=\frac{p(\mathbf{x},y)}{\sum_{y^\prime}p(\mathbf{x},y^\prime)}$$
- In any case, these terms help to roughly categorize problems. Traditionally, regression, classification, and structured output are considered supervised; density estimation is considered unsupervised.

#### Other paradigms
- Semi-supervised (only some example labelled)
- Multi-instance (entire collections of examples labelled)
- Reinforcement learning (environment; feedback between learning system and experiences)

### Optimization
- Connect to "the experience"
- Practical concerns. Stochasticity and local minima. 

### Generalization
- How does a model perform on previously unseen inputs?
  - Example: Different coloured cat than in training examples.
- Training error vs. test/generalization error
- Difficulty: only get to observe training set (?).
- Data generating process: Assumption that training and test examples are identically distributed, and individual examples are independent of each other --> allows the generating process to be modeled as a distribution over a single example.
  - Refer to shared underlying distribution as *data generating distribution* or $p_\mathrm{data}$

#### Underfitting and overfitting
1. Make the training error small
2. Make the gap between training and test error small.

#### Capacity
- Ability of a model to fit a wide variety of functions. 
- Often controlled by choosing the *hypothesis space* of functions the learning algorithm can select as solutions.
- Representational capacity (i.e. how well the chosen class of functions could solve the problem) vs. effective capacity (i.e. given additional limitations, such as imperfection of optimization process, how well can chosen method solve the problem? upper bound is representational capacity).
- Often considered in terms of number of parameters... but not all parameters are equal (VC dimension: "the largest possible value of $m$ for which there exists a training set of $m$ different examples that the classifier can label arbitrarily").
- Too high: overfitting. Too low: underfitting. Figure 5.2.
- Statistical learning theory: Gap between training and generalization error is bounded above by a quantity that grows with capacity, but shrinks with number of training examples.
  - Simpler functions more likely to generalize, but must still choose a sufficiently complex hypothesis to achieve low training error.
- Performance is typically best when model capacity is appropriate for the complexity of the task and the number of available examples.
- Example:
  - Quadratic has higher capacity than linear.
  - Pathological example (single-parameter universal approximator).
- Ideal model: Oracle that knows the true distribution.
  - May still make errors; e.g. due to noise inherent to generating distribution, or due to excluded variables involved in the deterministic relationship between $\mathbf{x}$ and $y$.
  - *Bayes error*: error incurred by an oracle. That is, the lower bound on the error.

#### Non-parametric models
- Limit of infinite capacity; no parametrized function fixed prior to learning.
- Example: 
  - Nearest neighbour regression.
  - Wrap parametric learning algorithm inside another algorithm that optimizes no. of parameters as needed.

#### No Free Lunch theorem
"averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points"
i.e. we need to make assumptions about which data generating distributions are relevant

#### Regularization
- Any modification to learning algorithm intended to reduce generalization error but not training error.
- Additional preferences/penalties about the hypothesis space, above simple inclusion/exclusion.
- e.g. weight decay

### Representation, causality
- Disentangling factors of variation (e.g. PCA limitations)
- Separability vs. representation (e.g. polar vs cartesian)
- Another example: brain disentangling object state from illumination, perspective

### Classes of models/learning
- Venn diagram? (Fig 1.4)
- Deep learning vs. classic/rule-based systems (Fig 1.5)

VC dimension and capacity: single-parameter counterexample