# 3. Classification

 * Assign **labels** to **data points**
 * Training examples generalized to a **decision rule** (or more)

## Linear classifiers
 * Model expressed as a normal $w$ to a hyperplane separating two classes (and, optionally, a bias (offset) term $b$).
 * Linear classifiers are easy to understand, quick to train, work very well (especially in high dimensions) and make extremely efficient predictions.
 * In all future examples, use *homgeneous representation* and only define a model as $w$, with no bias term.

## Support Vector Machines
 * Max-margin classifiers
 * Find $w$ so that the margin to closest representatives of each class is maximized. Can be shown (but not here) that this is the right thing to do in order to improved our generalization.
 * Margin size is $\frac{1}{||w||_2} = \mathit{margin}$
 * There's always noise $\implies$ model as slack variables.

### Convex optimization
 * This is the way we solve the SVM constrained minimization problem
 * Minimization because we want to "shrink" our model's magnitude (i.e. the margin size) and, as much as possible, the impact of the slack variables (data points which we cannot avoid misclassifying).
 * Constrained because we also want to ensure our model does the right thing and correctly classifies most points (for most points, we expect their $\xi_i$ to be 0, and $y_iw^Tx_i\ge1$, meaning that the point is snugly placed in the correct class; positive means the right side of the separation hyperplane; also greater than one means it's beyond the separation margin).
 * **Problem:** current approach has a run-time complexity of $\Omega(n^2)$, and simply doesn't work if the data don't fit in memory.

### Convex optimization, reformulated
 * Because linear algebra is awesome, we can reformulate the (primal) SVM definition into something a little more useful.
 * $\xi_i = \operatorname{max}(0, 1 - y_iw^Tx_i)$, from the constraint part; it's called hinge loss ($l_{\operatorname{hinge}}$)
 * Shove this into the minimization formula.
 * We get: $\min_w{\left (w^Tw + C\sum\limits_{i}\max(0, 1 - y_iw^Tx_i) \right )}$

### Convex optimization, reformulated, again
 * We can formulate a multicriteria optimization problem in two ways:
     * Minimize everything at once
     * Minimize one objective, while keeping the other under a certain bound
 * Our above (re)formulation belongs to the first category (one big-ass $\min$)
 * Let's write it in the second way.
 * Minimize the second part, as long as the first part ($w^Tw$) is below some threshold B.
 * We write that as:
     * $\min_w\sum\limits_i\max(0, 1 - y_i w^T x_i)$ s. t. $||w||_2 \lt \frac{1}{\sqrt{\lambda}}$
 * Interestingly enough, all 3 of our formulations produce the **same solutions**. TODO: ask algebra guru for more details about why.
 * However, the complexity varies. A lot. The last version is the best in that regard.

## Convex optimization, in general
 * Many supervized learning problems consists of two components: a loss function and a regularizer.
 * From a (handwavily explained) Bayesian standpoint, the former represents the evidence, while the latter, our prior knowledge (i.e. we *know* our model shouldn't be *too* complex) [citation needed]
 * As we saw before, there are two ways to state these problems mathematically.
     * Using a single "large" minimization (l + r)
     * Using a little minimization (l) plus a constraint (r)
 * We will focus on the second techniques, as it makes it possible to perform online learning.
 
 
OCP -> online-to-batch conversion by averaging if we want to train a single SVM on fixed data set
If we also pick data at random -> SGD.

* TODO: strong convexity and its benefits 
    * Using geometric examples can really help!
* TODO: adapting to geometry
    * ADAGRAD explained really well in tutorial(s)
* TODO: parallelism

## Open questions
 * When transitioning from the first SVM formulation (with slack variables), to the second one aren't we loosening any constraint by fixing $\xi$?
     * (tentative) It seems we're not, since we're taking multiple cases into consideration and merging them together into a single formulation using max.
 * Slide 04:18: Is the first (primal) SVM formulation a (ii)-type one (since it has a minimization and its constraint as separate equation), or is it not eligible for this categorization?