#### KNN __Distance Metric__ to compare images:
- L1 Norm: $d_1(I_1, I_2) = \sum^p |I_1^p - I_2^p|$
- L2 Norm: $d_1(I_1, I_2) = \sum^p \sqrt{(I_1^p - I_2^p)^2}$

L1 depends on the choice of your coordinate system; Whereas there's no effect on L2 since a circle is similar. So if there's some special meaning, maybe somehow L1 is a more natual fit; Otherwise, L2 may be natual.
![](images/db.png)
It's shown that the decision boundary of L1 tends to follow the coordinate axes.

KNN is never used since it's hard to densely cover pixels in such high dimensional space

@FLANN

@[Recognizing and Learning Object Categories](http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html)

@[A Few Useful Things to Know About Machine Learning](https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

#### Hyper parameters
- Very Problem-dependent
- Must try them all out and see which is best

#### Linear Classifier
Only learn one template for each class
![](images/fails.png)

### Multi - SVM
![](images/svmloss.png)

- Note that the choice of **1** is actually doesn't matter since the effect is canceled with the scale of __W__
- The min possible value of svm Loss is **0** and max is $\infty$
- At the beginning, since the values of __W__ is small, the svm loss is expected to be **#nofclasses - 1**, which may be helpful when debugging
- The loss above omit the loss at correct class to make the minimum loss be zero, if calculate it as a part of loss, the new loss will be the loss above plus 1
- Optionally, we may square the hinge loss (adjust the trade-off between goodness and badness)
- Loss function is to tell the algorithm what kind of error should be care about and what should trade-off against
- Suppose a dataset is perfectly separatable, then suppose W makes L = 0, 2W also makes L = 0. So for avoiding overfitting, a regularization is a must to "simplify" the parameters

Aside: Optimization in primal. If you’re coming to this class with previous knowledge of SVMs, you may have also heard of kernels, duals, the SMO algorithm, etc. In this class (as is the case with Neural Networks in general) we will always work with the optimization objectives in their unconstrained primal form. Many of these objectives are technically not differentiable (e.g. the max(x,y) function isn’t because it has a kink when x=y), but in practice this is not a problem and it is common to use a subgradient.

@ [Deep Learning using Linear Support Vector Machines](https://arxiv.org/abs/1306.0239)

### Softmax
![](images/softmax.png)
- Min is **0** and Max is $\infty$
- The initial L should be $\log $ __#nClasses__

When you’re writing code for computing the Softmax function in practice, the intermediate terms efyi and ∑jefj may be very large due to the exponentials.
![](images/sft.png)
where $\log C = − \max_j f_j$

#### Differences:
The SVM interprets these as class scores and its loss function encourages the correct class to have a score higher by a margin than the other class scores. The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high.

The only thing SVM care is whether the correct scores is larger than a margin above the incorrect scores. Softmax will always want to drive the prob of correct class to 1. This property of SVM can intuitively be thought of as a feature: For example, a car classifier which is likely spending most of its “effort” on the difficult problem of separating cars from trucks should not be influenced by the frog examples, which it already assigns very low scores to, and which likely cluster around a completely different side of the data cloud.

But pratically, these tends not to make a huge difference.

![](images/reg.png)

- L2 is preferred to spread the influence accross all X, the decision depend the entire X vector. 
    - L2 corresponses a MAP inference using a Gaussian prior on W
- L1 has a opposite interpretation. It prefers a sparse weights.

__Bias regularization__. As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective. However, in practical applications (and with proper data preprocessing) regularizing the bias rarely leads to significantly worse performance. This is likely because there are very few bias terms compared to all the weights, so the classifier can “afford to” use the biases if it needs them to obtain a better data loss.

__Per-layer regularization__. It is not very common to regularize different layers to different amounts (except perhaps the output layer). Relatively few results regarding this idea have been published in the literature.

__In practice__: It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of p=0.5 is a reasonable default, but this can be tuned on validation data.

Word of caution: It is important to note that the L2 loss is much harder to optimize than a more stable loss such as Softmax. Intuitively, it requires a very fragile and specific property from the network to output exactly one correct value for each input (and its augmentations). Notice that this is not the case with Softmax, where the precise value of each score is less important: It only matters that their magnitudes are appropriate. Additionally, the L2 loss is less robust because outliers can introduce huge gradients. When faced with a regression problem, first consider if it is absolutely inadequate to quantize the output into bins. For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss. Classification has the additional benefit that it can give you a distribution over the regression outputs, not just a single output with no indication of its confidence. If you’re certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea.

#### Gradient:
Always use analytical gradient, but numerical gradient is useful when debugging

#### Minibatch:
32/64/128

#### Image Features
![](images/bow.png)

And the training process is 
![](images/tp.png)
The difference between such process and ConvNet is in cnn, we don't need to write the rule of feature extraction by ourself; we learn it from data.

#### Neural Network
“fully-connected networks” or sometimes “multi-layer perceptrons” (MLP)

![](images/pa.png?)

![](images/mi.png?)

![](images/af.png)