# Linear Classification
We are now going to develop a more powerful approach to image classification that we will eventually naturally extend to entire Neural Networks and Convolutional Neural Networks. The approach will have two major components:

* a **score function**: mapping the raw data to class scores
* a **loss function**: quantifying the agreement between the predicted scores and the ground truth labels

We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function.

## Linear Classifier

Each image \\(x_i\\) is associated with a label \\(y_i\\) in training dataset of N samples and K classes. The image \\(x_i\\) can be flatten out to a single column vector of shape[Dx1]. So we build a linear score function \\(f: R^D \mapsto R^K\\) to get the scores of every label for the image.

$$f(x_i, W, b) =  W x_i + b$$
where
$$x_i \in R^D  (i = 1 \dots N)$$
$$y_i \in { 1 \dots K }$$

The matrix **W** (of size [KxD]), and the vector **b** (of size [Kx1]) are the parameters of the function.The parameters in **W** are often called the weights, and **b** is called the bias vector because it influences the output scores, but without interacting with the actual data \\(x_i\\). However, you will often hear people use the terms weights and parameters interchangeably.

For example, in CIFAR-10 we have a training set of N = 50,000 images, each with D = 32 x 32 x 3 = 3072 pixels, and K = 10. Some notes here:

* \\(W x_i\\) is effectively evaluating 10 separate classifiers in parallel (one for each class), where each classifier is a row of W
* The input data \\((x_i, y_i)\\) are given and fixed, we have control over the setting of **W** and **b**. Intuitively we wish that the correct class has a score that is higher than the scores of incorrect classes
* Once the learning is complete, we can discard the entire training set and only keep the learned parameters
* Classifying the test image involves a single matrix multiplication and addition, which is significantly faster than comparing a test image to all training images

**Bias trick**(homogeneous equation)

The new linear score function \\(f(x_i, W, b) =  W x_i + b\\) can be simplified to a single matrix multiply.

$$f(x_i, W) =  W x_i$$
because
$$
\left[\begin{array}{lcr}W & b \end{array}\right]
\left[\begin{array}{lcr}x_i \\ 1 \end{array}\right]
= W x_i + b
$$

With our CIFAR-10 example, \\(x_i\\) is now [3073 x 1] instead of [3072 x 1] , and **W** is now [10 x 3073] instead of [10 x 3072].

## Loss Function
We do have control over these weights and we want to set them so that the predicted class scores are consistent with the ground truth labels in the training data.

We are going to measure our unhappiness with outcomes such as this one with a loss function (or sometimes also referred to as the **cost** function or the **objective**). Intuitively, the loss will be high if we’re doing a poor job of classifying the training data, and it will be low if we’re doing well.

### Multiclass Support Vector Machine loss
The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin \\(\Delta\\). The Multiclass SVM loss for the i-th example is then formalized as follows:

$$L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta)$$
where the score for the j-th class is the j-th element (**s** short for scores)
$$s_j = f(x_i, W)_j$$


In [None]:
import numpy as np

def L_i(x, y, W):
    delta = 1.0
    scores = W.dot(x)
    correct_class_score = scores[y]
    D = W.shape[0]
    loss_i = 0.0
    for j in range(D):
        if j == y:
            continue
        loss_i += max(0, scores[j] - correct_class_score + delta)
    return loss_i

In [None]:
import numpy as np

def L_i_vectorized(x, y, W):
    delta = 1.0
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + delta)
    margins[y] = 0
    loss_i = np.sum(margins)
    return loss_i

In [None]:
D_IMG = 33
N_CLASS = 10

W = np.random.rand(N_CLASS, D_IMG)
x = np.random.randint(0, 255, D_IMG)
y = 4 # 0 <= y < NUM_CLASS

L_i(x, y, W), L_i_vectorized(x, y, W)

In [None]:
import numpy as np

def L(X, Y, W):
    """
    fully-vectorized implementation :
    - X holds all the training examples as columns (e.g. 3073 x 50,000 in CIFAR-10)
    - Y is array of integers specifying correct class (e.g. 50,000-D array)
    - W are weights (e.g. 10 x 3073)
    
    """
    # evaluate loss over all examples in X without using any for loops
    # left as exercise to reader in the assignment
  
    delta = 1.0
    scores = W.dot(X)
    # convert Y to one-hot matrix: 10x5,0000
    

## Softmax Classifier
Softmax classifier is the generalization of the binary Logistic Regression classifier to multiple classes. It gives a slightly more intuitive output and has a probalistic interpretation which normalizes class probalilites.

**softmax function** $$f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}}$$
here \\(z_j\\) is the j-th element. It takes a vector **z** of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.

**cross-entropy** between a "true" distribution **p** and an estimated distribution **q** is defined as following in information theory.

$$H(p,q) = - \sum_x p(x) \log q(x)$$

The cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as following.

$$H(p,q) = H(p) + D_{KL}(p||q)$$

**cross-entropy loss**
$$L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L_i = -f_{y_i} + \log\sum_j e^{f_j}$$

* minimizing the KL divergence between the two distributions (a measure of distance)
* wants the predicted distribution to have all of its mass on the correct answer

**Probabilistic interpretation** Looking at the expression, we see that
$$P(y_i \mid x_i; W) = \frac{e^{f_{y_i}}}{\sum_j e^{f_j} }$$

we are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing Maximum Likelihood Estimation (MLE).

### Numeric stability
When you're writing code for computing the Softmax function in practice, the interdediate terms may be very large due to the exponentials, such as \\(e^{f_{y_i}}\\) and \\(\sum_j e^{f_j}\\). So it is important to use a normalization trick below.

In [None]:
import numpy as np

f = np.array([123, 456, 789])
p = np.exp(f) / np.sum(np.exp(f))

f, p

In [None]:
import numpy as np

f -= np.max(f)
p = np.exp(f) / np.sum(np.exp(f))

f,p

## Possibly confusing naming conventions
To be precise, the SVM classifier uses the **hinge loss**, or also sometimes called the **max-margin loss**. The Softmax classifier uses the **cross-entropy loss**. The Softmax classifier gets its name from the **softmax function**, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesn't make sense to talk about the "softmax loss", since softmax is just the squashing function, but it is a relatively commonly used shorthand.

## Regularization
We have a dataset and a set of parameters W that correctly classify every example.(\\(L_i = 0\\) for all i). One easy way to see this is that if one parameters W correctly classify all examples, then any multiple of these parameters \\(\lambda W\\) where \\(\lambda > 1\\) will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences.

In other words, we wish to encode some preference for a certain set of weights **W** over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty \\(R(W)\\). The most common regularization penalty is the L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters:

$$R(W) = \sum_k\sum_l W_{k,l}^2$$

That is, the full Multiclass SVM loss becomes:

$$L =  \underbrace{ \frac{1}{N} \sum_i L_i }_\text{data loss} + \underbrace{ \lambda R(W) }_\text{regularization loss}$$

The L2 penalty prefers smaller and more diffuse weight vectors, this effect can improve the generalization performance of the classifiers on test images and lead to less overfitting.
