In the above equation, we are assuming that the image xi has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W (of size [K x D]), and the vector b (of size [K x 1]) are the parameters of the function. In CIFAR-10, xi contains all pixels in the i-th image flattened into a single [3072 x 1] column, W is [10 x 3072] and b is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores). The parameters in W are often called the weights, and b is called the bias vectortwo major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels. We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function.

score function that maps the pixel values of an image to confidence scores for each class. let’s assume a training dataset of images $x_i \in R^D$, each associated with a label $y_i$. Here $i\in{1…N}$ and $y_i\in{1…K}$. That is, we have N examples (each with a dimensionality of D) and K distinct categories.

For example, in CIFAR-10 we have a training set of N = 50,000 images, each with D = 32 x 32 x 3 = 3072 pixels, and K = 10, since there are 10 distinct classes (dog, cat, car, etc). We will now define the score function f:RD↦RK that maps the raw image pixels to class scores.

simplest possible function, a linear mapping: $f(x_i,W,b)=Wx_i+b$

In the above equation, we are assuming that the image xi has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W (of size [K x D]), and the vector b (of size [K x 1]) are the parameters of the function. In CIFAR-10, xi contains all pixels in the i-th image flattened into a single [3072 x 1] column, W is [10 x 3072] and b is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores). The parameters in W are often called the weights, and b is called the bias vector

There are a few things to note:

First, note that the single matrix multiplication Wxi is effectively evaluating 10 separate classifiers in parallel (one for each class), where each classifier is a row of W.
we think of the input data (xi,yi) as given and fixed, but we have control over the setting of the parameters W,b Our goal will be to set these in such way that the computed scores match the ground truth labels across the whole training set
training data is used to learn the parameters W,b, but once the learning is complete we can discard the entire training set and only keep the learned parameters
lassifying the test image involves a single matrix multiplication and addition, which is significantly faster than comparing a test image to all training images.

## Interpreting Linear Classifiers

computes the score of a class as a weighted sum of all of its pixel values across all 3 of its color channels. the function has the capacity to like or dislike (depending on the sign of each weight) certain colors at certain positions in the image you can imagine that the “ship” class might be more likely if there is a lot of blue on the sides of an image (which could likely correspond to water).

### Point in High-Dimensional Space
Since the images are stretched into high-dimensional column vectors, we can interpret each image as a single point in this space
Since we defined the score of each class as a weighted sum of all image pixels, each class score is a linear function over this space. We cannot visualize 3072-dimensional spaces, but if we imagine squashing all those dimensions into only two dimensions, then we can try to visualize what the classifier might be doing

![linearClassifierInterpritation](/img/programming/linearClassifierInterpritation.png)

every row of W is a classifier for one of the classes. The geometric interpretation of these numbers is that as we change one of the rows of W, the corresponding line in the pixel space will rotate in different directions. The biases b, on the other hand, allow our classifiers to translate the lines

### Template Matching

Another interpretation for the weights W is that each row of W corresponds to a template (or sometimes also called a prototype) for one of the classes. The score of each class for an image is then obtained by comparing each template with the image using an inner product (or dot product) one by one to find the one that “fits” best. With this terminology, the linear classifier is doing template matching, where the templates are learned. Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class

![linearClassifierTemplates](/img/programming/linearClassifierTemplates.png)

## Bias Simplification

 a common simplifying trick to representing the two parameters W,b as one combine the two sets of parameters into a single matrix that holds both of them by extending the vector xi with one additional dimension that always holds the constant 1 the new score function will simplify to a single matrix multiply

 ![biasSimplification](/img/programming/biasSimplification.png)

## Common Preprocessing

it is a very common practice to always perform normalization of your input features (in the case of images, every pixel is thought of as a feature). In particular, it is important to center your data by subtracting the mean from every feature. In the case of images, this corresponds to computing a mean image across the training images and subtracting it from every image to get images where the pixels range from approximately [-127 … 127]. Further common preprocessing is to scale each input feature so that its values range from [-1, 1].

## Loss Function

We are going to measure our unhappiness with outcomes such as this one with a loss function (or sometimes also referred to as the cost function or the objective). Intuitively, the loss will be high if we’re doing a poor job of classifying the training data, and it will be low if we’re doing well.

### Multiclass Support Vector Machine loss

commonly used loss called the Multiclass Support Vector Machine (SVM) loss. he SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin $\Delta$.

The score function takes the pixels and computes the vector $f(x_i,W)$ of class scores, which we will abbreviate to $s$ (short for scores). For example, the score for the $j$-th class is the $j$-th element: $s_j=f(x_i,W)_j$. The Multiclass SVM loss for the $i$-th example is then formalized as follows:

$$L_i=\sum_{j \neq y_j}{max(0,s_j - s_{y_i} - \Delta)}$$

example to see how it works. Suppose that we have three classes that receive the scores s=[13,−7,11], and that the first class is the true class (i.e. yi=0). Also assume that $\Delta=10$ (a hyperparameter we will go into more detail about soon). The expression above sums over all incorrect classes $j \neq y_i$, so we get two terms:

$$L_i=max(0, -7-13+10) + max(0, 11-13+10)$$

You can see that the first term gives zero since [-7 - 13 + 10] gives a negative number, which is then thresholded to zero with the max(0,−) function. We get zero loss for this pair because the correct class score (13) was greater than the incorrect class score (-7) by at least the margin 10. In fact the difference was 20, which is much greater than 10 but the SVM only cares that the difference is at least 10; Any additional difference above the margin is clamped at zero with the max operation. The second term computes [11 - 13 + 10] which gives 8. That is, even though the correct class had a higher score than the incorrect class (13 > 11), it was not greater by the desired margin of 10. The difference was only 2, which is why the loss comes out to 8 (i.e. how much higher the difference would have to be to meet the margin). In summary, the SVM loss function wants the score of the correct class yi to be larger than the incorrect class scores by at least by Δ (delta). If this is not the case, we will accumulate loss.

A last piece of terminology we’ll mention before we finish with this section is that the threshold at zero max(0,−) function is often called the hinge loss. You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM), which uses the form max(0,−)2 that penalizes violated margins more strongly (quadratically instead of linearly).

### Softmax
