# Terminology
The following cells describe important terms I need to familiarize myself better or to learn.

Sources:
- [Wiki](en.wikipedia.org)
- [Machine-learning course by Google](https://developers.google.com/machine-learning/crash-course/)

## Key ML terminology

Supervised machine learning - ML systems learn how to combine input to produce useful predictions on never-before-seen data.

### Labels
A **label** is the thing we are predicting - the `y` variable in simple linear regression. The label could be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip, or jus about anything.

### Features
A **feature** is an input vraiable - the `x` in simple linear regression. A simple machine learning project might use a s single feature, while a more sphisticated machine learning projects coud use mollions ofeatures specified as:

$$ x_1, x_2, \cdots x_N$$

In e-mail spam detector example, the features could be the following:
- Words in the email text
- Sender's address
- Time of day the mail was sent
- E-mail contains the phrase "one weird trick".

### Models
A model defines the relationship between features and label. For example a spam detection model might associate certain features strongly with "spam". Two phases of model's life are:

- **Training** means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.
- **Inference** means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions(`y'`).


### Regression vs. Classification

A **regression** model predicts continuos values. For example regression models make prediction that answer questions like the following:
- What is the value of a house?
- What is the probability that a user will click on an ad?
A **classification** model predicts discrete values. For example, classification models make predictions that answer questions like the following:
- Is a given email message spam or not spam?
- IS this an image of a dog, a cat or a hamster?

## Linear regression

In linear regression we define a model in form of linear function:

$$ y = mx+b$$

where:

- $y$ it the value we are trying to predict
- $m$ is the slope of the line
- $x$ is the value of our input feature
- $b$ is the y-intercept, a.k.a. point in witch line touches y axis.

By convention in machine learning, the equation is slightly different:

$$ y' = b+w_1x_1 $$

where:

- $y'$ is predicted label (a desired output)
- $b$ is the bias (the y-intercept) or $w_0$
- $w_1$ is the weight of feature 1. Weight is the same concept as the "slope" m in traditional equation of a line.
- $x_1$ is a feature (a known input)


This example shows model that depends on one feature. Models can rely on multiple features, each having separate weight:

$$ y' = b+w_1x_1+w_2x_2+w_3x_3$$


## Training and loss

**Training** a model simply means learning (determining) good values for all weights and the bias from labeled examples. In supervised training a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called **empirical risk minimization**.

Loss i penalty for a bad prediction. That is, loss is a number indicating how bad the model's prediction was on a single example. if the model's prediction is perfect, the loss is zero; otherwise thi loss is greater. The goal of training a model is to find a sets of weights and biases that have low loss, on average, across all examples. For example, Figure shows a high loss model on the left and low loss model on the right:
- the arrows represent loss
- the blue line represent predictions

![Figure](https://developers.google.com/machine-learning/crash-course/images/LossSideBySide.png)

Notice that the arrows in the left plot are much longer than their counterparts in the right plo. Clearly the line in the right plot is much better predictive model than the line in the left plot.

### Squared loss

The linear regression models in examples discussed next, use a loss function called **squared loss** (also known as **L<sub>2</sub> loss**). The squared loss for a single example is as follows:

```
  = the square of the difference between the label and the prediction
  = (observation - prediction(x))2
  = (y - y')2
```

**Mean square error (MSE)** is the average squared loss per example over the whole dataset. To calculate MSE, sum up all squared losses for individual examples and then divide it by the number of examples:

$$ MSE = \frac{1}{N} \sum_{(x,y) \in D} (y-prediction(x))^2 $$

where:
- $(x, y)$ is example in which
  - $x$ is the set of features
  - $y$ is the example's label
- $prediction(x)$ is a function of weights ad bias in combination with the set of features 
- $D$ is a data set containing many labeled examples, which are $(x, y)$ pairs.
- $N$ is the number of examples in $D$

Although MSE is commonly used in machine-learning, it is neither the only practical loss function nor the best loss function for all circumstances.


## Backpropagation

In machine learning backpropagation (BP) is widely used algorithm for training feed-forward neural networks. In fitting a NN backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input-output example, and does so efficiently, unlike a naive direct computation of the gradient with respect ot each weight individually. This efficiency makes it feasible to use gradient methods for training multilayer networks, updating weights to minimize loss; gradient descent, or variants of such as SGD (stochastic gradient descent) are commonly used. The BP algorithm works by computing the gradient of the loss function with the respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward fro the last layer to avoid redundant calculations of intermediate terms in the chain, it is example of dynamic programming.

The term *backpropagation* refers strictily to algorithm for computing the gradient, not how the gradient is used. However the term is commonly loosely used to refer to entire learning algorithm.

Computing:
in general BP computes the gradient in *weight* space of a feedforward nn, with respect with the loss function.

We can denote:

- $x$ : input (vector of features)
- $y$ : target output
    - for classification , output will be a vector of class probabilities (.1, .7, .2), and target outputs is specific class, by the one-hot/dummy variable (0, 1, 0)  
- $C$ : loss function or "cost function"
    - for classification this is usually *cross entropy* (XC, log loss), while for regression we usually use squared error loss (SEL)
- $L$ : the number of layers
- $W^l = (w^l_{jk})$ : the weights between layer $l - 1$, where $w^l_{jk}$ is weight between $k$-th node in layer $l-1$ and $j$-th node in layer $l$
- $f^l$ : activation functions at layer $l$
    - for classification the last layer is usually the logistic function for binary classification , and softmax(softargmax) for  multi class classification, while for the hidden layers this was traditionally a sigmoid function (logistic functions or others ) on each node (coordinate), but today is more varied, with rectifier (ramp, ReLU) being common.

The overall network is combination of function composition and matrix multiplication:
$$ g(x) := f^L (W^L f^{L-1}(W^{L-1}\cdots f^1(W^1x)\cdots))$$