# Bootcamp Preamble

## Datapoint

Here a single data point, $x^{(1)}$ is represented as a vector where each row is a different **feature**. 

For example, if each training example is a house, then its vector of features may include elements for its price, no. rooms, no. windows etc.

### $X^{(1)} =  \begin{bmatrix} x_1\\ x_2\\ \vdots \\ x_{n-1}\\ x_n \end{bmatrix}$

## Design Matrix
The **design matrix**, **X** contains all of our training data. Each column represents a certain example. There are $m$ training examples. Each row represents a different feature. There are $n$ features. Hence the design matrix has dimensions of $n$ by $m$.

### $Design \ matrix,\ X = \begin{bmatrix} \vdots & & \vdots \\ x^{(1)} & \dots &  x^{(m)} \\ \vdots & & \vdots \end{bmatrix} = \begin{bmatrix} x_{11} \dots x_{1m} \\ \vdots \ddots \vdots \\ x_{n1} \dots x_{nm} \end{bmatrix} \in n \times m$

## Hypothesis
The hypothesis, $h$ is the output of your model. It is your current prediction of the mapping from input to output.

## Loss/cost function

Your loss function is a function which you use to measure how bad your model is. We will represent the loss of our models with the symbol $J$.

#### Mean Squared Error Loss
MSE loss is the average over all training points of the squared error between your hypothesis and the label. The factor of $\frac{1}{2}$ is often included to cancel with the power of 2 when differentiated so that no constants are present.

### $ J = \sum_{i=1}^{m} \frac{1}{2m}(h - y)^2$

#### Binary Cross entropy loss
BCE loss is used to calculate error for classification tasks. 

### $ J = \sum_{i=1}^{m} - y \cdot \text{log}(h) + (1-y) \cdot \text{log}(1-h)$

In classification tasks, for each class the label of a datapoint can only take binary values of 0 or 1; i.e. it *is* a member of that class or it *is not* a member of that class, and the output is usually a *confidence* value $\in [0, 1]$.
When , $y, = 0$ the first term is 'turned off' and the second term 

#### Kullback-Leibler Divergence
The KL divergence is a metric that quantifies the difference between two probability distributions, $p$ & $q$. It is used frequently in machine learning to measure the information lost when we try to represent a probability distribution in a different way (e.g. after reconstructing it from an encoding).

### $D_{KL}(p||q) = \sum_{i=1}^{m} p(x_i)\cdot (\text{log }p(x_i) - \text{log }q(x_i)) = \sum_{i=1}^{m} p(x_i)\cdot \text{log } \frac{p(x_i)}{q(x_i)}$

For a single datapoint, $x$, the KL divergence tests how similar the log probabilities of that value are and weights that difference by the value of the probability of sampling that $x$ from $p(x)$. The weighting $p(x)$ of the log difference makes the KL divergence different depending on which arrangement you compare the probability distributions in.

Consider:
- It takes large values when the sampled probabilities for the same values are more different, and the weighting probability distribution $p(x)$ is larger. 
- It takes a value of zero where the weighting probability distribution is zero.
- Where the 
- The aim is often to minimise the KL divergence (the information difference between two probability distributions).