# Week 2

## Linear Regression
- Regression is one of the basic machine learning tasks. 
- We wish to predict a numeric value for a sample by using the values of its features. E.g. we might use different features of houses (such as living area, age, number of bedrooms etc.) for predicting the price of houses on the market. 
- The regression model is a mapping function $f: \mathbb{R}^M \rightarrow \mathbb{R}$, where $M$ is the dimensionality of the sample, i.e. the number of features. 
- Function $f$ is used to _approximate_ the real data generation process, in this case a market that sets the prices for the houses according to various criteria.

### Example
- In general, we assume we have a dataset of samples, where each sample $i$ has features $\mathbf{x}^{(i)}$ and a true value of predicted variable $y^{(i)}$. 
- Our goal is to create the function $f$, such as $f(\mathbf{x}^{(i)}) \approx y^{(i)}$ even for samples the model has not seen before.

### Definition of Linear Regression
- With linear regression each feature $x_i \in \mathbf{x}$ has a weight $w_i$ assigned. 
- The prediction of linear regression model is then a weighted sum:

\begin{equation}
\hat{y} = \sum_{i=1}^{N}{x_i w_i} + b = \mathbf{x}\cdot\mathbf{w} + b
\end{equation}

$\hat{y}$ is the predicted value (we use $y$ for the true value). 
- Apart from feature vector $\mathbf{x}$, we also introduced weight vector $\mathbf{w}$ in the equation. 
- Feature vector $\mathbf{x}^{(i)}$ has different values for each sample $i$, while the weight vector $\mathbf{w}$ is the same for all the samples.

#### Bias:
- This is essentially an additional weight that is not dependent on the value of input vector, but is instead applied to each input. 
- In this model $\mathbf{w}$ and $b$ are considered _parameters_. While $\mathbf{x}$ and $y$ are given from our dataset, parameters $\theta = \{\mathbf{w}, b\}$ are variables whose values we do not know.

#### 2D Input
- This is the same model as before, only now it works in higher dimensional space. The model is defined by: $f(\mathbf{x}) = w_1x_1 + w_2x_2 + b$.

### Loss function
- The metric used for optimization of models is called loss function (or also _cost function_ ). 
- It is a function that maps the current solution (current parameter values) to a real valued score: $L: \mathbb{R}^{dim(\theta)} \rightarrow \mathbb{R}$. 
- Often we first define the loss function for one sample, e.g. by using the square of the error:

\begin{equation}
L^{(i)}(\theta) = (\hat{y}^{(i)} - y^{(i)})^2
\end{equation}

$L^{(i)}$ is a loss function for $i$-th sample. 
- Perfect prediction makes the loss function equal to zero. 
- The worse the prediction, the higher the loss function value is. 
- We are rarely interested in only one sample, usually we want to measure the performance on the whole dataset. - Most often we use a mean of individual losses over $K$ samples, this particular loss function is also called mean squared error (MSE):

\begin{equation}
L(\theta) = \frac{\sum_{i=1}^K{L^{(i)}(\theta)}}{K} = \frac{\sum_{i=1}^K{(\hat{y}^{(i)} - y^{(i)})^2}}{K}
\end{equation}
- We minimize the loss function by searching for good values of model parameters $\theta$.

## Gradient Descent
- Now we need to find out how to minimze the loss function by changing the values of parameters $\theta = \{\textbf{w}, b\}$. 
- Gradient descent (GD) is a technique that is based around the idea of calculating gradient of the loss function $\triangledown L$. 
- Gradient tells us the direction of the steepest ascent in any given point.

The general idea behind GD is that we can find local minima by following the "arrows", i.e. gradients. Outline of the GD is:

1. __Initialize the parameters to starting values $\theta_0$.__ Many initialization strategies exist. We will usually start with zero parameters in this lab.
2. __Calculate the gradient $\triangledown L$.__
3. __Update the parameters with the gradient $\theta_{i+1} = \theta_i - \alpha \triangledown L$.__ In this equation $\alpha$ is called _learning rate_ - it is a constant that is set before the algorithm by us.
4. __If the stopping criterium is not fulfilled, jump to step 2.__ E.g. when certain number of steps was done, when sufficient solution was found or when we detect that further training does not improve the performance anymore.

Note that for each parameter $p \in \theta = \{ \mathbf{w}, b\}$ the learning rule from Step 3 can be written as $p_{i+1} = p - \alpha \frac{dL}{dp}$, i.e. we substract the derivative of $L$ w.r.t $p$.

### Gradient Descent for Linear Regression

We can apply the same algorithm for minimizing the loss function of logistic regression model. To do so we must be able to compute the derivatives w.r.t parameters. This is a rather math heavy section and it is not neccessary to understand this completely. However you should have an intuitive understanding that calculating the derivatives like this is possible.

\begin{equation}
\frac{d L}{d \theta} = \frac{d}{d \theta}\frac{\sum_{i=1}^K{L^{(i)}(\theta)}}{K}
=^A \frac{1}{K}\frac{d}{d \theta}\sum_{i=1}^K{L^{(i)}}
=^B \frac{1}{K}\sum_{i=1}^K{\frac{d L^{(i)}}{d \theta}}
\end{equation}

<sup>$^A$ We can do this because $(c \cdot f(x))' = c \cdot f(x)'$</sup>  
<sup>$^B$  $(f(x) + g(x))' = f(x)' + g(x)'$</sup>
    
We found out that the derivation of loss function w.r.t parameters is a mean of derivatives of losses for individual samples $L^{(i)}$. Next we need to find out the derivation for this sample loss:

\begin{equation}
\frac{d L^{(i)}}{d \theta} = \frac{d}{d \theta}(y^{(i)} - \hat{y}^{(i)})^2
=^C 2(y^{(i)} - \hat{y}^{(i)})\frac{d}{d \theta} (y^{(i)} - \hat{y}^{(i)})
=^D -2(y^{(i)} - \hat{y}^{(i)})\frac{d}{d \theta} \mathbf{x}^{(i)} \cdot \mathbf{w} + b
\end{equation}

<sup>$^C$ $f(g(x))' = f'(g(x))\cdot g(x)'$ and $(x^2)' = 2x$</sup>   
<sup>$^D$ We removed $y$ because it is a constant ($c' = 0$) and took the minus from $-\hat{y}$ outside.</sup> 

It is easy to see that $\frac{d}{d \mathbf{w}}\mathbf{x} \cdot \mathbf{w} + b = \mathbf{x}$ because:

\begin{equation}
\frac{d}{dw_j} x_1w_1 + x_2w_2 + ... + x_nw_n + b = x_j
\end{equation}

From this we can see that:

\begin{equation}
\frac{d L^{(i)}}{dw_j} = -2(y^{(i)} - \hat{y}^{(i)})x^{(i)}_j
\end{equation}

### Stochastic Gradient Descent
- Analyzing time complexity of gradient descent algorithm reveals that it consists of two nested loops. 
- For each training step we need to calculate gradient for each sample. 
- Other parts of training step are negligible.
- We can say say that each step has $O(K)$ time complexity.
- On top of that, the step is performed $J$ times before a stopping criteria are fullfilled. Total time complexity of GD is therefore $O(JK)$.

Number of steps $J$ needed is usually in the orders of $10^4 - 10^6$ for non-toy experiments. AlphaGo (2017) [did more than 3 million steps](https://www.nature.com/articles/nature24270) [1] during its training, which took 40 days. Number of samples $K$ for extensive datasets can also be in the order of $10^6$. It is obvious, that calculating the gradient for each sample with $O(JK)$ complexity would be simply too expensive, even for modern hardware.

In practice, *__stochastic__ gradient descent* or SGD is used instead. The main idea is to use only a subset $S$ of samples to calculate the gradient and use this as an approximation for the loss function:

\begin{equation}
\frac{\sum_{i \in S}{L^{(i)}(\theta)}}{|S|} \approx L(\theta) = \frac{\sum_{i=1}^K{L^{(i)}(\theta)}}{K}
\end{equation}

This subset is called a minibatch or shortly _a batch_. The size of batch $|S|$ is set before the training and it is called a _batch size_. In practice, batch size tends to be much smaller than $K$, usually in the orders of $10^0 - 10^3$ for smaller experiments.

The gradients calculated with these batches are only approximation of the true gradients. Theoretically we need to do __more__ steps with SGD because of this. However, the cost of calculating the gradients is much smaller and SGD is significantly faster to use. Therefore in practice, vanilla GD is never used for non-trivially big datasets.

__Optional Note:__ Even though GD seems to be more "correct", as it computes the true gradient of loss function, in practice it does not seem to have better results than SGD when applied to real data [2].

Samples for each step are pick randomly, using _selection without replacement_. This means that we treat the dataset like a bag with balls, where each ball is a sample. We randomly select $|S|$ balls/samples and use them for training. Then we will __not__ return them to the bag. We continue sampling until the bag is empty and then we return all the balls/samples back. One iteration of this process is called _an epoch_.

In practice, it is almost always implemented with shuffling. At the start of the epoch we shuffle the order of samples. Then we simply iteratate over the $|S|$-tuples. E.g. with batch size 5, we select the first 5 samples for the first training step. Then for the second step we select the second 5 samples, i.e. samples 6-10. We continue with this until we select all the 5-tuples. The table belows illustrates where each sample $s_i$ belongs with different batch sizes over a dataset with 7 data points.

| Batch size | $s_1$ | $s_2$ | $s_3$ | $s_4$ | $s_5$ | $s_6$ | $s_7$ |
| :--------- | - | - | - | - | - | - | - |
| 1          | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 2          | 1 | 1 | 2 | 2 | 3 | 3 | 4 |
| 3          | 1 | 1 | 1 | 2 | 2 | 2 | 3 |
| 5          | 1 | 1 | 1 | 1 | 1 | 2 | 2 |

With this in mind, the general outline of SGD is as follows (compare it to the outline of GD):

1. __Initialize the parameters to starting values $\theta_0$.__ 
2. __Run an epoch.__  
  2.1. Shuffle data.  
  2.2. Pick the next $|S|$-tuple.  
  2.3. Calculate the gradient.  
  2.4. Update parameters.    
  2.5. If we have additional |S|-tuple in bag, jump to 2.2.  
3. __If the stopping criterium is not fulfilled, jump to step 2.__

## Case Study: House Sales Prices

We can apply our linear regression model to real data. Kaggle has a nice dataset with house sales prices for King County, USA. We prepared these data in `data/houses.csv` file. It contains five columns:

1. Number of bedrooms
2. Number of bathrooms
3. Living area square footage
4. Year built
5. Market price

You should see that the final loss is `nan`. Similarly final parameters values are `nan`. This is a special `numpy` constant telling us that something went wrong.

__Exercise 2.16:__ Can you diagnose what went wrong? _Hint:_ Try printing the values for various quantities, such as loss, gradient or weights after each step of training.

We can fix this by _normalizing_ data. This is a process that "squishes" the data into a more reasonable scale, e.g. we can rescale the house prices into $<-1, 1>$ interval, instead of current $<75{,}000; 77{,}000{,}000>$. To do so we calculate the mean $\mu$ and standard deviation $\sigma$ for prices. We then transform any current price $x$ with following formula:

\begin{equation}
z = \frac{x - \mu}{\sigma}
\end{equation}

This new value $z$ is then used in computation. We do the same for all features as well, with unique $\mu$ and $\sigma$ for each feature. We have prepared this functionality in our `load_data` function from `backstage/load_data.py` file. You can simply call it by adding additional argument `normalize=True`. Try running the code cell above with this change and see what happens.

Normalization should make the learning work and the loss function should be approximately $0.45$. However with normalizaed prices the predictions are also made in the normalized scale. If we want to interpret the prediction in the original scale, we need to rescale it back: $x = z\sigma + \mu$. You can see the original prices and the prices your model predicts for some samples below:


In [None]:
true_data = load_data('houses.csv')
mu, sigma = np.mean(true_data.y), np.std(true_data.y)

true_prices = true_data.y[:10]
predicted_prices = model.predict(data.x[:10]) * sigma + mu

print('True price\t| Predicted price')
for true, predicted in zip(true_prices, predicted_prices):
    print(f'{true:7.0f}\t\t| {predicted:7.0f}')

## Key Concepts from This Week

- Linear regression
- Bias term
- Loss function
- Mean squared error
- Gradient descent
- Learning rate
- Stochastic gradient descent
- Batch and batch size
- Epoch
- Normalization

## Further Reading

- Chapter 8.1 from the _Deep Learning_ book [3] covers some additional topics related to SGD.

## Sources
[1] Silver, D., Schrittwieser, J., Simonyan, K., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354.  
[2] Wilson, D. R., & Martinez, T. R. (2003). The general inefficiency of batch training for gradient descent learning. Neural networks, 16(10), 1429-1451.  
[3] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.  

## Correct Answers

__E 2.1:__ Artificial neuron is usually defined as $\sigma(\textbf{w}\cdot\textbf{x} + b)$. Compared to linear regression there is an additional activation function $\sigma$. We can say that linear regression is a special case of artificial neuron with identity function used for activation $\sigma(x) = x$.

__E 2.2:__ All linear functions of one variable are the family for linear regression. This is not surprising as the definition is the same as the definition for general linear function $f(x) = ax + b$. This inludes all non-vertical lines in the 2D space. With no bias term, we have the family limited only to linear functions that cross the origin $\textbf{O} = [0, 0]$.

__E 2.4:__ All linear functions of two variables are the family here. This includes all the non-vertical planes in the 3D space. Again, with no bias term, all the planes would have to cross the origin point. With bias term we can shift the function surface up or down along the $\hat{y}$ axis.


__E 2.7:__ The derivatives are $\frac{d f}{d x} = 2x$ and $\frac{d f}{d y} = 2.4y$.

The update rule is:

\begin{equation}
 \begin{bmatrix}
  x_i \\ y_i
 \end{bmatrix} = \begin{bmatrix}
  x_{i-1} \\ y_{i-1}
 \end{bmatrix}- \alpha \begin{bmatrix}
  2x_{i-1} \\ 2.4y_{i-1}
 \end{bmatrix}
\end{equation}

__E 2.8:__ $-2(y^{(i)} - \hat{y}^{(i)})$

__P 2.9.1:__ Return the prediction as: $\hat{y} = \mathbf{x} \cdot \mathbf{w} + b$

__P 2.9.2:__ Return tuple with derivative for $\mathbf{w}$ and $b$. Do not forget that $\mathbf{w}$ is a vector, not a scalar:

$\frac{dL}{dw_j} = \frac{1}{K}\sum_{i=1}^K-2(y^{(i)} - \hat{y}^{(i)})x^{(i)}_j$  
$\frac{dL}{db} = \frac{1}{K}\sum_{i=1}^K-2(y^{(i)} - \hat{y}^{(i)})$

__P 2.9.3:__ Apply the derivatives on current parameters, i.e. update the current parameters of the model: $\mathbf{w} = \mathbf{w} - \alpha\frac{dL}{d\mathbf{w}}$ and $b = b - \alpha\frac{dL}{db}$

__P 2.9.4:__ Return the loss function: $L(\theta) = \frac{1}{K} \sum_{i=1}^K{(\hat{y}^{(i)} - y^{(i)})^2}$