# Derivations related to the Linear Regression (and polynomial regression) models

### What is Linear Regression?

Linear regression is a machine learning algorithm, representing a linear relationship between the input space and output space

Given some input vector (feature vector) $x \in \mathbb{R}^{n}$ we want to convert $x$ into a real number, (ideally that closely predicts what output we are trying to model). Thus we define our function as follows: 

$$ h_{\theta}(x) = \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n} + b = \theta^{T}x + b$$
$$ x, \theta \in \mathbb{R}^{n}, b \in \mathbb{R} $$

$\theta$ is often known as the weights, and $b$ the bias. Both of them together are the parameters of the model

This means that our hypothesis, given some input $x$ is the output of $h_{\theta}(x)$

Since this function is supposed to model some phenomena (otherwise why use machine learning), we wish for our hypothesis to be accurate. In order to do so, we need to pick a $\theta$ vector that gives us the most optimal hypothesis for any given $x$. In order to do so, we require some data. In particular we need data that shows us the $x$ inputs, as well as the actual corresponding $y$ output that we wish to model.
$$\text{Let } X = (x^{(1)}, x^{(2)}, \rightarrow x^{(m)})^{T} \in \mathbb{R}^{m \times n}, x^{(i)} \in \mathbb{R}^{n} $$
$$ y = (y^{(1)}, y^{(2)}, ..., y^{(m)}) \in \mathbb{R}^{m}, y^{(i)} \in \mathbb{R}$$

In other words: $X$ is a matrix where each column represents a feature, and each row represents a training example, and $y$ is a vector of each training example output

To find the optimal $\theta$ for our data, we need some function (of our parameters $\theta$, and that uses our training examples) to optimize. A common choice is Least Squares Error (LSE) due to the inherent assumptions of the linear regression model. <br> *For more detail on why we choose LSE for Linear Regression you can view the LSE derivations notebook which covers this in detail*

We will define our loss function as $J(\theta, b) = LSE$, since we want it to be a function of our parameters (rather than a function of the input features)

$$ J(\theta, b) = \sum_{i = 1}^{m}(y^{(i)} - h_{\theta}(x^{(i)}))^{2} = \sum_{i = 1}^{m}(y^{(i)} - (\theta^{T}x^{(i)} + b))^{2}$$

At a high level, we are taking the actual outputs $y^{(i)}$ and subtracting our predicted output $h_{\theta}(x^{(i)})$. Then we are squaring it so that we are only considering magnitude (as some differences may be negative but after squared all will be positive). Finally we sum over every example to get the total squared error.

To optimize our parameters using this loss function, we attempt to minimize the loss function. So we are trying to find:

$$ \underset{\theta, b}{\mathrm{argmin}} [J(\theta, b)]$$

In order to do so, we may attempt to solve it directly, or for most machine learning algorithms, attempt to approximate it *(or solve it)* iteratively.
One of the simplest form of iterative algorithms is gradient descent, which is what I will be using in the code for this notebook.
<br> Iterative algorithms rely on the gradient of the loss function: $\nabla_{J}(\theta)$.
The gradient is a vector of all partial derivatives of the function. So in the case of the loss function the gradient is defined as: $\nabla_{J}(\theta) = (\frac{\partial J}{\partial \theta_{1}}, ..., \frac{\partial J}{\partial \theta_{n}})$
<br> We must take the derivative of the loss function w.r.t. each parameter

In the case of LSE, the gradient is defined as follows:

$$ \nabla_{J}(\theta, b) = 2 \cdot \frac{\partial h}{\partial \theta}^{T}(y - h) $$
$$ h = (h_{\theta}(x^{(1)}), h_{\theta}(x^{(2)}), ..., h_{\theta}(x^{(m)})) \in \mathbb{R}^{m} $$
$$ \frac{\partial h}{\partial \theta} = (\frac{\partial h}{\partial \theta_{1}} \rightarrow \frac{\partial h}{\partial \theta_{n}}, \frac{\partial h}{\partial b}) \in \mathbb{R}^{m \times n + 1}$$
$$ \frac{\partial h}{\partial \theta_{j}} = (\frac{\partial h_{\theta}(x^{(1)})}{\partial \theta_{j}}, ..., \frac{\partial h_{\theta}(x^{(m)})}{\partial \theta_{j}}) \in \mathbb{R}^{m}$$
$$ \frac{\partial h}{\partial b} = (\frac{\partial h_{\theta}(x^{(1)})}{\partial b}, ..., \frac{\partial h_{\theta}(x^{(m)})}{\partial b}) \in \mathbb{R}^{m}$$

*If you are interested in how this gradient was derived, please check out the LSE.ipynb notebook which covers the gradient of this specific loss function*

So now all we need to find is $\frac{\partial h_{\theta}(x^{(i)})}{\partial \theta_{j}}$ and $\frac{\partial h_{\theta}(x^{(i)})}{\partial b}$ to then fill the matrix $\frac{\partial h}{\partial \theta}$

$$ h_{\theta}(x^{(i)}) = \theta^T x^{(i)} + b = \theta_{1}x_{1}^{(i)} + \theta_{2}x_{2}^{(i)} + ... + \theta_{n}x_{n}^{(i)} + b$$
$$ \implies \frac{\partial h_{\theta}(x^{(i)})}{\partial \theta_{j}} = x_{j}^{(i)} $$

Since the equation is linear, and we are taking the partial w.r.t one parameter, all other parameters are treated as constant and discarded in the derivative. Then we apply the power rule to be left with the constant in front of the parameter.

$$ h_{\theta}(x^{(i)}) = \theta^T x^{(i)} + b = \theta_{1}x_{1}^{(i)} + \theta_{2}x_{2}^{(i)} + ... + \theta_{n}x_{n}^{(i)} + b$$
$$ \implies \frac{\partial h_{\theta}(x^{(i)})}{\partial b} = 1 $$

Since b is standalone, if we take the partial w.r.t $b$ all other terms are treated as constant and disappear. We then apply the power rule to $b$ leaving us with simply $1$

Thus our matrix $\frac{\partial h}{\partial \theta}$ is simply:

$$\frac{\partial h}{\partial \theta} = 
\begin{bmatrix} 
    x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & \dots & x_{n}^{(1)} & 1 \\
    x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & \dots & x_{n}^{(2)} & 1 \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{1}^{(m)} & x_{2}^{(m)} & x_{3}^{(m)} & \dots & x_{n}^{(m)} & 1
\end{bmatrix} = [X \quad b], \quad b = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \in \mathbb{R}^{m}
$$

Finally we have an easy formula we can code to compute the gradient at a given parameter vector $\theta$ and bias $b$

$$\nabla_{J}(\theta, b) = [X \quad b]^T(2 \cdot (y - h))


### Coding our model

In [2]:
import numpy as np
import sys
from pathlib import Path

sys.path.append(str(Path().resolve().parent.parent))

from loss.LSELoss import LSELoss
from optimizers.SGDOptimizer import SGDOptimizer

# CREATE SOME DUMMY DATA
x = np.linspace(0, 10, 10)
y = x + 2 + np.random.rand() - 0.5 # y is a linear function of x with some noise

# reshape the data to place into the model
X = np.reshape(x, (x.shape[0], 1))

# initialize our loss (I will use my LSELoss class here) and our optimizer (I will use my SGD Optimizer here)
loss = LSELoss()
optimizer = SGDOptimizer(learning_rate=0.01)

Firstly we need to initialize the weights of the model. We need to specify the feature space. In this example I will use a dimension of 1, however this scales to $n$ dimensions

In [121]:
weights = np.random.rand(1) # shape is (1,)
bias = np.random.rand(1) # shape is (1,)
print(f"Weights: {weights}, Bias: {bias}")

Weights: [0.81673522], Bias: [0.69851521]


Let's define the forward pass of our model, i.e the hypothesis function, and run through it for all training examples to get a vector of predictions

In [7]:
def forward(X):
    return np.dot(X, weights) + bias

y_pred = forward(X)
y_pred

array([0.98955928, 1.09353263, 1.19750599, 1.30147934, 1.40545269,
       1.50942605, 1.6133994 , 1.71737275, 1.82134611, 1.92531946])

Now let's specify our gradient function. This should take in the calculated gradients for the loss function and we will multiply with the derivatives of our model (Using the formula we derived above)

In [8]:
def grads(X, loss_grad):
        return np.concatenate((np.dot(X.T, loss_grad), [np.sum(loss_grad)]))

loss_grad = loss.grads(y_pred, y)
model_grads = grads(X, loss_grad)
model_grads

array([-714.19685344, -105.91098624])

Finally we need to apply them using the optimizer to get the updated parameters

In [122]:
params = np.concatenate((weights, bias))
optimizer.step(params, model_grads)
weights = np.array(params[:weights.size])
bias = np.array(params[weights.size:])
print(f"Weights: {weights}, Bias: {bias}")

Weights: [7.95870376], Bias: [1.75762507]


You would then keep repeating this until the loss function converges 

Note: Polynomial Regression follows the same format except that you now have extra terms depending on the degree of the polynomial. I have implemented polynomial regression within my Linear Regression model. You can specify when initializing the model with the degrees parameter.

Your hypothesis function for a degree 2 polynomial (with no interaction terms) would look like $$h_{\theta}(x) = \theta_{1}x_{1} + \theta_{2}x_{2} + ... + \theta_{n}x_{n} + \theta_{n+1}x_{1}^{2} + ... + \theta_{2n}x_{n}^{2}$$

This can be scaled to higher degree polynomials as well