<a href="https://colab.research.google.com/github/MinghanChu/DeepLearning-ZerosToGans/blob/main/LinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Supervised (machine) learning**
+ input to output mappings
+ most economic value created through supervised learning
+ **regression model** to predict **numbers** from **infinitely** many possible outputs
+ **classification model** to predict categories from **several** possible outputs
+ **Supervised** means we know the ''right answers (ground truth)''. For example, we have a dataset that contains the information of house prices and the corresponding house sizes. A supervised regression model is a continous function that can potentially give infinite number of predictions, from which we usually need one or several predictions for our particular applications.


Input  | Output | Application
-------------------|--------------------|------------------
email              | spam?              | spam filtering
audio              |text transcripts    | speech recognition
English            | Chinese            | machine translation

##**Linear Regression and Gradient Descent**

+ understand what the **linear regression** and **gradient descent** are
+ implement a **linear regression model** using `PyTorch`
+ train your linear regression model using the **gradient descent algorithm**
+ implement gradient descent using `PyTorch` built-in

**Linear regression** is one of the foundamental algorithms in machine learning (ML). Most ML courses beigin with linear regression. Use a data table to help us understand what a linear regression model requires:

+ **training set** is a dataset used to train your model, denoted $x =$ **input variable (feature)**
+ **output variable (target) variable** is the dataset to predict, denoted $y$
+ $m$ number of training examples ($m = 47$ in the following example), e.g., $(x,y)= (2104,400)$ (refer to rows)
+ $(x^{(i)}_{j}, y^{(i)}_{j}) = i^{th}$ training example, $i$ refers to specific rows in this example

\\

*superscript for $i$th row (training example) and subscript for $j th$ column (feature)*

\\

size ($x$)         | price ($y$)|
-------------------|--------------------
(1) 2104              | 400              
(2) 1416              | 232   
(3) 1534              | 315            
...                   | ...
(47) 3210             | 870

In a training set, you have features ($x^{(i)}_{j}$) and targets ($y^{(i)}_{j}$). On the other hand, a regression model takes features and gives predictions.

+ a regression model denoted $f$
+ the model takes input features $x^{(i)}_{j}$
+ the model gives prediction $\hat{y}^{(i)}$. (**Note that $y^{(i)}_{j}$ refers to the target or actual true value in a training set, while $\hat{y}^{(i)}$ refers to the prediction**)
+ if we have only **one input feature variable**, we call the $f$ the **univariate linear regression** (In practice, we usually have multiple input features)

\\

###**Linear regression**

A linear regression model can be defined as follows:

$$
\hat{y}^{(i)} = f_{w,b}(x^{(i)}) = wx^{(i)} + b,
$$

where $w,b$ are called **parameters** or **weights.** The **purpose** is to find $w,b$ that can give $\hat{y}^{(i)}$ that is close to $y^{(i)}$.

###**Cost function (sqaured error cost function)**
Depending on the values of weights, you get different $f$ functions. We need a **cost function** to measure how well $f$ is fitted to the training data. The cost function basically measures the **difference between the prediction and the actual true value:**

$$J(w, b)=\frac{1}{2 m} \sum_{i=1}^m\left(f_{w, b}\left(x^{(i)}\right)-y^{(i)}\right)^2. $$

The purpose of machine learning is to find the values of $w,b$ that can **minimize** the cost function. In math, it reads as

$$
\operatorname{minimize}_{w, b} J(w, b).
$$

The above equation indicates that we want to minimize $J$ as a function of $w,b$. For different values of $w,b$, you can trace out what the cost function $J$ looks like. Later we can use **gradient descent technique** to find the minimum value of $J$, and hence, determine the corresponding values of $w,b$.

###**Gradient descent**

We start off with some initial guesses for $w,b$, e.g., $w=0,b=0$. Keep changing the values of $w,b$ to reduce $J(w,b)$ until it approaches or near minimum.

$$w = w -α\frac{\partial J(w,b)}{\partial w}$$
$$b = b -α\frac{∂J(w,b)}{∂b}.$$

$α$ is the **learning rate** that controls how big the step is. A small value of $α$ will give you a small step change in reducing $w$. **Very importantly, we always update $w$ and $b$ simultaneously.**

1. **weight** to the **input variable** ($\vec{x}^{(i)}_j$, where $(i)$ represents training examples and $j$ represents features)
2. **target (output) variable** ($y^{(i)}$)
3. note that $y^{(i)}$ has only superscript to indicate training examples (no need to add subscript to indicate features)
4. $b$ stands for offset by some constant (bias)

+ gradient descent reduces $w$ when $J > 0$, which shifts $J$ toward the minimum
+ gradient descent increases $w$ when $J < 0$, which also shifts $J$ toward the minimum

###**Multiple features**
So far we only deal with one feature variable, i.e., input $x^{(i)}$ = size. In practice, we always have multiple features: $x^{(i)}_{j}$. Then the regression model can be expressed as

$$
f_{\overrightarrow{\mathrm{w}}, b}(\overrightarrow{\mathrm{x}})=\overrightarrow{\mathrm{w}} \cdot \overrightarrow{\mathrm{x}}+b,
$$

where $\vec{x} = [x_1 \ x_2 \ x_3 \ ... \ x_n]$ and $\vec{w} = [w_1 \ w_2 \ w_3 \ ... \ w_n]$.  

\\

###**Gradient descent for multiple linear regression**

The cost function for multiple linear regression can be expressed as follows:

$$J(\vec{w},b).$$

and the gradient descent can be defined written as follows:

$$
w_j=w_j-\alpha \frac{\partial}{\partial w_j} J\left(w_1, \cdots, w_n, b\right),
$$

$$
b=b-\alpha \frac{\partial}{\partial b} J\left(w_1, \cdots, w_n, b\right).
$$

+ note $w_{j}$ only varies with input features

Now let's take derivative with respect to $w_1$ at $j=1$ or feature one:

$$
w_1=w_1-\alpha \frac{\partial}{\partial w_1} J(\overrightarrow{\mathrm{w}}, b) = \sum_{i=1}^m\left(f_{\overrightarrow{\mathrm{w}, b}}\left(\overrightarrow{\mathrm{x}}^{(i)}\right)-y^{(i)}\right) x_1^{(i)}.
$$

Similarly, we can take derivative with respect to any feature at $j = n$:

$$
w_n=w_n-\alpha \frac{1}{m} \sum_{i=1}^m\left(f_{\overrightarrow{\mathrm{w}}, b}\left(\overrightarrow{\mathrm{x}}^{(i)}\right)-y^{(i)}\right) x_n^{(i)},
$$

$$
b=b-\alpha \frac{1}{m} \sum_{i=1}^m\left(f_{\overrightarrow{\mathrm{w}}, b}\left(\overrightarrow{\mathrm{x}}^{(i)}\right)-y^{(i)}\right).
$$

\\

In addition, you can have multiple targets to predict. In that case, you will have multiple sets of weights (one set has the weights at the same number of the features). For example, the regression model is used to predict two targets:


$$\hat{y1}^{(i)}  = w_{11} \times x_{1}^{(i)} + w_{12} \times x_{2}^{(i)} + w_{13} \times x_{3}^{(i)} + b1 $$
$$\hat{y2}^{(i)}  = w_{21} \times x_{1}^{(i)} + w_{22} \times x_{2}^{(i)} + w_{23} \times x_{3}^{(i)} + b2$$

Note that the above equations are the results of multiplying matrices ($Ax=b$). The regression model can be used to train the **two sets of weights** for each target. Further, weights `b1` and `b2` are also called **biases** that are added to ensure the predictions are realizable, e.g., when all features are at a value of zero we still get predictions for our targets.

\\

##**Goal**
**Learning** is a process to find the best set of weights `w11, w12, ..., b1 and b2` using **training data**, to accurately predict the new data that are close to the targets. One of the fundational learning technique is called **gradient descent**. This technique can be used through the combination of `numpy` and `PyTorch`.


In [None]:
import numpy as np
import torch

##Training data

+ two matrices: `inputs` and `targets`

First we use `numpy` arrays to store our data because `numpy` is a powerful python library to deal with matrices. Then we convert it to `pytorch` tensors for traning. **Floating point numbers** are good for mathematical operations.

It is efficient to operate on the input and output variables separatelly.

In [None]:
# Input variables (multiple features: temp (col1), rainfall (col2), humidity (col3))
inputs = np.array([[73, 67, 43],
                   [91, 88, 64],
                   [87, 134, 58],
                   [102, 43, 37],
                   [69, 96, 70]], dtype='float32') #you either do 73. or use 'dtype='float32' to convert to floating point

__Target variable__ is like the estimated _posterior state_ in `Kalman filter`. __Input variable__ is like the _prior state_ from the prediction model. The __bias__ is like the noise/uncertainty added to the prior state.

Next, we convert them to `PyTorch` tensors via `torch.from_numpy(variable_name)`.

In [None]:
# Targets (apples, oranges)
targets = np.array([[56, 70],
                    [81, 101],
                    [119, 133],
                    [22, 37],
                    [103, 119]], dtype='float32')

In [None]:
# Convert inputs and targets to tensors
inputs = torch.from_numpy(inputs)
targets = torch.from_numpy(targets)
print(inputs)
print(targets)

tensor([[ 73.,  67.,  43.],
        [ 91.,  88.,  64.],
        [ 87., 134.,  58.],
        [102.,  43.,  37.],
        [ 69.,  96.,  70.]])
tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])


##Create a linear regression model

Now we have everything to create a linear regression model. We can beigin with random **weights** `w` and **biases** `b` based on a **normal distribution**.

In [None]:
# Weights and biases
w = torch.randn(2 , 3, requires_grad=True) #two rows and three columns
b = torch.randn(2, requires_grad=True) #one row and two columns
print(w)
print(b)

tensor([[ 1.3349, -1.3759, -0.0413],
        [ 0.7914,  0.9177, -0.6042]], requires_grad=True)
tensor([1.9568, 1.1380], requires_grad=True)


Now we have

+ inputs in pytorch tensor form
+ targets in pytorch tensor form
+ weights in pytorch tensor form
+ biase in pytorch tensor form

the model can be created:

```
model = inputs x weights(transposed) + b
```

To get the transpose of a matrix, use `.t()` and `@` represents **matrix multiplication.**




In [None]:
#Test output here
Output = inputs @ w.t() + b
print(Output)

tensor([[  5.4423,  94.4197],
        [ -0.2909, 115.2500],
        [-68.6733, 157.9256],
        [ 77.4226,  98.9692],
        [-40.9130, 101.5564]], grad_fn=<AddBackward0>)


To avoid repetition, we want to create a function to pass the inputs and predict the targets.

In [None]:
def model(x):
  return x @ w.t()+b

Now let's pass the inputs to the created **linear regression model** and predict the target variables.

In [None]:
#predict target variables
preds = model(inputs)
print(preds)

tensor([[  5.4423,  94.4197],
        [ -0.2909, 115.2500],
        [-68.6733, 157.9256],
        [ 77.4226,  98.9692],
        [-40.9130, 101.5564]], grad_fn=<AddBackward0>)


To know how accurate the model is we need to compare the predictions with __actual targets__. Here we have 10 targets.

In [None]:
print(targets)

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])


Comparison shows that the predictions are quite off from the the targets, becuase the model is based on **random weights and biases**. Therefore, we need to improve the model.

+ to evaluate the mean squared error (MSE) See Eqn. 11

```
MSE = $Σ[(preds - actual)^2]/#elements$ see Eqn.
```
+ calculate the difference between the two matrices (`preds` and `targets`). (This step is like calculating the residual in _Kalman filter_.)

+ square all elements of the difference matrix to remove negative values.

+ calculate the average of the elements in the resulting matrix.

Note that **MSE** is a **single value**.

In [None]:
diff = preds - targets
#To get negative values

torch.sum(diff * diff) / diff.numel() #add all entry elements up and divide by the number of elements

tensor(7373.3345, grad_fn=<DivBackward0>)

##**The cost function**
To avoid repetition, we create the MSE function.

+ `torch.sum` returns the sum of all elements in a matrix
+ `.numel()` returns the number of elements in a matrix

In [None]:
# MSE loss
def mse(P,T):
  diff = P - T
  return torch.sum(diff*diff)/diff.numel()

Now compute MSE using the created model above

In [None]:
# Compute loss
loss = mse(preds, targets)
print(loss)

tensor(7373.3345, grad_fn=<DivBackward0>)


From the **MSE** model, it is clear that the **square root of the error** is quite large. It indicates that the model is bad at predicting the target variables. **The lower the loss the better the model.**

##**To reduce the loss using gradient descent technique**

+  recall we set `requires_grad=True` to `w` and `b`
+ loss is a function of `w` and `b`
+ compute gradients using `.backward()`
+ calculated gradients are stored in `.grad`
+ note `grad` can be implicitly created only for **scalar** outputs (loss function outputs a single scalar value)

In this particular example, not only are there multiple features (temp (col1), rainfall (col2), humidity (col3)), but also mutliple targets (apples and orange). This indicates that we need two sets of $\vec{w}_j$. Recall $j$ represents features. Three features are required to describe one target, e.g., $j = 3$. Therefore, we need six weights for two targets. Further, each target needs a cost function to train the weights.

## Compute gradients
`Pytorch` can automatically compute the gradient or derivative of the loss w.r.t. to the weights and biases because they have `requires_grad` set to `True`.

In [None]:
# Compute gradients
loss.backward()

In [None]:
print(w.grad)
print(b.grad)

tensor([[-6338.5322, -9424.3359, -5256.9863],
        [ 2073.0339,  1444.0497,   895.9088]])
tensor([-81.6025,  21.6242])


## **Adjust weights and biases to reduce the loss vs Kalman gain in Kalman filter**

Our objective is to find the set of weights where the loss is the lowest.

In _Kalman filter_, Kalman gain serves as the weight in machine learning. It should be also used to minimize the loss. between prediction and measurement. However, the _Kalman filter_ is not a gradient-based method. Instead, it's a **recursive algorithm** that estimates the state of a dynamic system from a series of noisy measurements.

The _Kalman filter_ operates by iteratively updating its estimates of the state based on new measurements and the system's dynamics model.

Therefore, instead of adjusting the Kalman gain through taking the derivative of a loss function w.r.t. Kalman gain, we obtain optimal results with the Kalman filter by selecting an __appropriate initial covariance matrix__ for the state estimate!! (This is the key difference between linear regression and Kalman filter.)

The _covariance matrix_ represents the uncertainty or error in the initial state estimate. By choosing an initial covariance matrix that accurately reflects the uncertainty in the initial state, the Kalman filter can effectively balance between trusting the initial estimate and incorporating new measurements.

If the initial covariance matrix underestimates the uncertainty in the initial state, the filter may overly trust the initial estimate and __be slow to adjust to new measurements__, leading to __suboptimal results__. On the other hand, if the initial covariance matrix overestimates the uncertainty, the filter may be __overly cautious__ and __converge more slowly__ to the true state, also leading to __suboptimal performance__.


If a gradient element is positive:

+ **increasing the weight** element's value slightly will **increase the loss**
+ **decreasing the weight** element's value slightly will **decrease the loss**

If a gradient element is negative:

+ **increasing the weight** element's value slightly will **decrease the loss**
+ **decreasing the weight** element's value slightly will **increase the loss**

`PyTorch` will do this judement for us. Now we need to choose an appropriate **learning rate ($α = 1e-5$)** to train the weights.

+ learning rate is used to help us avoid modifying the weights by a very large amount.  (It is kind of like choosing an appropriate initial covariance matrix: excessive trust can lead to slow convergence, whereas insufficient trust can cause erratic fluctuations and delay the convergence rate.)

 `with torch.no_grad()` disables gradient calculation because we have obtained the gradients via `loss.backward()` and we do not need to recalculate the gradients again.

It should be noted that `w.grad` is just taking the derivative of the cost function, i.e., $\frac{\partial J(\vec{w},b)}{\partial w_{n}}$ where $J$ is the cost function. The following coding reads as:

$$w_{n} = w_{n}-α\frac{\partial J(\vec{w},b)}{\partial w_{n}}$$

The gradient measures the rate of change of the loss. We perform a simple subtraction of the gradient from the weight. The use of the __'minus' sign__ aligns with the following principles: when the gradient is negative, we increase the weight value to decrease the loss; conversely, when the gradient is positive, we decrease the weight to reduce the loss. The negative sign ensures that this desired outcome is achieved!

In [None]:
with torch.no_grad():
  w -= w.grad * 1e-5
  b -= b.grad * 1e-5

In [None]:
print(w)
w.grad * 1e-5

tensor([[ 1.3983, -1.2816,  0.0113],
        [ 0.7707,  0.9033, -0.6131]], requires_grad=True)


tensor([[-0.0634, -0.0942, -0.0526],
        [ 0.0207,  0.0144,  0.0090]])

With gradient descent technique (updated weights and biases), let's test if the model can give more accurate predictions.

Remember to reset the gradients to zero by invoking the `.zero_()` method. We need to do this because PyTorch accumulates gradients. Otherwise, the next time we invoke `.backward` on the loss, the new gradient values are added to the existing gradients, which may lead to unexpected results.

In [None]:
preds = model(inputs)
loss = mse(preds, targets)#use the new weights
w.grad.zero_()
b.grad.zero_()
print(w.grad)
print(b.grad)
print(loss)
print(torch.sqrt(loss))

tensor([[0., 0., 0.],
        [0., 0., 0.]])
tensor([0., 0.])
tensor(5878.7480, grad_fn=<DivBackward0>)
tensor(76.6730, grad_fn=<SqrtBackward0>)


## **Train the model using gradient descent**

In machine learning, linear regression is implemented as a gradient descent optimization algorithm. It adjusts the weights by computing their gradients (derivatives with respect to the weights) to minimize the loss (residual). We can train the model using the following steps:

1. Generate predictions
2. Calculate the loss
3. Compute gradients w.r.t. the weights and biases
4. Adjust the weights by substracting a small quantity proportional to the gradient (A small quantity results from multiplying the gradient by a learning rate. It's worth noting that the gradient value is usually quite large.)

In [None]:
# Generate predictions
preds = model(inputs)
print(preds)

tensor([[ 18.6450,  91.5534],
        [ 17.1358, 111.5192],
        [-47.4803, 153.6673],
        [ 89.8863,  95.9020],
        [-23.8114,  98.1124]], grad_fn=<AddBackward0>)


In [None]:
# Calculate the loss
loss = mse(preds, targets)
print(loss)

tensor(5878.7480, grad_fn=<DivBackward0>)


In [None]:
# Compute gradients
loss.backward()
print(w.grad)
print(b.grad)

tensor([[-14908.7520, -23811.5840, -13028.6582],
        [  5337.2754,   3400.0527,   2109.5806]])
tensor([-195.9747,   54.4526])


update the weights and biases using the gradients computed above

`torch.no_grad()` context manager ensures that gradients aren't tracked during the forward pass, which is usually desired during inference or evaluation. This can help save memory and speed up computation because the computational graph doesn't need to be stored for gradient calculation.

In [None]:
with torch.no_grad():
    w -= w.grad * 1e-5
    b -= b.grad * 1e-5
    w.grad.zero_()
    b.grad.zero_()

check updated weights and biases

In [None]:
print(w)
print(b)

tensor([[ 1.5473, -1.0435,  0.1416],
        [ 0.7173,  0.8693, -0.6342]], requires_grad=True)
tensor([1.9596, 1.1373], requires_grad=True)


we expect a significant reduction in the loss with the new weights

In [None]:
# Calculate loss
preds = model(inputs)
loss = mse(preds, targets)
print(loss)

tensor(3405.1355, grad_fn=<DivBackward0>)


We have seen a reduction in the loss by adjusting the weights using gradient descent.

##**Reduce the loss further by training for multiple epochs**

Let's train the model for $200$ epochs.

In [None]:
# Train for 100 epochs
for i in range(100):
    preds = model(inputs)
    loss = mse(preds, targets)
    loss.backward()
    with torch.no_grad():
        w -= w.grad * 1e-5
        b -= b.grad * 1e-5
        w.grad.zero_()
        b.grad.zero_()

Let's verify if the loss gets lower:

Please note that the loss is computed as the sum of squared differences divided by the number of elements. Therefore, computing the square root of the loss is essential for accurately assessing the performance of our linear regression model.

In [None]:
# Calculate loss
preds = model(inputs)
loss = mse(preds, targets)
print(loss)
print(torch.sqrt(loss))

tensor(845.6121, grad_fn=<DivBackward0>)
tensor(29.0794, grad_fn=<SqrtBackward0>)


Let's compare preditions to targets:

In [None]:
preds

tensor([[ 69.1629,  75.2402],
        [ 89.4368,  93.0396],
        [ 82.9678, 142.3888],
        [ 88.8255,  63.4678],
        [ 75.3508,  90.6502]], grad_fn=<AddBackward0>)

In [None]:
targets

tensor([[ 56.,  70.],
        [ 81., 101.],
        [119., 133.],
        [ 22.,  37.],
        [103., 119.]])

Comparison shows a significant improvment in predicting the target variables.

## **Linear regression using PyTorch built-ins**

As linear regression and gradient descent model use basic tensor operations, whcih is a common pattern in deep learning. PyTorch provides **built-in** functions and classes to make it easy to create and train models with just a few lines of code.

For example:

```
torch.nn

```

In [1]:
import torch.nn as nn