In [1]:
import torch
torch.__version__

'1.8.0'

## Linear Regression

Regression refers to a set of methods for modeling relationship between one or more independent variables and a dependent variable. When the relationship between variables is linear, then it is expressed as a Linear Regression. 

This relationship is expressed as a linear equation, $y = xw + b$, where x is the set of independent variables multiplied by some weights, b is a bias and y is the dependent variable.

## Linear Model

When the inputs consist of $d$ parameters/features/attributes, our prediction $\hat{y}$ is 
$$ \hat{y} = w_1x_1 + ... + w_dx_d + b $$

The above equation can also be expressed as 
$$ \hat{y} = w^Tx+b $$
where $x \in R^{d}$ is a features vector and $w \in R^{d}$ is a weights vector and their dot product will be the weighted sum of the product of feature vectors with the weights vector. 

Normally, the dataset consist of more than a single example, therefore there will be a features matrix not the features vector, where each row represents an observation or sample. 

The above equation will be - $\hat{y} = Xw + b$, where $X \in R^{n*d}$

> Note: Broadcasting is applied during the summation

## Training Dataset

To find the weights and bais of a model for regression, we need a Training dataset consisting of both the independent variables and a dependent variable. 

Given features of a training dataset X and corresponding labels y, the goal is to find $w$ and $b$ such that the model makes predictions with the least error. Therefore, we need to two things to accomplish this task. 
- A quality measure to find the accuracy of the model (hint: norms).
- A procedure for updating the model to improve its quality.

## Loss Function

Loss function is a non-negative function that is used for comparing the ground-truth values to the predicted values. Norms are normally used as loss functions, as they have the same origin and the lower the number they produce the better the predictions will be. 

The most popular function is a squared error, which is a $L_2$ norm:
$$l^{(i)}(w,b) = \frac{1}{2}(\hat{y}^{(i)} - y^{(i)})^2$$

TO measure the quality of the model with n training examples, we calculate the average of all the squared differences, and the loss function can be expressed as:
$$L(w,b) = \frac{1}{n} \sum_{i=1}^nl^{(i)}(w,b) =  \frac{1}{n}\sum_{i=1}^n\frac{1}{2} (w^Tx^{(i)} + n - y^{(i)})^2 $$ 

As training the model is an optimisation problem, we find parameters that minimise the loss function across all training examples: 
$$w^*,b^* = argmin_{w,b} L(w, b) $$

## Gradient Descent (GD)

There is an mathematical equation that can calculate the value of weights and bias depending on the data supplied, however it is too rigid and can only solve problem for a specific equation. In other words, for each model there is an equation to find weights and bias. This approach does not generalise for all the models and it is difficult to derive. 

Gradient Descent is a key technique used for optimising nearly any deep learning model. It is an algorithm that iteratively reduce the loss produced by the error by updating the parameters (weights and bias) in the direction of the minima of the loss function. 

We use partial derivative in this algorithm to calculate the change in loss function w.r.t the parameters:

$$w = w - \frac{\eta}{N} \sum_{i=1}^N \triangledown_wl^{(i)}(w, b)$$
$$b = b - \frac{\eta}{N} \sum_{i=1}^N \frac{\delta l^{(i)}(w, b)}{\delta b}$$

where $\eta$ is a positive scaler called the __Learning Rate__. We randomly initialise the values of the model paramters, and use Gradient Descent to produce the optimal sets of parameters.

## Minibatch Stochastic Gradient Descent
It is a type of gradient descent algorithm that only takes a random minibatch of samples every time it compute the weights. Normal Gradient Descent takes time to compute as it runs over the whole dataset in each training iteration, minibatch solves this problem by using only a small subset at a time, therefore minibatch takes comparatively lower time to find paramteres. The above equations can be updated to:
$$w = w - \frac{\eta}{|\beta|} \sum_{i=1}^{|\beta|} \triangledown_wl^{(i)}(w, b)$$
$$b = b - \frac{\eta}{|\beta|} \sum_{i=1}^{|\beta|} \frac{\delta l^{(i)}(w, b)}{\delta b}$$

In this method, there are two minimum two hyperparamters required: Batchsize and Learning Rate. hyperparamter tuning is a process by which hyperparamters are chosen based on results on a separate __Validation__ set.

## Deep Learning Pipeline

1. Create and read the dataset
2. Prepare the modeling 
3. Initialise the parameters (Weights and Bias)
4. Define the loss function (MSE)
5. Choose the training algorithms (SGD|ADAM)
6. Run the training method

In [2]:
# Generating the Dataset

def prepare_data(w, b, num_examples):
    """ Generate y = Xw + b + noise """
    X = torch.normal(0, 1 ,(num_examples, len(w)))
    y = torch.mm(X, w) + b
    y += torch.normal(0, 0.01, y.size())
    return X, y.reshape(-1, 1)

true_w = torch.tensor([[2], [-3.3]])
true_b = torch.tensor([4.5])
features, labels = prepare_data(true_w, true_b, 1000)

In [3]:
# Reading the dataset

dataset = torch.utils.data.TensorDataset(features, labels)
batch_size = 10
data_iter = torch.utils.data.DataLoader(dataset, batch_size, shuffle=True)

In [4]:
# Define the model

num_of_inp, num_of_out = 2, 1
net = torch.nn.Linear(num_of_inp, num_of_out)

In [5]:
# Initialise the parameters
net.weight.data.normal_(0, 0.01)
net.bias.data.fill_(0)

tensor([0.])

In [6]:
# Choose the loss function
loss = torch.nn.MSELoss()

In [7]:
# Choose the training algorithms (SGD|ADAM)
optimiser = torch.optim.SGD(net.parameters(), lr=0.001)

In [8]:
# Run the training method
num_epochs = 30
for epoch in range(num_epochs):
    for X, y in data_iter:
        y_hat = net(X)
        l = loss(y_hat, y)
        optimiser.zero_grad()
        l.backward()
        optimiser.step()
    l = loss(net(features), labels)
    print(f"epoch {epoch+1}, loss {l:f}")

00101
epoch 1232, loss 0.000101
epoch 1233, loss 0.000101
epoch 1234, loss 0.000101
epoch 1235, loss 0.000101
epoch 1236, loss 0.000101
epoch 1237, loss 0.000101
epoch 1238, loss 0.000101
epoch 1239, loss 0.000101
epoch 1240, loss 0.000101
epoch 1241, loss 0.000101
epoch 1242, loss 0.000101
epoch 1243, loss 0.000101
epoch 1244, loss 0.000101
epoch 1245, loss 0.000101
epoch 1246, loss 0.000101
epoch 1247, loss 0.000101
epoch 1248, loss 0.000101
epoch 1249, loss 0.000101
epoch 1250, loss 0.000101
epoch 1251, loss 0.000101
epoch 1252, loss 0.000101
epoch 1253, loss 0.000101
epoch 1254, loss 0.000101
epoch 1255, loss 0.000101
epoch 1256, loss 0.000101
epoch 1257, loss 0.000101
epoch 1258, loss 0.000101
epoch 1259, loss 0.000101
epoch 1260, loss 0.000101
epoch 1261, loss 0.000101
epoch 1262, loss 0.000101
epoch 1263, loss 0.000101
epoch 1264, loss 0.000101
epoch 1265, loss 0.000101
epoch 1266, loss 0.000101
epoch 1267, loss 0.000101
epoch 1268, loss 0.000101
epoch 1269, loss 0.000101
epoch 

In [11]:
new_w = net.weight.data
new_b = net.bias.data
print("True weights ({}), Predicted weights ({}), error in estimating w: {}".format(true_w, new_w, true_w - new_w.reshape(true_w.shape)))
print("True bais ({}), Predicted bais ({}), error in estimating b: {}".format(true_b, new_b, true_b - new_b.reshape(true_b.shape)))

True weights (tensor([[ 2.0000],
        [-3.3000]])), Predicted weights (tensor([[ 2.0001, -3.3003]])), error in estimating w: tensor([[-8.9884e-05],
        [ 2.6298e-04]])
True bais (tensor([4.5000])), Predicted bais (tensor([4.5005])), error in estimating b: tensor([-0.0005])
