# 3.2 Linear Regression Implementation from Scratch

## Exercises

### Q1. What would happen if we were to initialize the weights to zero. Would the algorithm still work?

Initializing the weights to zero would imply that `linreg(X, w, b)` return the same value (= `b`) for all `X`. Hence the loss function would be independent of `w`, which would mean that once the training is done, only `b` would have been optimized. When we then test out model, it would give the same prediction for any input (= `b` after training).

### Q2. Assume that you are Georg Simon Ohm trying to come up with a model between voltage and current. Can you use auto differentiation to learn the parameters of your model?

Yes. Assume that a linear relationship exists between voltage V and current I. `V = wI + b`. Given a set of observations of V and I, our task is the to find the values of `(w, b)` which optimize our model. This can be achieved using auto-diff as illustrated in the text (linear regression).

### Q3. Can you use Planckʼs Law to determine the temperature of an object using spectral energy density?

TODO

### Q4. What are the problems you might encounter if you wanted to compute the second derivatives? How would  you fix them?

Second derivative much more expensive to compute than the first derivative, reason as mentioned in 2.2 Q1. 

How to fix them? TODO

### Q5. Why is the `reshape` function needed in the `squared_loss` function?

- `reshape` is used to make `y` and `y_hat` the same shape for proper calculations (ex. if `y_hat` is a vector, and `y` is a 1D tensor)
- `reshape` is used on `y` because to input to the function is `y_hat` and we expect the output to be that same shape

### Q6. Experiment using different learning rates to find out how fast the loss function value drops.

In [1]:
import torch

In [2]:
true_w = torch.tensor([4, 3.4, -3])
true_b = torch.tensor([5.6])

In [3]:
def synthetic_data(w, b, num_examples):
    X = torch.normal(0, 1, (num_examples, len(w)))
    y = torch.matmul(X, w) + b
    y += torch.normal(0, 0.01, y.shape)
    return X, y.reshape((-1, 1))

num_examples = 1000

features, labels = synthetic_data(true_w, true_b, num_examples)

In [4]:
import random

def data_iter(batch_size, features, labels):
    num_examples = len(labels)
    indices = list(range(num_examples))
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = torch.tensor([k for k in range(i, min(i + batch_size, num_examples))])
        yield features[batch_indices], labels[batch_indices]

In [5]:
def linreg(X, w, b):
    return torch.matmul(X, w) + b

In [6]:
def squared_loss(y_hat, y):
    return((y_hat - y.reshape(y_hat.shape))**2) / 2

In [7]:
def sgd(params, lr, batch_size):
    with torch.no_grad():
        for param in params:
            param -= lr * param.grad / batch_size
            param.grad.zero_()

In [8]:
num_epochs = 10
batch_size = 10

In [9]:
# Paramter initialization has been done inside the function so that we start off from scratch everytime train() is called
def train(lr):
    w = torch.normal(0, 0.01, size=(len(true_w), 1), requires_grad=True)
    b = torch.zeros(1, requires_grad=True)
    for epoch in range(num_epochs):
        for X, y in data_iter(batch_size, features, labels):
            l = squared_loss(linreg(X, w, b), y)
            l.sum().backward()
            sgd([w, b], lr, batch_size)
        with torch.no_grad():
            train_loss = squared_loss(linreg(features, w, b), labels)
            if((epoch + 1) % 2 == 0):
                print(f'epoch {epoch + 1}, loss = {float(train_loss.mean()):.4f}')

In [10]:
print("learning rate = 0.0003")
train(0.0003)
print("\nlearning rate = 0.003")
train(0.003)
print("\nlearning rate = 0.03")
train(0.03)

learning rate = 0.0003
epoch 2, loss = 30.4014
epoch 4, loss = 26.9095
epoch 6, loss = 23.8204
epoch 8, loss = 21.0873
epoch 10, loss = 18.6692

learning rate = 0.003
epoch 2, loss = 10.1153
epoch 4, loss = 3.0057
epoch 6, loss = 0.8988
epoch 8, loss = 0.2704
epoch 10, loss = 0.0818

learning rate = 0.03
epoch 2, loss = 0.0002
epoch 4, loss = 0.0000
epoch 6, loss = 0.0000
epoch 8, loss = 0.0000
epoch 10, loss = 0.0000


### Q7. If the number of examples cannot be divided by the batch size, what happens to the `data_iter` functionʼs behavior?

If the number of examples is not divisible by the batch size, the last batch of data will be the leftover examples after all possible full batches have been drawn. This functionality is made possible due to this line of code: `range(i, min(i + batch_size, num_examples))` in the `data_iter` function