# Linear Regression Implementation from Scratch

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Linear-Regression-Implementation-from-Scratch" data-toc-modified-id="Linear-Regression-Implementation-from-Scratch-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Linear Regression Implementation from Scratch</a></span><ul class="toc-item"><li><span><a href="#Generating-the-Dataset" data-toc-modified-id="Generating-the-Dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Generating the Dataset</a></span></li><li><span><a href="#Reading-the-Dataset" data-toc-modified-id="Reading-the-Dataset-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Reading the Dataset</a></span></li><li><span><a href="#Initializing-Model-Parameters" data-toc-modified-id="Initializing-Model-Parameters-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Initializing Model Parameters</a></span></li><li><span><a href="#Defining-the-Model" data-toc-modified-id="Defining-the-Model-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Defining the Model</a></span></li><li><span><a href="#Defining-the-Loss-Function" data-toc-modified-id="Defining-the-Loss-Function-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Defining the Loss Function</a></span></li><li><span><a href="#Defining-the-Optimization-Algorithm" data-toc-modified-id="Defining-the-Optimization-Algorithm-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Defining the Optimization Algorithm</a></span></li><li><span><a href="#Training" data-toc-modified-id="Training-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Training</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Exercises" data-toc-modified-id="Exercises-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Exercises</a></span></li></ul></li></ul></div>

In [2]:
%matplotlib inline
import random
import tensorflow as tf
from dl import tensorflow as dl

ModuleNotFoundError: No module named 'pandas'

## Generating the Dataset

We generate a synthetic dataset,
according to a linear model by adding noise and visualize it.

The generated dataset contains 1000 examples, each consisting of 2 features
sampled from a standard normal distribution. The dataset is a matrix
$\mathbf{X}\in \mathbb{R}^{1000 \times 2}$.

The true parameters that generate the synthetic dataset are:
$\mathbf{w} = [2, -3.4]^\top$ and $b = 4.2$.

The corresponding labels will be assigned according
to the following linear model with the additive noise $\epsilon$:

$$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.$$

You could think of $\epsilon$ as capturing potential
measurement errors on the features and labels.

We will assume that the standard assumptions hold and thus
that $\epsilon$ obeys a normal distribution with mean of 0.
To make our problem easy, we will set its standard deviation to 0.01.

The following code generates our synthetic dataset.


In [None]:
def synthetic_data(w, b, num_examples):  #@save
    """Generate y = Xw + b + noise."""
    X = tf.zeros((num_examples, w.shape[0]))
    X += tf.random.normal(shape=X.shape)
    y = tf.matmul(X, tf.reshape(w, (-1, 1))) + b
    y += tf.random.normal(shape=y.shape, stddev=0.01)
    y = tf.reshape(y, (-1, 1))
    return X, y

In [None]:
true_w = tf.constant([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

Each row in `features` consists of a 2-dimensional data example.

Each row in `labels` consists of a 1-dimensional label value (a scalar)


In [None]:
print('features:', features[0], '\nlabel:', labels[0])

By using a scatter plot to visualize the second feature `features[:, 1]` and `labels`, we observe the linear correlation between the two.


In [None]:
dl.set_figsize()
# The semicolon is for displaying the plot only
dl.plt.scatter(features[:, (1)].numpy(), labels.numpy(), 1);

## Reading the Dataset

In Deep Learning and Machine Learning, training models require
multiple passes over the dataset to progressively update the parameters. each pass consists of one minibatch of examples at a time.

Since this process is a fundamental step
to training machine learning algorithms, we define a utility function, called `data_iter` to shuffle the dataset and access it in minibatches.

The `data_iter` function reads a batch size, a matrix of features,
and a vector of labels, yielding minibatches of the size `batch_size`.

Each minibatch consists of a tuple of features and labels.


In [None]:
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # The examples are read at random, in no particular order
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        j = tf.constant(indices[i:min(i + batch_size, num_examples)])
        yield tf.gather(features, j), tf.gather(labels, j)


Let us read and print the first small batch of data examples.

The shape of the features in each minibatch shows the minibatch size and the number of input features.

The minibatch of labels has a shape given by `batch_size`.


In [None]:
batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break

As we run the iteration, we obtain distinct minibatches
successively until the entire dataset has been exhausted. For the sake of this example, we added the `break` to only print one batch instead of all generated minibatches.

**Note:**

The iteration implemented in `data_iter` function is a good for didactic purposes but it is inefficient with large datasets. In fact, it loads all the data in memory to perform lots of random memory access. However, the built-in iterators implemented in deep learning frameworks are more efficient and they can deal
with both data stored in files and data fed via data streams.

## Initializing Model Parameters

The next step is to initialize the model paramters 

Before training our model to optimize its parameters (weights and biases) using minibatch stochastic gradient descent, we have to initialize them.

One possible way to initialize parameters is by sampling
random numbers from a normal distribution with mean 0
and a standard deviation of 0.01, and setting the bias to 0.

In [None]:
w = tf.Variable(tf.random.normal(shape=(2, 1), mean=0, stddev=0.01),
                trainable=True)
b = tf.Variable(tf.zeros(1), trainable=True)

After initializing model's parameters,
the next task is to update them by computing the gradient
of our loss function. Given this gradient at each minibatch, we update parameters in the direction that may reduce the loss them until they fit our data well.

Computing the gradients explicitly is a difficult task and error prone,
we use automatic differentiation with the `autograd` function to compute the gradient.

## Defining the Model

By defining our model we associate inputs and parameters to its outputs.
Since we are computing the output of a linear model, we simply take the matrix-vector dot product of the input features $\mathbf{X}$ and the model weights $\mathbf{w}$, and then add the offset $b$ to each example.


In the following code the $\mathbf{Xw}$  is a vector and $b$ is a scalar.

In addition, the broadcasting mechanism is explicitly applied when adding a vector and a scalar. By such, the scalar is added to each component of the vector.

In [None]:
def linreg(X, w, b):  #@save
    """The linear regression model."""
    return tf.matmul(X, w) + b

## Defining the Loss Function

Since we need to update our model by computing the gradient of our loss function, we decide to define the loss function as the squared root.


The squared loss function described in the `linear_regression` lesson to compute the difference between the true value `y` and the predicted value. Note that we need to shape `y` as `y_hat` before computing the difference.

The result returned by the following function
will also have the same shape as $y_hat$.


In [None]:
def squared_loss(y_hat, y):  #@save
    """Squared loss."""
    return (y_hat - tf.reshape(y, y_hat.shape))**2 / 2

## Defining the Optimization Algorithm

As we discussed in the lesson, the `linear_regression` has a closed-form solution and can be solved analytically. 
However, none of the other machine learning models that we study in this course
can be solved analytically, we will take this opportunity to introduce your first working example of minibatch stochastic gradient descent. This solution works for all models.



At each step:
- we use one minibatch randomly drawn from our dataset,
- we estimate the gradient of the loss with respect to our parameters.
- we  update our parameters in the direction that may reduce the loss.

In [None]:
def sgd(params, grads, lr, batch_size):  #@save
    """Minibatch stochastic gradient descent."""
    for param, grad in zip(params, grads):
        param.assign_sub(lr * grad / batch_size)

Note that this code code applies the minibatch stochastic gradient descent update,
given a set of parameters, a learning rate, and a batch size.
The size of the update step is determined by the learning rate `lr`.

Because our loss is calculated as a sum over the minibatch of examples,
we normalize our step size by the batch size (`batch_size`),
so that the magnitude of a typical step size
does not depend heavily on our choice of the batch size.

## Training

To train our model, we implement the main training loop.

It is important that you understand this snippet of code. 
You will see nearly identical training loops
over and over again in all your deep learning projects.

1. In each iteration, we read a minibatch of training examples,
and pass them through our model to obtain a set of predictions.

2. After apply the loss function to calculate the loss (total error), we initiate the backwards pass through the network, storing the gradients with respect to each parameter.

3. Finally, we invoke the optimization algorithm `sgd` to update the model parameters.

In summary, we will execute the following loop:

* Initialize parameters $(\mathbf{w}, b)$
* Repeat until done
    * Compute gradient $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} l(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{w}, b)$
    * Update parameters $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$

In each *epoch*,
we  iterate through the entire dataset (using the `data_iter` function) once
passing through every example in the training dataset
(assuming that the number of examples is divisible by the batch size).

**Hyperparameters:**
- The number of epochs `num_epochs` and the learning rate `lr` are both hyperparameters, which we set here to 3 and 0.03, respectively. Try to change their values and rerun the training function. what is your observation? is it possible to improve the model performance? 

- You may conclude that the setting of hyperparameters is tricky
and requires tuning some adjustment by trial and error.
We will explain the these details for now but revise them
later in the optimization lesson.

In [None]:
lr = 0.03
num_epochs = 3
net = linreg
loss = squared_loss

In [None]:
for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        with tf.GradientTape() as g:
            l = loss(net(X, w, b), y)  # Minibatch loss in `X` and `y`
        # Compute gradient on l with respect to [`w`, `b`]
        dw, db = g.gradient(l, [w, b])
        # Update parameters using their gradient
        sgd([w, b], [dw, db], lr, batch_size)
    train_l = loss(net(features, w, b), labels)
    print(f'epoch {epoch + 1}, loss {float(tf.reduce_mean(train_l)):f}')

In this case-study, we synthesized the dataset and we thus know in advanced what the true parameters are.

Consequently , we can easily evaluate our success in training
by comparing the true parameters
with parameters that are learned through our training loop.


It turns out that they are very close to each other!


In [None]:
print(f'error in estimating w: {true_w - tf.reshape(w, true_w.shape)}')
print(f'error in estimating b: {true_b - b}')

Note that we should not take it for granted
that we are able to recover the parameters perfectly.


In machine learning, we are typically less concerned
with recovering true underlying parameters,
and more concerned with parameters that make our model generalize to unforeseen  examples.

Even on difficult optimization problems, the stochastic gradient descent often finds good solutions. Note that for deep networks, many configurations of the parameters are possible and may lead to highly accurate prediction.


## Summary

* Deep networks can be implemented and optimized from scratch, using just tensors and auto differentiation, without any need for defining layers or fancy optimizers.

* This case-study only scratches the surface of what is possible with deep neural networks. In the following sections, we will describe additional models based on the concepts that we have just introduced and implement them more concisely with high-level TensorFlow API/libraries.





## Exercises

1. What would happen if we were to initialize the weights to zero. Would the algorithm still work?
1. Assume that you are
   [Georg Simon Ohm](https://en.wikipedia.org/wiki/Georg_Ohm) trying to come up
   with a model between voltage and current. Can you use auto differentiation to learn the parameters of your model?
1. Can you use [Planck's Law](https://en.wikipedia.org/wiki/Planck%27s_law) to determine the temperature of an object using spectral energy density?
1. What are the problems you might encounter if you wanted to  compute the second derivatives? How would you fix them?
1.  Why is the `reshape` function needed in the `squared_loss` function?
1. Experiment using different learning rates to find out how fast the loss function value drops.
1. If the number of examples cannot be divided by the batch size, what happens to the `data_iter` function's behavior?

In this notebook, we implement the entire linear regression from scratch. The implementation covers the data preprocessing, the model building, the loss function, and the minibatch stochastic gradient descent optimizer.

We only use tensors and auto differentiation to implement the regression from scratch 
to make sure that you understand how it works.

We will provide an alternative and more concise implementation using the TensorFlow Keras library.