# Linear regression model

This notebook is organised in two parts:

* the *first part* implements a **linear regression model from scratch**, including the data pipeline, the model, the loss function, and the minibatch stochastic gradient descent optimizer.

* the *second part* makes use of **TensorFlow's high-level APIs** (`data`, `keras`, `initializers` etc) for a concise implementation of a linear regression model.


## Note for ST456

The following command is necessary for downloading some helper functions in TensorFlow used by the reference book.

If you get a message saying **you need to restart the runtime**, please **do so** before running the rest of the code.

In [None]:
!mkdir /content/d2l
! wget https://raw.githubusercontent.com/d2l-ai/d2l-en/master/d2l/tensorflow.py
!mv tensorflow.py /content/d2l

#!pip install d2l==0.17.1 2>/dev/null

In [None]:
# importing necessary modules
%matplotlib inline
import random
import tensorflow as tf
from d2l import tensorflow as d2l  # helper module from the textbook


## Generating the dataset

To keep things simple, we will **construct an artificial dataset
according to a linear model with additive noise.**
Our task will be to recover this model's parameters
using the finite set of examples contained in our dataset.
We will keep the data low-dimensional so we can visualize it easily.

In the following code snippet, we generate a dataset
containing 1000 examples, each consisting of 2 features
sampled from a standard normal distribution.
Thus our synthetic dataset will be a matrix
$\mathbf{X}\in \mathbb{R}^{1000 \times 2}$.

**The true parameters generating our dataset will be
$\mathbf{w} = [2, -3.4]^\top$ and $b = 4.2$**,
and our synthetic labels will be assigned according
to the following linear model with the noise term $\epsilon$:

**$$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon.$$**

You could think of $\epsilon$ as capturing potential
measurement errors on the features and labels.
We will assume that the standard assumptions hold and thus
that $\epsilon$ obeys a normal distribution with mean of 0.
To make our problem easy, we will set its standard deviation to 0.01.

The following code generates our synthetic dataset.


In [None]:
def synthetic_data(w, b, num_examples):  
    """Generate y = Xw + b + noise."""
    X = tf.zeros((num_examples, w.shape[0]))
    X += tf.random.normal(shape=X.shape)
    y = tf.matmul(X, tf.reshape(w, (-1, 1))) + b
    y += tf.random.normal(shape=y.shape, stddev=0.01)
    y = tf.reshape(y, (-1, 1))
    return X, y

In [None]:
true_w = tf.constant([2, -3.4])
true_b = 4.2
features, labels = synthetic_data(true_w, true_b, 1000)

Note that **each row in `features` consists of a 2-dimensional data example
and that each row in `labels` consists of a 1-dimensional label value (a scalar).**


In [None]:
print('features:', features[0], '\nlabel:', labels[0])

By generating a scatter plot using the second feature `features[:, 1]` and `labels`,
we can clearly observe the linear correlation between the two.


In [None]:
d2l.set_figsize()
# the semicolon is for displaying the plot only
d2l.plt.scatter(features[:, (1)].numpy(), labels.numpy(), 1);

## First approach: linear regression model from scratch


`[based on the D2L reference book]`

While modern deep learning frameworks can automate nearly all of this work, implementing things from scratch is the only way
to make sure that you really know what you are doing.
Moreover, when it comes time to customize models,
defining our own layers or loss functions, understanding how things work under the hood will prove handy.

In this section, we will rely only on **(i) tensors for data storage and linear algebra,
and (ii) auto differentiation for calculating gradients**. Afterwards, in the second part, we will introduce a more concise implementation, taking advantage of bells and whistles of deep learning frameworks.


### Reading the dataset

Recall that training models consists of
making multiple passes over the dataset,
grabbing one minibatch of examples at a time,
and using them to update our model.
Since this process is so fundamental
to training machine learning algorithms,
it is worth defining a **utility function
to shuffle the dataset and access it in minibatches**.

In the following code, we **define the `data_iter` function** to demonstrate one possible implementation of this functionality. The function **takes a batch size, a matrix of features,
and a vector of labels, yielding minibatches of the size `batch_size`.** Each minibatch consists of a tuple of features and labels.


In [None]:
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # the examples are read at random, in no particular order
    random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        j = tf.constant(indices[i:min(i + batch_size, num_examples)])
        yield tf.gather(features, j), tf.gather(labels, j)

In general, note that we want to use reasonably sized minibatches
to take advantage of the GPU hardware,
which excels at parallelizing operations.
Because each example can be fed through our models in parallel
and the gradient of the loss function for each example can also be taken in parallel,
GPUs allow us to process hundreds of examples in scarcely more time
than it might take to process just a single example.

To build some intuition, let us read and print
the first small batch of data examples.
The shape of the features in each minibatch tells us
both the minibatch size and the number of input features.
Likewise, our minibatch of labels will have a shape given by `batch_size`.


In [None]:
# minibatch size
batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
    break

As we run the iteration, we obtain distinct minibatches
successively until the entire dataset has been exhausted (try this).
While the iteration implemented above is good for didactic purposes,
it is inefficient in ways that might get us in trouble on real problems.
For example, it requires that we load all the data in memory
and that we perform lots of random memory access.
The built-in iterators implemented in a deep learning framework
are considerably more efficient and they can deal
with both data stored in files and data fed via data streams.

---

### Initializing model parameters

**Before we can begin optimizing our model's parameters** by minibatch stochastic gradient descent,
**we need to have some parameters in the first place.**

In the following code, we initialize weights by sampling
random numbers from a normal distribution with mean 0
and a standard deviation of 0.01, and setting the bias to 0. `trainable=True` means all weights and the bias will be updated by the optimizer during training.


In [None]:
w = tf.Variable(tf.random.normal(shape=(2, 1), mean=0, stddev=0.01), trainable=True)
b = tf.Variable(tf.zeros(1), trainable=True)

After initializing our parameters,
our next task is to update them until
they fit our data sufficiently well.
Each update requires taking the gradient
of our loss function with respect to the parameters.
Given this gradient, we can update each parameter
in the direction that may reduce the loss.

Since nobody wants to compute gradients explicitly (this is tedious and error prone), we use [automatic differentiation](https://www.tensorflow.org/guide/autodiff).

---

### Defining the model

Next, we must **define our model,
relating its inputs and parameters to its outputs.**

Recall that to calculate the output of the linear model,
we simply take the matrix-vector dot product
of the input features $\mathbf{X}$ and the model weights $\mathbf{w}$,
and add the offset $b$ to each example.

Note that below $\mathbf{Xw}$  is a vector and $b$ is a scalar, and recall that when we add a vector and a scalar,
the scalar is added to each component of the vector (through the [broadcasting](https://www.tensorflow.org/xla/broadcasting) mechanism).



In [None]:
def linreg(X, w, b):  
    """The linear regression model."""
    return tf.matmul(X, w) + b

## Defining the loss function

Since **updating our model requires taking
the gradient of our loss function**,
we ought to **define the loss function first**.

Here we will use the *squared loss function*.
In the implementation, we need to transform the true value `y`
into the predicted value's shape `y_hat`.
The result returned by the following function
will also have the same shape as `y_hat`.


In [None]:
def squared_loss(y_hat, y):  
    """Squared loss."""
    return (y_hat - tf.reshape(y, y_hat.shape))**2 / 2

## Defining the optimization algorithm

Despite linear regression has a closed-form (analytic) solution, other deep learning models don't. So, we will use the **minibatch stochastic gradient descent** algorithm.

At each step, using one minibatch randomly drawn from our dataset,
we will estimate the gradient of the loss with respect to our parameters.
Next, we will update our parameters
in the direction that may reduce the loss.

The following code applies the minibatch stochastic gradient descent update,
given a `set of parameters`, a `learning rate`, and a `batch size`.
The size of the update step is determined by the learning rate `lr`.
Because our loss is calculated as a sum over the minibatch of examples,
we normalize our step size by the batch size (`batch_size`),
so that the magnitude of a typical step size
does not depend heavily on our choice of the batch size.


In [None]:
def sgd(params, grads, lr, batch_size):  
    """Minibatch stochastic gradient descent."""
    for param, grad in zip(params, grads):
        param.assign_sub(lr * grad / batch_size)

## training

Now that we have all of the parts in place,
we are ready to **implement the main training loop.**

It is crucial that you understand this code
because you will see nearly identical training loops
over and over again throughout most deep learning courses and materials.

In each iteration, we will grab a minibatch of training examples,
and pass them through our model to obtain a set of predictions.
After calculating the loss, we initiate the backwards pass through the network,
storing the gradients with respect to each parameter.
Finally, we will call the optimization algorithm `sgd`
to update the model parameters.

In summary, we will execute the following loop:

* Initialize parameters $(\mathbf{w}, b)$
* Repeat until done
    * Compute gradient $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} l(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{w}, b)$
    * Update parameters $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$

In each *epoch*,
we will iterate through the entire dataset
(using the `data_iter` function) once
passing through every example in the training dataset
(assuming that the number of examples is divisible by the batch size).

The number of epochs `num_epochs` and the learning rate `lr` are both hyperparameters,
which we set here to 3 and 0.03, respectively.
Unfortunately, setting hyperparameters is tricky
and requires some adjustment by trial and error.


In [None]:
lr = 0.03      # learning reate
num_epochs = 3 # number of training steps
net = linreg   # model
loss = squared_loss # loss function

In [None]:
# main training loop
for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        with tf.GradientTape() as g:
          # Minibatch loss in X and y
          l = loss(net(X, w, b), y)  
          # compute gradient on l with respect to [w, b]
          dw, db = g.gradient(l, [w, b])
          # update parameters using their gradient
          sgd([w, b], [dw, db], lr, batch_size)
    train_l = loss(net(features, w, b), labels)
    print(f'epoch {epoch + 1}, loss {float(tf.reduce_mean(train_l)):f}')

In this case, because we synthesized the dataset ourselves,
we know precisely what the true parameters are.
Thus, we can **evaluate our success in training
by comparing the true parameters
with those that we learned** through our training loop.
Indeed they turn out to be very close to each other.


In [None]:
print(f'Error in estimating w: {true_w - tf.reshape(w, true_w.shape)}')
print(f'Error in estimating b: {true_b - b}')

Note that we should not take it for granted
that we are able to recover the parameters perfectly.
However, in machine learning, we are typically less concerned
with recovering true underlying parameters,
and more concerned with parameters that lead to highly accurate prediction.
Fortunately, even on difficult optimization problems,
stochastic gradient descent can often find remarkably good solutions,
owing partly to the fact that, for deep networks,
there exist many configurations of the parameters
that lead to highly accurate prediction.

---

### Summary of the first approach

* We saw how a deep network can be implemented and optimized from scratch, using just tensors and auto differentiation, without any need for defining layers or fancy optimizers.
* This section only scratches the surface of what is possible. In the following sections, we will describe additional models based on the concepts that we have just introduced and learn how to implement them more concisely.


---

## Second approach: linear regression model using high-level APIs

`[from the D2L reference book]`

Broad and intense interest in deep learning for the past several years
has inspired companies, academics, and hobbyists
to develop a variety of mature open source frameworks
for automating the repetitive work of implementing
gradient-based learning algorithms.

In the first approach, we relied only on
tensors and auto differentiation for calculating gradients. In practice, because data iterators, loss functions, optimizers, and neural network layers are so common, modern libraries implement these components for us as well.

In this section, we will show you how to implement
the linear regression model concisely by using **high-level APIs from TensorFlow**.

### Reading the dataset

Rather than rolling our own iterator,
we can call upon the existing [`data` API](https://www.tensorflow.org/guide/data) to read data.

We pass in `features` and `labels` as arguments and specify `batch_size`
when instantiating a data iterator object.
Besides, the boolean value `is_train`
indicates whether or not
we want the data iterator object to shuffle the data
on each epoch (pass through the dataset).

In [None]:
# loading the dataset through the 'data' high-level API
def load_array(data_arrays, batch_size, is_train=True):
    """Construct a TensorFlow data iterator."""
    dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
    if is_train:
        dataset = dataset.shuffle(buffer_size=1000)
    dataset = dataset.batch(batch_size)
    return dataset

Now we can use `data_iter` in much the same way as we called the `data_iter` function in the first part.

To verify that it is working, we can read and print
the first minibatch of examples. Comparing with our previous approach, here we use `iter` to construct a Python iterator and use `next` to obtain the first item from the iterator.

In [None]:
batch_size = 10
data_iter = load_array((features, labels), batch_size)

In [None]:
next(iter(data_iter))

### Defining the model

When we implemented linear regression from scratch,
we defined our model parameters explicitly
and coded up the calculations to produce output
using basic linear algebra operations. But once your models get more complex,
and once you have to do this nearly every day,
you will be glad for the assistance.

For standard operations, we can use predefined layers from the [Keras Sequential model](https://keras.io/guides/sequential_model/),
which allow us to focus especially
on the layers used to construct the model
rather than having to focus on the implementation.

We will first define a model variable `net`,
which will refer to an instance of the `Sequential` class, which defines a container
for several layers that will be chained together.
Given input data, a `Sequential` instance passes it through
the first layer, in turn passing the output
as the second layer's input and so forth.

In the following example, our model consists of only one layer,
so we do not really need `Sequential`.
But since nearly all of our future models
will involve multiple layers,
we will use it anyway just to familiarize you
with the most standard workflow.

Recall the architecture of a single-layer network.
The layer is said to be *fully-connected*
because each of its inputs is connected to each of its outputs by means of a matrix-vector multiplication.

In [None]:
# To see this figure, download it from the GitHub repo 'fig' folder, upload it into Colab, and then uncomment the following lines.
#from IPython.display import Image
#Image(filename='./w01_singleneuron.png', width='500') 


In the `Keras Sequential model`, the fully-connected layer is defined in the `Dense` class. Since we only want to generate a single scalar output, we set that number to 1.

It is worth noting that, for convenience,
Keras does not require us to specify
the input shape for each layer.
So here, we do not need to tell Keras
how many inputs go into this linear layer.
When we first try to pass data through our model,
e.g., when we execute `net(X)` later,
Keras will automatically infer the number of inputs to each layer.
We will describe how this works in more detail later.


In [None]:
# `keras` is the high-level API for TensorFlow
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1))

### Initializing model parameters

Before using `net`, we need to **initialize the model parameters**,
such as the weights and bias in the linear regression model.

Deep learning frameworks have a predefined way to initialize the parameters. Here we specify that each weight parameter
should be randomly sampled from a normal distribution with mean 0 and standard deviation 0.01. The bias parameter will be initialized to zero.

The [initializers module in TensorFlow](https://www.tensorflow.org/api_docs/python/tf/keras/initializers) provides various methods for model parameter initialization. The easiest way to specify the initialization method in Keras is when creating the layer by specifying `kernel_initializer`. Here we recreate `net`, adding the initializer.

In [None]:
initializer = tf.initializers.RandomNormal(stddev=0.01)
# model definition
net = tf.keras.Sequential()
# model's first layer
net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))

There is an important aspect regarding the above code: we are initializing parameters for a network
even though Keras does not yet know
how many dimensions the input will have!
It might be 2 as in our example or it might be 2000.

Keras lets us get away with this because, behind the scenes,
the initialization is actually *deferred*.
The real initialization will take place only
when we, for the first time, attempt to pass data through the network.
Just be careful to remember that since the parameters
have not been initialized yet,
we cannot access or manipulate them.

### Defining the loss function

The `MeanSquaredError` class computes the mean squared error, also known as squared $L_2$ norm. By default it returns the average loss over examples.

Check other options in the [tf.keras.losses](https://www.tensorflow.org/api_docs/python/tf/keras/losses) module.

In [None]:
loss = tf.keras.losses.MeanSquaredError()

### Defining the optimization algorithm

Minibatch stochastic gradient descent is a standard tool
for optimizing neural networks
and thus Keras supports it alongside a number of
variations on this algorithm in the [optimizers module](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers).
Minibatch stochastic gradient descent just requires that
we set the value `learning_rate`, which is set to 0.03 here.

In [None]:
trainer = tf.keras.optimizers.SGD(learning_rate=0.03)

### Training the model

You might have noticed that expressing our model through
high-level APIs of a deep learning framework
requires comparatively few lines of code.
We did not have to individually allocate parameters,
define our loss function, or implement minibatch stochastic gradient descent.
Once we start working with much more complex models,
advantages of high-level APIs will grow considerably.

However, once we have all the basic pieces in place,
**the training loop itself is strikingly similar
to what we did when implementing everything from scratch**: for some number of epochs,
we will make a complete pass over the dataset (`train_data`),
iteratively grabbing one minibatch of inputs
and the corresponding ground-truth labels.
For each minibatch, we go through the following steps:

* Generate predictions by calling `net(X)` and calculate the loss `l` (the forward propagation).
* Calculate gradients by running the backpropagation.
* Update the model parameters by invoking our optimizer.

For good measure, we compute the loss after each epoch and print it to monitor progress.


In [None]:
# number of training steps
num_epochs = 3
# iterate over training steps
for epoch in range(num_epochs):
  for X, y in data_iter:
    with tf.GradientTape() as tape:
        # generating predictions and calculate loss
        l = loss(net(X, training=True), y)
    # calculate gradients (backpropagation)
    grads = tape.gradient(l, net.trainable_variables)
    # update model parameters
    trainer.apply_gradients(zip(grads, net.trainable_variables))
  l = loss(net(features), labels)
  print(f'epoch {epoch + 1}, loss {l:f}')

Below, we **compare the model parameters learned by training on finite data
and the actual parameters** that generated our dataset.
To access parameters,
we first access the layer that we need from `net`
and then access that layer's weights and bias.
As in our from-scratch implementation,
note that our estimated parameters are
close to their ground-truth counterparts.


In [None]:
w = net.get_weights()[0]
print('Error in estimating w:', true_w - tf.reshape(w, true_w.shape))
b = net.get_weights()[1]
print('Error in estimating b:', true_b - b)

### Summary of the second approach

* Using TensorFlow's high-level APIs, we can implement models much more concisely.
* In TensorFlow, the `data` module provides tools for data processing, the `keras` module defines a large number of neural network layers and common loss functions.
* TensorFlow's module `initializers` provides various methods for model parameter initialization.
* Dimensionality and storage are automatically inferred (but be careful not to attempt to access parameters before they have been initialized).
