# <font color = 'pickle'>**Lecture Goal**
In this lecture, we will understand PyTorch nn. Module. All the modules in Pytorch are implemented as subclass of the torch.nn.Module class. Pytorch uses these modules to perfrom operations on Tensors. We will first understand some importnat modules and then use these in implementing Linear Regression.

We will first disuss following modules 

- nn.Linear()
- nn.Sequential()
- nn.init()
- nn.MSELoss()
- torch.optim()
- torch.utils.data.DataLoader
- torch.utils.data.TensorDataset

We will then use these modules to refactor linear regression (HW1).

# <font color = 'pickle'>**Install Libraries**

In [None]:
# install torchviz libraries
if 'google.colab' in str(get_ipython()):
    !pip install torchsummary -qq

# <font color = 'pickle'>**Import Libraries**

In [None]:
# Importing PyTorch Library
import torch
import torch.nn as nn
from torch.utils import data
import torch.nn.functional as F
import torchsummary

# Importing random library to generate random dataset
import random
import math

# <font color = 'pickle'>**nn.Module**

nn.Module is a base class for all neural network modules in PyTorch. Your models should also subclass this class. This will help us to create a class that
holds our weights, biases. `nn.Module` has a
number of attributes and methods (such as `.parameters()` and `.zero_grad()`), which we will be using.

In [None]:
class LinearRegression(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.weights = nn.Parameter(torch.randn(self.output_dim, self.input_dim) / math.sqrt(2))
        self.biases = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return x@self.weights.T + self.biases

Note that nn.Module objects are used as if they are functions (i.e they are callable), but behind the scenes Pytorch will call the forward method automatically.

In [None]:
x = torch.arange(6).view(3, 2).float()

# Input Dimension
input_dim = 2

# Output Dimension
output_dim = 1

# Since we're now using an object instead of just using a function, we
# first have to instantiate our model

model = LinearRegression(input_dim, output_dim)

# Get the output of linear layer after transformation
output = model(x)

print('input_tensor shape :', x.shape)
print('output_tensor shape: ', output.shape)

input_tensor shape : torch.Size([3, 2])
output_tensor shape:  torch.Size([3, 1])


In [None]:
model.weights

Parameter containing:
tensor([[-1.0733, -0.5606]], requires_grad=True)

In [None]:
model.biases

Parameter containing:
tensor([0.], requires_grad=True)

# <font color = 'pickle'>**Linear Module (nn.Linear)**


Instead of manually defining and
initializing parameter (weights and biases), and calculating `x @ self.weights.T + self.biases`, we wcan use the Pytorch class `nn.Linear`for a
linear layer, which does all that for us.

This layer takes in dimensions of input and output features and applies the following transformation to the input tensor $x$

$y = x w^T + b$ , 
$w$ and $b$ are the parameters.

The syntax for Linear Module is  :
`torch.nn.Linear(in_features, out_features, bias=True, device=None, dtype=None)`

- in_features – size of each input sample
- out_features – size of each output sample

Shapes :

Input: $(N, *, H_{in})$ <br>

here ,  $H_{in} = in\_features$, ∗ means any number of additional dimensions and N is the batch size (number of observations). <br><br>

Output: $(N ,*,  H_{out})$, 
where all but the last dimension are the same shape as the input and $H_{out} = out\_features$,


Example : 
  - if input has shape(3, 2) (batch size is 3 and there are two features)
  and output = nn.Linear(in_features = 2, out_features =1) 
  - then output will have the shape (3, 1) (3 observations and 1 feature).


In [None]:
class LinearRegression(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.linear_layer = nn.Linear(input_dim, output_dim)


    def forward(self, x):
        return self.linear_layer(x)

In [None]:

x = torch.arange(6).view(3, 2).float()

# Input Dimension
input_dim = 2

# Output Dimension
output_dim = 1

# Initialize first linear layer
model = nn.Linear(input_dim, output_dim)
# model = LinearRegression(input_dim, output_dim)

# Get the output of linear layer after transformation
output = model(x)

print('input_tensor shape :', x.shape)
print('output_tensor shape: ', output.shape)

input_tensor shape : torch.Size([3, 2])
output_tensor shape:  torch.Size([3, 1])


We have not specified any initial weights or bias values.  Linear module automatically initializes the weights randomly based on the formulae given below: 

1. `Weights`: The learnable weights are from standard normal distribution (-${\sqrt k}$, ${\sqrt k}$) where k = 1 / input_dim.

2. Bias : The learnable bias values are initialized from standard normal distribution (-${\sqrt k}$, ${\sqrt k}$) where k = 1 / output_dim.

- This initilaization is also known as LeCun initialization (This is default for Linear Layer. 

In [None]:
# We can get all the parameters associated with model(linear layer) as follows
for name, param in  model.named_parameters():
  print(name, param)

weight Parameter containing:
tensor([[ 0.0217, -0.6451]], requires_grad=True)
bias Parameter containing:
tensor([-0.6815], requires_grad=True)


In [None]:
print('We can see that PyTorch initializes  weights  in the background\n')
print('W:', model.weight)
print('b:', model.bias)
print('Shape of W :', model.weight.data.shape)
print('Shape of b:', model.bias.data.shape)

We can see that PyTorch initializes  weights  in the background

W: Parameter containing:
tensor([[ 0.0217, -0.6451]], requires_grad=True)
b: Parameter containing:
tensor([-0.6815], requires_grad=True)
Shape of W : torch.Size([1, 2])
Shape of b: torch.Size([1])


## <font color = 'pickle'>**Summary Linear Layer:**

- When we initializes the layer (`layer = nn.Linear(input_dim, output_dim)`), Linear module takes the input and output dimensions as parameters, and automatically initializes the weights randomly.

  - PyTorch sets the attribute requires_grad = True for weights and biases.
  - Shape of weights is [out_features, in_features]
  - Shape of bias is [out_features]

- We can then apply this layer to inputs to get our output `(output = layer(input)`
  - It then uses randomly initilaized weights and biases to transform inputs. 

  - Shape of input = [batch_size, in_features]
  - output = input (W.T) + b
  - shape of output = [batch_size, out_features]

<img src ="https://drive.google.com/uc?export=view&id=1ewECT6hqC1sXd-TqXG3K1WZHAKXhY7g7" width =700 >

In the example above, the **output layer** would be `nn.Linear(2, 1)`. In the figure above, we have assumed a batch size of 1.


# <font color = 'pickle'>**Sequential Module (nn.sequential)**

Many times, we want to compose Modules together. `torch.nn.Sequential` provides a good interface to combine modules sequentially where the output of a module (layer) is sequentially fed as an input to the next layer. Consider the following network:

<img src ="https://drive.google.com/uc?export=view&id=1rymZGH-Xrp_1ywGAcRJraiuuAd-verg7" width =700 >


In the example above, the **hidden layer** would be `nn.Linear(3, 4)` and the **output layer** would be `nn.Linear(4, 1)`. In the figure above, we have assumed a batch size of 1.

## <font color = 'pickle'>**Shallow NN with Custom Class**

In [None]:
class LinearRegression(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.output_dim = hidden_dim
        self.linear_layer1 = nn.Linear(input_dim, hidden_dim)
        self.linear_layer2 = nn.Linear(hidden_dim, output_dim)


    def forward(self, x):
        out1 = self.linear_layer1(x)
        out2 = self.linear_layer2(out1)
        return out2

In [None]:
# The code below illustrates above eample with batch size of 5
input_ =   torch.arange(15).view(5, 3).float()
model = LinearRegression(3,4 , 1)
output = model(input_)

print('input_tensor shape :', input_.shape)
print('output_tensor shape: ', output.shape)

input_tensor shape : torch.Size([5, 3])
output_tensor shape:  torch.Size([5, 1])


## <font color = 'pickle'>**Shallow NN with nn.Sequential**

In [None]:
# The code below illustrates above eample with batch size of 5
input_ =   torch.arange(15).view(5, 3).float()
hidden_layer = nn.Linear(3, 4)
output_layer = nn.Linear(4, 1)
model = nn.Sequential(hidden_layer, output_layer)
output = model(input_)

print('input_tensor shape :', input_.shape)
print('output_tensor shape: ', output.shape)


input_tensor shape : torch.Size([5, 3])
output_tensor shape:  torch.Size([5, 1])


In [None]:
# print the model
print(model)

Sequential(
  (0): Linear(in_features=3, out_features=4, bias=True)
  (1): Linear(in_features=4, out_features=1, bias=True)
)


In [None]:
# model summary
from torchsummary import summary
summary(model, input_size=(5,3) )

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                 [-1, 5, 4]              16
            Linear-2                 [-1, 5, 1]               5
Total params: 21
Trainable params: 21
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
----------------------------------------------------------------


In [None]:
# We can get all the parameters associated with model(linear layer) as follows
for name, param in  model.named_parameters():
  print(name, param)

0.weight Parameter containing:
tensor([[ 0.0908,  0.5047,  0.2549],
        [ 0.1145, -0.1412,  0.1246],
        [-0.1499, -0.4533,  0.3863],
        [-0.5293, -0.0734,  0.2802]], requires_grad=True)
0.bias Parameter containing:
tensor([-0.2281, -0.1056,  0.1738, -0.3594], requires_grad=True)
1.weight Parameter containing:
tensor([[-0.0161,  0.3384, -0.4132, -0.2445]], requires_grad=True)
1.bias Parameter containing:
tensor([-0.4676], requires_grad=True)


# <font color = 'pickle'>**Custom initialization nn.init()**
Each layer in PyTorch has default initialization. We can chnage that using nn.init() module.

In [None]:
# The code below illustrates how we can specify default initilaization for each layer
input =   torch.arange(15).view(5, 3).float()
hidden_layer = nn.Linear(3, 4)

torch.nn.init.normal_(hidden_layer.weight, mean = 0, std=0.01)
torch.nn.init.zeros_(hidden_layer.bias)

output_layer = nn.Linear(4, 1)
torch.nn.init.normal_(output_layer.weight, mean = 0, std=0.01)
torch.nn.init.zeros_(output_layer.bias)

model = nn.Sequential(hidden_layer, output_layer)
output = model(input)

print('input_tensor shape :', input.shape)
print('output_tensor shape: ', output.shape)

input_tensor shape : torch.Size([5, 3])
output_tensor shape:  torch.Size([5, 1])


In [None]:
# We can get all the parameters associated with model as follows
for name, param in  model.named_parameters():
  print(name, param)

0.weight Parameter containing:
tensor([[ 0.0062, -0.0051,  0.0024],
        [ 0.0231,  0.0032, -0.0079],
        [ 0.0030, -0.0010, -0.0090],
        [ 0.0061,  0.0146,  0.0084]], requires_grad=True)
0.bias Parameter containing:
tensor([0., 0., 0., 0.], requires_grad=True)
1.weight Parameter containing:
tensor([[-0.0101, -0.0068, -0.0169, -0.0020]], requires_grad=True)
1.bias Parameter containing:
tensor([0.], requires_grad=True)


In [None]:
for layer in model.children():
  print(layer)

Linear(in_features=3, out_features=4, bias=True)
Linear(in_features=4, out_features=1, bias=True)


In [None]:
# If we want to apply same initialization for all the layers we can do that 
# using for loop
for layer in model:
  if isinstance(layer, nn.Linear):
    torch.nn.init.constant_(layer.weight, 5)
    torch.nn.init.zeros_(layer.bias)

In [None]:
# Check the parameter values
for name, param in  model.named_parameters():
  print(name, param.data)

0.weight tensor([[5., 5., 5.],
        [5., 5., 5.],
        [5., 5., 5.],
        [5., 5., 5.]])
0.bias tensor([0., 0., 0., 0.])
1.weight tensor([[5., 5., 5., 5.]])
1.bias tensor([0.])


## <font color = 'pickle'>**Custom initialization using apply function** </font>

`apply()` function apply the initialization recursively. In complex models,layers will have sublayers. This will make sure that initialization is applied to sublayers layers as well.

In [None]:
# Preferred Method
def init_weights(layer):
  if type(layer) == nn.Linear:
    torch.nn.init.normal(layer.weight, mean = 0, std = 0.05)
    torch.nn.init.zeros_(layer.bias)

In [None]:
model.apply(init_weights)

  after removing the cwd from sys.path.


Sequential(
  (0): Linear(in_features=3, out_features=4, bias=True)
  (1): Linear(in_features=4, out_features=1, bias=True)
)

In [None]:
# Check the parameter values
for name, param in  model.named_parameters():
  print(name, param.data)

0.weight tensor([[-0.0206, -0.0087, -0.0083],
        [-0.0452, -0.0667, -0.0673],
        [-0.0507, -0.0312, -0.0406],
        [-0.0515,  0.0198,  0.0876]])
0.bias tensor([0., 0., 0., 0.])
1.weight tensor([[ 0.0029,  0.0317, -0.0181, -0.0537]])
1.bias tensor([0.])


# <font color = 'pickle'>**Mean Squared Error Loss (nn.MSELoss())**

PyTorch implements many common loss functions including `MSELoss` and `CrossEntropyLoss`. We will discuss `MSELoss()` in this lecture. We will explore `CrossEntropyLoss` in coming lectures.

Supposedly our input and output is as follows:

`x = [0, 1, 2, 3, 4]`

`y = [1, 3, 5, 7, 9]`

But our predicted output comes out with an error with equation `y = 2 * x`

`ypred = [0, 2, 4, 6, 8] `

Mean Squared Error (MSE) = $\frac{\sum_{i=1}^{n} (ypred_i  - y_i)^2} {n}$. Here, n = number of elements.

For the above example, loss = 1.0

Earlier we have written function to implement MSE. We can use nn.MSE() module from pytorch to calculate loss.



In [None]:
# Instantiate Mean Squared Error loss function
def mse_loss(ypred, y):
  """
  Squared error loss function.
  Input: actual labels and predicted labels
  Output: squared error loss
  """
  error = ypred - y.view(ypred.shape)
  mean_squared_error = error.T@(error)/len(y)
  return mean_squared_error

loss_nn = nn.MSELoss(reduction='mean')
loss_functional = F.mse_loss

# when we specify reduction = 'mean' - this will give us mean sqaured loss
# if reduction = 'sum' - this will give us total squared loss
# reduction = 'mean' is the default

# inputs
x = torch.Tensor([0, 1, 2, 3, 4])
y = torch.Tensor([1, 3, 5, 7, 9])

# output
ypred = 2 * x

# Calculating loss
# Loss function will take in 2 inputs: actual labels and predicted labels.
loss_manual = mse_loss(y, ypred)
loss_nn_module = loss_nn(y, ypred)
loss_functional = loss_functional(y, ypred)
print(loss_manual, loss_nn_module, loss_functional )

tensor(1.) tensor(1.) tensor(1.)


# <font color = 'pickle'>**Cross Entropy Loss**

**Let us summarize Softmax Regression Model:**

- **Input**: Features X, shape: (n x d)
  - n : number of examples.
  - d : number of features in each example.
- **Output**: Labels y = {1, 2....K}, shape: (n x K) 

- **Parameters**: Weights w, shape: (K x d) and bias b, dimension: (K)

- **Forward pass**

$$o_k^{(i)}  = \mathbf{x^{(i)}}\mathbf{w_k} ^T+b_k$$

$$\hat{p_k}^{(i)} = softmax(o_k^{(i)}) = \frac{e^{o_k^{(i)}}}{\sum_{j=1}^{K} e^{o_j^{(i)}}}$$




<img src = "https://drive.google.com/uc?export=view&id=1JZ2cNVX2Cs3v-MhKNcE2jvLjzny9QHW1" width =600 >

- **Cross Entropy Loss Function (assuming a batch size of two)**: 
\begin{equation}
\mathcal{L} = -\frac{1}{m} \sum_{k=1}^{K} \sum_{i=1}^{m} \bigg[y_k^{(i)}log(\hat{p_k}^{(i)}) \bigg]
\end{equation}

<img src = "https://drive.google.com/uc?export=view&id=1YFkNGN_x1lETidVDNLaSZ4ndwZv9Wc2T" width =600 >

## <font color = 'pickle'>**Create a function for Cross Entropy**

In [None]:
def cross_entropy(outcome, y):
  numerator = torch.exp(Output)
  denominator = numerator.sum(axis = 1, keepdim=True)
  softmax = numerator / denominator
  p_hat = softmax
  return -torch.log(p_hat[range(len(p_hat)), y]).mean()

In [None]:
Output = torch.Tensor([[1.2, -0.8, 0.7, -2.4], [0.1, 0.3, -0.3, 2.4]])
y = torch.tensor([0, 1])
print(cross_entropy(Output, y))

tensor(1.4626)


## <font color = 'pickle'>**nn.CrossEntropyLoss()**

We have to take a log of softmax and then oass it to negative log likelihood loss. nn.CrossEntropyLoss combines these two steps into one step.

In [None]:
logsoftmax = nn.LogSoftmax(dim =1)(Output)
negative_log_likelihood_loss = nn.NLLLoss()(logsoftmax, y)
print(negative_log_likelihood_loss)

tensor(1.4626)


In [None]:
loss = nn.CrossEntropyLoss()

In [None]:
print(loss(Output, y))

tensor(1.4626)


# <font color = 'pickle'>**torch.optim**
We can implement number of gradient-based optimization methods using `torch.optim`. **SGD (Stochastic Gradient Descent)** is the most basic of them and **Adam** is one of the most popular. We will use SGD in this notebook and cover other optimizers in a later lecture.

An optimizer takes the **model parameters** we want to update (learnable parameters), and the **learning rate**  (and some other hyper-parameters as well).

Optimizers do not compute the gradients on their own, we need to call **backward()** on the loss first.

We can then use optimizer's **step()** mehod to update the model parameters.

Further, we do no not need to zero the gradients one by one. We can invoke the optimizer’s **zero_grad()** method.

This does  `zero_()` call on all learnable parametets of the model.

In [None]:
# create a simple model
model = nn.Linear(3, 1)

# create a simple dataset
X = torch.tensor([[1., 3., 4.]])
y = torch.tensor([[2.]])

# create our optimizer
optim = torch.optim.SGD(model.parameters(), lr=1e-2)

# loss function 
criterion = nn.MSELoss()

y_hat = model(X)

print('model params before weight update:', model.weight.data, model.bias.data)

# calculate loss
loss = criterion(y_hat, y)

# reset gradients to zero
optim.zero_grad()

# calculate gradients
loss.backward()

# update weights
optim.step()


print('model params after weight update:', model.weight.data, model.bias.data)


model params before weight update: tensor([[ 0.2046,  0.0729, -0.3445]]) tensor([-0.2338])
model params after weight update: tensor([[ 0.2684,  0.2642, -0.0895]]) tensor([-0.1700])


# <font color = 'pickle'>**Dataset and Dataloader**

When we train our model, we typically

  - want to process the data in batches 
  - reshuffle the data at every epoch to reduce model overfitting, 
  - and use Python’s multiprocessing to speed up data retrieval.

Earlier we wrote a function to create an iterator, that will shuffle the data and yield batches of data. However, we can do this much more efficently using **torch.utils.data.DataLoader**, which is an iterator that provides all the above features.

The most important argument of DataLoader constructor is dataset, which is a PyTorch Dataset. Pytorch **Dataset** is a regular **Python class** that inherits from the [**Dataset**](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class. 

If a dataset consists of tensors of lables and features, we can use PyTorch’s [**TensorDataset**](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) class to wrap tensors in a Dataset class.

If the **dataset is big** (tens of thousands of image/text files, for instance), loading it at once would not be memory efficient. In that case we will need to create  custom dataset class , that load the files\examples on demand. We will demonstrate how to create a CustomDataset that inherits from PyTorch's Dataset class later. 

In [None]:
# Generate Dataset
x = torch.arange(10).view(5, 2)
x = x.type(dtype = torch.float)
w = torch.Tensor([2, 3]).view(-1, 1)
y = x.mm(w) + 1
print(f'x:{x}' )
print(f'\ny: {y}')

x:tensor([[0., 1.],
        [2., 3.],
        [4., 5.],
        [6., 7.],
        [8., 9.]])

y: tensor([[ 4.],
        [14.],
        [24.],
        [34.],
        [44.]])


In [None]:
# Create Dataset 
dataset = data.TensorDataset(x, y)

In [None]:
# Create DataLoader
data_iter = data.DataLoader(dataset, batch_size= 2, shuffle= True)

In [None]:
# We can loop over the DataLoader object to get batch of observations

for epoch in range(3):
  print(f'\nEpoch {epoch + 1}\n')
  for i, (x, y) in enumerate(data_iter):
    print(f'Batch Number {i+1}')
    print(f'x:{x}' )
    print(f'y: {y}\n')


Epoch 1

Batch Number 1
x:tensor([[6., 7.],
        [0., 1.]])
y: tensor([[34.],
        [ 4.]])

Batch Number 2
x:tensor([[4., 5.],
        [2., 3.]])
y: tensor([[24.],
        [14.]])

Batch Number 3
x:tensor([[8., 9.]])
y: tensor([[44.]])


Epoch 2

Batch Number 1
x:tensor([[2., 3.],
        [8., 9.]])
y: tensor([[14.],
        [44.]])

Batch Number 2
x:tensor([[4., 5.],
        [0., 1.]])
y: tensor([[24.],
        [ 4.]])

Batch Number 3
x:tensor([[6., 7.]])
y: tensor([[34.]])


Epoch 3

Batch Number 1
x:tensor([[0., 1.],
        [2., 3.]])
y: tensor([[ 4.],
        [14.]])

Batch Number 2
x:tensor([[8., 9.],
        [6., 7.]])
y: tensor([[44.],
        [34.]])

Batch Number 3
x:tensor([[4., 5.]])
y: tensor([[24.]])



We can obseve that in every epoch, an obsetvation is a part of a different batch. This happens as DataLoader shuffles the dataset to create batches.