# Pytorch Modules and Builtin Functions

## Torch's `nn.Parameter`

In PyTorch, any tensor that needs to be trained can be wrapped with the `nn.Parameter` class, which makes it a trainable parameter. nn.Parameter is a subclass of the `torch.Tensor` class, which means that it has all the properties and methods of a tensor, but with additional functionality for optimization.

When a tensor is wrapped with nn.Parameter, it is automatically added to the list of parameters that can be optimized by an optimizer such as stochastic gradient descent (SGD). This means that when you call `backward()` on your loss tensor, the gradients will be computed for all the parameters in your model, including those wrapped with nn.Parameter.

In [55]:
from torch import nn
torch.manual_seed(0)

W = torch.rand(10, 10)
W.requires_grad = True
X = F.one_hot(torch.tensor([1,2,3,4,5]), num_classes=10).float()

print(X @ W)

from torch import nn
torch.manual_seed(0)

W = nn.Parameter(torch.rand(10, 10))
X = F.one_hot(torch.tensor([1,2,3,4,5]), num_classes=10).float()

print(X @ W)

tensor([[0.3489, 0.4017, 0.0223, 0.1689, 0.2939, 0.5185, 0.6977, 0.8000, 0.1610,
         0.2823],
        [0.6816, 0.9152, 0.3971, 0.8742, 0.4194, 0.5529, 0.9527, 0.0362, 0.1852,
         0.3734],
        [0.3051, 0.9320, 0.1759, 0.2698, 0.1507, 0.0317, 0.2081, 0.9298, 0.7231,
         0.7423],
        [0.5263, 0.2437, 0.5846, 0.0332, 0.1387, 0.2422, 0.8155, 0.7932, 0.2783,
         0.4820],
        [0.8198, 0.9971, 0.6984, 0.5675, 0.8352, 0.2056, 0.5932, 0.1123, 0.1535,
         0.2417]], grad_fn=<MmBackward0>)
tensor([[0.3489, 0.4017, 0.0223, 0.1689, 0.2939, 0.5185, 0.6977, 0.8000, 0.1610,
         0.2823],
        [0.6816, 0.9152, 0.3971, 0.8742, 0.4194, 0.5529, 0.9527, 0.0362, 0.1852,
         0.3734],
        [0.3051, 0.9320, 0.1759, 0.2698, 0.1507, 0.0317, 0.2081, 0.9298, 0.7231,
         0.7423],
        [0.5263, 0.2437, 0.5846, 0.0332, 0.1387, 0.2422, 0.8155, 0.7932, 0.2783,
         0.4820],
        [0.8198, 0.9971, 0.6984, 0.5675, 0.8352, 0.2056, 0.5932, 0.1123, 0.1535,
    

## Torch's `nn.Module`

The `nn.Module` class is a fundamental building block for creating complex deep learning models. It provides a convenient way to organize and encapsulate all the trainable parameters and operations of a deep learning model.

`nn.Module` is designed to make it easy to build complex neural networks by allowing you to define layers, activations, loss functions, and other components as modules. It provides a set of pre-defined methods for forward propagation, backward propagation, and optimization that can be easily customized to fit your specific needs.

Here is an example of how to create an nn.Module:
    

In [112]:
import torch
import torch.nn as nn

class MatrixFactorization(nn.Module):
    def __init__(self, A, B_rows, C_cols):
        super(MatrixFactorization, self).__init__()
        self.A = A
        self.B = nn.Parameter(torch.randn(B_rows, A.size(1)))
        self.C = nn.Parameter(torch.randn(A.size(0), C_cols))

    def forward(self):
        return torch.matmul(self.C, self.B)

### Torch's `torch.optim`

PyTorch optimizer is a class in the PyTorch library that helps in optimizing the parameters of a neural network during the training process. It provides several optimization algorithms, such as Stochastic Gradient Descent (SGD), Adam, Adagrad, RMSProp, etc., to update the weights and biases of the neural network to minimize the loss function.


The `optim.step()` function is called after computing the gradients of the loss function with respect to the model parameters using `loss.backward()`. It updates the model parameters similarly to:

-------------------------
```python
with torch.no_grad():
    W += -lr * W.grad
    W.grad = None
```
-------------------------

When defining an optimizer we provide it with all the parameters over which we would like to perform gradient descent `optim.SGD(model.parameters(), lr=1)` as well as a learning rate.

In [116]:
import torch
import torch.nn as nn
import torch.optim as optim


A = torch.randint(0,3,(100,100)).float()
B_rows = 60
C_cols = 60

model = MatrixFactorization(A, B_rows, C_cols)
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=1)

epochs = 10000

for epoch in range(epochs):
    optimizer.zero_grad()
    A_approx = model()
    loss = criterion(A_approx, A)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 100 == 0:
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')

Epoch [100/10000], Loss: 1.0932
Epoch [200/10000], Loss: 0.4378
Epoch [300/10000], Loss: 0.3379
Epoch [400/10000], Loss: 0.3049
Epoch [500/10000], Loss: 0.2875
Epoch [600/10000], Loss: 0.2751
Epoch [700/10000], Loss: 0.2648
Epoch [800/10000], Loss: 0.2556
Epoch [900/10000], Loss: 0.2470
Epoch [1000/10000], Loss: 0.2390
Epoch [1100/10000], Loss: 0.2314
Epoch [1200/10000], Loss: 0.2242
Epoch [1300/10000], Loss: 0.2175
Epoch [1400/10000], Loss: 0.2111
Epoch [1500/10000], Loss: 0.2051
Epoch [1600/10000], Loss: 0.1994
Epoch [1700/10000], Loss: 0.1940
Epoch [1800/10000], Loss: 0.1889
Epoch [1900/10000], Loss: 0.1840
Epoch [2000/10000], Loss: 0.1794
Epoch [2100/10000], Loss: 0.1751
Epoch [2200/10000], Loss: 0.1709
Epoch [2300/10000], Loss: 0.1669
Epoch [2400/10000], Loss: 0.1632
Epoch [2500/10000], Loss: 0.1596
Epoch [2600/10000], Loss: 0.1562
Epoch [2700/10000], Loss: 0.1529
Epoch [2800/10000], Loss: 0.1498
Epoch [2900/10000], Loss: 0.1469
Epoch [3000/10000], Loss: 0.1440
Epoch [3100/10000],

## Torch's `nn.Embedding`

`nn.Embedding` is a PyTorch module that maps discrete tokens (e.g., words, characters, or subwords) to vectors of fixed size in a continuous space. These embeddings can be considered as a lookup table that converts an index (corresponding to a specific word) into a dense vector. 

In [197]:
torch.manual_seed(0)
# Initialize the weight matrix using the normal distribution
W = torch.randn((10,10))
x = torch.tensor([3, 4])

X = F.one_hot(x, num_classes=10).float()
X @ W

tensor([[ 0.9463, -0.8437, -0.6136,  0.0316, -0.4927,  0.2484,  0.4397,  0.1124,
          0.6408,  0.4412],
        [-0.1023,  0.7924, -0.2897,  0.0525,  0.5229,  2.3022, -1.4689, -1.5867,
         -0.6731,  0.8728]])

In [199]:
torch.manual_seed(0)
E = nn.Embedding(10,10)
x = torch.tensor([3, 4])
E(x)


tensor([[ 0.9463, -0.8437, -0.6136,  0.0316, -0.4927,  0.2484,  0.4397,  0.1124,
          0.6408,  0.4412],
        [-0.1023,  0.7924, -0.2897,  0.0525,  0.5229,  2.3022, -1.4689, -1.5867,
         -0.6731,  0.8728]], grad_fn=<EmbeddingBackward0>)

# Torch's `nn.Linear`

`nn.Linear` performs an affine transformation on the input data. Given an input tensor `x`, it computes the output `y` as follows:

$$
y = \mathbf{W}x + b
$$

Here, `W` represents the weight matrix, `x` is the input tensor, `b` is the bias vector, and `y` is the output tensor. Both the weight matrix and the bias vector are learnable parameters of the layer.

In [200]:
import torch
import torch.nn as nn

torch.manual_seed(0)
# Initialize the weight matrix using the normal distribution
W = torch.randn(10, 10)
x = torch.tensor([3, 4])

X = F.one_hot(x, num_classes=10).float()
X @ W

tensor([[-0.2327, -0.2094,  0.2366, -0.1456, -0.2953,  0.0427,  0.0955,  0.2620,
         -0.3154,  0.1182],
        [-0.1218, -0.1304, -0.0510, -0.2209, -0.2285,  0.2120,  0.1736,  0.2752,
          0.0592, -0.3130]], grad_fn=<MmBackward0>)

In [201]:
torch.manual_seed(0)
linear_layer = nn.Linear(in_features=10, out_features=10, bias=False)
linear_layer(X)

tensor([[-0.2327, -0.2094,  0.2366, -0.1456, -0.2953,  0.0427,  0.0955,  0.2620,
         -0.3154,  0.1182],
        [-0.1218, -0.1304, -0.0510, -0.2209, -0.2285,  0.2120,  0.1736,  0.2752,
          0.0592, -0.3130]], grad_fn=<MmBackward0>)