<a href="https://colab.research.google.com/github/DavoodSZ1993/Dive-into-Deep-Learning-Notes-/blob/main/12_5_minibatch_stochastic_gradient_descent_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install d2l==1.0.0-alpha1.post0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.0/93.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.2/121.2 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.9/84.9 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## 12.5 Minibatch Stochastic Gradient Descent

### 12.5.1 Vectorization and Caches

* `torch.dot(input, other)`: Computes the dot product of two 1D tensors.

In [2]:
import torch

A = torch.tensor([2, 3])
B = torch.tensor([2, 1])
C = torch.dot(A, B)
C

tensor(7)

* `torch.mv(input, vec)`: Performs a matrix-vector product of the matrix `input` and the vector `vec`.
* If `input` is a ($n \times m$) tensor, `vec` is a 1-D tensor of size $m$, 'out' will be 1-D of size $n$.

In [3]:
A = torch.randn(2, 3) # n x m
B = torch.randn(3)    # m
C = torch.mv(A, B)
C

tensor([ 3.4966, -6.4120])

* `torch.mm(input, mat)`: Performs a matrix multiplication of the matrices `input` and `mat`.
* If `input` is a $(n \times m)$ tensor, `mat` is a $(m \times p)$ tensor, `out` will be a $(n \times p)$ tensor.

In [4]:
A = torch.randn(2, 3) # n x m
B = torch.randn(3, 3)    # m x p
C = torch.mm(A, B)
C

tensor([[-0.4760, -0.8428, -1.0985],
        [-1.9362, -1.4545, -1.4131]])

### 12.5.3 Reading the Dataset

* `torch.from_numpy(ndarray)`: Creates a Tensor from a Numpy ndarray.

In [5]:
import numpy as np

A = np.array([1, 2, 3])
B = torch.from_numpy(A)
B, type(B)

(tensor([1, 2, 3]), torch.Tensor)

### 12.5.4 Implementation from Scratch

* `torch.tensor.sub_()`: In-place version of `sub()`
* `torch.sub(input, other, alpha=1)`: Subtracts `other`, scaled by `alpha` from `input`.
$$
out_i = input_i - alpha \times other_i
$$

In [6]:
A = torch.tensor([1, 2])
B = torch.tensor([0, 1])
torch.sub(A, B, alpha=2)

tensor([1, 0])

* It is beneficial to zero out gradients when building a neural network. This is because by default, gradients are accumulated in buffers (i.e., not overwritten) whenever `backward()` is called.
* `p.grad.data.zero_()`: Zeros out the gradient of the given weight after each interation.

### 12.5.5 Concise Implementation

* Class `torch.nn.MSELoss(reduction='none)`: Creates a criterion that measures the mean squared error (squared L2 norm) between each element in the input $x$ and target $y$.
* The unreduced (i.e., with `reduction` set to `none`) loss can be described as:
$$
l(x,y) = L = \{l_1, ..., l_N\}^T, l_n = (x_n - y_n)^2
$$

In [7]:
from d2l import torch as d2l
import torch
from torch import nn

In [8]:
loss = nn.MSELoss()
input = torch.randn(3, 5, requires_grad=True)
target = torch.randn(3, 5)
output = loss(input, target)
output.backward()

* `torch.optim.optimizer.step()`: Performs a single optimization step (parameter update)

* `torch.optim`: Is a package implementing various optimization algorithms.

* Class `torch.optim.SGD`: Implements stochastic gradient descent (optionally) with momentum.