<a href="https://colab.research.google.com/github/DavoodSZ1993/Dive-into-Deep-Learning-Notes-/blob/main/10_7_Encoder_Decoder_Machine_Translation_notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install d2l==1.0.0-alpha1.post0 --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.0/93.0 KB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.0/121.0 KB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 KB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
[?25h

## 10.7 Encoder-Decoder Seq2Seq for Machine Translation

### 10.7.2 Encoder

In [None]:
import torch
from torch import nn

input_size, hidden_size, num_layers = 10, 20, 2
rnn = nn.GRU(input_size, hidden_size, num_layers)

rnn._flat_weights_names

['weight_ih_l0',
 'weight_hh_l0',
 'bias_ih_l0',
 'bias_hh_l0',
 'weight_ih_l1',
 'weight_hh_l1',
 'bias_ih_l1',
 'bias_hh_l1']

In [None]:
rnn._parameters

#### Embedding
* Embedding layer is equivalent to a linear layer without the bias term. However, embedding does a lookup instead of a matrix-vector multiplication.

* An embedding is an efficient alternative way to a single linear layer when one has a large number of input features. This may happen in natural language processing (NLP) when one is working with text data.

* Class `torch.nn.Embedding(num_embeddings, embedding_dim)`: A simple lookup table that stores embeddings of a fixed dictionary and size.

* This module is often used to score word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.

**Parameters:**

* *num_embeddings(int)*: Size of the dictionary of embeddings.
* *embedding_dim(int)*: The size of each embedding vector.

**Shape:**

* `input: (*)`: IntTensor or LongTensor of arbitrary shape containing the indices to extract.
* `output(*, H)`: Where * is the input shape and H is `embedding_dim`. 

In [None]:
vocab_size, embed_size = 10, 8

embedding = nn.Embedding(vocab_size, embed_size)

In [None]:
batch_size, num_steps = 4, 9
X = torch.zeros((batch_size, num_steps))

embs = embedding(X.t().type(torch.int32))
embs[0,:,:]

tensor([[-0.1544,  0.7103,  0.2287,  1.2542,  0.0098,  0.2911, -0.1725, -1.0152],
        [-0.1544,  0.7103,  0.2287,  1.2542,  0.0098,  0.2911, -0.1725, -1.0152],
        [-0.1544,  0.7103,  0.2287,  1.2542,  0.0098,  0.2911, -0.1725, -1.0152],
        [-0.1544,  0.7103,  0.2287,  1.2542,  0.0098,  0.2911, -0.1725, -1.0152]],
       grad_fn=<SliceBackward0>)

#### In-depth Look at `nn.Embedding()`
* The `nn.Embedding()` layer is a simple lookup table that maps an index value to a wieght matrix of a certain dimension.

* This simple operation is the foundation of many advanced NLP architectures, allowing for the processing of discrete input symbols in a continuous state.

* The *nn.Embedding* layer takes at least two arguments, **the vocabulary size** and **the size of the encoded representation for each word**.

* Lets say we have a vocabulary of 1000 words, then the first argument would be 1000. Each word in the vocabulary will be represented by a vector of fixed size. So, the second argument is the size of the learned embedding for each word.

In [None]:
vocab_size, embed_size = 10, 50

embedding = nn.Embedding(vocab_size, embed_size)

In the above example, PyTorch is created a lookup table named `embedding` that has 10 rows and 50 columns. Each row represents a single word embedding that is initialized randomly drawn from a uniform distribution. They are initialized using `nn.init.uniform_()` function from the `torch.nn.init` module and weights are initialized with random values between -1 and 1.

In [None]:
embedding(torch.LongTensor([0]))

tensor([[-0.8511,  0.1879,  0.7696,  0.6570, -1.3819,  1.5425,  1.2204,  1.3877,
         -0.1510, -1.0491,  0.0389, -0.0055, -0.2214,  1.0159,  0.8809,  0.5047,
         -1.0305, -0.2009,  1.4088, -1.9277, -1.5470, -0.8555,  0.1258,  0.8923,
          1.4227,  0.9337, -0.6947, -0.0893,  0.0046,  1.4121,  0.6984, -1.4294,
          0.0077,  0.1171,  1.0447, -1.2901,  0.9510,  1.2457,  1.4315,  0.9397,
          1.4418, -0.4111,  1.3055, -0.7426, -0.7994, -0.3499,  0.5569, -0.8989,
         -0.5761,  1.3471]], grad_fn=<EmbeddingBackward0>)

The above numbers get tuned and optimized during the training process to convey the meaning of a certain word.

* `torch.tensor.repeat()`: Repeats a tensor along the specified dimensions.

In [3]:
import torch

X = torch.tensor([1, 2])

X.repeat(2, 2), X.repeat(2, 2).shape

(tensor([[1, 2, 1, 2],
         [1, 2, 1, 2]]), torch.Size([2, 4]))

### `torch.optim` package
This is a package implementing various optimization algorithms.

### Adam Optimizer
* The Adam optimizer algorithm is an extension to stochastic gradient descent.
* Stochastic gradient descent maintains a single learning rate for all weight updates and does not change during training.
* In Adam, a learning rate is maintained for each network weight (parameter) and seperately adapted as learning unfolds.
* Adam exploits the advantages of **Adaptive Gradient Algorithm** and **Root Mean Square Propagation** which are two extensions of stochastic gradient descent.
* **Adaptive Gradient Algorithm**: maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language processing, and computer vision problems.)
* **Root Mean Square Propagation**: also maintains a per=parameter learning rate that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing).
* In Adam, instead of adapting the parameter learning rates based on the average first moment (the mean) as in **Root Mean Square Propogation**, the algorithm makes use of the average of the second moments of the gradients (the uncentered variance).
* Adam algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters `beta1` and `beta2` control the decay rates of the moving averages.