Q1. Suppose that we design a deep architecture to represent a sequence by stacking self-attention layers with positional encoding. What could be issues? (paragraph format)

Ans1:

The issues for the above condition can be categorized into three different areas:

1. Depth
	- If the dataset is small or contains a lot of variation, then the deep models tend to memorize the data and overfit
	- If skip connections are not implemented between the layers, the gradient during backpropagation vanishes before it reaches some of the initial layers.
	- On a similar note, without proper regularization, normalization, and activations, the loss might be too significant and would lead to NaN gradients
2. Self attention
	- Self attention though revolutionary, requires large amounts of data(pre-training) before it can start performing significantly compared to RNNs
	- Self-attention can theoretically scale to Infinite sequence lengths. However, this scaling is done at the expense of quadratically increased time and space requirements.
	- Due to their fully connected nature, self-attention mechanisms tend to learn trivial solutions based on some inherent information.
3. positional encoding
	- If the relationships between various time steps are not based simply on their relative positions, then the classic sine /cosine encoding will not be able to encapsulate these relationships and will lead to a sub-optimal model.
	- Positional encoding could lead to an overfit model utterly dependent on the positions and failing to learn from the data embeddings.
	- On the other hand, a learnable encoding that could mitigate this issue would require vast amounts of data (with many variabilities) to capture the optimal encoding accurately.

Self-attention and positional encodings are revolutionary inventions; however, they are much more computationally expensive than RNN and CNN-based architectures. Moreover, they are also highly data-hungry. However, we can achieve close to optimal performance with transformers with suitable normalization, regularization, well-processed datasets, and custom positional information to encapsulate relationships fully.


In [3]:
#Q2: Can you design a learnable positional encoding method using pytorch? (Create dummy dataset)


import torch
from torch import nn


class customEmbedding(nn.Module):
    def __init__(self, hidden_dim, sequence_len, vocab_size):
        super().__init__()

        self.embed = nn.Embedding(vocab_size,hidden_dim)
        self.pos_emb = nn.Embedding(sequence_len,hidden_dim)

    def forward(self,x): # 
        x = self.embed(x) + self.pos_emb.weight
        
        return x
        
vocab_size=1000
sequence_length=10
hidden_dim=256
cstmEmbed= customEmbedding(hidden_dim,sequence_length, vocab_size)
inputSeq = torch.randperm(vocab_size)[:sequence_length]
print("input is a sequence of integers of sequence_length(10):\n",inputSeq,"\n")

print("input shape:",inputSeq.shape)
print("positional_embedding shape:",cstmEmbed.pos_emb.weight.shape)
print("output shape:",cstmEmbed(inputSeq).shape)


input is a sequence of integers of sequence_length(10):
 tensor([829, 317, 761, 762, 535, 937, 935, 482, 541, 938]) 

input shape: torch.Size([10])
positional_embedding shape: torch.Size([10, 256])
output shape: torch.Size([10, 256])
