In [1]:
# Peter Wu
# qw262@corenll.edu
# 1/22/2024

### Suppose that we design a deep architecture to represent a sequence by stacking self-attention layers with positional encoding. What could be issues? (paragraph format)

#### Sequence model (transformers) and self-attention layers explained, using  a natural language (text) example

1. Sequence model: e.g, a transformer
1. Positional encoding: e.g, position of words in a sentence
1. **Self-attention layers (KQV attention)**: key: input; query: output; value: a learned/calculated vector -> together they form a attention head
    1. Encoder: what we have (e.g, a sentence)
        1. Word-2-vec + positional encoding: a R<sup>128</sup> vector to represent a word x<sub>i</sub>
        1. Calculate attention of each pair: with key and query, use Softmax( (Q * K) / Scale) to get a distribution of each pair of wrods; now knows the releation between each pair
        1. Multiply attention with value: we get x<sup>'</sup><sub>i</sub>
        1. Repeat the steps: we can get a multi-head self-attention (**stacking self-attention layers**), by for example, using a fully connected neural network: x = FC(x<sup>''</sup><sub>i</sub> , x<sup>'</sup><sub>i</sub> , x<sub>i</sub> , ...) 
    1. Decoder: what we want to predict/match (e.g, the next word)
        1. Cross-attention: look through the past, relation between a current word and words already occurred <- same steps as self-attention
        2. **Find the next word: from the English vocabulary, which are the most likely words, given cross-attention?** 

#### Issues

**Interpretability**: just like other NNs, although stacking layers make lingustic sense (there are many relationships between words in a sentence), it is hard to explain why the layers are necessary and how they form a decison-making process.

**Overfitting**: depend on the training data, we can have high variance if the model overfits, i.e., stacking more then necesary self-attnetion layers. A limited amount of training data, together with two many layers, can lead to a over-explaination of the data. The model finds a relationship that does not exist, or can not be generalized.

**Time/Space Complexity**: large language model often have an enormous amount of parameters, and use corpus to train also takes time. Training, testing, and storing the model demands a lot of resources. 

**Positional Encoding**: one inherent limitation, for longer sequences(sentences), the positons of words becomes difficult to interpret, and simply stacking layers may not help. May need more transforming procedures before sending it to self-attention layers.

Referenced Sources:
- Cornell CS 5740 Intro to Machine Learing (taken last year)
- https://arxiv.org/pdf/1706.03762.pdf
- https://medium.com/analytics-vidhya/understanding-q-k-v-in-transformer-self-attention-9a5eddaa5960

### Can you design a learnable positional encoding method using pytorch? (Create dummy dataset)

In [2]:
import torch
import torch.nn as nn
import torch.utils.data as data

In [3]:
class DummyDataset(data.Dataset):
    """
    Create dummy dataset, inherit pytorch dataset
    """
    
    def __init__(self, dim=10, seq_len=10, num_seq=3000):
        """Constructor to instantiate
        
        Parameters
        ----------
        dim : dimension of the positional encoding
        seq_length: length of sequence (# of words)
        num_seq: number of sequences
        """
        
        self.dim = dim
        self.seq_len = seq_len
        self.num_seq = num_seq
        
        # generate data from a random normal distrubiton, with mean 0, var 1
        self.data = torch.randn(num_seq, seq_len, dim) 
        
    def __getitem__(self, id):
        """get a sequece in the dataset by id
            inherited from data.Dataset
        
        Parameters
        ----------
        id : dataset[id]
        """
        
        return self.data[id], torch.tensor(0)

In [4]:
class PositionalEncoding(nn.Module):
    """
    Create pos-encoding, inherit the base nn
    """
    
    def __init__(self, dim, max_seq=500):
        """Constructor to instantiate
        
        Parameters
        ----------
        dim : dimension of the positional encoding, same as DummyDataset.dim
        max_seq: max length of a sequence
        """
        
        super(PositionalEncoding, self).__init__() 
        
        # create the pos-encoding tensors <- learnable
        self.pos_encode = nn.Parameter(torch.zeros(max_seq, dim))

    def forward(self, x):
        """forward pass
        
        Parameters
        ----------
        x : new sequence to update encoding,
        Here the encoding here is simple,
        however,
        it can be sth like a sin(x)
        """
        seq_len = x.size(dim=1) # obtain seq length
        
        # get the positions of the sequence
        pos = torch.arange(seq_len)
        
        # get the pos-encoding with dim, from the learnable self.pos_encode
        pos_encode = self.pos_encode[pos, :]
                
        #return word-2-vec + positional encoding
        return x + pos_encode

In [5]:
class LinearModel(nn.Module):
    """
    Create a linear model, inherit the base nn
    Predict the mean of a sequence's encoding
    """
        
    def __init__(self, dim=10, seq_len=10):
        """Constructor to instantiate
        
        Parameters
        ----------
        dim : dimension of the positional encoding, same as DummyDataset.dim
        max_seq: max length of a sequence
        """
            
        super(LinearModel, self).__init__()
        
        # use pos-encoding in the linear model <- learnable
        self.pos_encode = PositionalEncoding(dim, seq_len)
        
        # a 1 layer model <- learnable
        self.layer = nn.Linear(dim, 1)

    def forward(self, x):
        """forward pass
        
        Parameters
        ----------
        x : new sequence to update model
        """
        
        x = self.pos_encode(x) # add learnable pos-encoding
        output = self.layer(x.mean(dim=1)) # output for the learnable layer, mean of value
        
        return output

In [6]:
torch.manual_seed(0) # fix seed
dataset = DummyDataset()
model = LinearModel(dim=10, seq_len=10)

In [7]:
dummy_input = dataset[0][0].unsqueeze(0)  
output = model(dummy_input)
print(output)
print(dataset[0][0].unsqueeze(0).mean())

tensor([[0.0671]], grad_fn=<AddmmBackward0>)
tensor(0.0880)


In [8]:
dummy_input = dataset[1][0].unsqueeze(0) 
output = model(dummy_input)
print(output)
print(dataset[1][0].unsqueeze(0).mean())

tensor([[0.0436]], grad_fn=<AddmmBackward0>)
tensor(-0.0395)


In [9]:
dummy_input = dataset[2][0].unsqueeze(0)
output = model(dummy_input)
print(output)
print(dataset[2][0].unsqueeze(0).mean())

tensor([[-0.2207]], grad_fn=<AddmmBackward0>)
tensor(0.0030)
