# Part 7

## Lecture 7

**Word embeddings**: vector representation of a word, with semantic distances
**Word2vec**: discrete word into a d-dimensional real valued feature vector.
- Predict context
    - Skip gram
        - center word to k surrounding words
    - Continuous bag-of-words (CBOW)
        - surrounding k words to center word
- **Cannot** use one-hot vectors to measure word similarities.

<img src="./image/Predict_context.png" height="150" />


## Assignment 7: Self-attention and Transformers
Given a sequence of input vectors, the output of self-attention is a weighted sum of these vectors.

Self-attention is **permutation equivariant**: changing the order of the input sequence will result in the exact same output sequence, with also the same changes in the order. For a Transformer this could be detrimental. Ex. *Sequence classification* where the whole sequence is represented as a single feature vector. Append a *pooling layer* before classification layer, this would make the Transformer **permutation invariant**. As in it would give the same prediction regardless of the order of the sequence.

This can be alleviated by learning **positional embeddings**.
- Make the term "Deep Learning" differ from "Learning Deep".

<img src="./image/Three_operations.png" height="300" />

**Self-attention**:
- Learning how to re-weigh words with all other words.
- No more recurrence; each vector can be computed in parallel.
- Measure similarity between two input vectors
- This is normalized between \[0,1] (softmax) to act as probabilities
- input tensor: \[batch_size, sequence_length, embedding_dimension(dimension of input vector)]
- Learn relationships between input vectors via three operations (implemented as linear layers *without bias*):
    - **Query-mapping**: create Query (embedding_dimension x embedding_dimension)
        - Calc: $$\mathbf{q_i} = \mathbf{W_q} \mathbf{x_i}$$
        - Ex: $$q_i$$ asks question to $$x_j$$ -> "Are you a noun?"
    - **Key-mapping**: create Key (embedding_dimension x embedding_dimension)
        - Calc: 
            - $$\mathbf{k_i} = \mathbf{W_k} \mathbf{x_i}$$
            - $$w_{ij}' = \frac{\mathbf{q_i}^T \mathbf{k_j}}{\sqrt{k}}$
        - Ex: $$k_j$$ dot product with $$q_i$$ -> "Noun-ness: high nr if true"
    - **Value-mapping**: create Value(embedding_dimension x embedding_dimension)
        - Calc: 
            - $$\mathbf{v_i} = \mathbf{W_v} \mathbf{x_i}$$
            - $$w_{ij} = \text{softmax}(w_{ij}')$$
            - $$\mathbf{y_i} = \sum_j w_{ij}\mathbf{v_j}$$
        - Ex: if $$q_i^T$$ \* $$k_j$$ is high, $$v_j$$ its value is mostly passed onto the output $$y_i$$
- output tensor: \[batch_size, sequence_length, embedding_dimension(dimension of input vector)]

In [None]:
import torch
import torch.nn.functional as F
import torch.nn as nn
from torchinfo import summary


class SelfAttention(nn.Module):
    """
    Self-attention operation with learnable key, query and value embeddings.
    
    Args:
        k: embedding dimension
        
    """
    def __init__(self, k):
        super(SelfAttention, self).__init__()

        # These compute the queries, keys and values
        self.tokeys    = nn.Linear(k, k, bias=False)
        self.toqueries = nn.Linear(k, k, bias=False)
        self.tovalues  = nn.Linear(k, k, bias=False)

    def forward(self, x):

        # Get tensor dimensions: batch size, sequence length and embedding dimension
        b, t, k = x.size()

        # Transform input tensor to queries, keys and values
        queries = self.toqueries(x)
        keys = self.tokeys(x)
        values = self.tovalues(x)
        
        # Compute weights
        w_prime = torch.bmm(queries, keys.transpose(1, 2))
        # Scale weights
        w_prime = w_prime / (k ** 0.5)
        # Apply softmax
        w = F.softmax(w_prime, dim=2)

        # Compute output tensor
        y = torch.bmm(w, values)

        return y


# Learnable parameters: 3 * embedding_dimension * embedding_dimension
batch_size, sequence_length, embedding_dimension = 4, 5, 6
bias = False

model_ouput = summary(
    SelfAttention(embedding_dimension),
    (batch_size, sequence_length, embedding_dimension),
    verbose=2,
    col_width=16,
    col_names=["input_size", "output_size", "kernel_size", "num_params", "mult_adds"],
)

Layer (type:depth-idx)                   Input Shape      Output Shape     Kernel Shape     Param #          Mult-Adds
├─Linear: 1-1                            [4, 5, 6]        [4, 5, 6]        [6, 6]           36               144
├─Linear: 1-2                            [4, 5, 6]        [4, 5, 6]        [6, 6]           36               144
├─Linear: 1-3                            [4, 5, 6]        [4, 5, 6]        [6, 6]           36               144
Total params: 108
Trainable params: 108
Non-trainable params: 0
Total mult-adds (M): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00


**Multi-head (self-)attention**:
- Adding multiple dimensions to the queries, keys and values vector, allows learning multiple relations ("noun-ness", "verb-ness", ...)
- Two types:
    - Narrow:
        - Ex: map Embedding of dimension 256, with 8 attention heads to 8 different 32-dimensional query, key and value vectors. Then concatenated back to embedding dimension 256.
        - **More computationally efficient**.
        - Query, Key and values tensor: (embedding_dimension, embedding_dimension / heads)
    - Wide:
        - Ex: map Embedding of dimension 256, with 8 attenhion heads to 8 different 256-dimensional query, key and value vectors. then concatenated and mapped back to embedding dimension 256.
        - **More expressive power**.
        - Query, Key and values tensor: (embedding_dimension, embedding_dimension * heads)
- input tensor: \[batch_size * heads, sequence_length, embedding_dimension(dimension of input vector)]

- output tensor: \[batch_size, sequence_length, embedding_dimension(dimension of input vector)]


In [None]:
import torch
import torch.nn.functional as F
import torch.nn as nn
from torchinfo import summary


class MultiHeadAttention(nn.Module):
    """
    Wide mult-head self-attention layer.

    Args:
        k: embedding dimension
        heads: number of heads (k mod heads must be 0)

    """
    def __init__(self, k, heads=8):
        super(MultiHeadAttention, self).__init__()

        self.heads = heads

        # These compute the queries, keys and values for all 
        # heads (as a single concatenated vector)
        self.tokeys    = nn.Linear(k, k * heads, bias=False)
        self.toqueries = nn.Linear(k, k * heads, bias=False)
        self.tovalues  = nn.Linear(k, k * heads, bias=False)

        # This unifies the outputs of the different heads into 
        # a single k-vector
        self.unifyheads = nn.Linear(k * heads, k, bias=True)
        
    def forward(self, x):

        b, t, k = x.size()
        h = self.heads

        # Project input to queries, keys and values
        queries = self.toqueries(x).view(b, t, h, k)
        keys    = self.tokeys(x).view(b, t, h, k)
        values  = self.tovalues(x).view(b, t, h, k)

        # Fold heads into the batch dimension
        keys = keys.transpose(1, 2).reshape(b * h, t, k)
        queries = queries.transpose(1, 2).reshape(b * h, t, k)
        values = values.transpose(1, 2).reshape(b * h, t, k)
        
        # Compute attention weights
        w_prime = torch.bmm(queries, keys.transpose(1, 2))
        w_prime = w_prime / (k ** (1 / 2))
        w = F.softmax(w_prime, dim=2)

        # Apply the self-attention to the values
        y = torch.bmm(w, values).view(b, h, t, k)

        # Swap h, t back, unify heads
        y = y.transpose(1, 2).reshape(b, t, h * k)

        y = self.unifyheads(y)

        return y

## -- Multihead -- ##
# -- Narrow -- Learnable parameters: 
#   ( 3 * embedding_dimension * ( embedding_dimension / heads ) )
# + ( embedding_dimension / heads ) * embedding_dimension ( + embedding_dimension <- bias )

# -- Wide -- Learnable parameters: 
#   ( 3 * embedding_dimension * ( embedding_dimension * heads ) )
# + ( embedding_dimension * heads ) * embedding_dimension ( + embedding_dimension <- bias )
batch_size, sequence_length, embedding_dimension, heads = 4, 5, 6, 8
bias = False

model_ouput = summary(
    MultiHeadAttention(embedding_dimension, heads),
    (batch_size, sequence_length, embedding_dimension),
    verbose=2,
    col_width=16,
    col_names=["input_size", "output_size", "kernel_size", "num_params", "mult_adds"],
)

Layer (type:depth-idx)                   Input Shape      Output Shape     Kernel Shape     Param #          Mult-Adds
├─Linear: 1-1                            [4, 5, 6]        [4, 5, 48]       [6, 48]          288              1,152
├─Linear: 1-2                            [4, 5, 6]        [4, 5, 48]       [6, 48]          288              1,152
├─Linear: 1-3                            [4, 5, 6]        [4, 5, 48]       [6, 48]          288              1,152
├─Linear: 1-4                            [4, 5, 48]       [4, 5, 6]        [48, 6]          294              1,152
Total params: 1,158
Trainable params: 1,158
Non-trainable params: 0
Total mult-adds (M): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.02
Params size (MB): 0.00
Estimated Total Size (MB): 0.03


**Transformer**:
- Consist of self-attention layer, layer norm, Multilayer Perceptron and some residual connections.


<img src="./image/Transformer.png" height="300" />

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=de0be7a9-29e1-4ab6-9ce7-607fa646094e' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>