## **Scaled Dot-Product Attention**

Scaled Dot-Product Attention is a form of attention where the attention scores are calculated as the dot product of the query and key, divided by the square root of the dimension of the key (scaling factor). This scaling helps to stabilize gradients during training. The attention scores are then passed through a softmax function to get the attention weights.

**Imports**

In [3]:
import torch
import torch.nn as nn
import numpy as np

**Data Loading**

In [None]:
query = torch.randn(1, 20)
key = torch.randn(10, 20)
value = torch.randn(10, 20)

**Scaled Dot-Product Attention Model**

In [None]:
class ScaledDotProductAttention(nn.Module):
    def __init__(self, input_dim):
        super(ScaledDotProductAttention, self).__init__()
        self.input_dim = input_dim

    def forward(self, query, key, value):
        # Compute the attention scores
        scores = torch.matmul(query, key.transpose(-2, -1)) / np.sqrt(self.input_dim)
        attention_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, value)
        return output, attention_weights

**Instantiate and Apply Attention**

In [None]:
scaled_dot_product_attention = ScaledDotProductAttention(input_dim=20)
output, attention_weights = scaled_dot_product_attention(query, key, value)

**Display Results**

In [None]:
print("Scaled Dot-Product Attention Output:", output)
print("Attention Weights:", attention_weights)