# Implementing Self-Attention from Scratch in PyTorch with Example

In this notebook, we'll build a self-attention mechanism from scratch and demonstrate it with a short example.

## Import Libraries

First, let's import the necessary libraries.


In [5]:
import torch
import torch.nn.functional as f

## Step-by-Step Implementation

### Step 1: Initialize Parameters

We'll start by defining the necessary parameters and initializing the weight matrices for queries, keys, and values.


In [6]:
embed_size = 256
heads = 8
head_dim = embed_size // heads

# Initialize weight matrices for queries, keys, and values
W_Q = torch.rand((embed_size, embed_size))
W_K = torch.rand((embed_size, embed_size))
W_V = torch.rand((embed_size, embed_size))

# Initialize the final output weight matrix
W_O = torch.rand((embed_size, embed_size))

### Step 2: Define Input Tensors

Let's create some dummy input tensors for values, keys, and queries. We'll use a simple example with a batch size of 1 and a sequence length of 3.


In [8]:
batch_size = 1
seq_length = 3

# Example sentences represented as random tensors
values = torch.rand((batch_size, seq_length, embed_size))
keys = torch.rand((batch_size, seq_length, embed_size))
queries = torch.rand((batch_size, seq_length, embed_size))

### Step 3: Linear Transformations

Apply the linear transformations to the input tensors using the weight matrices.


In [10]:
Q = torch.einsum('bse, ee->bse', queries, W_Q)
K = torch.einsum('bse, ee->bse', keys, W_K)
V = torch.einsum('bse, ee->bse', values, W_V)