---- 

In this tutorial, we will code **Self-Attention** in **[PyTorch](https://pytorch.org/)**. **Attention** is an essential component of neural network **Transformers**, which are driving the current excitement in **Large Language Models** and **AI**. Specifically, an **Enecoder-Only Transformer**, illustrated below, is the foundation for the popular model **BERT**. 

<img src="./images-nb/encoder_only_1.png" alt="an enecoder-only transformer neural network" style="width: 800px;">

At the heart of **BERT** is **Self-Attention**, which allows it to establish relationships among the words, characters and symbols, that are used for input and collectively called **Tokens**. For example, in the illustration below, where the word **it** could potentially refer to either **pizza** or **oven**, **Attention** could help a **Transformer** establish the correctly relationship between the word **it** and **pizza**.

<img src="./images-nb/attention_ex_1.png" alt="an illustration of how attention works" style="width: 800px;"/>

In this tutorial, you will...

- **[Code a Basic Self-Attention Class!!!](#selfAttention)** The basic self-attention class allows the transformer to establish relationships among words and tokens.

- **[Calculate Self-Attention Values!!!](#calculate)** We'll then use the class that we created, SelfAttention, to calculate self-attention values for some sample data.
 
- **[Verify The Calculations!!!](#validate)** Lastly, we'll validate the calculations made by the SelfAttention class..

In [0]:
import torch
import torch.nn as nn ## torch.nn gives us nn.module() and nn.Linear()
import torch.nn.functional as F ## This gives us the softmax()

# Code Self-Attention
<a id="selfAttention"></a>


In [0]:
# nn.Module is the base class for all neural network modules in PyTorch
class SelfAttention(nn.Module):
  def __init__(self, d_model=2,
               row_dim=0,
               col_dim=1):
    ## d_model = the number of embedding values per token.
    ##           Because we want to be able to do the math by hand, we've
    ##           the default value for d_model=2.
    ##           However, in "Attention Is All You Need" d_model=512
    ##
    ## row_dim, col_dim = the indices we should use to access rows or columns
    super().__init__()

        

# Calculate Self-Attention
<a id="calculate"></a>

# Print Out Weights and Verify Calculations
<a id="validate"></a>