# Transformer Visual Explainer: Hands-on Tutorial

Welcome to this beginner-friendly notebook where we'll build a simple visualization of how transformers process input data.

We'll cover key concepts like tokenization, embeddings, positional encoding, attention weights, and multi-head attention, all demonstrated with mock data.

## Step 1: Tokenize the Input Sentence
Let's start by converting an input sentence into tokens (words). In a real transformer, these would be turned into indices based on a vocabulary.

In [None]:
def tokenize(sentence):
    '''Convert sentence to list of token indices.'''
    tokens = sentence.lower().split()
    # For simplicity, assign each unique word an index
    vocab = {word: idx for idx, word in enumerate(set(tokens), start=1)}
    token_ids = [vocab[word] for word in tokens]
    return token_ids, vocab


## Step 2: Create Embeddings and Add Positional Encoding
Let's generate random embeddings for each token and add positional information to help the model understand order.

In [None]:
import numpy as np

def get_embeddings(token_ids, embed_dim=128):
    '''Generate random word embeddings for each token.'''
    embeddings = np.random.rand(len(token_ids), embed_dim)
    return embeddings

def add_positional_encoding(embeddings):
    '''Add simple positional encoding to embeddings.'''
    seq_len, embed_dim = embeddings.shape
    position_enc = np.zeros((seq_len, embed_dim))
    for pos in range(seq_len):
        for i in range(embed_dim):
            if i % 2 == 0:
                position_enc[pos, i] = np.sin(pos / (10000 ** (i / embed_dim)))
            else:
                position_enc[pos, i] = np.cos(pos / (10000 ** ((i - 1) / embed_dim)))
    return embeddings + position_enc


## Step 3: Compute Mock Attention Weights
We'll simulate attention weights as random heatmaps to illustrate how attention might flow between words.

In [None]:
def compute_attention_weights(seq_len):
    '''Generate mock attention weights for visualization.'''
    attention = np.random.rand(seq_len, seq_len)
    # Normalize to sum to 1 for each query
    attention /= attention.sum(axis=1, keepdims=True)
    return attention


## Step 4: Visualize Attention Heatmap
Using Seaborn, we can display the attention weights to see which words focus on each other.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def visualize_attention(attention, tokens):
    plt.figure(figsize=(8,6))
    sns.heatmap(attention, xticklabels=tokens, yticklabels=tokens, cmap='viridis')
    plt.xlabel('Keys')
    plt.ylabel('Queries')
    plt.title('Mock Attention Weights')
    plt.show()


## Step 5: Putting It All Together
Let's define a class to encapsulate the full process and run a demo with the sentence "The cat sat on the mat".

In [None]:
class TransformerVisualizer:
    def __init__(self, embed_dim=128, num_heads=8):
        self.embed_dim = embed_dim
        self.num_heads = num_heads

    def tokenize(self, sentence):
        # Return token IDs and the token sequence for labeling
        tokens = sentence.lower().split()
        vocab = {word: idx for idx, word in enumerate(set(tokens), start=1)}
        token_ids = [vocab[word] for word in tokens]
        return token_ids, tokens

    def embed_tokens(self, token_ids):
        embeddings = get_embeddings(token_ids, self.embed_dim)
        return add_positional_encoding(embeddings)

    def compute_attention(self, seq_len):
        return compute_attention_weights(seq_len)

    def visualize_attention(self, attention, tokens):
        visualize_attention(attention, tokens)

    def run_full_demo(self, sentence):
        token_ids, tokens = self.tokenize(sentence)
        embeddings = self.embed_tokens(token_ids)
        attention = self.compute_attention(len(tokens))
        self.visualize_attention(attention, tokens)

# Usage
visualizer = TransformerVisualizer()
visualizer.run_full_demo("The cat sat on the mat")

## Summary and Next Steps
In this notebook, we've built a simplified visualization pipeline for transformer attention mechanisms. You learned how to tokenize input, generate embeddings, add positional encoding, and visualize attention weights.

Next, you can extend this by implementing multi-head attention, layering multiple transformer blocks, or making the visualization interactive with Plotly.

Keep experimenting and exploring how transformers work under the hood!