# MY475 Seminar 5: Language models fundamentals and further applications


# 1) Review of vector and matrix computations

With pen and paper:

a) Compute the dot product between

$$
\begin{pmatrix}
1 \\
4 \\
2
\end{pmatrix}

\quad

\text{and}

\quad

\begin{pmatrix}
0.4 \\
0.8 \\
0.2
\end{pmatrix}

$$

b) Multiply the matrices

$$
\begin{pmatrix}
1 & 2 \\
3 & 4 \\
5 & 6
\end{pmatrix}

\quad

\text{and}

\quad

\begin{pmatrix}
4 & 3 \\
2 & 1
\end{pmatrix}

$$

c) Compute the softmax for the vector

$$

\begin{pmatrix}
4.2 \\
-3 \\
0.2
\end{pmatrix}

$$

In [34]:
import numpy as np

def softmax_with_temperature(x, T=1.0):
    x = x - np.max(x)  # numerical stability (for ex: exp(1000) could not be processed by computer)
    x = x / T          # scale by temperature
    exps = np.exp(x)
    return np.round(exps / np.sum(exps),3)

x = np.array([4.2, -3, 0.2])

print("T = 1:", softmax_with_temperature(x, T=1))
print("T = 0.5:", softmax_with_temperature(x, T=0.5))
print("T = 2:", softmax_with_temperature(x, T=2))
print("T = 10:", softmax_with_temperature(x, T=10))

T = 1: [0.981 0.001 0.018]
T = 0.5: [1. 0. 0.]
T = 2: [0.86  0.024 0.116]
T = 10: [0.464 0.226 0.311]


# 2) Understanding the Transformer Architecture (with a Focus on Attention)

This document explains the transformer architecture and how attention works, using a concrete example with three tokens: `x0`, `x1`, and `x2`. We specifically follow how `x2` flows through a single transformer block.

---

## 🧠 The Big Idea: What is a Transformer?

A **transformer** is a deep learning architecture designed to process sequences, like sentences, by allowing each element to **interact with all others**—regardless of position.

Each transformer block consists of two main components:

1. **Multi-head self-attention** – lets each token "look at" other tokens and decide what is relevant.
2. **Feedforward neural network** – a small MLP applied independently to each token.

Other essential components:
- **Positional encodings** (to encode the order of tokens)
- **Residual connections** and **Layer Normalisation** (to stabilize and accelerate training)

---

## 🎯 What is Attention?

**Self-attention** is the key mechanism in transformers. Each token is mapped to three vectors:
- **Query ($Q$)** – what the token is looking for
- **Key ($K$)** – what the token offers
- **Value ($V$)** – what the token sends if it's attended to

Imagine a meeting:
- Query = a person's question
- Key = their expertise
- Value = what they say if asked

Each token compares its query with all keys to compute **attention weights**, which it then uses to take a weighted average over all value vectors.

---

## 📦 Setup: Input Sequence

We consider a sequence with three tokens:
- $x_0 = \text{"Language"}$
- $x_1 = \text{"models"}$
- $x_2 = \text{"work"}$

We will now trace how $x_2$ flows through a transformer block.

---

## 🛠 Step-by-Step: What Happens to $x_2$?

### Step 1: Input Embedding

We get the embedding vector $e_2$ for token $x_2$, and add its positional encoding $p_2$:

$$
x_2 = e_2 + p_2
$$

---

### Step 2: Compute Query, Key, and Value Vectors

We apply linear transformations to compute:

- Query: $q_2 = W_Q x_2$
- Key: $k_2 = W_K x_2$
- Value: $v_2 = W_V x_2$

We do the same for $x_0$ and $x_1$ to get $q_0$, $k_0$, $v_0$ and $q_1$, $k_1$, $v_1$.

---

### Step 3: Compute Attention Weights

We calculate dot products between $q_2$ and each key to score how much $x_2$ should attend to each token:

- $q_2 \cdot k_0$
- $q_2 \cdot k_1$
- $q_2 \cdot k_2$

Then apply the softmax with scaling:

$$
\alpha_{2j} = \frac{\exp\left(q_2 \cdot k_j / \sqrt{d}\right)}{\sum_{t=0}^{2} \exp\left(q_2 \cdot k_t / \sqrt{d}\right)}
$$

Where $d$ is the dimension of the key and query vectors.

---

### Step 4: Aggregate Values

Now $x_2$ builds its new representation by combining value vectors:

$$
x'_2 = \alpha_{2,0} v_0 + \alpha_{2,1} v_1 + \alpha_{2,2} v_2
$$

This $x'_2$ now encodes context from the full sequence.

---

### Step 5: Feedforward Network

We pass $x'_2$ through a feedforward network:

1. First layer: apply GELU activation  
   $$h = \text{GELU}(W_1 x'_2 + b_1)$$

2. Output layer:  
   $$x''_2 = W_2 h + b_2$$

---

### Step 6: Residual Connections & Normalisation

We apply residual connections and layer normalisation to stabilize training:

1. After attention:  
   $$
   x'_2 = \text{LayerNorm}(x_2 + \text{SelfAttention}(x_2))
   $$

2. After feedforward block:  
   $$
   x''_2 = \text{LayerNorm}(x'_2 + \text{FFN}(x'_2))
   $$

---

## 📣 Final Output for $x_2$

After passing through the block:
- $x''_2$ is the updated, contextualised embedding for token 2.
- It's informed by other tokens and ready for the next transformer block.

---

## 🧩 Summary Visual (Simplified)

```plaintext
Input:      x0        x1        x2
             ↓         ↓         ↓
           Embeddings + Positional Encoding
             ↓         ↓         ↓
          Self-Attention (x2 attends to x0, x1, x2)
             ↓         ↓         ↓
           New x2' = weighted sum of v0, v1, v2
             ↓
         Feedforward + Residual + Norm
             ↓
         Output x2''


![Transformer Architecture Diagram](Transformer-neural-network-12.png)

In [35]:
import numpy as np

np.random.seed(0)

# Parameters
vocab_size = 1000        # number of unique tokens
embed_dim = 128          # size of token and position embeddings
context_length = 100     # number of positions

# Embedding matrices
Winputtokemb = np.random.randn(vocab_size, embed_dim)       # (1000, 128)
Winputposemb = np.random.randn(context_length, embed_dim)   # (100, 128)

# Attention weight matrices
WQ = np.random.randn(embed_dim, embed_dim) # (128, 128)
WK = np.random.randn(embed_dim, embed_dim) # (128, 128)
WV = np.random.randn(embed_dim, embed_dim) # (128, 128)

# Feedforward weights
W1ff = np.random.randn(256, embed_dim)
W2ff = np.random.randn(embed_dim, 256)
b1ff = np.random.randn(256)
b2ff = np.random.randn(embed_dim)

# Final projection (logits) layer
Wlinear = np.random.randn(vocab_size, embed_dim)

# Activation functions
def relu(x):
    return np.maximum(0, x)

def softmax(x):
    exp_x = np.exp(x - np.max(x))  # for numerical stability
    return exp_x / exp_x.sum(axis=1, keepdims=True)

In [36]:
# ==== Input tokens (indices into vocabulary) ====
X = Winputtokemb[[17, 500, 4],:] + Winputposemb[[0,1,2],:]
X.shape

(3, 128)

In [None]:
# === Self-Attention for x2 ===
Q = X @ WQ

# Compute keys and values for all tokens
K = X @ WK
V = X @ WV

In [38]:
scores = Q @ K.T
scores.shape

(3, 3)

In [39]:
scores = scores / np.sqrt(embed_dim)  # scale by sqrt(d)
alpha = softmax(scores)

In [40]:
res = alpha @ V
res.shape

(3, 128)

In [41]:
x2_prime = res[2,:]                  # shape: (128,) — new representation for x2
x2_mid = X[2, :] + x2_prime

In [42]:
# === Feedforward Network ===
h = relu(W1ff @ x2_mid + b1ff)     # shape: (256,)
x_ff_out = W2ff @ h + b2ff           # shape: (128,)

# === Residual Connection ===
x2_output = x2_mid + x_ff_out       # shape: (128,)
## HERE, WE SHOULD HAVE ALSO NORMALIZED AFTER ADDING THE RESIDUAL CONNECTION

In [43]:
# Step 1: Project into vocabulary space
logits = Wlinear @ x2_output  # shape: (1000,)

# Step 2: Convert to probabilities
probs = softmax(logits.reshape(1, -1)).flatten()  # Reshape to 2D, apply softmax, and flatten back to 1D

# Step 3: Predict next token
next_token_id = np.argmax(probs)  # or use sampling


next_token_id

117

# Optional: Retrieval augment generation (RAG)

Reconsider the AER abstract dataset from the last seminar. Using the sentence transformer library (https://sbert.net/), can you build a simple retrieval part for a RAG system? These text chunks could then be added to the context of a language model, e.g. via an API or a local model.

- Encode all abstracts or titles

- Write a function that inputs a user questions, encodes it, takes cosine similarity to all embeddings in the dataset, and returns the most similar K texts

- If you want to additionally refine your ranking of only the most similar abstracts with a slower model, have a look at https://sbert.net/examples/sentence_transformer/applications/retrieve_rerank/README.html and

In [None]:
from sentence_transformers import SentenceTransformer
import pandas as pd

df = pd.read_csv(
    "aer_sample.csv",
    index_col="date",
    parse_dates=True,
)
column = "title"

# Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Calculate embeddings
data_embeddings = model.encode(df[column])

In [None]:
# embeddings dim = 384
data_embeddings.shape

In [None]:
# import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

def find_similar(query, data_embeddings, df, top_k=5, model=model):

    query_embed = model.encode(query)

    # Calculate cosine similarity
    similarities = cosine_similarity(
        [query_embed], data_embeddings
    ).flatten()
    # Sort INDEXES by similarity
    sorted_indexes = np.argsort(similarities)[::-1]
    # Get top K most similar
    top_k_indexes = sorted_indexes[:top_k]
    # Get the most similar titles
    most_similar = df[column].iloc[top_k_indexes]
    # Get the most similar scores
    most_similar_scores = similarities[top_k_indexes]
    # Create a DataFrame for better visualization
    result_df = pd.DataFrame(
        {
            "title": most_similar,
            "similarity": most_similar_scores,
        }
    )
    result_df.index = df.index[top_k_indexes]
    result_df.index.name = "date"


    return result_df

In [None]:
query = "Does an increase in immigration necessarly lead to a decrease in wages?"
res = find_similar(query, data_embeddings, df)
res