# MY475 Seminar 5: Language models fundamentals and further applications


## Review of vector and matrix computations

With pen and paper:

a) Compute the dot product between

$$
\begin{pmatrix}
1 \\
4 \\
2
\end{pmatrix}

\quad

\text{and}

\quad

\begin{pmatrix}
0.4 \\
0.8 \\
0.2
\end{pmatrix}

$$

b) Multiply the matrices

$$
\begin{pmatrix}
1 & 2 \\
3 & 4 \\
5 & 6
\end{pmatrix}

\quad

\text{and}

\quad

\begin{pmatrix}
4 & 3 \\
2 & 1
\end{pmatrix}

$$

c) Compute the softmax for the vector

$$

\begin{pmatrix}
4.2 \\
-3 \\
0.2
\end{pmatrix}

$$

## Inference in a language model

The following parameters and basic functions correspond to a simple decoder transformer language model with one layer of causal attention and one feed-forward layer.

Assume no norms and only a single attention head.

a) What is context length this model?

b) How many unique tokens does its vocabulary have?

c) What is the embeding dimension?

d) The user enters the input sequence [17, 500, 4]. Which token has the highest predicted probability to be next?

e) Briefly explain in words how you would predict the token after your first prediction (no need to add computations in this part)?

To answer this question, write code (using only Numpy) to:

- Obtain the embeddings of the input  tokens

- Add their positional embeddings

- Compute the relevant attention weights and updated embeddings

- Transform with the feed-forward neural network component

- Map into vocabulary

- Apply softmax

In [None]:
import numpy as np

np.random.seed(0)

Winputtokemb = np.random.randn(128, 1000)
Winputposemb = np.random.randn(128, 100)

WQ = np.random.randn(128, 128)
WK = np.random.randn(128, 128)
WV = np.random.randn(128, 128)

W1ff = np.random.randn(256, 128)
W2ff = np.random.randn(128, 256)
b1ff = np.random.randn(256)
b2ff = np.random.randn(128)

Wlinear = np.random.randn(1000, 128)


def relu(x):
    return np.maximum(0, x)


def softmax(x):
    exp_x = np.exp(x - np.max(x))
    return exp_x / exp_x.sum(axis=0, keepdims=True)

In [None]:
# Your code here

## Optional: Retrieval augment generation (RAG)

Reconsider the AER abstract dataset from the last seminar. Using the sentence transformer library (https://sbert.net/), can you build a simple retrieval part for a RAG system? These text chunks could then be added to the context of a language model, e.g. via an API or a local model.

- Encode all abstracts or titles

- Write a function that inputs a user questions, encodes it, takes cosine similarity to all embeddings in the dataset, and returns the most similar K texts

- If you want to additionally refine your ranking of only the most similar abstracts with a slower model, have a look at https://sbert.net/examples/sentence_transformer/applications/retrieve_rerank/README.html and

In [None]:
from sentence_transformers import SentenceTransformer
import pandas as pd

df = pd.read_csv(
    "data/aer_sample.csv",
    index_col="date",
    parse_dates=True,
)
column = "title"

# Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Calculate embeddings
embeddings = model.encode(df[column])

def find_similar(query, embeddings, df, top_k=5):

    # Your code here

## Optional: AI led interviews

If you have worked with the OpenAI or Claude APIs previously and have a key (no need to register for this course!), you can have a look at setting up the interview platform in https://github.com/friedrichgeiecke/interviews and run it on your computer.