In [None]:
import numpy as np
from scipy import special
from matplotlib import pyplot as plt

# A toy model for attention

## General instructions

Each question is provided in a _Markdown_ cell and should be answered in the cell(s) below. You may add new cells if needed. All figures must be generated and shown directly in this notebook. If a question demands that you write an answer, use a _Markdown_ cell, which can include latex between \\$ symbols. As an example,
\\$\vec{F}=m\vec{a}\\$
gives $\vec{F}=m\vec{a}$.

Your code should run properly if you do the following: 1) restart the kernel 2) execute all cells in order from top to bottom. Running all cells should take a reasonable time on a standard computer (<10 min.).

Avoid using `for` loops whenever possible. Instead, use vectorized operations or numpy functions.

All external sources you consult must be explicitly cited, except for the official NumPy, Scipy and Matplotlib documentation, the lecture notes, and previous exercises. You are encouraged to use external sources, since every function needed in this exercise has not necessarily been seen in the previous exercises. Please also cite every person you discussed this exercise with.

Overly long or unnecessarily complicated answers will be penalized.

## Setting

In this exercise, we illustrate how an attention mechanism can focus on a few relevant tokens (*fragments*). We consider a simple model of attention trained with gradient descent on a small dataset.

We consider $N$ training samples. Each sample, indexed by $\mu = 1, \ldots, N$, is a sequence of $L$ tokens. Each token, indexed by $\ell = 1, \ldots, L$, is a $D$-dimensional vector in $\mathbb{R}^D$. We also have $N'$ test samples, indexed by $\mu = N+1, \ldots, N+N'$. We pack all these samples into the tensors $X^\mathrm{train}\in\mathbb{R}^{N\times L\times D}$, $X^\mathrm{test}\in\mathbb{R}^{N'\times L\times D}$ and $X\in\mathbb{R}^{(N+N')\times L\times D}$. $X$ is a short notation for the concatenation of $X^\mathrm{train}$ and $X^\mathrm{test}$.

For each sample $\mu=1,\ldots,N+N'$ there is a relevant token $\epsilon_\mu\in\{1,\ldots,L\}$ whose first component contains all the information we want to extract. More precisely, for each sample $\mu$ its label $y_\mu\in\mathbb R$ is
$$
y_\mu=X_{\mu,\epsilon_\mu,1}+\Xi_{\mu}
$$
where $\Xi_{\mu}$ is some small noise, independent across samples. We define the train and test labels as $y^\mathrm{train}=(y_\mu)_{\mu=1,\ldots,N}$ and $y^\mathrm{test}=(y_\mu)_{\mu=N+1,\ldots,N+N'}$. We want to predict the test labels given $X$ and the train labels. The main difficulty is that we do not know $\epsilon_\mu$.

We consider an estimator that tends to focus on the most relevant tokens by estimating $\epsilon_\mu$. For this reason, we call it attention. It is defined by
$$
\hat y_\mu(k)=\sum_{\ell=1}^L\sigma(\chi_\mu)_\ell X_{\mu,\ell,1}\ ,\qquad \mathrm{where}\qquad \chi_\mu=\frac{1}{\sqrt D}X_{\mu}k \in\mathbb R^L \qquad \mathrm{and}\qquad k\in\mathbb R^D\ .
$$
$\sigma : \mathbb{R}^L \to \mathbb{R}^L$ is a fixed function, which we initially take to be the identity. $k$ are the parameters of the attention that have to be trained. Intuitively, $\chi_{\mu,\ell}$ indicates how likely it is that the token $\ell$ is the relevant token. We transform this score via $\sigma$ and then take a linear combination of the tokens, weighted by these scores. If the attention perfectly identifies the relevant token—i.e., $\sigma(\chi_\mu)\ell = 1$ when $\ell = \epsilon_\mu$ and $0$ otherwise—then $\hat y_\mu$ will match $y_\mu$, up to the noise term $\Xi_\mu$.

We train this estimator with the following quadratic loss:
$$
\mathcal L(k) = \frac{1}{2}\sum_{\mu=1}^N(y_\mu-\hat y_\mu(k))^2
$$
The parameter vector $k$ is optimized on the training data using gradient descent to minimize $\mathcal{L}(k)$.
Once the optimized parameter $\hat k$ is found, we evaluate the performance on the test set using the mean squared test error $E$:
$$
E = \frac{1}{N'}\sum_{\mu=N+1}^{N+N'}(y_\mu-\hat y_\mu(\hat k))^2
$$

## 1. Preprocessing the data

The following cell loads the training and test data: $X^\mathrm{train}$, $X^\mathrm{test}$, $y^\mathrm{train}$, and $y^\mathrm{test}$.

In [None]:
X_train = np.load("dataX.npy")
X_test = np.load("dataX_test.npy")
y_train = np.load("dataY.npy")
y_test = np.load("dataY_test.npy")

**1.1** Instantiate the variables `N`, `L` and `D` with their values obtained from `X_train`. Print their values. Also print `N'`.

In [None]:
# Your code here

## 2. Loss and gradient

Unless stated otherwise, we take $\sigma(\chi) = \chi$ (the identity function). We will consider other cases later.

**2.1** Write a function `attentionLin` that takes $X$ and $k$ as input and outputs $\hat y$. The function must work with an arbitrary number of samples. The output $\hat y$ must be a one-dimensional numpy array whose length is the number of samples.

In [None]:
# Your code here

**2.2** Write a function `loss` that computes $\mathcal L$. It takes as input the NumPy arrays $y^\mathrm{train}$ and $\hat y^\mathrm{train}$.

In [None]:
# Your code here

**2.3** We now compute the gradient of the loss with respect to $k$,
$\nabla_k \mathcal{L}(k) \in \mathbb{R}^D$. We define the gradient with respect to $k$ of a function $f$ to be
$$
\nabla_kf = \left(\frac{\partial}{\partial k_i}f\right)_{i=1,\ldots,D}\ .
$$
We proceed step by step. First compute $\nabla_k\chi_\mu\in\mathbb R^{L\times D}$. Write your result in one line in the following cell.

Your answer here :

**2.4** We remind that $\sigma$ is the identity, i.e. $\hat y_\mu=\sum_{\ell=1}^L\chi_{\mu,\ell} X_{\mu,\ell,1}$. Compute $\nabla_k\hat y_\mu\in\mathbb R^D$. Write your result in one or two lines in the following cell.

Your answer here:

**2.5** Compute $\nabla_k\mathcal L$ using the chain rule. Write your result in one or two lines in the following cell.

Your answer here:

**2.6** Write a function `gradLossLin` that computes $\nabla_k\mathcal L(k)$. It takes as input the NumPy arrays $X^\mathrm{train}$, $y^\mathrm{train}$ and $k$. The output must be a one-dimensional numpy array. 

_Hint_ : to perform a summation over various indices you can use `np.einsum`.

In [None]:
# Your code here

**2.7** Check that your gradient makes sense by comparing it to finite differences: for any $k$ and any small perturbation $\delta\in\mathbb R^D$ you should have
$$
2\nabla_k\mathcal L(k)^\top\delta \approx \mathcal L(k+\delta)-\mathcal L(k-\delta).
$$

In [None]:
# Your code here

## 3. Gradient descent

We can start implementing gradient descent to estimate $\hat k$ that minimizes $\mathcal L$. We initialize the descent at a random $k^{(0)}$.

**3.1** Introduce the learning rate and formally write, in one line, a single iteration of gradient descent updating $k$ from time step $t$ to time step $t+1$.

Your answer here:

**3.2** Implement the gradient descent. Initialize the descent at a random $k^{(0)}$ and perform enough iterations so the descent converges. We define convergence when $|\mathcal L(k^{(t+1)})-\mathcal L(k^{(t)})|<10^{-2}$ is reached. You may need to adjust the learning rate to ensure convergence. In any case you should not need to go beyond $5\times 10^3$ time steps. You can use a `for` or `while` loop. At each iteration, compute and store both the loss $\mathcal L$ and the error $E$ in two separate lists. To help with debugging, you can print the value of the loss at each iteration.

In [None]:
# Your code here

**3.3** Plot the evolution of $\mathcal L$ across iterations. Label the axes, and use a logarithmic scale for the x-axis (number of iterations).

In [None]:
# Your code here

**3.4** Plot $E$ at each iteration. Add labels to the axes. For the x-axis (number of iterations), use a logarithmic scale. Should we do early stopping ? Explain briefly. Print the achieved error.


In [None]:
# Your code here

Your answer here:

## 4. Softmax attention

In this part we consider another $\sigma$ : the softmax attention. It is defined for $\chi_\mu\in\mathbb R^L$ by
$$
\sigma(\chi_\mu)_\ell=\frac{e^{\chi_{\mu,\ell}}}{\sum_{k=1}^Le^{\chi_{\mu,k}}}\ , \qquad\mathrm{for\ }\ell=1,\ldots,L\ .
$$

**4.1** Implement the softmax attention $\sigma$. It takes in input a two-dimensional numpy array $\chi\in\mathbb R^{N\times L}$ and outputs a numpy array of the same size. It performs the softmax operation independently for each $\chi_\mu\in\mathbb R^L$, for $\mu=1,\ldots,N$. Do not use a built-in function.

In [None]:
# Your code here

**4.2** Take $N=1, L=3$ and compute $\sigma((1,1,1))$, $\sigma((3,1,1))$ and $\sigma((10,1,1))$. Why is this function called a softmax ? Explain briefly.

In [None]:
# Your code here

Your answer here:

**4.3** Would it be possible to run gradient descent with the following $\sigma$ ?
$$
\sigma(\chi_\mu)_\ell=\left \{\begin{array}{cl}
1 & \mathrm{if\ }\chi_{\mu,\ell}=\mathrm{max}(\chi_\mu) \\
0 & \mathrm{else}
\end{array} \right .
$$

Your answer here:

**4.4** Implement a function `attentionSoftmax` that computes $\hat y$ given $X$ and $k$. The function must work with an arbitrary number of samples. The output $\hat y$ must be a one-dimensional numpy array whose length is the number of samples.

In [None]:
# Your code here

In the following cell we give a function that computes $\nabla_k\mathcal L$ for the softmax attention. Note that we use your implementation of `attentionSoftmax` that you defined above.

In [None]:
def gradLossSoftmax(X, y, k):
    """
    X : (N, L, D) numpy array
    y : (N,) numpy array
    k : (D,) numpy array
    """
    yC = attentionSoftmax(X, k) # here we use your implementation of the attention
    chis = X@k/np.sqrt(D)
    dsoftmax = special.softmax(chis, axis=-1)[:,np.newaxis,:]*(np.identity(L)[np.newaxis,:,:]-special.softmax(chis, axis=-1)[:,:,np.newaxis])
    return np.einsum("n,nl,nlk,nki->i", yC-y, X[:,:,0], dsoftmax, X)/np.sqrt(D)

**4.5** Implement gradient descent for the softmax attention in the same way as in the previous section.

In [None]:
# Your code here

**4.6** Plot $\mathcal L$ as a function of the iteration number. Label both axes, and use a logarithmic scale for the x-axis (number of iterations).

In [None]:
# Your code here

**4.7** Plot the evolution of $E$ over iterations. Add labels to the axes. For the x-axis (number of iterations), use a logarithmic scale. Should we do early stopping ? Explain briefly. Print the minimal achieved error.

In [None]:
# Your code here

Your answer here:

**4.8** Does the softmax attention or linear attention have better performance when comparing $E$ ?  In a few lines give an intuition why.

Your answer here:

## 5. Application : attention for sentiment analysis

We are not going to build ChatGPT, but at least, we will see, what are some important principles behind it. One of the main components of a Transformer model, such as ChatGPT is the attention module. It compares the tokens in the sequence (understand as words in the sentence) between each other to only ``attend'' to the relevant tokens for each position in the sequence.

Let's take a look at an example: an attention module that classifies positive vs negative tweets. This task is called sentiment analysis. In the following cell we load the tweets together with their labels.

In [None]:
dataTxt = np.loadtxt('tweets_clean.csv', delimiter=',', skiprows=1, usecols=0, dtype=str)
dataY = np.loadtxt('tweets_clean.csv', delimiter=',', skiprows=1, usecols=1)

- Print the first ten samples and their labels.

In [None]:
# Your code here

### Embeddings

Now, we need to tokenize (*fragmenter*) the text, that is, split sentences into tokens. In common LLMs one token is a piece of a word ; in our case we consider that one token is exactly one word, separated by white spaces.

Then each token should be embedded (*plonger*) to a vector in $\mathbb{R}^D$, so that we can train a model on this data. Each sample $\mu$ will consist of a sequence of $L$ embeddings: $X^\mu = (X^\mu_1, \ldots, X^\mu_L)\in\mathbb{R}^{L\times D}$.

To embed the words we will use the GloVe model https://nlp.stanford.edu/projects/glove/. We provide a file ``embeddings.csv``, which contains only a part of the glove embeddings that is required to encode the given tweets, and we load them.

In [None]:
def get_glove_embeddings():
    embeddings_dict = {}
    with open("embeddings.csv", 'r', encoding="utf-8") as f:
        for line in f:
            values = line.split(",")
            word = values[0]
            vector = np.asarray(values[1:], "float32")
            embeddings_dict[word] = vector
    
    return embeddings_dict

embeddings_dict = get_glove_embeddings()

**5.1** Print the embedding of the token `apple`.

In [None]:
# Your code here

The embeddings encode the words so that the semantic structure is preserved. We give a few examples.

**5.2** Write a function `find_closest_embeddings` that, given an embedding $X_{\mu,\ell}\in\mathbb R^D$, sorts all the embeddings in the dictionary `embeddings_dict` by how close they are to $X_{\mu,\ell}$. To measure "closeness" use Euclidean norm.

In [None]:
# Your code here

Your code should work on the following examples of semantic structure :

In [None]:
find_closest_embeddings(embeddings_dict["king"])[:5]

In [None]:
find_closest_embeddings(embeddings_dict["smarter"] - embeddings_dict["smart"] + embeddings_dict["strong"])[0]

In [None]:
find_closest_embeddings(embeddings_dict["king"] - embeddings_dict["queen"] + embeddings_dict["woman"])[:2]

In the following we process the tweets : we tokenize, embed and split them in train and test sets.

In [None]:
def embed_tweet(tweet):
    embedding = [embeddings_dict[word] for word in tweet.split()]
    while len(embedding) < 16:
        embedding.append(np.zeros_like(embeddings_dict['0']))
    return embedding[:16]

X = np.stack([embed_tweet(tweet) for tweet in dataTxt])

N, Np = int(N*2/3), int(N/3)
X_train, X_test = X[:N,:,:], X[N:,:,:]
y_train, y_test = dataY[:N], dataY[N:]

N, L, D = X_train.shape

### Attention estimator, gradient descent

We now consider the following estimator :
$$
\hat{y}_\mu(k, v) = \frac{1}{\sqrt D}v^T\sum_{\ell=1}^L\sigma(\chi_\mu)_\ell X_{\mu, \ell} ,\qquad \mathrm{where}\qquad \chi_\mu=\frac{1}{\sqrt D}X_{\mu}k \in\mathbb R^L \qquad \mathrm{and}\qquad k, v\in\mathbb R^D.
$$
$\sigma : \mathbb{R}^L \to \mathbb{R}^L$. The vectors $k$ and $v$ are the parameters of the model that have to be trained. Compared to the previous sections we do not know how the embeddings and the labels are related. We assume that the sentiment of the token $\ell$ can be expressed as $v^\top X_{\mu,\ell}$ for a good $v$, that has to be learnt. The sentiments of all the tokens are then summed with a ponderation given by the attention. The previous sections correspond to the special case $v=(1,0,\ldots,0)$.

To keep things simpler we still train the estimator using the quadratic loss (mean square error):
$$
\mathcal L(k,v) = \frac{1}{2}\sum_{\mu=1}^N(y_\mu-\hat y_\mu(k,v))^2
$$
We minimise $\mathcal L$ over $k$ and $v$ using gradient descent. Once the optimized parameters $\hat k, \hat v$ are found, the predicted class of the tweet $\mu$ is $\mathrm{sign}(\hat y_\mu(\hat k, \hat v))$, and we evaluate the performance on the test set using the test accuracy $\mathrm{Acc}$ :
$$
\mathrm{Acc} = \frac{1}{2}+\frac{1}{2N'}\sum_{\mu=N+1}^{N+N'}y_\mu\mathrm{sign}(\hat y_\mu(\hat k, \hat v))\ .
$$

**5.3** Introduce the learning rate and formally write, in two lines, a single iteration of gradient descent updating $k$ and $v$ from time step $t$ to time step $t+1$.

Your answer here:

**5.4** In one line compute $\nabla_v \hat y_\mu$. In one line compute $\nabla_v \mathcal L$.

Your answer here:

**5.5** We assume that the attention is $\sigma(\chi)=\chi$. Remark that instead of $X_{\mu,\ell,1}$ we have now $D^{-1/2}v^\top X_{\mu,\ell}$. Using your previous results, in one line compute $\nabla_k\mathcal L$.

Your answer here:

We give an implementation of the linear and softmax attentions and of their gradients.

In [None]:
def attentionLin(X, k, v):
    chi = (X * k).sum(2) / np.sqrt(D)
    return (X * chi.reshape(-1, L, 1)).sum(1) @ v / np.sqrt(D)

def attentionSoftmax(X, k, v):
    chis = (X * k).sum(2) / np.sqrt(D)
    XTX = (X * special.softmax(chis, axis=-1).reshape(-1, L, 1)).sum(1)
    return XTX @ v / np.sqrt(D)

def gradLossLin(X, y, k, v):
    yC = attentionLin(X, k, v)
    chi = (X * k).sum(2) / np.sqrt(D)
    XTX = np.transpose(X,(0, 2, 1))@X / D
    return  ((XTX @ v) * (yC-y).reshape(-1, 1)).sum(0), (((X * chi.reshape(-1, L, 1)).sum(1) / np.sqrt(D)) * (yC-y).reshape(-1, 1)).sum(0)

def gradLossSoftmax(X, y, k, v):
    yC = attentionSoftmax(X, k, v)
    chis = (X * k).sum(2) / np.sqrt(D)
    activation = special.softmax(chis, axis=-1).reshape(-1, L, 1)
    XTX = (X * activation).sum(1) / np.sqrt(D)
    Xv_dif = ((X@v/ np.sqrt(D)) - yC.reshape(-1, 1)).reshape(-1, L, 1)
    grad_y_k = (X * (Xv_dif * activation)).sum(1)/np.sqrt(D)
    return  (grad_y_k * (yC-y).reshape(-1, 1)).sum(0), (XTX * (yC-y).reshape(-1, 1)).sum(0)

**5.6** Perform the gradient descent for both the linear attention and the softmax attention. Plot the test accuracy $\mathrm{Acc}$ versus the iterations of the gradient descent. For the x-axis (number of iterations), use a logarithmic scale. Print the best accuracy you obtain for both models.

In [None]:
# Your code here

### Interpreting the results

Now, let's take a look at the produced attention scores for both linear and softmax attention.

**5.7** write a function `getScores` that takes a tweet string, the trained vector $k$ and a boolean flag ``apply_softmax`` as inputs and returns the attention scores $\chi$ of the tweet. It should apply softmax to the scores if the ``apply_softmax`` flag is true and it should return the absolute values of the linear attention scores otherwise.

In [None]:
# Your code here

We provide a function to plot linear and softmax attention scores $\chi$.

In [None]:
def plotAttention(tweet, linear_scores, softmax_scores, ax):
    L = min(len(tweet.split()), 16)
    words = tweet.split()[:L][::-1]
    linear_scores = linear_scores[::-1]
    softmax_scores = softmax_scores[::-1]
    
    bar_height = 0.4

    ax.barh(np.arange(L)-bar_height/2, linear_scores, bar_height, color="lightblue", label="Linear attention")
    ax.set_xlabel("Linear attention scores")
    ax.set_yticks(np.arange(L))
    ax.set_yticklabels(words)
    leg = ax.legend()

    ax2 = ax.twiny()

    ax2.set_xlabel("Softmax attention scores")
    ax2.barh(np.arange(L)+bar_height/2, softmax_scores, bar_height, color="salmon", label="Softmax attention")
    ax2.set_xlabel("Softmax attention scores")

    leg2 = ax2.legend()
    ax2.legend(leg.get_patches()+leg2.get_patches(),
            [text.get_text() for text in leg.get_texts()+leg2.get_texts()])
    leg.remove()

Now we can plot the attention scores for positive and negative examples with given indices.

**5.8** Complete the code.

In [None]:
negative_ids = [1954, 1521, 1263]
positive_ids = [1744, 176, 924]

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i in range(3):
    tweet = dataTxt[negative_ids[i]]
    linear_scores = # to be filled
    softmax_scores = # to be filled
    plotAttention(tweet, linear_scores, softmax_scores, axes[i])

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i in range(3):
    tweet = dataTxt[positive_ids[i]]
    linear_scores = # to be filled
    softmax_scores = # to be filled
    plotAttention(tweet, linear_scores, softmax_scores, axes[i])

**5.9** What do you observe ?

Your answer here: