<a id="transformers"></a>
## Transformers - High Level Motivation:
How does the **ChatGPT** model predicts next word, given an input word?
    <p align="center">
        <div style="display: flex; justify-content: space-around;">
            <img src="../images/next-token-prediction.png" alt="Image 1" style="width:40%; margin-top:25px; margin-right:10px; ">
            <img src="../images/probability-distribution.png" alt="Image 2" style="width:40%; margin-top:25px;">
        </div>
    </p>

**ChatGPT** is based on the transformer architecture.

## Transformers - A High Level Motivation
 
- The transformer architecture was introduced in the [scientific paper](https://arxiv.org/pdf/1706.03762)
  <p align="center">
    <div style="display: flex; justify-content: space-around;">
      <img src="../images/attention-is-all-you-need.png" alt="Attention Is All You Need", style="width:35%; margin-top:10px;">
      <img src="../images/transformer-architecture.png" alt="Attention Is All You Need", style="width:60%; margin-top:10px;">
  </div>
  </p>
- It uses a key mechanism called **self-attention**, which lets each word in a sequence focus on all others to capture their relationships.

## Text Preprocessing 
<div style="display: flex; align-items: left;">
    <div style="flex: 1;">
        <ul>
            <li>Input text is tokenized.</li>
            <li>Tokens are converted to embeddings.</li>
            <li>Positional encodings are added.</li>
            <li>Resulting vectors are fed into the self-attention layer.</li>
        </ul>
    </div>
    <div style="flex: 1;">
        <p align="left">
          <img src="../images/transformer_positional_encoding_vectors.png" alt="Preprocessing" style="width:70%; margin-left: 10px;">
        </p>
    </div>
</div>

<p style="text-align: left; margin-top: 20px;">
    <ul>
        <li>We will now concentrate on the <b>self-attention layer</b> and implement it.</li>
    </ul>
</p>

<a id="self-attention"></a>
## Self-Attention

The _self-attention_ block in the transformer architecture takes $N$ inputs: ${\boldsymbol x_{1}},\, {\boldsymbol x_{2}}, \dots, {\boldsymbol x_{N}}$, each of size $D \times 1$, and returns $N$ output vectors. The process is as follows:


<a id="self-attention-step-01"></a>
### Self-Attention: Step 1
For each input word ${\boldsymbol x}$, generate:
- A _query_ vector ${\boldsymbol q}$, a _key_ vector ${\boldsymbol k}$, and a _value_ vector ${\boldsymbol v}$.
- To compute these, use:
  - Weight matrix ${\boldsymbol W_{q}}$ and bias vector ${\boldsymbol b_{q}}$ for queries.
  - Weight matrix ${\boldsymbol W_{k}}$ and bias vector ${\boldsymbol b_{k}}$ for keys.
  - Weight matrix ${\boldsymbol W_{v}}$ and bias vector ${\boldsymbol b_{v}}$ for values.

- Implement the computation of _queries_, _keys_, and _values_ using the [torch](https://pytorch.org/) Python Library.
- Generate input vectors ${\boldsymbol x_{1}}$, ${\boldsymbol x_{2}}$, and ${\boldsymbol x_{3}}$ with dimensions $D \times 1$, with $D=4$.
- Define weight matrices and bias vectors: 
  - ${\boldsymbol W_{q}}, {\boldsymbol b_{q}}$ for queries
  - ${\boldsymbol W_{k}}, {\boldsymbol b_{k}}$ for keys
  - ${\boldsymbol W_{v}}, {\boldsymbol b_{v}}$ for values
- Compute the _queries_, _keys_, and _values_ using these weights and biases.


In [None]:
import torch

# Set seed so we get the same random numbers
torch.manual_seed(3)

# Number of inputs
N = 3

# Number of dimensions of each input
D = 4

# Create an empty list
all_x = []
# Create elements x_n and append to list
for n in range(N):
    # Creates a tensor of shape (D, 1) with values drawn from the standard normal distribution, 
    # with a mean of 0 and a standard deviation of 1 
    x = torch.randn(size=(D, 1))  # D x 1 tensor
    print(f"The vector x_{n+1} is: \n {x} \n")
    all_x.append(x)  # Append x to list

After executing the cell above, the list object, `all_x` now has $N$ vectors, each of dimension $D \times 1$.

Let's define the weight matrices and bias vectors.

In [2]:
# Set seed so we get the same random numbers
torch.manual_seed(0)

# Choose random values for the parameters

# weight matrices
W_q = torch.randn(size=(D, D))
W_k = torch.randn(size=(D, D))
W_v = torch.randn(size=(D, D))

# bais terms
b_q = torch.randn(size=(D, 1))
b_k = torch.randn(size=(D, 1))
b_v = torch.randn(size=(D, 1))

<a id="exercise-01"></a>
#### Exercise 1
Given vectors ${\boldsymbol x}_{1}$, ${\boldsymbol x}_{2}$, and ${\boldsymbol x}_{3}$ in **`all_x`**, and weight matrices ${\boldsymbol W_{q}}$, ${\boldsymbol W_{k}}$, ${\boldsymbol W_{v}}$, and bias vectors ${\boldsymbol b_{q}}$, ${\boldsymbol b_{k}}$, ${\boldsymbol b_{v}}$, compute:
  - Query vectors: ${\boldsymbol q_{1}}$, ${\boldsymbol q_{2}}$, ${\boldsymbol q_{3}}$
  - Key vectors: ${\boldsymbol k_{1}}$, ${\boldsymbol k_{2}}$, ${\boldsymbol k_{3}}$
  - Value vectors: ${\boldsymbol v_{1}}$, ${\boldsymbol v_{2}}$, ${\boldsymbol v_{3}}$

- Formula: 
  - ${\boldsymbol q} = {\boldsymbol b} + {\boldsymbol W}\, {\boldsymbol x}$
  - ${\boldsymbol k} = {\boldsymbol b} + {\boldsymbol W}\, {\boldsymbol x}$
  - ${\boldsymbol v} = {\boldsymbol b} + {\boldsymbol W}\, {\boldsymbol x}$

- Use [torch.matmul](https://pytorch.org/docs/stable/generated/torch.matmul.html).

<a href="#exercise-1">Exercise 1</a> code snippet for you to complete.

In [3]:
# Make three lists to store queries, keys, and values
all_queries = []
all_keys = []
all_values = []

# For every input
for x in all_x:
    # TODO
    # Compute query, key, and value
    query = None 
    key = None
    value = None

    # Append the results to the lists
    all_queries.append(query)
    all_keys.append(key)
    all_values.append(value)

A visual illustration of <a href="#self-attention-step-1"> Self Attention: Step 1</a>


<p align="center">
  <img src="../images/self_attention_step_01.png" alt="self attention step 1", style="width:50%; margin-top:25px;">
</p>

### Self Attention: Step 2

- **Objective**: Calculate a score for self-attention.
- **Example**: For the word _"Thinking"_ in the sentence _"Thinking Machine"_.
- **Purpose**: Score each word against _"Thinking"_ to determine how much focus to place on other parts of the sentence during encoding.


### Self Attention: Step 2

- **Attention Score Calculation**:
  - Use the **dot product** of the _query_ vector of the word being processed with the _key_ vectors of all words.
  - For word 1 (_"Thinking"_):
    - First score: **dot product** of ${\boldsymbol q_{1}}$ and ${\boldsymbol k_{1}}$.
    - Second score: **dot product** of ${\boldsymbol q_{1}}$ and ${\boldsymbol k_{2}}$.


### Self Attention: Step 2
  
- **Visual llustration**

<p align="center">
  <img src="../images/self_attention_step_02.png" alt="self attention step 2", style="width:50%; margin-top:25px;">
</p>

<a id="exercise-02"> </a>
#### Exercise 2

Compute the _self-attention scores_ of the input vectors ${\boldsymbol x_{1}}$, ${\boldsymbol x_{2}}$, ${\boldsymbol x_{3}}$, using their query vectors ${\boldsymbol q_{1}}$, ${\boldsymbol q_{2}}$, ${\boldsymbol q_{3}}$ and key vectors: ${\boldsymbol k_{1}}$, ${\boldsymbol k_{2}}$, ${\boldsymbol k_{3}}$.

In [None]:
import math

all_attention_scores = []
for query in all_queries:
    query_keys_attention_scores = []
    for key in all_keys:
        # Compute the dot product of the query and key
        # TODO
        # Compute the dot product
        dot_product = None
        # TODO
        # Compute the scaled dot product (Normalizing the dot product by the square root of D)
        # Use the math.sqrt() function to get the square root of D
        # The purpose of normalization is for numerical stability
        scaled_dot_product = None

        # Append the result to the list
        query_keys_attention_scores.append(scaled_dot_product)
    all_attention_scores.append(query_keys_attention_scores)
    
all_attention_scores

Your output should be
```python
[[tensor(-2.8869), tensor(3.6531), tensor(-1.1920)],
 [tensor(0.0760), tensor(4.2115), tensor(4.1211)],
 [tensor(0.8152), tensor(-6.1015), tensor(1.6231)]]
```

<a id="self-attention-step-03"></a>
### Self Attention: Step 3

- Normalize each attention score so that they are positive and sum to $1$. 
- This is done using the [softmax function](https://en.wikipedia.org/wiki/Softmax_function), $\sigma: \mathbb{R}^{m} \to (0, 1)^{m}$, defined as:

Given the vector ${\boldsymbol z} = (z_{1},\, z_{2},\,\dots,\, z_{m}) \in \mathbb{R}^{m}$, 

$$\sigma({\boldsymbol z})_{i} = \dfrac{e^{z_{i}}}{\sum_{j=1}^{m} e^{z_{j}}}$$


<a id="exercise-03"></a>
#### Exercise 3
Implement the softmax function.

In [5]:
def softmax(items_in):
    if not isinstance(items_in, torch.Tensor):
        items_in = torch.Tensor(items_in)
    # TODO
    # Compute the exponential of the input
    exp_items = None
    # TODO
    # Compute the softmax by dividing by the sum of exponentials
    items_out = None

    return items_out


- Use your `softmax` implementation to compute the _self-attention weights_ by applying it to the _self-attention scores_ calculated in <a href="#exercise-02">Exercise 2</a>. 

- The _self-attention scores_ for each input vector: ${\boldsymbol x_{1}}$, ${\boldsymbol x_{2}}$, and ${\boldsymbol x_{3}}$ are stored in the list object `all_attention_scores`.



<a id="exercise-04"></a>
#### Exercise 4
Compute the self-attention weights of the input vectors: ${\boldsymbol x_{1}}$, ${\boldsymbol x_{2}}$, and ${\boldsymbol x_{3}}$ using their self-attention scores.

In [None]:
all_attention_weights = []
for idx, attention_scores in enumerate(all_attention_scores):
    # TODO
    # Compute the self-attention weights using the softmax function
    attention_weights = None 
    print(f"The attention weights for input vector x_{idx+1} are: {attention_weights}")
    all_attention_weights.append(attention_weights)

Your output should be:
```python
The attention weights for input vector x_1 are: tensor([0.0014, 0.9908, 0.0078])
The attention weights for input vector x_2 are: tensor([0.0083, 0.5183, 0.4735])
The attention weights for input vector x_3 are: tensor([3.0824e-01, 3.0549e-04, 6.9145e-01])
```

<a id="self-attention-step-04"></a>
### Self Attention: Step 4

- Multiply each value vector in **`all_values`** by its corresponding attention weight from **`all_attention_weights`**.
- This retains important values while reducing the influence of less relevant words using small attention weights.
- After multiplication, sum the weighted value vectors.



<a id="exercise-05"></a>
#### Exercise 5
Use the attention weights from <a href="#exercise-04">Exercise 4</a> (stored in **`all_attention_weights`**) and the value vectors in **`all_values`** to compute the weighted sum of ${\boldsymbol v_{1}}, {\boldsymbol v_{2}}, {\boldsymbol v_{3}}$.


In [None]:
all_attention_weighted_values = []

# Loop over each set of attention weights
for attention_weights in all_attention_weights:
    attention_weighted_values = []
    
    # For each attention weight and corresponding value
    for attention_weight, value in zip(attention_weights, all_values):
        # TODO
        # Compute the scalar vector multiplication
        weighted_value = None
        
        attention_weighted_values.append(weighted_value)
    # TODO
    # Sum the weighted values across all values
    # Use torch.stack() to stack the weighted values
    # Use torch.sum() to sum the values
    attention_weight_across_all_values = None
    
    # Append the final result to the list
    all_attention_weighted_values.append(attention_weight_across_all_values)

all_attention_weighted_values

Your output should be: 

```python
[tensor([[ 0.2117],
         [ 1.0697],
         [-3.3355],
         [-4.9260]]),
 tensor([[ 0.6486],
         [ 0.9883],
         [-2.4109],
         [-3.0185]]),
 tensor([[ 0.6463],
         [ 0.8405],
         [-1.6421],
         [-0.0805]])]
```

<a id="self-attention-all-steps"></a>
### Self Attention: All Steps
A visual illustration of all **self-attention** mechanism steps.
<p align="center">
  <img src="../images/self-attention-output.png" alt="Complete Attention Steps", style="width:50%; margin-top:25px;">
</p>

<a id="self-attention-matrix-computations"></a>
### Self-Attention: Matrix Computation
All computations could be done using matrices. Step 1 is as follows 

<p align="center">
  <img src="../images/self-attention-matrix-calculation.png" alt="Self Attention Matrix Calculation Step 1", style="width:40%; margin-top:20px">
</p>

### Self-Attention: Matrix Computation
The remain steps are illustrated in the figure below:

<p align="center">
  <img src="../images/self-attention-matrix-calculation-2.png" alt="Self Attention Matrix Calculation Step 2", style="width:50%;">
</p>

### Self-Attention: Matrix Computation: Step 1

Compute the **attention scores** by multiplying the set of queries packed in matrix $Q$ with the keys in the matrix $K$. If the matrix $Q$ is of size $m \times d_k$, and the matrix $K$ is of size $n \times d_k$, then the resulting matrix will be of size $m \times n$:

$$
QK^\top = \begin{bmatrix}
e_{11} & e_{12} & \dots & e_{1n} \\
e_{21} & e_{22} & \dots & e_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
e_{m1} & e_{m2} & \dots & e_{mn}
\end{bmatrix}
$$

### Self-Attention: Matrix Computation: Step 2

Scale each of the **attention scores** by $\frac{1}{\sqrt{d_k}}$:

$$
\frac{QK^\top}{\sqrt{d_k}} = \begin{bmatrix}
\frac{e_{11}}{\sqrt{d_k}} & \frac{e_{12}}{\sqrt{d_k}} & \dots & \frac{e_{1n}}{\sqrt{d_k}} \\
\frac{e_{21}}{\sqrt{d_k}} & \frac{e_{22}}{\sqrt{d_k}} & \dots & \frac{e_{2n}}{\sqrt{d_k}} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{e_{m1}}{\sqrt{d_k}} & \frac{e_{m2}}{\sqrt{d_k}} & \dots & \frac{e_{mn}}{\sqrt{d_k}}
\end{bmatrix}
$$

### Self-Attention: Matrix Computation: Step 3

Apply the softmax function, $\sigma: \mathbb{R}^{m} \to (0, 1)^{m}$, to obtain a set of **attention weights**:

$$
\sigma\left( \frac{QK^\top}{\sqrt{d_k}} \right) = \begin{bmatrix}
\sigma\left( \frac{e_{11}}{\sqrt{d_k}} \right)_{11} & \sigma\left( \frac{e_{12}}{\sqrt{d_k}} \right)_{12} & \dots & \sigma\left( \frac{e_{1n}}{\sqrt{d_k}} \right)_{1n} \\
\sigma\left( \frac{e_{21}}{\sqrt{d_k}} \right)_{21} & \sigma\left( \frac{e_{22}}{\sqrt{d_k}} \right)_{22} & \dots & \sigma\left( \frac{e_{2n}}{\sqrt{d_k}} \right)_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
\sigma\left( \frac{e_{m1}}{\sqrt{d_k}} \right)_{m1} & \sigma\left( \frac{e_{m2}}{\sqrt{d_k}} \right)_{m2} & \dots & \sigma\left( \frac{e_{mn}}{\sqrt{d_k}} \right)_{mn}
\end{bmatrix}
$$

### Self-Attention: Matrix Computation: Step 4

Finally, apply the resulting attention weights to the values in matrix $V$, of size $n \times d_v$:

$$
\sigma\left( \frac{QK^\top}{\sqrt{d_k}} \right) \cdot V = 
\begin{bmatrix}
\sigma\left( \frac{e_{11}}{\sqrt{d_k}} \right)_{11} & \dots & \sigma\left( \frac{e_{1n}}{\sqrt{d_k}} \right)_{1n} \\
\sigma\left( \frac{e_{21}}{\sqrt{d_k}} \right)_{21} & \dots & \sigma\left( \frac{e_{2n}}{\sqrt{d_k}} \right)_{2n} \\
\vdots & \ddots & \vdots \\
\sigma\left( \frac{e_{m1}}{\sqrt{d_k}} \right)_{m1} & \dots & \sigma\left( \frac{e_{mn}}{\sqrt{d_k}} \right)_{mn}
\end{bmatrix}
\cdot
\begin{bmatrix}
v_{11} & \dots & v_{1d_v} \\
v_{21} & \dots & v_{2d_v} \\
\vdots & \ddots & \vdots \\
v_{n1} & \dots & v_{nd_v}
\end{bmatrix}
$$


- In [Exercise 6](#exercise-06), you will implement self-attention using matrices.
- The input matrix ${\boldsymbol X}$ consists of the columns from **`all_x`**.
- Apply the _softmax_ function using [torch.softmax](https://pytorch.org/docs/stable/generated/torch.softmax.html).

In [None]:
# Copy data into a matrix
X = torch.zeros((D, N))  # Create a tensor of shape (D, N) filled with zeros

# Copy the data from all_x into the matrix
X[:, 0] = all_x[0].squeeze()
X[:, 1] = all_x[1].squeeze()
X[:, 2] = all_x[2].squeeze()

X

<a id="exercise-06"></a>
#### Exercise 6
Given the input matrix: ${\boldsymbol X}$, the weight matrices: ${\boldsymbol W_{v}}$, ${\boldsymbol W_{q}}$, ${\boldsymbol W_{k}}$, and the bias vectors: ${\boldsymbol b_{v}}$, ${\boldsymbol b_{q}}$, ${\boldsymbol b_{k}}$, implement the matrix form of the self-attention mechanism:

$$\sigma\left( \frac{QK^\top}{\sqrt{d_k}} \right) \cdot V$$

For matrix transpose, use the `.T` operator.




In [10]:
# Define the scaled dot product self-attention function
def scaled_dot_product_self_attention(X, W_v, W_q, W_k, b_v, b_q, b_k):
    ## Step 1: Compute queries, keys, and values
    # 1. Compute queries, keys, and values using matrix multiplication and addition
    # TODO
    queries = None
    keys = None
    values = None

    # 2. Compute dot products of keys and queries (keys.T * queries)
    # TODO
    attention_scores = None

    ## Step 2: Perform the scaling operation
    # 3. Scale the dot products by the square root of the dimensionality of the keys
    d_k = torch.tensor(keys.shape[0], dtype=torch.float32) 
    # TODO
    # Use torch.sqrt() to get the square root of d_k
    scaled_attention_scores = None

    ## Step 3: Perform the softmax operation to calculate attention weights
    # 4. Apply softmax to calculate attention weights, i.e., apply softmax 
    # to the columns of scaled_attention_scores (i.e., across rows, dimension 0) 
    # TODO
    # Note the dimension you want to apply the softmax to. 
    # You want to apply it across the rows
    attention_weights = None

    ## Step 4: Calculate the output
    # 5. Weight the values by attention weights
    # TODO
    X_prime = None

    return X_prime


In [None]:
X_prime = scaled_dot_product_self_attention(X, W_v, W_q, W_k, b_v, b_q, b_k)
X_prime

```python
tensor([[ 0.2117,  0.6486,  0.6463],
        [ 1.0697,  0.9883,  0.8405],
        [-3.3355, -2.4109, -1.6421],
        [-4.9260, -3.0185, -0.0805]])
```

<a id="multihead-self-attention"></a>
## Multihead Self-Attention 

- Multihead self-attention maps $N$ input vectors ${\boldsymbol x_{i}} \in \mathbb{R}^{D}$ to $N$ output vectors ${\boldsymbol x^{'}_{i}}\in \mathbb{R}^{D}$ with $1 \leq i \le N$.
- It repeats the self-attention mechanism a fixed number of times in parallel.
- Self-attention occurs across multiple "heads", each with its own queries, keys, and values.
- Outputs are vertically concatenated and recombined using a linear transformation layer, ${\boldsymbol \Omega_{c}}$.

## Multihead Self-Attention
Visual illustration of a 2 head self-atttention.

<p align="center">
  <img src="../images/two-head-self-attention.png" alt="Two Head Attention", style="width:50%;">
</p>

### Input Requirements for Inplementing 2 Head Self Attention Mechanism

- In [Exercise 7](#exercise-07), you will implement the a two head self-attention using matrices.
- Given the input matrix ${\boldsymbol X}$ of dimension $N \times D$, with $N=6$ and $D=8$.
- Apply the _softmax_ function using [torch.softmax](https://pytorch.org/docs/stable/generated/torch.softmax.html).

In [None]:
# Set seed so we get the same random numbers
torch.manual_seed(3)

# Number of inputs
N = 6

# Number of dimensions of each input
D = 8

# Create a tensor with random normal values (mean=0, std=1)
X = torch.randn(D, N)

# Print X
print(X)


### Input Requirements for Inplementing 2 Head Self Attention Mechanism

- Weight matrices and bias vectors are needed for keys, queries, and values.
- Queries, keys, and values will have dimensions $\frac{D}{H} \times N$, with $H=2$.

In [13]:
H = 2
# QKV dimension
H_D = int(D / H)

# Set seed so we get the same random numbers
torch.manual_seed(0)

# Choose random values for the parameters for the first head
W_q1 = torch.randn(size=(H_D, D))
W_k1 = torch.randn(size=(H_D, D))
W_v1 = torch.randn(size=(H_D, D))
b_q1 = torch.randn(size=(H_D, 1))
b_k1 = torch.randn(size=(H_D, 1))
b_v1 = torch.randn(size=(H_D, 1))

# Choose random values for the parameters for the second head
W_q2 = torch.randn(size=(H_D, D))
W_k2 = torch.randn(size=(H_D, D))
W_v2 = torch.randn(size=(H_D, D))
b_q2 = torch.randn(size=(H_D, 1))
b_k2 = torch.randn(size=(H_D, 1))
b_v2 = torch.randn(size=(H_D, 1))

# Choose random values for the parameters
W_c = torch.randn(size=(D, D)) # Linear transformation used to combine the vertically concatenated attention heads.

<a id="exercise-07"></a>
#### Exercise 7

Implement the multihead self-attention mechanism by completing the code snippets below

In [14]:
# Define the multi-head scaled self-attention mechanism
def multihead_scaled_self_attention(
        X, W_v1, W_q1, W_k1, b_v1, b_q1, b_k1, W_v2, W_q2, W_k2, b_v2, b_q2, b_k2, W_c
    ):
    
    # 1. Compute queries, key, and value for Head 1
    Q_1 = torch.matmul(W_q1, X) + b_q1
    # TODO
    K_1 = None
    V_1 = None

    # 2. Compute queries, key, and value for Head 2
    # TODO
    Q_2 = None
    K_2 = None
    V_2 = None

    # 3. Compute dot products (Remmber to transpose the keys)
    # TODO
    attention_scores_1 = None
    attention_scores_2 = None
    
    # dimensionality of the keys (same for both heads)
    d_k = torch.tensor(K_1.shape[0], dtype=torch.float32)  

    # 4. Scale dot products
    # TODO
    scaled_attention_scores_1 = None
    scaled_atteneion_scores_2 =  None

    # 5. Apply softmax to calculate attention scores
    # TODO
    attention_weights_1 = None 
    attention_weights_2 = None

    # 6. Weight values by attention weights
    # TODO
    head_1_output = None
    head_2_output = None

    # 7. Concatenate the outputs of the two heads using torch.cat
    # Note the dimension you want to concatenate along. 
    # You want to concatenate along the rows
    # TODO
    concatenated_output = None

    # 8. Apply the final linear transformation
    # TODO
    X_prime = None

    return X_prime


In [None]:
X_prime = multihead_scaled_self_attention(
    X=X,
    W_v1=W_v1, W_q1=W_q1, W_k1=W_k1, b_v1=b_v1, b_q1=b_q1, b_k1=b_k1,
    W_v2=W_v2, W_q2=W_q2, W_k2=W_k2, b_v2=b_v2, b_q2=b_q2, b_k2=b_k2,
    W_c=W_c
)

# Set precision for printing
# torch.set_printoptions(precision=3)

# Print out the results
print("Your answer:")
print(X_prime)

print("\nTrue values:")
true_values = torch.tensor([[  7.5007,  15.3861,  12.1210,  23.4584,   5.5463,  -7.4986],
                            [  4.2205,   4.8746,  -2.2050,   4.0501,  -4.5253,   5.1547],
                            [  1.8911,   3.0347,   3.3987,   2.7332,   2.9584,  -0.8235],
                            [  2.6210,   2.1765,  -4.9742,  -0.9246,  -1.9279,   3.7258],
                            [ -0.1304,  -0.2504,   3.7003,   0.9485,   9.3840,   0.6969],
                            [  2.5238,   1.5546,  -0.7890,   2.6671,  -0.4587,   4.4277],
                            [  0.0560,  -1.6877,  -1.5374,  -1.7000,   0.3914,   4.6478],
                            [ -1.3516,  -4.1357,  -8.8783,  -1.0026, -12.8571,  -4.9448]])

print(true_values)

# Check if the result is correct
assert torch.allclose(X_prime, true_values, atol=1e-4)

### Text Generation with Pretrained Transformer Based Model

In [9]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"
# model_name = "Qwen/Qwen2.5-0.5B"
# model_name = "Qwen/Qwen2.5-1.5B-Instruct"
# model_name = "Qwen/Qwen2.5-1.5B"

device = "cpu" # the device to load the model onto
device_map = "cpu"  # "auto"

# Load the model and tokenizer
# The transformers is a Python library that provides general-purpose architectures 
# for natural language understanding and natural language generation. 
# It is based on the PyTorch library.
# Each model is associated with a tokenizer that is used to preprocess the input text.
# Load the model and tokenizer framework
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the Qwen2.5-0.5B-Instruct pretrained model 
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map=device_map,
)

# Load the tokenizer for the Qwen2.5-0.5B-Instruct pretrained model
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [15]:
## Input text

prompt = "Give me a short introduction to large language model."

## Maximum number of tokens to generate
max_new_tokens=100

## Apply the chat template to the input text
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user", 
        "content": prompt
    }
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)


## Tokenization
model_inputs = tokenizer([text], return_tensors="pt").to(device)

In [None]:
## Generate at most max_new_tokens tokens
generated_ids = model.generate(
    model_inputs.input_ids,
    attention_mask=model_inputs.attention_mask,
    max_new_tokens=max_new_tokens
)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]


## Decode the generated token ids
response = tokenizer.batch_decode(sequences=generated_ids, skip_special_tokens=True)[0]

response