# Developing an NLP Model with QKV Attention Mechanism

This notebook focuses on developing on developing a Natural Language Processing (NLP) model from scratch using Python and NumPy. This notebook will walk through implementing the QKV (Query, Keys, Values) attention mechanism, a cornerstone of modern NLP models.

## Overview

The attention mechanism has revolutionized NLP by allowing models to focus on different parts of the input sequence, thereby enhancing their predictive accuracy. In this guide, we will:

- **Understand and Implement the QKV Attention Mechanism**: Dive deep into the Query, Key, and Values (QKV) attention mechanism, and learn how to implement it from scratch using Python and NumPy.
- **Build the Model Architecture**: Construct the architecture of an NLP model incorporating the QKV attention mechanism.
- **Train and Evaluate the Model**: Train the model with appropriate datasets and assess its performance.

## Objectives

- **Gain Insight into Attention Mechanisms**: Develop a thorough understanding of how attention mechanisms work and their significance in NLP tasks.
- **Hands-On Implementation**: Build and experiment with a QKV attention-based NLP model from scratch.
- **Evaluate Performance**: Learn how to evaluate the model's performance and make improvements based on the results.

This project aims to provide a clear and practical guide for implementing and experimenting with attention mechanisms in NLP, offering valuable hands-on experience with one of the most influential techniques in modern machine learning.



# Input Data

In this section, we are preparing the input data for our NLP model by defining arrays representing word embeddings and combining them into a structured format.
Let's consider as starting point for example 3 phrases made of 4 words each where each word have embedding size 5:
 


In [None]:
import sys 
import numpy as np
 

# Phrase 1
word1 = np.array([0.1, 0.2, 0.3, 0.4, 0.5,0.3])
word2 = np.array([0.5, 0.4, 0.7,0.3, 0.2,0.3])
word3 = np.array([0.2,0.7, 0.3, 0.5, 0.4,0.4])
word4 = np.array([0.4, 0.1,0.7, 0.2, 0.5,0.7])

# Phrase 2
word5 = np.array([0.1, 0.9, 0.3, 0.4, 0.5,0.2])
word6 = np.array([0.4, 0.4, 0.7,0.3, 0.4,0.6])
word7 = np.array([0.2,0.7, 0.4, 0.5, 0.4,0.2])
word8 = np.array([0.4, 0.5,0.7, 0.7, 0.5,0.1])

# Phrase 3
word9 = np.array([0.1, 0.2, 0.3, 0.8, 0.5,0.2])
word10 = np.array([0.4, 0.5, 0.7,0.3, 0.8,0.4])
word11 = np.array([0.9,0.7, 0.3, 0.5, 0.4,0.6])
word12 = np.array([0.4, 0.5,0.1, 0.7, 0.4,0.4])
 

We combine all these word embeddings into a single matrix. This matrix, `inputs`, has the shape `(3, 4, 6)`, where:
- `3` represents the number of phrases (batch size),
- `4` is the number of words in each phrase (sequence length),
- `6` is the dimensionality of each word embedding.


In [None]:
inputs = np.stack([[word1, word2, word3, word4],[word5, word6, word7, word8],[word9, word10, word11, word12]])
inputs, inputs.shape

Implementing the classifier model class we can start by adding these parameters:

In [None]:
class QKVAttentionClassifier:
    def __init__(self,word_len,batch_size):
        self.word_len=word_len
        self.batch_size = batch_size 

# Attention Head

An attention head in the attention mechanism is a crucial component of the model that computes the weighted sum of the values based on the similarity between the queries and keys.
The primary goal of the attention mechanism is to derive better and richer representations of word embeddings. By focusing on different parts of the input sequence, the attention mechanism helps the model capture intricate relationships and dependencies between words. This enhanced representation improves the model’s ability to understand context and perform various NLP tasks more effectively.

The output of an attention head can be obtained mathematically as follows:

1. **Calculate the Scores**: The attention scores are computed as the dot product of the query matrix \\( Q \\) with the transpose of the key matrix \\( K \\). To ensure that the gradients are well-behaved and to prevent excessively large values in the softmax step, the dot product is scaled by \\( \sqrt{d_k} \\):
  $$
  \text{Scores} = \frac{QK^T}{\sqrt{d_k}}
  $$
   where \\( d_k \\) refers to the dimensionality of the key vectors. Specifically, \\( d_k \\) is equal to the number of neurons in the key linear layer \\( K \\). This dimension is crucial for scaling the attention scores, which helps in stabilizing the gradients during training.

2. **Apply Softmax**: Apply the softmax function to the scores to get the attention weights:
   $$
   \text{Attention Weights} = \text{softmax}(\text{Scores})
   $$

3. **Compute the Weighted Sum**: Multiply the attention weights by the value matrix \\( V \\) to get the weighted sum:
   $$
   \text{Output} = \text{Attention Weights} \times V = \sigma( \frac{QK^T}{\sqrt{d_k}}) \times V
   $$

The meaning of the Q,K,V matrices is as follow:

- **Query Matrix \( Q \)**: Represents the queries for which we are computing attention.
- **Key Matrix \( K \)**: Represents the keys that are used to compute the similarity with the queries.
- **Value Matrix \( V \)**: Contains the values that will be weighted by the attention weights to produce the final output.

While the query and key matrices have dimensions \\(\text{EmbeddingSize} \times d_k\\), the value matrix \\( V \\) has a second dimension \\( d_v \\) that specifies how the words will be represented after attention. This dimension \\( d_v \\) influences the output representation of the words, allowing the model to create a more meaningful and rich representation based on the attention mechanism.
 

Here for example is the attention value calculation for the phrase "Bank of the river" represented as words of lenght 5, \\( d_v \\) = 3 and \\( d_k \\) = 3, having 3 neurons in each layer for Q,V,K:

![NotebookAttention1.png](attachment:b403002f-bdc0-4b8e-b172-7ab071c25a66.png)

Considering our numerical example we have the following:

In [None]:
word2vec_len=6
dk=3
dv=3
Q = np.random.rand(word2vec_len, dk)/ np.sqrt(word2vec_len)
K = np.random.rand(word2vec_len, dk)/ np.sqrt(word2vec_len)
V = np.random.rand(word2vec_len, dv)/ np.sqrt(word2vec_len)

Adding Q,K,V to the classifier we have:


In [None]:
class QKVAttentionClassifier:
    def __init__(self,word2vec_len,batch_size,dk,dv):
        self.word2vec_len=word2vec_len
        self.batch_size = batch_size
        self.dk=dk
        self.dv = dv
        self.Q = np.random.rand(self.word2vec_len, self.dk) / np.sqrt(self.word_len)
        self.K = np.random.rand(self.word2vec_len, self.dk) / np.sqrt(self.word_len)
        self.V = np.random.rand(self.word2vec_len, self.dv) / np.sqrt(self.word_len)

## Forward Pass in Attention Mechanism

The `forward` method in the attention mechanism is responsible for computing the query, key, and value vectors from the input embeddings. Let’s break down the code provided:


In [None]:
Qval=np.matmul(inputs, Q)
Qval,Qval.shape

In [None]:
Kval=np.dot(inputs, K)
Kval

In [None]:
Vval=np.dot(inputs, V)
Vval

Having the values of the \\(Q\\), \\(K\\), \\(V\\) matrices and the \\(dk\\) values we can calculate the scores:
$$
  \text{QKscaled} = \frac{QK^T}{\sqrt{d_k}}
  $$

In [None]:
QKscaled=np.matmul(Qval, np.transpose(Kval, (0, 2, 1)))/np.sqrt(K.shape[1])
QKscaled

The \\(attention\\) \\(weights\\) as:
$$
  \text{Attention weights} =\sigma( \frac{QK^T}{\sqrt{d_k}})
  $$

In [None]:
import numpy as np

def softmax(x, axis=-1):
    # Subtract the max value for numerical stability
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

Attention_weights=softmax(QKscaled)
Attention_weights

and finally the \\(Attention\\) value as:
$$
  \text{Attention} =\sigma( \frac{QK^T}{\sqrt{d_k}})V
  $$

In [None]:
Attention=np.matmul(Attention_weights, Vval)
Attention 

Consider now the change in the dimensionality of each word:

- **Initial Embedding Size**: The word embeddings begin with a dimensionality of 6.
- **Attention Output Size**: After applying the attention mechanism, the output of the attention layer typically has a reduced dimensionality, in our case it end up with a size of 3 as \\( dv \\)=3.

This reduction occurs because:
  - The attention mechanism often projects the embeddings into a lower-dimensional space to capture the most relevant information while reducing computational complexity.
  - The output dimension of the attention mechanism is determined by the number of neurons in the linear layer of V matrix of the attention mechanism.


## Computing Phrase Representation from Attention Scores

In the attention mechanism, once the attention scores are computed and applied, we often need to summarize or aggregate these scores to obtain a representation of the entire phrase. Here’s how we compute the `phrase_representation`:

### Context

Given an `Attention` matrix that represents the attention weights applied to each word in a phrase, the goal is to aggregate these weights to obtain a single representation for the phrase.

### Computing Phrase Representation

1. **Attention Matrix**:
   - **Shape**: (batch_size, sequence_length, embedding_dim) 
   - **Purpose**: Contains the attention weights for each word in each phrase. Each entry in this matrix represents the weighted influence of words in the phrase.

2. **Phrase Representation Calculation**:
   - To obtain a single representation for each phrase, we compute the average of the attention weights along the sequence length dimension.

   ```python
   phrase_representation = np.mean(Attention, axis=1)


![NotebookAttention2.png](attachment:b455c92e-0a83-42a8-a3d2-c8f2883a0344.png)

In [1]:
phrase_representation = np.mean(Attention, axis=1)
print("Phrase Representation:")
print(phrase_representation)

NameError: name 'np' is not defined

At this point, we have computed the following:

- **Attention Mechanism**:
  - `Qval`: The result of multiplying the input with the query weight matrix \( Q \).
  - `Kval`: The result of multiplying the input with the key weight matrix \( K \).
  - `Vval`: The result of multiplying the input with the value weight matrix \( V \).
  - `QKscaled`: The scaled dot product of `Qval` and `Kval`, normalized by the square root of the dimensionality of the key vectors \( d_k \).
  - `attention_weights`: The result of applying the softmax function to `QKscaled` to get attention weights.
  - `attention`: The result of multiplying `attention_weights` with `Vval`.
We can now add these parts to the classifier model:

In [None]:
class QKVAttentionClassifier:
    def __init__(self,word2vec_len,batch_size,dk,dv):
        self.word2vec_len=word2vec_len
        self.batch_size = batch_size
        self.dk=dk
        self.dv = dv
        # Initialize weights with Xavier/Glorot initialization
        self.Q = np.random.rand(self.word2vec_len, self.dk) / np.sqrt(self.word_len)
        self.K = np.random.rand(self.word2vec_len, self.dk) / np.sqrt(self.word_len)
        self.V = np.random.rand(self.word2vec_len, self.dv) / np.sqrt(self.word_len)

    
    def AttentionHead(self, Inputs):
        self.Qval = np.dot(Inputs, self.Q)
        self.Kval = np.dot(Inputs, self.K)
        self.Vval = np.dot(Inputs, self.V) 
        QKscaled = np.matmul(self.Qval, np.transpose(self.Kval, (0, 2, 1))) / np.sqrt(self.K.shape[1]) 
        self.Attention_weights = self.softmax(QKscaled) 
        return np.matmul(self.Attention_weights, self.Vval)
    
    def forward(self, Inputs):
        Attention = self.AttentionHead(Inputs)
        self.phrase_representation = np.mean(Attention, axis=1)

With these computations complete, we are now ready to feed the `phrase_rep` into the linear layer for further or classification task.

In [None]:
num_classes = 2  # Example number of classes (binary classification)
linearlayer= np.random.rand(dv, num_classes)   
linear_bias = np.random.rand(num_classes)

In [None]:
Sigma_Zout=softmax(np.matmul(phrase_representation, linearlayer) + linear_bias)
Sigma_Zout

We add the linearlayer and Sigma_Zout calculation to the classifier:


In [None]:
class QKVAttentionClassifier:
    def __init__(self, word_len, words_per_phrase, batch_size, dk, dv, num_classes):

        self.word_len = word_len
        self.batch_size = batch_size
        self.dk = dk
        self.dv = dv
        self.num_classes = num_classes
        self.words_per_phrase = words_per_phrase

        # Initialize weights with Xavier/Glorot initialization
        self.Q = np.random.randn(self.word_len, self.dk) / np.sqrt(self.word_len)  
        self.K = np.random.randn(self.word_len, self.dk) / np.sqrt(self.word_len)  
        self.V = np.random.randn(self.word_len, self.dk) / np.sqrt(self.word_len)  

        # Initialize linear layer weights
        self.linearlayer = np.random.randn(self.dk, self.num_classes) / np.sqrt(self.dk)
        self.linear_bias = np.zeros(self.num_classes) 
        
    def LinearLayer(self):
        output = np.matmul(self.phrase_representation, self.linearlayer) + self.linear_bias
        return output

    def forward(self, Inputs):
        Attention = self.AttentionHead(Inputs)
        self.phrase_representation = np.mean(Attention, axis=1)

        Zout = self.LinearLayer()
        Sigma_Zout = self.softmax(Zout)

        return Sigma_Zout

### Cross-Entropy Loss Calculation

At this stage, we have calculated the cross-entropy loss between our predictions and the true target values.

 

1. **True Target**:
   - The `target` variable represents the true class labels for each example in the batch. It is a list of one-hot encoded vectors. For instance:
     - `np.array([0, 1])` represents the true class for the first example.
     - `np.array([1, 0])` represents the true class for the second example.
     - `np.array([1, 0])` represents the true class for the third example.

2. **Cross-Entropy Loss Calculation**:
   - The cross-entropy loss is computed using the formula:
     \\[
     \text{batch\_loss} = -\sum (\text{target} \cdot \log(\text{predictions} + 1e-8))
     \\]
   - This loss function measures the difference between the predicted probabilities and the true class labels. The `1e-8` term is added to avoid taking the logarithm of zero, which could result in undefined values.

3. **Result**:
   - The computed loss, `loss`, is an average of the individual losses across the batch. It quantifies how well the predicted probabilities match the true labels. A lower loss indicates better performance.

In summary, this step provides a measure of how well our model's predictions align with the actual class labels in our batch. The output of this calculation will be used to guide the training process through backpropagation.


In [None]:
import numpy as np

 
target = [np.array([0, 1]),np.array([1, 0]),np.array([1, 0])]


def cross_entropy_loss(predictions, target): 
    batch_loss = -np.sum(target * np.log(predictions + 1e-8), axis=1)
    return np.mean(batch_loss) 
 

loss = cross_entropy_loss(Sigma_Zout, target)
print("Cross-Entropy Loss: ",loss)
 


We can now add both softmax and cross entropy loss functions to the classifier model:

In [None]:
class QKVAttentionClassifier:
     
    def softmax(self, x, axis=-1):
        x = np.clip(x, -1e4, 1e4)  # Clip for numerical stability
        e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
        return e_x / np.sum(e_x, axis=axis, keepdims=True)

    def cross_entropy_loss(self, predictions, target):
        # Cross-entropy loss for a batch of predictions and targets
        batch_loss = -np.sum(target * np.log(predictions + 1e-9), axis=1)
        return np.mean(batch_loss)


## Backpropagation
 
### Overview

Backpropagation is a fundamental algorithm used for training artificial neural networks. It is a supervised learning technique that adjusts the weights of the network to minimize the error between the predicted output and the actual target values. The goal of backpropagation is to optimize the network's performance by systematically reducing the loss function through gradient descent.

In our example, using the chain rule we need to calculate therefore the following quantities:

\\[
\frac{\partial \text{Loss}}{\partial Z^{out}} = \frac{\partial \text{Loss}}{\partial \sigma(Z^{out})} \frac{\partial \sigma(Z^{out})}{\partial Z^{out}}
\\]

\\[
\frac{\partial \text{Loss}}{\partial Q} = \frac{\partial \text{Loss}}{\partial Z^{out}}  \frac{\partial Z^{out}}{\partial A} \frac{\partial A}{\partial Q}
\\]
\\[
\frac{\partial \text{Loss}}{\partial K} = \frac{\partial \text{Loss}}{\partial Z^{out}}  \frac{\partial Z^{out}}{\partial A} \frac{\partial A}{\partial K}
\\]
\\[
\frac{\partial \text{Loss}}{\partial V} = \frac{\partial \text{Loss}}{\partial Z^{out}}  \frac{\partial Z^{out}}{\partial A} \frac{\partial A}{\partial V}
\\]
\\[
\frac{\partial \text{Loss}}{\partial W} = \frac{\partial \text{Loss}}{\partial Z^{out}}  \frac{\partial Z^{out}}{\partial W}
\\]
\\[
\frac{\partial \text{Loss}}{\partial Bias} =\frac{\partial \text{Loss}}{\partial Z^{out}}  \frac{\partial Z^{out}}{\partial Bias}
\\]

### Gradient of loss with respect to output probabilities:


The gradient of the loss with respect to the logits \\( Z^{out} \\) can be expressed as:
\\[
L=-\sum y_ilog(\sigma(Z^{out}))
\\]

\\[
\frac{\partial \text{Loss}}{\partial Z^{out}} = -\sum y_i \frac{1}{\sigma(Z^{out})}  \frac{\partial \sigma(Z^{out})}{\partial Z^{out}}= -\sum y_i \frac{1}{\sigma(Z^{out})}\sigma(Z^{out})[1-\sigma(Z^{out})]
\\]

as \\(y_i\\) is a one hot encoded vector: 

\\[ \sum y_i = 1 \\]

so:
\\[
\frac{\partial \text{Loss}}{\partial Z^{out}} = \frac{\partial \text{Loss}}{\partial \sigma(Z^{out})} \frac{\partial \sigma(Z^{out})}{\partial Z^{out}}=\sigma_i(Z^{out})-y_i
\\]

In [None]:
# Gradient of loss with respect to output probabilities
dLoss_dSigma_Zout =Sigma_Zout - np.stack(target)
dLoss_dSigma_Zout



### Gradient of the loss with respect to linear layer and bias:

The gradient of the loss with respect to the linear layer weights and bias can be expressed as:
\\[
\frac{\partial Loss}{\partial W}=\begin{cases} \frac{\partial Loss}{\partial Z^{out}}=\frac{\partial Loss}{\partial \sigma(Z^{out})}\frac{\partial \sigma(Z^{out})}{\partial  Z^{out}}= \sigma(Z^{out})-y_{true} \\  \frac{\partial Loss}{\partial w_{11}}=\frac{\partial Loss}{\partial \sigma(Z^{out}_1)}\frac{\partial \sigma(Z^{out}_1)}{\partial Z^{out}_1}\frac{\partial Z^{out}_1}{\partial w_{11}}= [\sigma(Z^{out}_1)-y_{1}]\cdot Y_1\\  \frac{\partial Loss}{\partial w_{21}}=\frac{\partial Loss}{\partial \sigma(Z^{out}_1)}\frac{\partial \sigma(Z^{out}_1)}{\partial Z^{out}_1}\frac{\partial Z^{out}_1}{\partial w_{21}}= [\sigma(Z^{out}_1)-y_{1}]\cdot Y_2\\  \frac{\partial Loss}{\partial w_{31}}=\frac{\partial Loss}{\partial \sigma(Z^{out}_1)}\frac{\partial \sigma(Z^{out}_1)}{\partial Z^{out}_1}\frac{\partial Z^{out}_1}{\partial w_{31}}= [\sigma(Z^{out}_1)-y_{1}]\cdot Y_3\\  \frac{\partial Loss}{\partial w_{12}}=\frac{\partial Loss}{\partial \sigma(Z^{out}_2)}\frac{\partial \sigma(Z^{out}_2)}{\partial Z^{out}_2}\frac{\partial Z^{out}_2}{\partial w_{12}}= [\sigma(Z^{out}_2)-y_{2}]\cdot Y_1\\  \frac{\partial Loss}{\partial w_{22}}=\frac{\partial Loss}{\partial \sigma(Z^{out}_2)}\frac{\partial \sigma(Z^{out}_2)}{\partial Z^{out}_2}\frac{\partial Z^{out}_2}{\partial w_{22}}= [\sigma(Z^{out}_2)-y_{2}]\cdot Y_2\\  \frac{\partial Loss}{\partial w_{32}}=\frac{\partial Loss}{\partial \sigma(Z^{out}_2)}\frac{\partial \sigma(Z^{out}_2)}{\partial Z^{out}_2}\frac{\partial Z^{out}_2}{\partial w_{32}}= [\sigma(Z^{out}_2)-y_{2}]\cdot Y_3\\  \frac{\partial Loss}{\partial B_{1}}=\frac{\partial Loss}{\partial \sigma(Z^{out}_1)}\frac{\partial \sigma(Z^{out}_1)}{\partial Z^{out}_1}\frac{\partial Z^{out}_2}{\partial B_{1}}= [\sigma(Z^{out}_1)-y_{1}]\cdot 1\\   \frac{\partial Loss}{\partial B_{2}}=\frac{\partial Loss}{\partial \sigma(Z^{out}_2)}\frac{\partial \sigma(Z^{out}_2)}{\partial Z^{out}_2}\frac{\partial Z^{out}_2}{\partial B_{2}}= [\sigma(Z^{out}_2)-y_{2}]\cdot 1\\ \end{cases}
\\]

These values will be used to update the weights of the linear layer trought the learning rate.

In [None]:
# Gradient for linear layer and bias
d_linear = np.dot(dLoss_dSigma_Zout.T, phrase_representation).T
d_bias =  np.sum(dLoss_dSigma_Zout, axis=0)
d_linear,d_bias

In [None]:
class QKVAttentionClassifier: 
    def BackPropagation(self, dLoss_dSigma_Zout, inputs):
        
        # Gradient for linear layer
        dlinear_dW = np.dot(dLoss_dSigma_Zout.T, self.phrase_representation).T

        # Gradient for bias
        d_bias = np.sum(dLoss_dSigma_Zout, axis=0)

### Gradient of the loss with respect to phrase representation:
 

The gradient of the loss with respect to phrase representation can be expressed as:
\\[
\frac{\partial Loss}{\partial Y}=\begin{cases} \frac{\partial Loss}{\partial Y^{z_1}_1}=\frac{\partial Loss}{\partial \sigma(Z^{out}_1)}\frac{\partial \sigma(Z^{out}_1)}{\partial Z^{out}_1}\frac{\partial Z^{out}_1}{\partial Y_1}=[\sigma(Z_1^{out})-y_1]\cdot w_{11}\\  \frac{\partial Loss}{\partial Y^{z_2}_1}=\frac{\partial Loss}{\partial \sigma(Z^{out}_2)}\frac{\partial \sigma(Z^{out}_2)}{\partial Z^{out}_2}\frac{\partial Z^{out}_2}{\partial Y_1}=[\sigma(Z_2^{out})-y_2]\cdot w_{12}\\    \frac{\partial Loss}{\partial Y^{z_1}_2}=\frac{\partial Loss}{\partial \sigma(Z^{out}_1)}\frac{\partial \sigma(Z^{out}_1)}{\partial Z^{out}_1}\frac{\partial Z^{out}_1}{\partial Y_2}=[\sigma(Z_1^{out})-y_1]\cdot w_{21}\\  \frac{\partial Loss}{\partial Y^{z_2}_2}=\frac{\partial Loss}{\partial \sigma(Z^{out}_2)}\frac{\partial \sigma(Z^{out}_2)}{\partial Z^{out}_2}\frac{\partial Z^{out}_2}{\partial Y_2}=[\sigma(Z_2^{out})-y_2]\cdot w_{22}\\     \frac{\partial Loss}{\partial Y^{z_1}_3}=\frac{\partial Loss}{\partial \sigma(Z^{out}_1)}\frac{\partial \sigma(Z^{out}_1)}{\partial Z^{out}_1}\frac{\partial Z^{out}_1}{\partial Y_3}=[\sigma(Z_1^{out})-y_1]\cdot w_{31}\\  \frac{\partial Loss}{\partial Y^{z_2}_3}=\frac{\partial Loss}{\partial \sigma(Z^{out}_2)}\frac{\partial \sigma(Z^{out}_2)}{\partial Z^{out}_2}\frac{\partial Z^{out}_2}{\partial Y_2}=[\sigma(Z_2^{out})-y_3]\cdot w_{32}\end{cases}=\begin{cases}\frac{\partial Loss}{\partial Y_1}=\frac{\partial Loss}{\partial Y^{z_1}_1}+\frac{\partial Loss}{\partial Y^{z_2}_1}\\  \frac{\partial Loss}{\partial Y_2}=\frac{\partial Loss}{\partial Y^{z_1}_2}+\frac{\partial Loss}{\partial Y^{z_2}_2}\\  \frac{\partial Loss}{\partial Y_3}=\frac{\partial Loss}{\partial Y^{z_1}_3}+\frac{\partial Loss}{\partial Y^{z_2}_3}\end{cases}
\\]
 


In [None]:
# Gradient for phrase representation
d_phrase_rep = np.dot(dLoss_dSigma_Zout, linearlayer.T)
d_phrase_rep

In [None]:
class QKVAttentionClassifier: 
    def BackPropagation(self, dLoss_dSigma_Zout, inputs):
        
        # Gradient for linear layer
        dlinear_dW = np.dot(self.phrase_representation.T, dLoss_dSigma_Zout)

        # Gradient for bias
        d_bias = np.sum(dLoss_dSigma_Zout, axis=0)
        
         # Gradient for phrase representation
        d_phrase_rep = np.dot(dLoss_dSigma_Zout, self.linearlayer.T)



### Gradient of the Loss with Respect to Attention Output

Given the attention output matrix:

\\[
\text{Attention} = 
\begin{bmatrix}
Y_a^1 & Y_a^2 & Y_a^3 \\
Y_b^1 & Y_b^2 & Y_b^3 \\
Y_c^1 & Y_c^2 & Y_c^3 \\
Y_d^1 & Y_d^2 & Y_d^3 \\
\end{bmatrix}
\\]

Each row \\(Y_a, Y_b, Y_c, Y_d\\) represents the attention for each input token.

#### Phrase Representation

We compute the phrase representation by averaging over the rows for each component:

\\[
\text{Phrase Representation} = Y =
\begin{bmatrix}
\frac{Y^1_a + Y^1_b + Y^1_c + Y^1_d}{4} \\
\frac{Y^2_a + Y^2_b + Y^2_c + Y^2_d}{4} \\
\frac{Y^3_a + Y^3_b + Y^3_c + Y^3_d}{4} \\
\end{bmatrix} = \begin{bmatrix}
{Y_1}\\
{Y_2}\\
{Y_3}\\
\end{bmatrix}
\\]

#### Loss Function and Gradient

The loss is computed based on the phrase representation. To compute the gradient of the loss with respect to each attention component \\(Y_x^i\\), we apply the chain rule:

\\[
\frac{\partial Loss}{\partial Y_x^i} = \frac{\partial Loss}{\partial \text{Y}_i} \cdot \frac{\partial \text{Y}_i}{\partial Y_x^i}
\\]

So we end up with the following set of equations:
\\[
\begin{align*}
\frac{\partial Loss}{\partial Y_x^1} = \begin{cases}
\frac{\partial Loss}{\partial Y^1_a} = \frac{\partial Loss}{\partial \text{Y}_1} \cdot \frac{\partial \text{Y}_1}{\partial Y_1^a}\\
\frac{\partial Loss}{\partial Y^1_b} = \frac{\partial Loss}{\partial \text{Y}_1} \cdot \frac{\partial \text{Y}_1}{\partial Y_1^b}\\
\frac{\partial Loss}{\partial Y^1_c} = \frac{\partial Loss}{\partial \text{Y}_1} \cdot \frac{\partial \text{Y}_1}{\partial Y_1^c}\\
\frac{\partial Loss}{\partial Y^1_d} = \frac{\partial Loss}{\partial \text{Y}_1} \cdot \frac{\partial \text{Y}_1}{\partial Y_1^d}\\
\end{cases}  & 
\frac{\partial Loss}{\partial Y_x^2} = \begin{cases}
\frac{\partial Loss}{\partial Y^2_a} = \frac{\partial Loss}{\partial \text{Y}_2} \cdot \frac{\partial \text{Y}_2}{\partial Y_2^a}\\
\frac{\partial Loss}{\partial Y^2_b} = \frac{\partial Loss}{\partial \text{Y}_2} \cdot \frac{\partial \text{Y}_2}{\partial Y_2^b}\\
\frac{\partial Loss}{\partial Y^2_c} = \frac{\partial Loss}{\partial \text{Y}_2} \cdot \frac{\partial \text{Y}_2}{\partial Y_2^c}\\
\frac{\partial Loss}{\partial Y^2_d} = \frac{\partial Loss}{\partial \text{Y}_2} \cdot \frac{\partial \text{Y}_2}{\partial Y_2^d}\\
\end{cases}  & 
\frac{\partial Loss}{\partial Y_x^3} = \begin{cases}
\frac{\partial Loss}{\partial Y^3_a} = \frac{\partial Loss}{\partial \text{Y}_3} \cdot \frac{\partial \text{Y}_3}{\partial Y_3^a}\\
\frac{\partial Loss}{\partial Y^3_b} = \frac{\partial Loss}{\partial \text{Y}_3} \cdot \frac{\partial \text{Y}_3}{\partial Y_3^b}\\
\frac{\partial Loss}{\partial Y^3_c} = \frac{\partial Loss}{\partial \text{Y}_3} \cdot \frac{\partial \text{Y}_3}{\partial Y_3^c}\\
\frac{\partial Loss}{\partial Y^3_d} = \frac{\partial Loss}{\partial \text{Y}_3} \cdot \frac{\partial \text{Y}_3}{\partial Y_3^d}\\
\end{cases}
\end{align*}
\\]



Since the phrase representation we used is the mean of the attention outputs, we can simplify the derivative with respect to each attention component as:

\\[
\frac{\partial \text{Y}_i}{\partial Y_x^i} = \frac{1}{4}
\\]

Therefore, the gradient of the loss with respect to each attention component is:

\\[ 
\frac{\partial Loss}{\partial A} = \begin{cases}
\frac{\partial Loss}{\partial Y_a^i} = \frac{1}{4} \frac{\partial Loss}{\partial \text{Y}_i}; \\
\frac{\partial Loss}{\partial Y_b^i} = \frac{1}{4} \frac{\partial Loss}{\partial \text{Y}_i}; \\
\frac{\partial Loss}{\partial Y_c^i} = \frac{1}{4} \frac{\partial Loss}{\partial \text{Y}_i}; \\
\frac{\partial Loss}{\partial Y_d^i} = \frac{1}{4} \frac{\partial Loss}{\partial \text{Y}_i};  
\end{cases} 
\\]
This shows how the gradient propagates through the mean operation during backpropagation.



In [None]:
# Gradient for attention
dL_dA = np.array([np.outer(np.ones(inputs.shape[1]), d_phrase_rep[i, :]) for i in range(d_phrase_rep.shape[0])])  / inputs.shape[1]
dL_dA

In [None]:
class QKVAttentionClassifier: 
    def BackPropagation(self, dLoss_dSigma_Zout, inputs):
        
        # Gradient for linear layer
        dlinear_dW = np.dot(self.phrase_representation.T, dLoss_dSigma_Zout)

        # Gradient for bias
        d_bias = np.sum(dLoss_dSigma_Zout, axis=0)
        
         # Gradient for phrase representation
        d_phrase_rep = np.dot(dLoss_dSigma_Zout, self.linearlayer.T)

        # Gradient for attention
        dL_dA = np.array([np.outer(np.ones(self.words_per_phrase), d_phrase_rep[i, :]) for i in range(d_phrase_rep.shape[0])])

### Gradient of the loss with respect to values matrix V:
 

The gradient of the loss with respect to values matrix can be expressed as:
\\[
\text{A} =\sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})V_{val} = \sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})\cdot Inputs \cdot V
\\]
\\[ 
\frac{\partial Loss}{\partial \text{V}} =\frac{\partial Loss}{\partial A} \cdot \frac{\partial A}{\partial V}=\frac{\partial Loss}{\partial A} \cdot \sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})\cdot Inputs
\\] 

In [None]:
# Gradient for V
d_Vval=np.matmul(np.transpose(dL_dA,(0,2,1)), Attention_weights) 
dLoss_dV = np.mean(np.matmul(d_Vval,inputs),axis=0).T
dLoss_dV

### Gradient of the loss with respect to values matrix Q:
 
 
The gradient of the loss with respect to queries matrix can be expressed as:
\\[
\text{A} =\sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})V_{val}  
\\]
\\[ 
\frac{\partial Loss}{\partial \text{Q}} =\frac{\partial Loss}{\partial A} \cdot \frac{\partial A}{\partial Q} 
\\] 
\\[ 
\frac{\partial A}{\partial Q}  = \frac{\partial \sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})}{\partial Q} \cdot V_{val} 
\\] 
\\[ 
\frac{\partial \sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})}{\partial Q}   = \sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}}) \cdot [1-\sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})] \cdot 
\frac{\partial (\frac{Q_{val}K_{val}^T}{\sqrt{d_k}})}{\partial Q}
\\] 
\\[ 
\frac{\partial (\frac{Q_{val}K_{val}^T}{\sqrt{d_k}})}{\partial Q} = \frac{\partial (\frac{Inputs \cdot Q \cdot K_{val}^T}{\sqrt{d_k}})}{\partial Q}=\frac{Inputs \cdot K_{val}^T}{\sqrt{d_k}}
\\]  
\\[ 
\frac{\partial Loss}{\partial \text{Q}} =\frac{\partial Loss}{\partial A} \cdot [\sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}}) \cdot [1-\sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})]] \cdot \frac{Inputs \cdot K_{val}^T}{\sqrt{d_k}} \cdot V_{val}
\\] 

In [None]:
#Gradient of Q
dQKscaled_dQ=np.matmul(np.transpose(inputs,(0,2,1)),Kval)/np.sqrt(dk)
dAttention_dSoftmax=np.matmul(Attention_weights,(1-Attention_weights))
dLoss_dQ=np.mean(np.transpose(np.transpose(np.transpose(dL_dA,(0,2,1))@dAttention_dSoftmax,(0,2,1))@np.transpose( dQKscaled_dQ,(0,2,1)),(0,2,1))@Vval,axis=0)
dLoss_dQ

### Gradient of the loss with respect to values matrix K:
 
 
The gradient of the loss with respect to queries matrix can be expressed as:
\\[
\text{A} =\sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})V_{val}  
\\]
\\[ 
\frac{\partial Loss}{\partial \text{K}} =\frac{\partial Loss}{\partial A} \cdot \frac{\partial A}{\partial K} 
\\] 
\\[ 
\frac{\partial A}{\partial K}  = \frac{\partial \sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})}{\partial K} \cdot V_{val} 
\\] 
\\[ 
\frac{\partial \sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})}{\partial K}   = \sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}}) \cdot [1-\sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})] \cdot 
\frac{\partial (\frac{Q_{val}K_{val}^T}{\sqrt{d_k}})}{\partial K}
\\] 
\\[ 
\frac{\partial (\frac{Q_{val}K_{val}^T}{\sqrt{d_k}})}{\partial K} = \frac{\partial (\frac{Q_{val} \cdot Inputs \cdot K }{\sqrt{d_k}})}{\partial K}=\frac{Q_{val} \cdot Inputs}{\sqrt{d_k}}
\\]  
\\[ 
\frac{\partial Loss}{\partial \text{K}} =\frac{\partial Loss}{\partial A} \cdot [\sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}}) \cdot [1-\sigma( \frac{Q_{val}K_{val}^T}{\sqrt{d_k}})]] \cdot \frac{Q_{val} \cdot Inputs}{\sqrt{d_k}} \cdot V_{val}
\\] 

In [None]:
dQKscaled_dK=np.transpose(Qval,(0,2,1))@inputs/np.sqrt(dk)
dAttention_dSoftmax=np.matmul(Attention_weights,(1-Attention_weights))
dLoss_dK=np.mean(np.transpose(np.transpose(np.transpose(dL_dA,(0,2,1))@dAttention_dSoftmax,(0,2,1))@dQKscaled_dK,(0,2,1))@Vval,axis=0)
dLoss_dK

Adding all the gradients in the backpropagation of our classifier we get:

In [None]:
class QKVAttentionClassifier: 
    def BackPropagation(self, dLoss_dSigma_Zout, inputs):
        
        # Gradient for linear layer
        dlinear_dW = np.dot(self.phrase_representation.T, dLoss_dSigma_Zout)

        # Gradient for bias
        d_bias = np.sum(dLoss_dSigma_Zout, axis=0)
        
         # Gradient for phrase representation
        d_phrase_rep = np.dot(dLoss_dSigma_Zout, self.linearlayer.T)

        # Gradient for attention
        dL_dA = np.array([np.outer(np.ones(self.words_per_phrase), d_phrase_rep[i, :]) for i in range(d_phrase_rep.shape[0])])

        # Gradient for V
        d_Vval=np.matmul(np.transpose(d_attention,(0,2,1)), self.Attention_weights) 
        dLoss_dV = np.mean(np.matmul(d_Vval,inputs),axis=0).T
        
        #Gradient of softmax
        dAttention_dSoftmax=np.matmul(self.Attention_weights,(1-self.Attention_weights))
        
        #Gradient of Q
        dQKscaled_dQ=np.matmul(np.transpose(inputs,(0,2,1)),self.Kval)/np.sqrt(self.dk)
        dLoss_dQ=np.mean(np.transpose(np.transpose(np.transpose(dL_dA,(0,2,1))@dAttention_dSoftmax,(0,2,1))@np.transpose( dQKscaled_dQ,(0,2,1)),(0,2,1))@self.Vval,axis=0)
        
        #Gradient of K
        dQKscaled_dK=np.transpose(self.Qval,(0,2,1))@inputs/np.sqrt(self.dk) 
        dLoss_dK=np.mean(np.transpose(np.transpose(np.transpose(dL_dA,(0,2,1))@dAttention_dSoftmax,(0,2,1))@dQKscaled_dK,(0,2,1))@Vself.val,axis=0)

### Parameters update:
 
 
Fixing a value for the learning rate \eta we can now proceed to update the model parameters:
\\[
\text{Q} = \text{Q} - \eta \cdot \frac{\partial Loss}{\partial Q}\\   
\\]
\\[ 
\text{V} = \text{V} - \eta \cdot \frac{\partial Loss}{\partial V}\\ 
\\] 
\\[ 
\text{K} = \text{K} - \eta \cdot \frac{\partial Loss}{\partial K}\\  
\\] 
\\[ 
\text{W} = \text{W} - \eta \cdot \frac{\partial Loss}{\partial W}\\  
\\] 
\\[ 
\text{b} = \text{b} - \eta \cdot \frac{\partial Loss}{\partial b}\\  
\\]  

In [None]:
# Update weights
learning_rate=0.001
Q -= learning_rate * dLoss_dQ
K -= learning_rate * dLoss_dK
V -= learning_rate * dLoss_dV
linearlayer -= learning_rate * d_linear
linear_bias -= learning_rate * d_bias

Putting together all the steps we finally have the classification model:

In [None]:
class QKVAttentionClassifier:
    def __init__(self, word_len, words_per_phrase, batch_size, dk, dv, num_classes):

        self.word_len = word_len
        self.batch_size = batch_size
        self.dk = dk
        self.dv = dv
        self.num_classes = num_classes
        self.words_per_phrase = words_per_phrase
        
        # Initialize weights with Xavier/Glorot initialization
        self.Q = np.random.randn(self.word_len, self.dk) / np.sqrt(self.word_len)  # * 0.01
        self.K = np.random.randn(self.word_len, self.dk) / np.sqrt(self.word_len)  # * 0.01
        self.V = np.random.randn(self.word_len, self.dk) / np.sqrt(self.word_len)  # * 0.01

        # Initialize linear layer weights
        self.linearlayer = np.random.randn(self.dk, self.num_classes) / np.sqrt(self.dk)
        self.linear_bias = np.zeros(self.num_classes)

    def softmax(self, x, axis=-1):
        x = np.clip(x, -1e4, 1e4)  # Clip for numerical stability
        e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
        return e_x / np.sum(e_x, axis=axis, keepdims=True)

    def cross_entropy_loss(self, predictions, target):
        # Cross-entropy loss for a batch of predictions and targets
        batch_loss = -np.sum(target * np.log(predictions + 1e-9), axis=1)
        return np.mean(batch_loss)

    def AttentionHead(self, Inputs):
        self.Qval = np.dot(Inputs, self.Q)
        self.Kval = np.dot(Inputs, self.K)
        self.Vval = np.dot(Inputs, self.V)

        QKscaled = np.matmul(self.Qval, np.transpose(self.Kval, (0, 2, 1))) / np.sqrt(self.K.shape[1])
        # QKscaled = np.clip(QKscaled, -1e2, 1e2)
        self.Attention_weights = self.softmax(QKscaled)

        return np.matmul(self.Attention_weights, self.Vval)

    def LinearLayer(self):
        output = np.matmul(self.phrase_representation, self.linearlayer) + self.linear_bias
        return output

    def forward(self, Inputs):

        Attention = self.AttentionHead(Inputs)

        self.phrase_representation = np.mean(Attention, axis=1)

        Zout = self.LinearLayer()

        Sigma_Zout = self.softmax(Zout)

        return Sigma_Zout
        
    def BackPropagation(self, dLoss_dSigma_Zout, inputs):
        
        # Gradient for linear layer
        dlinear_dW = np.dot(self.phrase_representation.T, dLoss_dSigma_Zout)

        # Gradient for bias
        d_bias = np.sum(dLoss_dSigma_Zout, axis=0)
        
         # Gradient for phrase representation
        d_phrase_rep = np.dot(dLoss_dSigma_Zout, self.linearlayer.T)

        # Gradient for attention
        dL_dA = np.array([np.outer(np.ones(self.words_per_phrase), d_phrase_rep[i, :]) for i in range(d_phrase_rep.shape[0])])

        # Gradient for V
        d_Vval=np.matmul(np.transpose(dL_dA,(0,2,1)), self.Attention_weights) 
        dLoss_dV = np.mean(np.matmul(d_Vval,inputs),axis=0).T
        
        #Gradient of softmax
        dAttention_dSoftmax=np.matmul(self.Attention_weights,(1-self.Attention_weights))
        
        #Gradient of Q
        dQKscaled_dQ=np.matmul(np.transpose(inputs,(0,2,1)),self.Kval)/np.sqrt(self.dk)
        dLoss_dQ=np.mean(np.transpose(np.transpose(np.transpose(dL_dA,(0,2,1))@dAttention_dSoftmax,(0,2,1))@np.transpose( dQKscaled_dQ,(0,2,1)),(0,2,1))@self.Vval,axis=0)
        
        #Gradient of K
        dQKscaled_dK=np.transpose(self.Qval,(0,2,1))@inputs/np.sqrt(dk) 
        dLoss_dK=np.mean(np.transpose(np.transpose(np.transpose(dL_dA,(0,2,1))@dAttention_dSoftmax,(0,2,1))@dQKscaled_dK,(0,2,1))@self.Vval,axis=0)
        
        #Gradient clipping
        clip_value = 10.0
        dLoss_dQ = np.clip(dLoss_dQ, -clip_value, clip_value)
        dLoss_dK = np.clip(dLoss_dK, -clip_value, clip_value)
        dLoss_dV = np.clip(dLoss_dV, -clip_value, clip_value)
        dlinear_dW = np.clip(dlinear_dW, -clip_value, clip_value)
        d_bias = np.clip(d_bias, -clip_value, clip_value)

        self.UpdateParams(dLoss_dQ,dLoss_dK,dLoss_dV,dlinear_dW,d_bias)

    def UpdateParams(self, dLoss_dQ, dLoss_dK, dLoss_dV, dlinear_dW, d_bias):
        self.Q -= self.learning_rate * dLoss_dQ
        self.K -= self.learning_rate * dLoss_dK
        self.V -= self.learning_rate * dLoss_dV
        self.linearlayer -= self.learning_rate * dlinear_dW
        self.linear_bias -= self.learning_rate * d_bias

    def train(self, X_train, y_train, num_epochs, learning_rate=0.01):

        self.learning_rate = learning_rate

        for epoch in range(num_epochs):

            total_loss = 0

            num_batches_per_epoch = len(X_train) // self.batch_size

            for i in tqdm(range(num_batches_per_epoch), desc=f"Epoch {epoch + 1}/{num_epochs}"):
                start = i * self.batch_size
                end = start + self.batch_size
                X_batch = X_train[start:end]
                y_batch = y_train[start:end]

                yi = self.forward(X_batch)

                Loss = self.cross_entropy_loss(yi, y_batch)
                total_loss += Loss

                dLoss_dSigma_Zout = yi - y_batch

                self.BackPropagation(dLoss_dSigma_Zout, X_batch)

            print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {(total_loss / num_batches_per_epoch):.4f}")

    def predict(self, X):
        return self.forward(X)

# Model Training for Text Classification

To test the classifier, we will be training the model using the BBC Text Classification dataset. As a starting point, we need embedded word representations. We will use the `spaCy en_core_web_lg` model to generate these embeddings. This model provides pre-trained word vectors that will help in capturing the semantic meaning of the words in our text data.
The model is made of a single head attention followed by a linear layer of 5 neurons each representing a class:

![classifier.png](attachment:d4913556-b72a-4480-a138-52ac618abbb1.png)

### Dataset Overview

The dataset comprises five categories, each representing different types of news articles. 
- **Categories**: The BBC Text Classification dataset includes five categories:
  1. Business
  2. Entertainment
  3. Politics
  4. Sport
  5. Technology

### Preprocessing
We will use the `spaCy` library for preprocessing the text data. The preprocessing steps will include:
- Removing stop words
- Tokenizing the text
- Lemmatizing the words

### Input Preparation
The input data will be prepared as follows:
- **Fixed Phrase Length**: Texts will be padded or truncated to a fixed phrase length to ensure uniform input size.
- **Embedding Size**: Each word in the text will be represented by an embedding of a specified size.
- **Batch Size**: Data will be processed in batches to optimize training efficiency and performance.

By preparing the data in this manner, we aim to create a robust model that can effectively classify text into one of the five categories.


In [None]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import spacy
from spacy.tokens import Doc
from functools import partial

# Load the data
df = pd.read_csv("/kaggle/input/bbc-fulltext-and-category/bbc-text.csv")

# Load spaCy model
nlp = spacy.load('en_core_web_lg')

# Disable unnecessary pipeline components
nlp.disable_pipes(["parser", "ner"])

# Custom tokenizer to speed up processing
def custom_tokenizer(text):
    return Doc(nlp.vocab, words=text.split())

nlp.tokenizer = custom_tokenizer

# Preprocess function
def preprocess_text(text, nlp, max_words=70):
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_stop and token.text.strip()]
    return tokens[:max_words] + ['<PAD>'] * (max_words - len(tokens))

# Vectorize function
def vectorize_text(tokens, nlp):
    return np.array([nlp.vocab[token].vector for token in tokens])

# Apply preprocessing in batches
batch_size = 1000
preprocessed_texts = []

for i in tqdm(range(0, len(df), batch_size)):
    batch = df['text'].iloc[i:i+batch_size]
    preprocessed_batch = [preprocess_text(text, nlp) for text in nlp.pipe(batch, batch_size=64)]
    preprocessed_texts.extend(preprocessed_batch)

df['processed_text'] = preprocessed_texts

# Vectorize in batches
inputs = []
for tokens in tqdm(df['processed_text']):
    inputs.append(vectorize_text(tokens, nlp))

# Convert to numpy array for further processing
inputs = np.array(inputs)

In [None]:
df

## Train-Test Split

For training and evaluating the model, we will split the dataset as follows:

- **Training Set**: 75% of the dataset will be used for training the model.
- **Test Set**: 25% of the dataset will be reserved for testing the model's performance.

This split ensures that we have a substantial amount of data for training while also retaining a significant portion for evaluation to assess the model's generalization capabilities.


In [None]:
from sklearn.model_selection import train_test_split
y = np.array(pd.get_dummies(df["category"], dtype=int))
X_train, X_test, y_train, y_test = train_test_split(inputs, y, test_size=0.25, random_state=42)
X_train.shape,X_test.shape,y_train.shape,y_test.shape
X_train.shape,X_test.shape,y_train.shape,y_test.shape

In [None]:
def pad_sequences(sequences, max_len):
    padded_sequences = np.zeros((len(sequences), max_len, sequences[0].shape[1]))
    for i, seq in enumerate(sequences):
        length = min(seq.shape[0], max_len)
        padded_sequences[i, :length] = seq[:length]
    return padded_sequences

word_len = 300
max_words_per_phrase=50
dk = 32
dv = 64
batch_size = 64
num_classes = 5
 

# padding and truncating phrases to max phrase lenght
X_train_padded = pad_sequences(X_train, max_words_per_phrase)
 

model = QKVAttentionClassifier(word_len, max_words_per_phrase, batch_size, dk, dv, num_classes)
model.train(X_train_padded, y_train, num_epochs=30, learning_rate=0.001)




## Model Evaluation

To evaluate the performance of the text classification model, we will use the following metrics:

- **Confusion Matrix**: This will help us understand how well the model is performing across the different categories by showing the counts of true positive, true negative, false positive, and false negative predictions.

- **F1 Score**: This metric provides a balance between precision and recall. The F1 score is especially useful when dealing with imbalanced datasets, as it considers both false positives and false negatives, providing a single metric that summarizes the model's accuracy.

These evaluation metrics will give us insight into the effectiveness of the model in classifying text into the correct categories and help us identify areas for improvement.


In [None]:
import numpy as np
from sklearn.metrics import confusion_matrix,f1_score
import seaborn as sns
import matplotlib.pyplot as plt

X_test= pad_sequences(X_test, max_words_per_phrase)

predictions = model.predict(X_test)
y_pred = np.argmax(predictions, axis=1)

# Convert one-hot encoded true labels to class labels
y_true = np.argmax(y_test, axis=1) 
# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred, labels=np.arange(5))

# Visualize the confusion matrix
plt.figure(figsize=(5, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=np.arange(5), yticklabels=np.arange(5))
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()


In [None]:
f1_score(y_pred,y_true,average=None)

## Conclusion

In this project, we implemented a Natural Language Processing (NLP) model utilizing the QKV (Queries, Keys, Values) attention mechanism from scratch. After thoroughly building and training the model, the **F1** scores indicate that the model performs exceptionally well across different classes, with the F1 scores consistently above 0.90.


