 # NLP Model Implementation Using the QVC Attention Mechanism

This project focuses on developing a Natural Language Processing (NLP) model using the QVC (Query, Value, Context) attention mechanism from scratch using Python and Numpy. The attention mechanism is a critical component in modern NLP models, enhancing their ability to focus on different parts of the input sequence to make more accurate predictions.

## Key Components:

- **QVC Attention Mechanism**: Understanding and implementing the Query, Value, and Context (QVC) attention mechanism from scratch.
- **Model Architecture**: Building the architecture of the NLP model utilizing QVC attention.
- **Training and Evaluation**: Training the model with appropriate datasets and evaluating its performance.

This project aims to provide a comprehensive guide to implementing and experimenting with attention mechanisms in NLP.


# Preparing Input Data for NLP Model

In this section, we are preparing the input data for our NLP model by defining arrays representing word embeddings and combining them into a structured format.

 


Let's consider as starting point for example 3 phrases made of 4 words each where each word have embedding size 5:

In [1]:
import sys
sys.path.append('c:\\python312\\lib\\site-packages')
import numpy as np
 

# Define four arrays of size 5
word1 = np.array([0.1, 0.2, 0.3, 0.4, 0.5])
word2 = np.array([0.5, 0.4, 0.7,0.3, 0.2])
word3 = np.array([0.2,0.7, 0.3, 0.5, 0.4])
word4 = np.array([0.4, 0.1,0.7, 0.2, 0.5])

word5 = np.array([0.1, 0.9, 0.3, 0.4, 0.5])
word6 = np.array([0.4, 0.4, 0.7,0.3, 0.8])
word7 = np.array([0.2,0.7, 0.4, 0.5, 0.4])
word8 = np.array([0.4, 0.5,0.7, 0.7, 0.8])

word9 = np.array([0.1, 0.2, 0.3, 0.8, 0.5])
word10 = np.array([0.4, 0.5, 0.7,0.3, 0.8])
word11 = np.array([0.9,0.7, 0.3, 0.5, 0.4])
word12 = np.array([0.4, 0.5,0.1, 0.7, 0.4])


Finally, we combine all these word embeddings into a single matrix. This matrix, `inputs`, has the shape `(3, 4, 5)`, where:
- `3` represents the number of phrases (batch size),
- `4` is the number of words in each phrase (sequence length),
- `5` is the dimensionality of each word embedding.


In [2]:
inputs = np.stack([[word1, word2, word3, word4],[word5, word6, word7, word8],[word9, word10, word11, word12]])
inputs

array([[[0.1, 0.2, 0.3, 0.4, 0.5],
        [0.5, 0.4, 0.7, 0.3, 0.2],
        [0.2, 0.7, 0.3, 0.5, 0.4],
        [0.4, 0.1, 0.7, 0.2, 0.5]],

       [[0.1, 0.9, 0.3, 0.4, 0.5],
        [0.4, 0.4, 0.7, 0.3, 0.8],
        [0.2, 0.7, 0.4, 0.5, 0.4],
        [0.4, 0.5, 0.7, 0.7, 0.8]],

       [[0.1, 0.2, 0.3, 0.8, 0.5],
        [0.4, 0.5, 0.7, 0.3, 0.8],
        [0.9, 0.7, 0.3, 0.5, 0.4],
        [0.4, 0.5, 0.1, 0.7, 0.4]]])

In [162]:
Q = np.random.rand(5, 3)/ np.sqrt(5)
K = np.random.rand(5, 3)/ np.sqrt(5)
V = np.random.rand(5, 3)/ np.sqrt(5)

In [186]:
Qval=np.dot(inputs, Q)
Qval,Qval.shape

(array([[[0.425201  , 0.40500248, 0.49409738],
         [0.48972278, 0.60121395, 0.58693049],
         [0.55304515, 0.61197165, 0.71089506],
         [0.48882271, 0.54546272, 0.51623665]],
 
        [[0.61746062, 0.71337737, 0.74854206],
         [0.7022129 , 0.77874207, 0.77742381],
         [0.58464292, 0.65247304, 0.72424078],
         [0.82157489, 0.84860538, 0.98189856]],
 
        [[0.51709733, 0.43081224, 0.66222289],
         [0.72967856, 0.82279562, 0.81377305],
         [0.57788218, 0.69460224, 0.91895465],
         [0.48796275, 0.47937538, 0.7550135 ]]]),
 (3, 4, 3))

In [187]:
Kval=np.dot(inputs, K)
Kval

array([[[0.2439634 , 0.42241932, 0.36283375],
        [0.26863076, 0.60383225, 0.53556176],
        [0.2943189 , 0.5631973 , 0.5252599 ],
        [0.23434532, 0.5958629 , 0.43157899]],

       [[0.24525167, 0.62492683, 0.50190133],
        [0.30037985, 0.80506284, 0.56375154],
        [0.30140748, 0.60107104, 0.54871196],
        [0.46828168, 0.87394052, 0.76164825]],

       [[0.41168119, 0.46236734, 0.54086365],
        [0.30056389, 0.83399249, 0.58361834],
        [0.40181679, 0.72851921, 0.72083694],
        [0.39434624, 0.49679936, 0.58351628]]])

In [188]:
Vval=np.dot(inputs, V)
Vval

array([[[0.21777027, 0.4314999 , 0.42197821],
        [0.27344964, 0.4831857 , 0.51122975],
        [0.24391528, 0.56372772, 0.5131798 ],
        [0.30615055, 0.44759465, 0.49360769]],

       [[0.24340562, 0.58048731, 0.48836437],
        [0.40250886, 0.63577634, 0.64760763],
        [0.25184159, 0.58747352, 0.53467971],
        [0.43852881, 0.82996639, 0.83161395]],

       [[0.25012802, 0.60440604, 0.59650079],
        [0.40617106, 0.65706025, 0.65709137],
        [0.42245289, 0.63104091, 0.70538781],
        [0.2879279 , 0.57935369, 0.59339037]]])

In [166]:
QKscaled=np.matmul(Qval, np.transpose(Kval))/np.sqrt(K.shape[1])
QKscaled

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 4 is different from 3)

In [189]:
QKscaled=np.matmul(Qval, np.transpose(Kval, (0, 2, 1)))/np.sqrt(K.shape[1])
QKscaled

array([[[0.26216873, 0.35991744, 0.35378323, 0.31997437],
        [0.33855647, 0.46703278, 0.45670011, 0.41933654],
        [0.37606757, 0.51893476, 0.50855153, 0.46249343],
        [0.31002377, 0.42559803, 0.41698043, 0.38242019]],

       [[0.56172495, 0.68229989, 0.59214874, 0.85604915],
        [0.60567839, 0.73678022, 0.63873026, 0.92464445],
        [0.52806186, 0.64039035, 0.55760373, 0.8057598 ],
        [0.70703762, 0.85650607, 0.74852332, 1.08208166]],

       [[0.44470118, 0.52030828, 0.57676605, 0.46439773],
        [0.64719177, 0.79700507, 0.85402644, 0.6762851 ],
        [0.60973603, 0.74437906, 0.80866583, 0.64039034],
        [0.4797157 , 0.56990197, 0.6290505 , 0.50295425]]])

In [190]:
import numpy as np

def softmax(x, axis=-1):
    # Subtract the max value for numerical stability
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)

Attention_weights=softmax(QKscaled)
Attention_weights

array([[[0.23484449, 0.25895965, 0.257376  , 0.24881986],
        [0.23006315, 0.26160354, 0.25891439, 0.24941892],
        [0.22802376, 0.26304287, 0.26032577, 0.24860759],
        [0.23199244, 0.26041566, 0.25818114, 0.24941077]],

       [[0.22216024, 0.25062902, 0.22902306, 0.29818768],
        [0.21980956, 0.25060134, 0.22719607, 0.30239303],
        [0.2237662 , 0.25036759, 0.23047528, 0.29539093],
        [0.21465535, 0.24926141, 0.22374778, 0.31233545]],

       [[0.23587059, 0.25439557, 0.26917135, 0.24056248],
        [0.2261974 , 0.26275483, 0.27817287, 0.2328749 ],
        [0.22751298, 0.26030404, 0.27758775, 0.23459523],
        [0.23370217, 0.25575854, 0.27134263, 0.23919666]]])

In [191]:
Attention=np.matmul(Attention_weights, Vval)
Attention

array([[[0.26090885, 0.4829214 , 0.48638669],
        [0.26114922, 0.48327111, 0.48680588],
        [0.26119456, 0.48351907, 0.48700495],
        [0.26106319, 0.48311262, 0.4866324 ]],

       [[0.34339687, 0.67033594, 0.64123543],
        [0.34419762, 0.67137079, 0.64258989],
        [0.34282182, 0.6696339 , 0.64030108],
        [0.34589534, 0.67375304, 0.64562954]],

       [[0.34530283, 0.61894373, 0.64047578],
        [0.34786783, 0.61981622, 0.64198633],
        [0.3474496 , 0.6196285 , 0.64176878],
        [0.34583805, 0.61910754, 0.64079904]]])

In [170]:
phrase_representation = np.mean(Attention, axis=0)
print("Phrase Representation:")
print(phrase_representation)

Phrase Representation:
[[0.32162485 0.58945342 0.59956795]
 [0.32284648 0.59020857 0.60069388]
 [0.32225555 0.5896471  0.59991074]
 [0.32270605 0.59071878 0.60125639]]


In [192]:
phrase_representation = np.mean(Attention, axis=1)
print("Phrase Representation:")
print(phrase_representation)

Phrase Representation:
[[0.26107895 0.48320605 0.48670748]
 [0.34407791 0.67127342 0.64243898]
 [0.34661458 0.619374   0.64125748]]


In [36]:
phrase_representation.shape[1]

3

In [193]:
num_classes = 2  # Example number of classes
linearlayer= np.random.rand(phrase_representation.shape[1], num_classes)   
linear_bias = np.random.rand(num_classes)


In [66]:
num_classes = 2  # Example number of classes
linearlayer= np.random.rand(phrase_representation.shape[0], num_classes)   
linear_bias = np.random.rand(num_classes)


In [194]:
probabilities=softmax(np.matmul(phrase_representation, linearlayer)+linear_bias)
probabilities

array([[0.48288356, 0.51711644],
       [0.46798522, 0.53201478],
       [0.46533757, 0.53466243]])

In [195]:
# Cross-Entropy Loss
target =np.array([0, 1])
def cross_entropy_loss(predictions, target):
    return -np.sum(target * np.log(predictions + 1e-8))  # Adding a small constant to avoid log(0)
loss = cross_entropy_loss(probabilities, target)
print("Cross-Entropy Loss:")
print(loss)

Cross-Entropy Loss:
1.9166908691853788


In [196]:
import numpy as np

# True target (one-hot encoded)
target = [np.array([0, 1]),np.array([1, 0]),np.array([1, 0])]


def cross_entropy_loss(predictions, target):
    # Cross-entropy loss for a batch of predictions and targets
    batch_loss = -np.sum(target * np.log(predictions + 1e-8), axis=1)
    return np.mean(batch_loss) 
# Calculate loss
loss = cross_entropy_loss(probabilities, target)
print("Cross-Entropy Loss:")
print(loss)


Cross-Entropy Loss:
0.7279326240583354


In [175]:
# Gradient of loss with respect to output probabilities
d_probabilities = probabilities - target 
d_probabilities



array([[ 0.39263926, -0.39263926],
       [-0.63592837,  0.63592837],
       [-0.63681096,  0.63681096]])

In [72]:
# Gradient for linear layer and bias
d_linear = np.outer(phrase_representation, d_probabilities)
d_bias = d_probabilities
d_linear,d_bias

(array([[ 0.14645888, -0.14645888],
        [ 0.13668903, -0.13668903],
        [ 0.17016618, -0.17016618]]),
 array([ 0.29630688, -0.29630688]))

In [176]:
# Gradient for linear layer and bias
d_linear = np.dot(phrase_representation.T, d_probabilities) 
d_bias =  np.sum(d_probabilities, axis=0)
d_linear,d_bias

(array([[-0.34254126,  0.34254126],
        [-0.63022567,  0.63022567],
        [-0.63682398,  0.63682398]]),
 array([-0.88010008,  0.88010008]))

In [73]:
# Gradient for phrase representation
d_phrase_rep = np.dot(d_probabilities, linearlayer.T)
d_phrase_rep

array([-0.25311242, -0.26720907,  0.01037652])

In [177]:
d_phrase_rep = np.dot(d_probabilities, linearlayer.T)
d_phrase_rep

array([[-0.13413652,  0.03503775, -0.26959773],
       [ 0.21725086, -0.05674802,  0.43664723],
       [ 0.21755238, -0.05682678,  0.43725324]])

In [75]:
np.ones(inputs.shape[0])

array([1., 1., 1., 1.])

In [76]:
d_phrase_rep

array([-0.25311242, -0.26720907,  0.01037652])

In [79]:
# Gradient for attention
d_attention = np.outer(np.ones(inputs.shape[0]), d_phrase_rep)# / inputs.shape[0]
d_attention

array([[-0.25311242, -0.26720907,  0.01037652],
       [-0.25311242, -0.26720907,  0.01037652],
       [-0.25311242, -0.26720907,  0.01037652],
       [-0.25311242, -0.26720907,  0.01037652]])

In [96]:
np.ones(inputs.shape[1])

array([1., 1., 1., 1.])

In [97]:
d_phrase_rep

array([[ 0.0070818 , -0.18506487, -0.20596266],
       [-0.01861819,  0.48653924,  0.54147994]])

In [178]:
d_attention = np.array([np.outer(np.ones(inputs.shape[1]), d_phrase_rep[i, :]) for i in range(d_phrase_rep.shape[0])])
d_attention

array([[[-0.13413652,  0.03503775, -0.26959773],
        [-0.13413652,  0.03503775, -0.26959773],
        [-0.13413652,  0.03503775, -0.26959773],
        [-0.13413652,  0.03503775, -0.26959773]],

       [[ 0.21725086, -0.05674802,  0.43664723],
        [ 0.21725086, -0.05674802,  0.43664723],
        [ 0.21725086, -0.05674802,  0.43664723],
        [ 0.21725086, -0.05674802,  0.43664723]],

       [[ 0.21755238, -0.05682678,  0.43725324],
        [ 0.21755238, -0.05682678,  0.43725324],
        [ 0.21755238, -0.05682678,  0.43725324],
        [ 0.21755238, -0.05682678,  0.43725324]]])

In [120]:
inputs

array([[[0.1, 0.2, 0.3, 0.4, 0.5],
        [0.5, 0.4, 0.7, 0.3, 0.2],
        [0.2, 0.7, 0.3, 0.5, 0.4],
        [0.4, 0.1, 0.7, 0.2, 0.5]],

       [[0.1, 0.9, 0.3, 0.4, 0.5],
        [0.4, 0.4, 0.7, 0.3, 0.8],
        [0.2, 0.7, 0.4, 0.5, 0.4],
        [0.4, 0.5, 0.7, 0.7, 0.8]]])

In [121]:
np.transpose(inputs,(0,2,1))

array([[[0.1, 0.5, 0.2, 0.4],
        [0.2, 0.4, 0.7, 0.1],
        [0.3, 0.7, 0.3, 0.7],
        [0.4, 0.3, 0.5, 0.2],
        [0.5, 0.2, 0.4, 0.5]],

       [[0.1, 0.4, 0.2, 0.4],
        [0.9, 0.4, 0.7, 0.5],
        [0.3, 0.7, 0.4, 0.7],
        [0.4, 0.3, 0.5, 0.7],
        [0.5, 0.8, 0.4, 0.8]]])

In [122]:
Attention_weights

array([[[0.24019827, 0.25219573, 0.25849304, 0.24911296],
        [0.23715059, 0.25280595, 0.26098206, 0.24906141],
        [0.23799408, 0.25216139, 0.25932752, 0.25051701],
        [0.23686983, 0.25351364, 0.26210051, 0.24751602]],

       [[0.238775  , 0.25485377, 0.23422579, 0.27214544],
        [0.23865045, 0.25306951, 0.23228578, 0.27599426],
        [0.23895129, 0.25474077, 0.23408144, 0.27222649],
        [0.23549621, 0.25485798, 0.22796979, 0.28167601]]])

In [123]:
d_attention

array([[[ 0.0070818 , -0.18506487, -0.20596266],
        [ 0.0070818 , -0.18506487, -0.20596266],
        [ 0.0070818 , -0.18506487, -0.20596266],
        [ 0.0070818 , -0.18506487, -0.20596266]],

       [[-0.01861819,  0.48653924,  0.54147994],
        [-0.01861819,  0.48653924,  0.54147994],
        [-0.01861819,  0.48653924,  0.54147994],
        [-0.01861819,  0.48653924,  0.54147994]]])

In [124]:
np.matmul(Attention_weights, d_attention)

array([[[ 0.0070818 , -0.18506487, -0.20596266],
        [ 0.0070818 , -0.18506487, -0.20596266],
        [ 0.0070818 , -0.18506487, -0.20596266],
        [ 0.0070818 , -0.18506487, -0.20596266]],

       [[-0.01861819,  0.48653924,  0.54147994],
        [-0.01861819,  0.48653924,  0.54147994],
        [-0.01861819,  0.48653924,  0.54147994],
        [-0.01861819,  0.48653924,  0.54147994]]])

In [125]:
np.matmul(np.transpose(inputs,(0,2,1)),np.matmul(Attention_weights, d_attention))

array([[[ 0.00849816, -0.22207784, -0.24715519],
        [ 0.00991452, -0.25909082, -0.28834772],
        [ 0.0141636 , -0.37012974, -0.41192531],
        [ 0.00991452, -0.25909082, -0.28834772],
        [ 0.01133088, -0.29610379, -0.32954025]],

       [[-0.02048001,  0.53519316,  0.59562793],
        [-0.04654548,  1.2163481 ,  1.35369985],
        [-0.0390982 ,  1.0217324 ,  1.13710787],
        [-0.03537456,  0.92442456,  1.02881188],
        [-0.04654548,  1.2163481 ,  1.35369985]]])

In [110]:
# Gradient for V
d_Vval = np.dot(Attention_weights, d_attention)
d_V = np.dot(inputs.T, d_Vval)
d_V

array([[[[[-0.00115364,  0.03014744,  0.03355173],
          [-0.00115364,  0.03014744,  0.03355173],
          [-0.00115364,  0.03014744,  0.03355173],
          [-0.00115364,  0.03014744,  0.03355173]],

         [[-0.00115364,  0.03014744,  0.03355173],
          [-0.00115364,  0.03014744,  0.03355173],
          [-0.00115364,  0.03014744,  0.03355173],
          [-0.00115364,  0.03014744,  0.03355173]]],


        [[[-0.00390638,  0.10208326,  0.11361065],
          [-0.00390638,  0.10208326,  0.11361065],
          [-0.00390638,  0.10208326,  0.11361065],
          [-0.00390638,  0.10208326,  0.11361065]],

         [[-0.00390638,  0.10208326,  0.11361065],
          [-0.00390638,  0.10208326,  0.11361065],
          [-0.00390638,  0.10208326,  0.11361065],
          [-0.00390638,  0.10208326,  0.11361065]]],


        [[[-0.00230728,  0.06029487,  0.06710346],
          [-0.00230728,  0.06029487,  0.06710346],
          [-0.00230728,  0.06029487,  0.06710346],
          [-0.00230

In [179]:
# Gradient for V
d_Vval = np.matmul(Attention_weights, d_attention)
d_V = np.mean(np.matmul(np.transpose(inputs,(0,2,1)),d_Vval),axis=0)
d_V

array([[ 0.15653547, -0.04088858,  0.31461684],
       [ 0.25622851, -0.06692936,  0.51498747],
       [ 0.1641757 , -0.04288428,  0.32997275],
       [ 0.24178533, -0.06315666,  0.48595846],
       [ 0.26178957, -0.06838196,  0.5261645 ]])

In [180]:
 # Gradient for attention weights
Vval_T = np.transpose(Vval, (0, 2, 1))  # (batch_size, hidden_dim, seq_len)

# Compute gradient w.r.t. attention weights
# Shape: (batch_size, seq_len, seq_len)
d_attention_weights = np.matmul(d_attention, Vval_T)
d_attention_weights

array([[[-0.13021179, -0.16045778, -0.15464488, -0.16107885],
        [-0.13021179, -0.16045778, -0.15464488, -0.16107885],
        [-0.13021179, -0.16045778, -0.15464488, -0.16107885],
        [-0.13021179, -0.16045778, -0.15464488, -0.16107885]],

       [[ 0.23898649,  0.34037878,  0.26041125,  0.41888764],
        [ 0.23898649,  0.34037878,  0.26041125,  0.41888764],
        [ 0.23898649,  0.34037878,  0.26041125,  0.41888764],
        [ 0.23898649,  0.34037878,  0.26041125,  0.41888764]],

       [[ 0.28578605,  0.34486993,  0.37109162,  0.29452467],
        [ 0.28578605,  0.34486993,  0.37109162,  0.29452467],
        [ 0.28578605,  0.34486993,  0.37109162,  0.29452467],
        [ 0.28578605,  0.34486993,  0.37109162,  0.29452467]]])

In [181]:
# Gradient for QK scaled
d_QKscaled = d_attention_weights * Attention_weights * (1 - Attention_weights)
d_QKscaled

array([[[-0.02338979, -0.03079541, -0.02956496, -0.03010524],
        [-0.0230545 , -0.03099962, -0.0296818 , -0.0301531 ],
        [-0.02290924, -0.03110991, -0.02978778, -0.03008716],
        [-0.02319052, -0.03090817, -0.02962635, -0.03015267]],

       [[ 0.04127577,  0.06392592,  0.04595515,  0.08772291],
        [ 0.04096013,  0.06392066,  0.04569394,  0.08843089],
        [ 0.04148957,  0.06388187,  0.04616095,  0.08724443],
        [ 0.04025815,  0.06369105,  0.04519472,  0.09004549]],

       [[ 0.05148291,  0.06543583,  0.07302731,  0.05379028],
        [ 0.0499843 ,  0.06683555,  0.07454869,  0.05259008],
        [ 0.0501912 ,  0.06643143,  0.07445131,  0.05286093],
        [ 0.051152  ,  0.06566757,  0.07339946,  0.05357948]]])

In [182]:
# Gradient for Q and K
d_Qval = np.matmul(d_QKscaled, Kval) / np.sqrt(K.shape[1])
d_Kval = np.matmul(d_QKscaled, Qval) / np.sqrt(K.shape[1])
d_Q = np.mean(np.matmul(np.transpose(inputs,(0,2,1)), d_Qval),axis=0)
d_K =  np.mean(np.matmul(np.transpose(inputs,(0,2,1)), d_Kval),axis=0)

In [183]:
d_Qval

array([[[-0.01729639, -0.03654878, -0.03105323],
        [-0.01730729, -0.03659279, -0.03109365],
        [-0.01731308, -0.03660766, -0.03111312],
        [-0.01730273, -0.03657585, -0.03107688]],

       [[ 0.04902884,  0.10522794,  0.08639036],
        [ 0.04912971,  0.10537874,  0.08652642],
        [ 0.04895755,  0.10511419,  0.08629228],
        [ 0.04934113,  0.10566123,  0.08680138]],

       [[ 0.05313395,  0.09177562,  0.08709111],
        [ 0.05310171,  0.09234668,  0.08732535],
        [ 0.05311962,  0.09224382,  0.08730424],
        [ 0.05313416,  0.09189526,  0.08715012]]])

In [184]:
learning_rate=0.01

In [185]:
# Update weights
Q -= learning_rate * d_Q
K -= learning_rate * d_K
V -= learning_rate * d_V
linearlayer -= learning_rate * d_linear
linear_bias -= learning_rate * d_bias

array([[0.41492789, 0.03156887, 0.15722164],
       [0.09636694, 0.38189275, 0.08675516],
       [0.22720729, 0.1759972 , 0.39875149],
       [0.35855727, 0.29835989, 0.07211897],
       [0.35133922, 0.0524147 , 0.40287874]])

In [1]:
import spacy 
import numpy as np 
import pandas as pd
df=pd.read_csv("data/bbc-text.csv")
nlp = spacy.load('en_core_web_lg')
 
def preprocess_text(text, max_words=70):
    # Process the text using SpaCy
    doc = nlp(text)
    
    # Filter out stopwords, punctuation, and spaces
    tokens = [token.text for token in doc if not token.is_stop and not token.is_punct and not token.is_space]
    
    # Limit to the top 'max_words' words and pad if necessary
    if len(tokens) > max_words:
        tokens = tokens[:max_words]  # Keep the first max_words words
    else:
        tokens += ['<PAD>'] * (max_words - len(tokens))  # Pad the list with '<PAD>' token
    
    # Join the tokens back into a string or return a list
    return tokens
df['processed_text'] = df['text'].apply(preprocess_text)
df['processed_text']

ModuleNotFoundError: No module named 'pandas'

In [72]:
df.to_csv("Processed_dataframe.csv",index=False)

In [None]:
import spacy
import pandas as pd
from spacy.tokens import Doc

In [79]:
inputs = [] 
# Process each phrase in the 'category' column
for phrase in list(df['processed_text']):
    doc = nlp(" ".join(phrase))  # Process the phrase with SpaCy
    # Extract word vectors
    matrix = np.array([token.vector for token in doc])
    inputs.append(matrix)


In [82]:
import pickle
# with open('InputProcessed.pkl', 'wb') as f:
#     pickle.dump(np.array(inputs, dtype=object), f)

# Load from pickle file
with open('InputProcessed.pkl', 'rb') as f:
    X = pickle.load(f)

y = np.array(pd.get_dummies(df["category"], dtype=int))
tts=0.85
X_train,X_test=X[0:round(tts*len(X))],X[round(tts*len(X)):]
y_train,y_test=y[0:round(tts*len(X))],y[round(tts*len(X)):]

In [88]:
len(X_train)

1891

In [67]:
X

array([array([[-1.0279  ,  5.8798  , -9.6866  , ..., -5.2057  , -6.7824  ,
               -0.97569 ],
              [ 1.7561  ,  3.5755  , -2.56    , ..., -3.9903  , -4.6985  ,
                6.7232  ],
              [-3.7766  ,  0.69426 , -3.3805  , ..., -7.698   ,  0.60605 ,
               -5.3688  ],
              ...,
              [-3.2771  ,  5.7918  , -6.2498  , ...,  0.3389  , -4.6256  ,
                3.8239  ],
              [ 4.3092  ,  6.3754  , -6.2211  , ...,  2.7577  , -5.5191  ,
               -0.42839 ],
              [-0.076454, -4.6896  , -4.0431  , ...,  1.304   , -0.52699 ,
               -1.3622  ]], dtype=float32)                                ,
       array([[  0.      ,   0.      ,   0.      , ...,   0.      ,   0.      ,
                 0.      ],
              [  1.2616  ,  -0.74531 ,  -0.17045 , ...,   4.4011  ,   1.5951  ,
                 1.6841  ],
              [  0.      ,   0.      ,   0.      , ...,   0.      ,   0.      ,
                 0.     

In [5]:
def get_train_test_data(data_dir):
    # Get the train data
    train_data = pd.read_json(f"{data_dir}/train.json")
    train_data.drop(['id'], axis=1, inplace=True)

    # Get the test data
    test_data = pd.read_json(f"{data_dir}/test.json")
    test_data.drop(['id'], axis=1, inplace=True)
    
    return train_data, test_data

data_dir = "corpus"

train_data, test_data = get_train_test_data(data_dir)

# Take one example from the dataset and print it
example_summary, example_dialogue = train_data.iloc[10]
print(f"Dialogue:\n{example_dialogue}")
print(f"\nSummary:\n{example_summary}")

Dialogue:
Lucas: Hey! How was your day?
Demi: Hey there! 
Demi: It was pretty fine, actually, thank you!
Demi: I just got promoted! :D
Lucas: Whoa! Great news!
Lucas: Congratulations!
Lucas: Such a success has to be celebrated.
Demi: I agree! :D
Demi: Tonight at Death & Co.?
Lucas: Sure!
Lucas: See you there at 10pm?
Demi: Yeah! See you there! :D

Summary:
Demi got promoted. She will celebrate that with Lucas at Death & Co at 10 pm.


In [2]:
import numpy as np
from tqdm import tqdm
import sys
sys.path.append('c:\\python312\\lib\\site-packages')
import pickle
import spacy
import numpy as np
import pandas as pd
from tqdm import tqdm

def softmax(x, axis=-1):
    # Subtract the max value for numerical stability
    x = np.clip(x, -1500, 1500)
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)


def cross_entropy_loss(predictions, target):
    return -np.sum(target * np.log(predictions + 1e-9))  # Adding a small constant to avoid log(0)


df=pd.read_csv("data/bbc-text.csv")
nlp = spacy.load('en_core_web_lg')
with open('InputProcessed.pkl', 'rb') as f:
    X = pickle.load(f)

y = np.array(pd.get_dummies(df["category"], dtype=int))
tts=0.85
X_train,X_test=X[0:round(tts*len(X))],X[round(tts*len(X)):]
y_train,y_test=y[0:round(tts*len(X))],y[round(tts*len(X)):]
X_train.shape,X_test.shape,y_train.shape,y_test.shape


 

class AttentionClassifier:
    def __init__(self, input_dim, hidden_dim, num_classes):
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.num_classes = num_classes
        
        # Initialize weights
        self.Q = np.random.randn(input_dim, hidden_dim) / np.sqrt(input_dim)
        self.K = np.random.randn(input_dim, hidden_dim) / np.sqrt(input_dim)
        self.V = np.random.randn(input_dim, hidden_dim) / np.sqrt(input_dim)
        self.linear = np.random.randn(hidden_dim, num_classes) / np.sqrt(hidden_dim)
        self.bias = np.zeros(num_classes)

    def forward(self, inputs):
        self.inputs = inputs  # Store for backprop
        
        # Attention mechanism
        self.Qval = np.dot(inputs, self.Q)
        self.Kval = np.dot(inputs, self.K)
        self.Vval = np.dot(inputs, self.V)
        
        self.QKscaled = np.dot(self.Qval, self.Kval.T) / np.sqrt(self.K.shape[1])
        self.attention_weights = softmax(self.QKscaled)
        self.attention = np.dot(self.attention_weights, self.Vval)
        
        # Phrase representation
        self.phrase_rep = np.mean(self.attention, axis=0)
        
        # Classification
        self.logits = np.dot(self.phrase_rep, self.linear) + self.bias
        probabilities = softmax(self.logits)
        
        return probabilities
 
    def backward(self, target, learning_rate):
        m = self.inputs.shape[0]  # Number of words in the input
        
        # Gradient of loss with respect to output probabilities
        d_probabilities = self.forward(self.inputs) - target
        
        # Gradient for linear layer and bias
        d_linear = np.outer(self.phrase_rep, d_probabilities)
        d_bias = d_probabilities
        
        # Gradient for phrase representation
        d_phrase_rep = np.dot(d_probabilities, self.linear.T)

      
        # Gradient for attention
        d_attention = np.outer(np.ones(m), d_phrase_rep) / m
         
        # Gradient for V
        d_Vval = np.dot(self.attention_weights, d_attention)
        d_V = np.dot(self.inputs.T, d_Vval)
        
        
        # Gradient for attention weights
        d_attention_weights = np.dot(d_attention, self.Vval.T)
        
        # Gradient for QK scaled
        d_QKscaled = d_attention_weights * self.attention_weights * (1 - self.attention_weights)
        
        # Gradient for Q and K
        d_Qval = np.dot(d_QKscaled, self.Kval) / np.sqrt(self.K.shape[1])
        d_Kval = np.dot(d_QKscaled.T, self.Qval) / np.sqrt(self.K.shape[1])
        
        d_Q = np.dot(self.inputs.T, d_Qval)
        d_K = np.dot(self.inputs.T, d_Kval)
        
        # Update weights
        self.Q -= learning_rate * d_Q
        self.K -= learning_rate * d_K
        self.V -= learning_rate * d_V
        self.linear -= learning_rate * d_linear
        self.bias -= learning_rate * d_bias
 

def train(model, X, y, epochs, learning_rate):
    for epoch in range(epochs):
        total_loss = 0
        for i in tqdm(range(len(X)), desc="Processing matrices"):
            inputs = X[i]
            target = y[i]
            
            # Forward pass
            probabilities = model.forward(inputs)
            
            # Compute loss
            loss = cross_entropy_loss(probabilities, target)
            total_loss += loss
            
            # Backward pass and update weights
            model.backward(target, learning_rate)
        
        # Print average loss for the epoch
        avg_loss = total_loss / len(X)
        print(f"Epoch {epoch+1}/{epochs}, Average Loss: {avg_loss:.4f}")



# Example usage
input_dim = 300 # Word embedding size
hidden_dim = 32  # Adjust based on your needs
num_classes = 5  # Adjust based on your classification task
model = AttentionClassifier(input_dim, hidden_dim, num_classes)

# Sample data (replace with your actual dataset)
  # One-hot encoded targets
    # Add more targets...
 
# Train the model
train(model, X_train, y_train, epochs=20, learning_rate=0.001)

Processing matrices:  57%|███████████████████████████████▋                        | 1071/1891 [00:01<00:01, 681.16it/s]


KeyboardInterrupt: 

In [96]:
n=6
pd.get_dummies(df["category"], dtype=int).iloc[n+1891]

business         0
entertainment    0
politics         0
sport            1
tech             0
Name: 1897, dtype: int32

In [97]:
df["text"].iloc[n+1891]

'henry tipped for fifa award fifa president sepp blatter hopes arsenal s thierry henry will be named world player of the year on monday.  henry is on the fifa shortlist with barcelona s ronaldinho and newly-crowned european footballer of the year  ac milan s andriy shevchenko. blatter said:  henry  for me  is the personality on the field. he is the man who can run and organise the game.  the winner of the accolade will be named at a glittering ceremony at zurich s opera house. the three shortlisted candidates for the women s award are mia hamm of the united states  germany s birgit prinz and brazilian youngster marta.  hamm  who recently retired - is looking to regain the women s award  which she lost last year to striker prinz. fifa has changed the panel of voters for this year s awards. male and female captains of every national team will be able to vote  as well as their coaches and fipro - the global organisation for professional players.'

In [98]:
np.set_printoptions(formatter={'float': lambda x: "{0:0.3f}".format(x)})
print(y_test[n],model.forward(X_test[n])) 

[0 0 0 1 0] [0.001 0.117 0.018 0.864 0.000]


In [46]:
model.forward(X_test[0])

array([9.99738797e-01, 4.12926975e-05, 3.19123457e-06, 4.18725669e-07,
       2.16300504e-04])

In [116]:
import sys
sys.path.append('c:\\python312\\lib\\site-packages')
import pickle
import spacy
import numpy as np
import pandas as pd
from tqdm import tqdm

def softmax(x, axis=-1):
    # Subtract the max value for numerical stability
    x = np.clip(x, -1500, 1500)
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / np.sum(e_x, axis=axis, keepdims=True)


def cross_entropy_loss(predictions, target):
    return -np.sum(target * np.log(predictions + 1e-3))  # Adding a small constant to avoid log(0)


df=pd.read_csv("data/bbc-text.csv")
nlp = spacy.load('en_core_web_lg')
with open('InputProcessed.pkl', 'rb') as f:
    X = pickle.load(f)

y = np.array(pd.get_dummies(df["category"], dtype=int))
tts=0.85
X_train,X_test=X[0:round(tts*len(X))],X[round(tts*len(X)):]
y_train,y_test=y[0:round(tts*len(X))],y[round(tts*len(X)):]
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((1891,), (334,), (1891, 5), (334, 5))

In [123]:

Attention_Embedding_Dimension=64
word_embedding_dim=X_train[0].shape[1]

Q = np.random.rand(word_embedding_dim, Attention_Embedding_Dimension)/ np.sqrt(word_embedding_dim)
K = np.random.rand(word_embedding_dim, Attention_Embedding_Dimension)/ np.sqrt(word_embedding_dim)
V = np.random.rand(word_embedding_dim, Attention_Embedding_Dimension)/ np.sqrt(word_embedding_dim)

num_classes = 5  # Example number of classes

        
linearlayer = np.random.rand(Attention_Embedding_Dimension, num_classes)
linear_bias = np.random.rand(num_classes)
learning_rate=0.002
epochs=10
words_in_phrase=70

def forward(Xi):
    Qval = np.dot(Xi, Q)
    Kval = np.dot(Xi, K)
    Vval = np.dot(Xi, V)

    QKscaled = np.dot(Qval, Kval.T) / np.sqrt(K.shape[1])

    Attention_weights = softmax(QKscaled)

    Attention = np.dot(Attention_weights, Vval)

    phrase_representation = np.mean(Attention, axis=0)

    yi = softmax(np.dot(phrase_representation, linearlayer) + linear_bias)

    return yi,phrase_representation,linearlayer
    
def backpropagation(yi,yt,phrase_representation,linearlayer):
    # Gradient of loss with respect to output probabilities
    dL_yi = yi - yt

    # Gradient for linear layer and bias
    dL_dw = np.outer(phrase_representation, dL_yi)
    dL_bias = dL_yi

    # Gradient for phrase representation
    d_phrase_rep = np.dot(dL_yi, linearlayer.T)

    # Gradient for attention
    d_attention = np.outer(np.ones(Xi.shape[0]), d_phrase_rep) / Xi.shape[0]

    # Gradient for V
    d_Vval = np.dot(Attention_weights, d_attention)
    d_V = np.dot(Xi.T, d_Vval)

    # Gradient for attention weights
    d_sigma_attention = np.dot(d_attention, Vval.T)

    # Gradient for QK scaled
    d_QKscaled = d_sigma_attention * Attention_weights * (1 - Attention_weights)

    # Gradient for Q and K
    d_Qval = np.dot(d_QKscaled, Kval) / np.sqrt(K.shape[1])
    d_Kval = np.dot(d_QKscaled.T, Qval) / np.sqrt(K.shape[1])
    d_Q = np.dot(Xi.T, d_Qval)
    d_K = np.dot(Xi.T, d_Kval)

    # Update weights
    Q -= learning_rate * d_Q
    K -= learning_rate * d_K
    V -= learning_rate * d_V
    linearlayer -= learning_rate * dL_dw
    linear_bias -= learning_rate * dL_bias
    
for epoch in range(epochs):
    total_loss = 0
    for i in tqdm(range(len(y_train)), desc="Processing matrices"):
        Xi = X_train[i]
        yt = y_train[i]
       
       
        yi,phrase_representation,linearlayer=forward(Xi)
        
        Loss = cross_entropy_loss(yi, yt)
        
        total_loss += Loss
         
        backpropagation(yi,yt,phrase_representation,linearlayer)
        
    avg_loss = total_loss / len(y_train)
    print(f"Epoch {epoch + 1}/{epochs}, Average Loss: {avg_loss:.5f}")



Processing matrices:   0%|                                                                                                        | 0/1891 [00:00<?, ?it/s]


UnboundLocalError: cannot access local variable 'K' where it is not associated with a value

In [None]:
np.outer(np.ones(5), np.array([[0.1,0.1],[0.1,0.1],[0.1,0.1]])) / 3

In [198]:
np.ones(5)

array([1., 1., 1., 1., 1.])