# Why are Attention Mechanisms Important?
## Attention mechanisms have become indispensable in various deep-learning applications due to their ability to address some critical challenges:

* Long Sequences: Traditional neural networks struggle with processing long sequences, such as translating a paragraph from one language to another. Attention mechanisms allow models to focus on the relevant parts of the input, making them more effective at handling lengthy data.
* Contextual Understanding: In tasks like language translation, understanding the context of a word is crucial for accurate translation. Attention mechanisms enable models to consider the context by assigning different attention weights to each word in the input sequence.
* Improved Performance: Models equipped with attention mechanisms often outperform their non-attention counterparts. They achieve state-of-the-art results in tasks like machine translation, image classification, and speech recognition.

# The name "transformer" for models using attention layers comes from the influential paper "Attention Is All You Need" by Vaswani et al. (2017). The term "transformer" reflects the model's architecture and its ability to "transform" sequences of data through a series of attention-based layers.

## Here are key reasons why the name "transformer" was chosen:

* Attention Mechanism: The core innovation of the transformer model is the self-attention mechanism. This allows the model to weigh the importance of different words in a sentence, regardless of their position, enabling it to capture long-range dependencies and relationships more effectively than previous models.

* Sequence Transformation: The model transforms input sequences into output sequences. In the context of natural language processing (NLP), this means transforming an input sentence into an output sentence, such as translating from one language to another.

* Layered Architecture: Transformers are composed of multiple layers of self-attention and feed-forward neural networks. Each layer processes the input and transforms it into a more abstract representation, progressively refining the output through successive transformations.

* Scalability: The architecture is highly scalable, allowing for the use of large amounts of data and computational resources to improve performance. This transformation of data at scale is a key feature of the model.

* Shift from RNNs/CNNs: Prior to transformers, recurrent neural networks (RNNs) and convolutional neural networks (CNNs) were commonly used for sequence modeling. The transformer model represents a significant shift in approach, using attention mechanisms without recurrence or convolution, thus transforming the field of deep learning.

In summary, the name "transformer" encapsulates the model's ability to apply attention mechanisms to transform input sequences into more useful representations, enabling powerful and flexible modeling of sequential data.

# Example in Machine Translation
* https://medium.com/@zhonghong9998/attention-mechanisms-in-deep-learning-enhancing-model-performance-32a91006092a
* https://www.scaler.com/topics/deep-learning/attention-mechanism-deep-learning/

## Context Size
### Context size is a huge bottleneck for language models because its o(n^2), K(n)*Q(n) so these are some approches for tackling this limit
* Sparse Attention Mechanism
* Blockwise Attention
* Linformer
* Reformer
* Ring attention
* Longformer
* Adaptive Attention Spam

# Attention(Q, K, V ) = softmax(Q*KT/√dk)*V
## Multi-Head computes multiple of this table, so each head goes for finding a pattern in context

# Attention from scratch

In [1]:
from numpy import array
from numpy import random
from numpy import dot
from scipy.special import softmax

In [6]:
# encoder representations of four different words
word_1 = array([1, 0, 0])
word_2 = array([0, 1, 0])
word_3 = array([1, 1, 0])
word_4 = array([0, 0, 1])

In [7]:
# generating the weight matrices
random.seed(42) # to allow us to reproduce the same attention values
W_Q = random.randint(3, size=(3, 3))
W_K = random.randint(3, size=(3, 3))
W_V = random.randint(3, size=(3, 3))

In [18]:
words = array([word_1, word_2, word_3, word_4])

In [19]:
Q = words @ W_Q
K = words @ W_K
V = words @ W_V
 
# scoring the query vectors against all key vectors
scores = Q @ K.transpose()

In [20]:
scores

array([[ 8,  2, 10,  2],
       [ 4,  0,  4,  0],
       [12,  2, 14,  2],
       [10,  4, 14,  3]])

In [22]:
weights = softmax(scores / K.shape[1] ** 0.5, axis=1)

In [23]:
attention = weights @ V

In [24]:
attention

array([[0.98522025, 1.74174051, 0.75652026],
       [0.90965265, 1.40965265, 0.5       ],
       [0.99851226, 1.75849334, 0.75998108],
       [0.99560386, 1.90407309, 0.90846923]])

In [25]:
newEmbbeding = attention + words

array([[1.98522025, 1.74174051, 0.75652026],
       [0.90965265, 2.40965265, 0.5       ],
       [1.99851226, 2.75849334, 0.75998108],
       [0.99560386, 1.90407309, 1.90846923]])

# Attention with Tensorflow

In [28]:
import tensorflow as tf
import keras

In [44]:

# Custom Vars
TOP_WORDS = 5000
EMBEDDING_LEN = 32
ADD_ATTENTION = True
MAX_INPUT_LEN = 200

# Preprocess the data
(train_x, train_y), (test_x, test_y) = tf.keras.datasets.imdb.load_data(num_words=TOP_WORDS)
train_x = tf.keras.preprocessing.sequence.pad_sequences(train_x, maxlen=MAX_INPUT_LEN)
test_x = tf.keras.preprocessing.sequence.pad_sequences(test_x, maxlen=MAX_INPUT_LEN)

In [52]:
# Define the model
inputs = keras.layers.Input(shape=(MAX_INPUT_LEN,))
embedding = keras.layers.Embedding(TOP_WORDS, EMBEDDING_LEN, input_length=MAX_INPUT_LEN)(inputs)
x = keras.layers.Dropout(0.5)(embedding)

if ADD_ATTENTION:
    # lstm_out = keras.layers.LSTM(100, return_sequences=True)(x)
    query_value_attention_seq = keras.layers.Attention()([x, x]) # using same key as query casue we are using self-attention
    x = keras.layers.GlobalAveragePooling1D()(query_value_attention_seq)
else:
    x = keras.layers.LSTM(100)(x)
    x = keras.layers.Dense(350, activation='relu')(x)

x = keras.layers.Dropout(0.5)(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)

model = keras.Model(inputs, outputs)

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(
    train_x, train_y,
    verbose=2,
    validation_data=(test_x, test_y),
    epochs=10,
    batch_size=64
)

Epoch 1/10
391/391 - 20s - 50ms/step - accuracy: 0.6713 - loss: 0.6404 - val_accuracy: 0.7602 - val_loss: 0.5315
Epoch 2/10
391/391 - 5s - 14ms/step - accuracy: 0.8074 - loss: 0.4582 - val_accuracy: 0.8425 - val_loss: 0.3943
Epoch 3/10
391/391 - 5s - 14ms/step - accuracy: 0.8465 - loss: 0.3761 - val_accuracy: 0.8551 - val_loss: 0.3474
Epoch 4/10
391/391 - 6s - 14ms/step - accuracy: 0.8610 - loss: 0.3403 - val_accuracy: 0.8692 - val_loss: 0.3228
Epoch 5/10
391/391 - 3s - 7ms/step - accuracy: 0.8717 - loss: 0.3140 - val_accuracy: 0.8734 - val_loss: 0.3099
Epoch 6/10
391/391 - 3s - 7ms/step - accuracy: 0.8818 - loss: 0.2967 - val_accuracy: 0.8758 - val_loss: 0.3006
Epoch 7/10
391/391 - 3s - 7ms/step - accuracy: 0.8878 - loss: 0.2838 - val_accuracy: 0.8780 - val_loss: 0.2957
Epoch 8/10
391/391 - 3s - 7ms/step - accuracy: 0.8927 - loss: 0.2729 - val_accuracy: 0.8735 - val_loss: 0.3001
Epoch 9/10
391/391 - 3s - 7ms/step - accuracy: 0.8950 - loss: 0.2650 - val_accuracy: 0.8747 - val_loss: 0.2

<keras.src.callbacks.history.History at 0x1d730fa66c0>

In [54]:
model.save('LSTM_TEXT_MAKE.keras')

In [43]:

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(TOP_WORDS, EMBEDDING_LEN, input_length=MAX_INPUT_LEN),
    tf.keras.layers.Dropout(0.5),
    *([tf.keras.layers.LSTM(100, return_sequences=True), keras.layers.Attention()] if ADD_ATTENTION
        else [tf.keras.layers.LSTM(100), tf.keras.layers.Dense(350, activation='relu')]),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(
    train_x, train_y,
    verbose=2,
    validation_data=(test_x, test_y),
    epochs=10,
    batch_size=64
)


Epoch 1/10




391/391 - 49s - 126ms/step - accuracy: 0.7513 - loss: 0.4848 - val_accuracy: 0.8570 - val_loss: 0.3439
Epoch 2/10
391/391 - 34s - 87ms/step - accuracy: 0.8758 - loss: 0.3086 - val_accuracy: 0.8692 - val_loss: 0.3091
Epoch 3/10
391/391 - 61s - 156ms/step - accuracy: 0.8916 - loss: 0.2709 - val_accuracy: 0.8738 - val_loss: 0.2989
Epoch 4/10
391/391 - 62s - 159ms/step - accuracy: 0.9042 - loss: 0.2425 - val_accuracy: 0.8678 - val_loss: 0.3278
Epoch 5/10
391/391 - 38s - 96ms/step - accuracy: 0.9110 - loss: 0.2285 - val_accuracy: 0.8709 - val_loss: 0.3089
Epoch 6/10
391/391 - 28s - 71ms/step - accuracy: 0.9204 - loss: 0.2042 - val_accuracy: 0.8634 - val_loss: 0.3233
Epoch 7/10
391/391 - 27s - 70ms/step - accuracy: 0.9266 - loss: 0.1889 - val_accuracy: 0.8684 - val_loss: 0.3595
Epoch 8/10
391/391 - 28s - 71ms/step - accuracy: 0.9315 - loss: 0.1742 - val_accuracy: 0.8658 - val_loss: 0.3826
Epoch 9/10
391/391 - 27s - 69ms/step - accuracy: 0.9318 - loss: 0.1729 - val_accuracy: 0.8642 - val_loss

<keras.src.callbacks.history.History at 0x1d72e08e6f0>

### Attention is much faster and a bit more acurate