## Attention mechanism

In a basic RNN, each recurrent neuron receives inputs from all neurons from the previous time step, as well as the inputs from the current time step, hence the term 'recurrent'.

In [22]:
### This cell should be hidden in the final version

import tensorflow as tf
import numpy as np
import ipywidgets as widgets
from keras.src.utils import pad_sequences
from jupyterquiz import display_quiz
from sklearn.metrics import accuracy_score

git_path="https://raw.githubusercontent.com/ChaosTheLegend/ML-Book/main/Quizes/"

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data(num_words=10000)

max_len = 200
x_train = pad_sequences(x_train, maxlen=max_len, truncating='post')
x_test = pad_sequences(x_test, maxlen=max_len, truncating='post')
num_words = 10000

embedding_dim = 100
hidden_dim = 256
output_dim = 1
dropout_rate = 0.5

In [24]:
### This cell should be hidden in the final version

from keras.datasets import imdb
import ipywidgets as widgets

word_index = imdb.get_word_index()

def make_prediction(review, model, bar):

    review = review.lower()

    review = ''.join([char for char in review if char.isalnum() or char == ' '])

    review = review.split()

    review = [[word_index[word] for word in review]]

    review = pad_sequences(review, maxlen=max_len, truncating='post')

    # print negative or positive based on the prediction

    prediction = model.predict(review)

    bar.value = prediction[0][0]

In [15]:
simpleRNN = tf.keras.models.load_model('simpleRNN.keras')

simpleRNN.summary()

Model: "sequential_16"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 200, 100)          1000000   
                                                                 
 simple_rnn_14 (SimpleRNN)   (None, 256)               91392     
                                                                 
 dense_15 (Dense)            (None, 1)                 257       
                                                                 
Total params: 1091649 (4.16 MB)
Trainable params: 1091649 (4.16 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [16]:
# make predictions and calculate accuracy

y_pred = simpleRNN.predict(x_test)
y_pred = np.round(y_pred)

accuracy = accuracy_score(y_test, y_pred)



In [13]:
accuracy

0.55552

# The Need for Attention Mechanism

The problem with basic RNNs is that they are not very good at handling long sequences. 

Even when using more epochs, the accuracy of the model does not improve much. This is because the model is not able to learn the long-term dependencies in the data.

This is known as the vanishing gradient problem.

In [None]:
# Insert graph showing accuracy vs epochs

In [20]:
display_quiz(git_path + "quiz3.json")

<IPython.core.display.Javascript object>

In [29]:
# make a text field for a review

text_field = widgets.Text()

text_field.continuous_update = True

# add a bar from 0 to 1 to show the prediction, make bar go from red to green

bar = widgets.FloatProgress(
    value=0,
    min=0,
    max=1.0,
    step=0.01,
    description='Prediction:',
    bar_style='info',
    orientation='horizontal'
)

text_field.observe(lambda text_field: make_prediction(text_field['new'], simpleRNN, bar), 'value')


def update_bar(change):
    bar.value = change['new']

display(text_field)
display(bar)

Text(value='')

FloatProgress(value=0.0, bar_style='info', description='Prediction:', max=1.0)

<IPython.core.display.Javascript object>

## Attention Mechanism

In [84]:
from keras.layers import Input, Embedding, LSTM, Dense, Attention, Bidirectional, Dropout
import os

#check if there is a model file in the current directory

# if not, train a new model and save it

# if there is, load the model from the file

train_new_model = True

if('LSTM.keras' in os.listdir()):
    LSTM = tf.keras.models.load_model('LSTM.keras')
    train_new_model = False


In [None]:
if train_new_model:    
    inputs = Input(shape=(max_len,))
    embedding = Embedding(input_dim=10000, output_dim=embedding_dim, input_length=max_len)(inputs)
    lstm = Bidirectional(LSTM(hidden_dim, return_sequences=True))(embedding)
    attention = Attention()([lstm, lstm])
    context = tf.reduce_sum(attention * lstm, axis=1)
    dropout = Dropout(dropout_rate)(context)
    output = Dense(output_dim, activation='sigmoid')(dropout)


    LSTM = tf.keras.Model(inputs=inputs, outputs=output)

    LSTM.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [62]:
LSTM.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 200)]                0         []                            
                                                                                                  
 embedding_5 (Embedding)     (None, 200, 100)             1000000   ['input_1[0][0]']             
                                                                                                  
 bidirectional (Bidirection  (None, 200, 512)             731136    ['embedding_5[0][0]']         
 al)                                                                                              
                                                                                                  
 attention (Attention)       (None, 200, 512)             0         ['bidirectional[0][0]',   

In [64]:
if train_new_model:
    LSTM.fit(x_train, y_train, epochs=5, batch_size=128, validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x19c27f14ed0>

In [66]:

LSTM.save('LSTM.keras')

In [86]:
y_pred = LSTM.predict(x_test)

y_pred = np.round(y_pred)

accuracy = accuracy_score(y_test, y_pred)

accuracy



0.82268

In [97]:
LSTM.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_1 (InputLayer)        [(None, 200)]                0         []                            
                                                                                                  
 embedding_5 (Embedding)     (None, 200, 100)             1000000   ['input_1[0][0]']             
                                                                                                  
 bidirectional (Bidirection  (None, 200, 512)             731136    ['embedding_5[0][0]']         
 al)                                                                                              
                                                                                                  
 attention (Attention)       (None, 200, 512)             0         ['bidirectional[0][0]',   

In [101]:
# make a text field for a review

# draw a label for prediction

label = widgets.Label()

review2 = widgets.Text()

review2.continuous_update = False

review2.observe(lambda review2: make_prediction(review2['new'], LSTM, label), 'value')

display(review2)

display(label)


Text(value='', continuous_update=False)

Label(value='')

## Alignment Scores in RNN Attention Mechanism

The attention mechanism in a recurrent neural network (RNN) uses alignment scores to determine how much focus to place on each input in a sequence.

In the context of sequence-to-sequence models, for example, if we have an encoded input sequence, an alignment score is computed for each pair of input and output positions. If "a" is the decoder’s hidden state and "b" is all of the encoder’s hidden states, the alignment score function often takes the following form:

$$
\text{score}(a, b) = a^Tb
$$

This score indicates how well the inputs around position "b" and the output at position "a" match. The alignment scores for each input "b" are combined into a single vector and normalized to sum to 1, resulting in the attention weights. These weights determine the amount of 'attention' given by the model to each input timestep while producing an output. 

Higher alignment scores mean that the decoder pays more attention to those parts of the encoder's output.

Overall, the attention mechanism improves the accuracy of the RNN model when handling tasks with long input sequences, by enabling it to focus on the most relevant parts of the input to produce a given output.

## From Alignment Scores to Attention Weights

After computing the alignment scores between the input and output vectors in the attention mechanism, these scores are then converted to attention weights using the softmax function.

The softmax function is commonly used in neural networks to turn scores into probabilities. It is defined as:

$$
\text{softmax}(x_i) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
$$

By applying softmax, we ensure that each weight falls in the range [0, 1] and that they all sum to 1, which allows us to interpret the attention weights as probabilities.

In the context of attention mechanisms, the softmax function is applied to the alignment scores for each input and output pair, resulting in attention weights. Therefore, each input element in a sequence gets an attention weight. Now the model knows how much attention it needs to pay to each element when encoding information.

For a given output, positions in the input sequence with a higher attention weight have a greater influence on the computation. This means our decoder will "pay more attention" to these positions during the encoding of the sequence.