# Comparative Analysis of Recurrent and Convolutional Models for Sequential Data Processing

This notebook made as an assignement for ENSI deep learning class provides a comparative analysis in terms of performance of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and the Long Short-Term Memory (LSTMs) variant of RNNs when applied to sequential data. 

Both RNNs, including LSTMs, and CNNs have become popular choices for modeling sequential data due to their unique strengths in handling temporal dependencies and efficient feature extraction, respectively.

importing libraries:

In [25]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, SimpleRNN, LSTM, Activation
from tensorflow.keras.layers import Bidirectional, Input, Layer
from tensorflow.keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D, Dropout, Dense, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import concurrent.futures
from tensorflow.keras.metrics import Precision, Recall, AUC

Training locally with tensorflow gpu using an RTX 3050ti

In [2]:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    print("GPU is available.")
    for gpu in gpus:
        print(f"GPU name: {gpu.name}")
else:
    print("GPU is not available.")

GPU is available.
GPU name: /physical_device:GPU:0


In [3]:
!nvidia-smi

Mon Nov 11 14:58:38 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3050 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   55C    P0             13W /   60W |      10MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

### Preprocessing text files

In [4]:
def load(f):
    texts = []
    labels = []
    with open(f, 'r') as f:
        for line in f:
            label, text = line.split(" ", 1)
            labels.append(int(label.split('__label__')[1])) 
            texts.append(text.strip())
    return texts, np.array(labels)

Considering the large size of the file, we'll use parallelization for file loading and processing

In [5]:
with concurrent.futures.ThreadPoolExecutor() as executor:
    train_future = executor.submit(load, 'train.ft.txt')
    test_future = executor.submit(load, 'test.ft.txt')
    
    train_texts, train_labels = train_future.result()
    test_texts, test_labels = test_future.result()

Tokenization

In [6]:
portion_of_texts = train_texts[:100000]

tokenizer = Tokenizer(num_words=10000)  # limit vocab to 10k
tokenizer.fit_on_texts(portion_of_texts)

In [7]:
X_train = tokenizer.texts_to_sequences(train_texts)


In [8]:
X_test = tokenizer.texts_to_sequences(test_texts)

Padding for 100 as token length

In [9]:
X_train = pad_sequences(X_train, maxlen=100)
X_test = pad_sequences(X_test, maxlen=100)

Converting labels to np arrays

In [10]:
y_train = np.array(train_labels)
y_test = np.array(test_labels)

# Convolutional Neural Network (CNN) with 100 token length

In [36]:
CNNmodel = Sequential()

First convolutional layer, using conv1D because of the sequential nature of the data (text data)
Using batch normalization after the first conv layer for regularization

In [37]:
CNNmodel.add(Conv1D(64, 5, activation='relu', input_shape=(X_train.shape[1], 1)))
CNNmodel.add(BatchNormalization()) 
#maxPooling
CNNmodel.add(MaxPooling1D(2)) 

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


second convolutional layer 

In [38]:
CNNmodel.add(Conv1D(128, 5, activation='relu'))
CNNmodel.add(BatchNormalization())  
CNNmodel.add(MaxPooling1D(2))

third convolutional layer

In [39]:
CNNmodel.add(Conv1D(256, 3, activation='relu'))
CNNmodel.add(BatchNormalization())  # Batch normalization after the third conv layer
CNNmodel.add(MaxPooling1D(2))

global maxpooling to reduce sequence length (dimentionality reduction)

In [40]:
CNNmodel.add(GlobalMaxPooling1D())

dropping random neurons at each feedforward pass for regularization/avoiding overfitting

In [41]:
CNNmodel.add(Dropout(0.5)) 

first fully connected dense layer with batch normalization;
these following layers will be responsible for classification from the features and patterns extracted with the convolutional layers

In [42]:
CNNmodel.add(Dense(64, activation='relu')) 
CNNmodel.add(BatchNormalization()) 

second fully connected dense layer

In [43]:
CNNmodel.add(Dense(32, activation='relu'))
CNNmodel.add(BatchNormalization())

output layer of dimension 1  (either label 1 or label 2)

In [44]:
CNNmodel.add(Dense(1, activation='sigmoid'))

model architecture + number of paramaters

In [45]:
CNNmodel.summary()

compiling + training

In [46]:
CNNmodel.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

In [47]:
CNNmodel.fit(X_train, y_train, epochs=10, batch_size=16, validation_data=(X_test, y_test))


Epoch 1/10
[1m225000/225000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m325s[0m 1ms/step - accuracy: 0.4989 - loss: -97668.0000 - val_accuracy: 0.5000 - val_loss: -752095.0625
Epoch 2/10
[1m225000/225000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m296s[0m 1ms/step - accuracy: 0.4991 - loss: -1369769.8750 - val_accuracy: 0.4999 - val_loss: -3638122.5000
Epoch 3/10
[1m225000/225000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m301s[0m 1ms/step - accuracy: 0.4986 - loss: -4328500.0000 - val_accuracy: 0.5000 - val_loss: -6995024.5000
Epoch 4/10
[1m225000/225000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m300s[0m 1ms/step - accuracy: 0.4991 - loss: -9000041.0000 - val_accuracy: 0.5000 - val_loss: -15811513.0000
Epoch 5/10
[1m225000/225000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m277s[0m 1ms/step - accuracy: 0.4984 - loss: -15413522.0000 - val_accuracy: 0.4994 - val_loss: -20603110.0000
Epoch 6/10
[1m225000/225000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m273s

<keras.src.callbacks.history.History at 0x7a1413392a50>

Evaluation

In [48]:
loss, accuracy = CNNmodel.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 882us/step - accuracy: 0.4951 - loss: -87694512.0000
Test Accuracy: 49.89%


here, we observe a low training and testing accuracy for the CNN model, even with the use of conv1D layers. 

this suggests that the model may struggle to capture sequential dependencies effectively, as CNNs generally excel at extracting local patterns but can be limited in handling long-term dependencies in sequential data.

# Recurrent neural network with 100 token length

In [146]:
RNNmodel = Sequential()

The Embedding layer is used to learn dense vector representations of words, mapping each token to a continuous vector space where similar words are closer together. 


This layer allows the model to automatically learn meaningful word representations during training, capturing semantic relationships and patterns in the data 

In [147]:
RNNmodel.add(Embedding(input_dim=10000, output_dim=128, input_length=100))

first recurrent layer

In [148]:
RNNmodel.add(SimpleRNN(64, return_sequences=True, activation='relu'))  # RNN layer with return_sequences=True


second recurrent layer

In [149]:
RNNmodel.add(SimpleRNN(32, return_sequences=True, activation='relu'))  # Keep return_sequences=True

global 1D max pooling to reduce sequence length to single vector

In [150]:
RNNmodel.add(GlobalMaxPooling1D())


dropout for regularization/avoiding overfitting

In [151]:
RNNmodel.add(Dropout(0.3))

fully connected layers with batch normalization

In [152]:
RNNmodel.add(Dense(64, activation='relu'))
RNNmodel.add(BatchNormalization())

RNNmodel.add(Dense(32, activation='relu'))
RNNmodel.add(BatchNormalization())

output layer for classification 1D = label 1 or label 2

In [153]:
RNNmodel.add(Dense(1, activation='sigmoid'))

compile + train

In [154]:
RNNmodel.compile(
    optimizer=Adam(),
    loss='binary_crossentropy',
    metrics=['accuracy', Precision(), Recall(), AUC()]
)

In [155]:
RNNmodel.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test))


Epoch 1/10
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 2s/step - accuracy: 0.5157 - auc_8: 0.5038 - loss: 0.8301 - precision_8: 0.4764 - recall_8: 0.4858 - val_accuracy: 0.0000e+00 - val_auc_8: 0.0000e+00 - val_loss: 0.7707 - val_precision_8: 0.0000e+00 - val_recall_8: 0.0000e+00
Epoch 2/10
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 1s/step - accuracy: 0.6241 - auc_8: 0.6561 - loss: 0.6615 - precision_8: 0.5815 - recall_8: 0.6166 - val_accuracy: 0.0000e+00 - val_auc_8: 0.0000e+00 - val_loss: 0.8347 - val_precision_8: 0.0000e+00 - val_recall_8: 0.0000e+00
Epoch 3/10
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 1s/step - accuracy: 0.6865 - auc_8: 0.7491 - loss: 0.5943 - precision_8: 0.6729 - recall_8: 0.6614 - val_accuracy: 0.0000e+00 - val_auc_8: 0.0000e+00 - val_loss: 0.8944 - val_precision_8: 0.0000e+00 - val_recall_8: 0.0000e+00
Epoch 4/10
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 1s/step - accuracy

<keras.src.callbacks.history.History at 0x7e3b8c4c0e00>

test evaluation

In [156]:
metrics = RNNmodel.evaluate(X_test, y_test)

print(f"Test Loss: {metrics[0]:.4f}")
print(f"Test Accuracy: {metrics[1] * 100:.2f}%")
print(f"Test Precision: {metrics[2]:.4f}")
print(f"Test Recall: {metrics[3]:.4f}")
print(f"Test AUC: {metrics[4]:.4f}")

[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 3ms/step - accuracy: 0.0032 - auc_8: 0.0000e+00 - loss: 1.1248 - precision_8: 0.9999 - recall_8: 0.0064
Test Loss: 1.1234
Test Accuracy: 0.33%
Test Precision: 1.0000
Test Recall: 0.0065
Test AUC: 0.0000


train evaluation

In [157]:
metrics = RNNmodel.evaluate(X_train, y_train)

print(f"Train Loss: {metrics[0]:.4f}")
print(f"Train Accuracy: {metrics[1] * 100:.2f}%")
print(f"Train Precision: {metrics[2]:.4f}")
print(f"Train Recall: {metrics[3]:.4f}")
print(f"Train AUC: {metrics[4]:.4f}")

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.9816 - auc_8: 1.0000 - loss: 0.3750 - precision_8: 1.0000 - recall_8: 0.9582
Train Loss: 0.3809
Train Accuracy: 97.90%
Train Precision: 1.0000
Train Recall: 0.9541
Train AUC: 1.0000


Despite the overfitting observed here, the clear superiority of recurrence over convolution in handling sequential data, such as text, is evident.   


Recurrent models, by maintaining memory of previous time steps, excel at capturing the dependencies , causality and contextual nuances in sequences. This enables them to effectively model the semantic relationships in text, where the order and context of words significantly impact the meaning. In contrast, convolutional models, while effective in certain tasks, struggle to maintain such long-term dependencies, limiting their performance on sequential data

# Recurrent neural network with 50 token length

In [188]:
RNN50_model=Sequential()

This model will follow the same architecture as the previous RNN but with 50 tokens as the input length in the embedding layer.

Embedding layer to learn dense vector representations of tokens.

In [189]:
RNN50_model.add(Embedding(input_dim=10000, output_dim=128, input_length=50))

In [None]:
RNN50_model.add(SimpleRNN(64, return_sequences=True, activation='relu'))  

In [None]:
RNN50_model.add(SimpleRNN(32, return_sequences=True, activation='relu'))  

In [192]:
RNN50_model.add(GlobalMaxPooling1D())


In [193]:
RNN50_model.add(Dropout(0.3))

In [194]:
RNN50_model.add(Dense(64, activation='relu'))
RNN50_model.add(BatchNormalization())

RNN50_model.add(Dense(32, activation='relu'))
RNN50_model.add(BatchNormalization())

In [195]:
RNN50_model.add(Dense(1, activation='sigmoid'))

In [196]:
RNN50_model.compile(
    optimizer=Adam(),
    loss='binary_crossentropy',
    metrics=['accuracy', Precision(), Recall(), AUC()]
)

In [197]:
RNN50_model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test))


Epoch 1/10
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 2s/step - accuracy: 0.5002 - auc_10: 0.5070 - loss: 0.8462 - precision_10: 0.4691 - recall_10: 0.5295 - val_accuracy: 1.7500e-05 - val_auc_10: 0.0000e+00 - val_loss: 0.7363 - val_precision_10: 1.0000 - val_recall_10: 2.2500e-05
Epoch 2/10
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 1s/step - accuracy: 0.5835 - auc_10: 0.6338 - loss: 0.6676 - precision_10: 0.5666 - recall_10: 0.5439 - val_accuracy: 0.1036 - val_auc_10: 0.0000e+00 - val_loss: 0.7030 - val_precision_10: 1.0000 - val_recall_10: 0.2091
Epoch 3/10
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 1s/step - accuracy: 0.6501 - auc_10: 0.7131 - loss: 0.6167 - precision_10: 0.6317 - recall_10: 0.5958 - val_accuracy: 0.4416 - val_auc_10: 0.0000e+00 - val_loss: 0.6690 - val_precision_10: 1.0000 - val_recall_10: 0.8775
Epoch 4/10
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 1s/step - accuracy: 0.8182 -

<keras.src.callbacks.history.History at 0x7e3b96371790>

In [198]:
metrics = RNN50_model.evaluate(X_test, y_test)

print(f"Test Loss: {metrics[0]:.4f}")
print(f"Test Accuracy: {metrics[1] * 100:.2f}%")
print(f"Test Precision: {metrics[2]:.4f}")
print(f"Test Recall: {metrics[3]:.4f}")
print(f"Test AUC: {metrics[4]:.4f}")

[1m12500/12500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 3ms/step - accuracy: 0.1758 - auc_10: 0.0000e+00 - loss: 0.8012 - precision_10: 1.0000 - recall_10: 0.3405
Test Loss: 0.8035
Test Accuracy: 17.58%
Test Precision: 1.0000
Test Recall: 0.3370
Test AUC: 0.0000


In [199]:
metrics = RNN50_model.evaluate(X_train, y_train)

print(f"Train Loss: {metrics[0]:.4f}")
print(f"Train Accuracy: {metrics[1] * 100:.2f}%")
print(f"Train Precision: {metrics[2]:.4f}")
print(f"Train Recall: {metrics[3]:.4f}")
print(f"Train AUC: {metrics[4]:.4f}")

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - accuracy: 1.0000 - auc_10: 1.0000 - loss: 0.2919 - precision_10: 1.0000 - recall_10: 1.0000
Train Loss: 0.2887
Train Accuracy: 100.00%
Train Precision: 1.0000
Train Recall: 1.0000
Train AUC: 1.0000


We observe here that when the token length is too short, the model overfits and doesn't generalize well to test data. 

A possible explanation for this is that the model receives insufficient context from the embedding layer to understand the sequence's structure and dependencies. This leads the RNN to focus on learning patterns within smaller sequences, which may not generalize well to unseen data, causing the model to memorize specific details from the training data instead of learning broader, more general patterns.

 Additionally, if the sequence length is small, the model might still have a large number of parameters relative to the size of the input, this complexity causes the model to learn more specialized patterns during training. 

# Long Short-Term Memory (LSTM) with 100 tokens

In [11]:
LSTMmodel = Sequential()

Embedding layer to learn dense vector representations of tokens. 

In [12]:
LSTMmodel.add(Embedding(input_dim=10000, output_dim=128, input_length=100)) 



LSTM layer with 64 units

In [13]:
LSTMmodel.add(LSTM(64, return_sequences=True))

I0000 00:00:1731305522.112654  212521 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2270 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3050 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6


Second LSTM layer with 64 units

In [14]:
LSTMmodel.add(LSTM(64, return_sequences=False))

No return sequences because the target shape is (32,1)

Dropout for regularization

In [15]:
LSTMmodel.add(Dropout(0.5))

Dense layer with batch normalization

In [16]:
LSTMmodel.add(Dense(64, activation='relu'))
LSTMmodel.add(BatchNormalization()) 
LSTMmodel.add(Dense(16, activation='relu'))

Output layer 

In [17]:
LSTMmodel.add(Dense(1, activation='sigmoid'))

Compiling the model with precision, accuracy, recall and AUC

In [18]:
LSTMmodel.compile(
    optimizer=Adam(),
    loss='binary_crossentropy',
    metrics=['accuracy', Precision(), Recall(), AUC()]
)

Training

In [19]:
LSTMmodel.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))


Epoch 1/10


I0000 00:00:1731305540.081787  214243 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m977s[0m 9ms/step - accuracy: 0.4990 - auc: 0.0000e+00 - loss: -43934236.0000 - precision: 1.0000 - recall: 0.9992 - val_accuracy: 0.5000 - val_auc: 0.0000e+00 - val_loss: -632000640.0000 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 2/10
[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m990s[0m 9ms/step - accuracy: 0.5001 - auc: 0.0000e+00 - loss: -1300403712.0000 - precision: 1.0000 - recall: 1.0000 - val_accuracy: 0.5000 - val_auc: 0.0000e+00 - val_loss: -4710282752.0000 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 3/10
[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m993s[0m 9ms/step - accuracy: 0.5001 - auc: 0.0000e+00 - loss: -6713518592.0000 - precision: 1.0000 - recall: 1.0000 - val_accuracy: 0.5000 - val_auc: 0.0000e+00 - val_loss: -15184941056.0000 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 4/10
[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9

<keras.src.callbacks.history.History at 0x742cc2ba2780>

The LSTM performs poorly. Probably due to the low complexity of the model implemented here: one LSTM layer and two dense layers

### Trying a different architecture with larger dense layer

In [12]:
lstmmodel2 = Sequential()
lstmmodel2.add(Embedding(input_dim=10000, output_dim=50, input_length=100, name='embedding'))
lstmmodel2.add(LSTM(64, name='lstm_layer'))
lstmmodel2.add(Dense(256, activation='relu', name='FC1'))
lstmmodel2.add(Dropout(0.5))
lstmmodel2.add(Dense(1, activation='sigmoid', name='out_layer'))

I0000 00:00:1731330271.411374  389444 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2270 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3050 Ti Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6


In [13]:
lstmmodel2.compile(
    optimizer=Adam(),
    loss='binary_crossentropy',
    metrics=['accuracy', Precision(), Recall(), AUC()]
)

In [14]:
lstmmodel2.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

Epoch 1/10


I0000 00:00:1731330334.278305  392959 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m458s[0m 4ms/step - accuracy: 0.5001 - auc: 0.0000e+00 - loss: -6050813.5000 - precision: 1.0000 - recall: 0.9999 - val_accuracy: 0.5000 - val_auc: 0.0000e+00 - val_loss: -51042264.0000 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 2/10
[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m456s[0m 4ms/step - accuracy: 0.5003 - auc: 0.0000e+00 - loss: -80968984.0000 - precision: 1.0000 - recall: 1.0000 - val_accuracy: 0.5000 - val_auc: 0.0000e+00 - val_loss: -197137296.0000 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 3/10
[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m463s[0m 4ms/step - accuracy: 0.4997 - auc: 0.0000e+00 - loss: -250828672.0000 - precision: 1.0000 - recall: 1.0000 - val_accuracy: 0.5000 - val_auc: 0.0000e+00 - val_loss: -437737056.0000 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 4/10
[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m500s[0m 

KeyboardInterrupt: 

# Bonus: a Bi-directional LSTM with an Attention Mechanism

In [28]:
vocab_sz = 10000 
emb_dim = 100          
token_length = 100          
num_classes = 2 

input layer, this layer is required to structure the model and specify the input format for all subsequent layers

In [29]:
inputs = Input(shape=(token_length,))

### Embedding layer

In [30]:
x = Embedding(input_dim=vocab_sz, output_dim=emb_dim, input_length=token_length)(inputs)




### Bi-directional LSTM layer

 processes the sequence in both forward and backward directions using two LSTM layers, then concatenates the outputs from both directions at each time step.

In [31]:
x = Bidirectional(LSTM(64, return_sequences=True))(x)

### Attention layer


this layer lets the model focus on critical parts of the input, improving performance by selectively weighting important words or phrases that contribute to the classification.

In [32]:
class AttentionLayer(Layer):
    def __init__(self, **kwargs):
        super(AttentionLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        self.W = self.add_weight(name="att_W", shape=(input_shape[-1], input_shape[-1]), initializer="random_normal")
        self.b = self.add_weight(name="att_b", shape=(input_shape[-1],), initializer="zeros")
        super(AttentionLayer, self).build(input_shape)

    def call(self, inputs):
        score = tf.nn.tanh(tf.tensordot(inputs, self.W, axes=1) + self.b)
        att_wts = tf.nn.softmax(score, axis=1)
        context = tf.reduce_sum(att_wts * inputs, axis=1)
        return context

In [33]:
x = AttentionLayer()(x)

W and b: learnable weights that help the model capture meaningful relationships within the sequence.


score: computes an attention score using tanh to introduce non-linearity, allowing more flexible weight learning.


att_wts: converts scores to probabilities(weights) across time steps via softmax, where higher weights indicate more importance.


context: produces a context vector by summing the LSTM outputs weighted by their importance. this context vector acts as a summary of the sequence, emphasizing relevant parts based on the learned attention weights.

output layer

In [34]:
outputs = Dense(1, activation='sigmoid')(x) 

In [35]:
Attention_model = Model(inputs=inputs, outputs=outputs)

In [36]:
Attention_model.compile(
    optimizer=Adam(),
    loss='binary_crossentropy',
    metrics=['accuracy', Precision(), Recall(), AUC()]
)

In [37]:
Attention_model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

Epoch 1/10


I0000 00:00:1731334143.225549  418288 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1047s[0m 9ms/step - accuracy: 0.5007 - auc: 0.0000e+00 - loss: -1795.0321 - precision: 1.0000 - recall: 1.0000 - val_accuracy: 0.5000 - val_auc: 0.0000e+00 - val_loss: -7152.3853 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 2/10
[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1053s[0m 9ms/step - accuracy: 0.5002 - auc: 0.0000e+00 - loss: -8934.9355 - precision: 1.0000 - recall: 1.0000 - val_accuracy: 0.5000 - val_auc: 0.0000e+00 - val_loss: -14294.3936 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 3/10
[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1053s[0m 9ms/step - accuracy: 0.5005 - auc: 0.0000e+00 - loss: -16065.8379 - precision: 1.0000 - recall: 1.0000 - val_accuracy: 0.5000 - val_auc: 0.0000e+00 - val_loss: -21437.0605 - val_precision: 1.0000 - val_recall: 1.0000
Epoch 4/10
[1m112500/112500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1053s[0m 9ms/step - accuracy

<keras.src.callbacks.history.History at 0x7a92181f2300>