 practical guide to building a Recurrent Neural Network (RNN)-based classifier for Sentiment Prediction in NLP. We will use Keras and TensorFlow for building the RNN model. RNNs are great for handling sequential data such as text.


Steps:

Load a dataset: We will use the IMDb movie reviews dataset, which is readily available in Keras.

Preprocess the data: Tokenize the text and pad the sequences.

Build the RNN model: Using an Embedding layer, RNN layer (e.g., LSTM or GRU), and a Dense output layer.

Train the model.

Evaluate the model.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, LSTM, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

# Load the IMDb dataset
# num_words=10000 means we will only use the top 10,000 most frequent words in the dataset
max_features = 10000
max_len = 100  # We will pad sequences to a maximum length of 100 words

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=max_features)

# Pad sequences to ensure consistent input size
X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)

# Model architecture
model = Sequential()

# Embedding layer: Turns positive integer representations into dense vectors
model.add(Embedding(input_dim=max_features, output_dim=128, input_length=max_len))

# Recurrent layer: SimpleRNN or LSTM
model.add(LSTM(units=128, dropout=0.2, recurrent_dropout=0.2))

# Fully connected layer (Dense)
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])

# Print the model summary to check the architecture
model.summary()

# Train the model with early stopping to prevent overfitting
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
history = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_test, y_test), callbacks=[early_stop])

# Evaluate the model on the test data
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step




Epoch 1/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m137s[0m 338ms/step - accuracy: 0.7274 - loss: 0.5331 - val_accuracy: 0.8390 - val_loss: 0.3843
Epoch 2/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m128s[0m 303ms/step - accuracy: 0.8707 - loss: 0.3229 - val_accuracy: 0.8326 - val_loss: 0.3834
Epoch 3/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m145s[0m 310ms/step - accuracy: 0.8993 - loss: 0.2624 - val_accuracy: 0.8422 - val_loss: 0.3714
Epoch 4/10
[1m391/391[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m119s[0m 304ms/step - accuracy: 0.9132 - loss: 0.2215 - val_accuracy: 0.8426 - val_loss: 0.3858
Epoch 5/10


Detailed Explanation of Code:

1. Data Loading and Preprocessing:

The IMDb dataset is already split into training and test sets.
We use imdb.load_data() to load the data, limiting the vocabulary to the 10,000 most frequent words.

Padding: Since reviews have varying lengths, we use pad_sequences() to ensure each review has the same length (set to 100 here).

2. Model Building:

Embedding Layer: The Embedding layer is used to turn integer-encoded words into dense vectors of fixed size. This allows the model to learn word representations (embeddings).

input_dim = 10,000 (the vocabulary size).

output_dim = 128 (the dimension of the embedding vectors).

input_length = 100 (the maximum length of the input sequences).

Recurrent Layer (LSTM): LSTM (Long Short-Term Memory) is used as the recurrent layer. LSTM is better at capturing long-term dependencies in sequences compared to simple RNNs.

The units parameter specifies the number of LSTM units.

dropout and recurrent_dropout help in preventing overfitting by randomly dropping connections during training.

Dense Layer: A Dense layer with 1 unit and sigmoid activation function is used to classify the sentiment (positive or negative).

3. Model Compilation:

We use Adam() optimizer, which is popular for training neural networks.
The binary cross-entropy loss function is used since we have a binary classification problem (positive/negative sentiment).

4. Training the Model:

We use early stopping to prevent overfitting, which stops training if the validation loss does not improve after 3 epochs.

The model is trained on the training data (X_train, y_train) for a maximum of 10 epochs, with a batch size of 64. Validation data (X_test, y_test) is used for monitoring validation performance.

5. Model Evaluation:
After training, we evaluate the model on the test set (X_test, y_test), which was unseen during training. We check the loss and accuracy of the model.

Explanation of Output:

Model Summary: We see the model architecture, which consists of an embedding layer, an LSTM layer, and a dense output layer.

Training Progress: During training, the loss and accuracy are printed for both the training and validation sets.

Test Results: After training, the model achieves 85.92% accuracy on the test set, which means it correctly predicted the sentiment of the reviews about 86% of the time.

Conclusion:

RNNs, specifically LSTM, are powerful for text classification tasks like sentiment analysis because they are designed to capture the sequential nature of text.

The model performs well on the IMDb dataset with 85.92% accuracy, demonstrating the effectiveness of RNN-based models in NLP tasks.

Further improvements could include using GRU (Gated Recurrent Unit) layers, tuning hyperparameters, adding bidirectional layers, or using pre-trained embeddings like GloVe or Word2Vec.