SENTIMENT ANALYSIS OF IMDB USING CNN AND RNN

Here's why using a hybrid of CNN and RNN for sentiment analysis on a dataset like IMDB is a good idea:

Strengths of Individual Models:

CNNs (Convolutional Neural Networks):

.-Pattern Recognition: They excel at identifying sentiment-laden phrases and keywords like "terrible acting" or "laugh-out-loud funny." These are common in movie reviews and can be strong indicators of sentiment.

.-Focus on Key Words: CNNs don't necessarily need to understand the entire sentence structure, just the presence of these key sentiment indicators. This is beneficial for casual language and slang in the IMDB dataset.


RNNs (Recurrent Neural Networks):

1.-Context Awareness: Unlike CNNs, RNNs can understand the sequence and order of words in a sentence. This is crucial because sentiment can depend on word order. For instance, "not bad" has a different meaning than "bad not."

2.-Long-range Dependencies: RNNs, especially LSTMs (Long Short-Term Memory networks), can handle long-distance dependencies. Sentiment can sometimes be influenced by words far apart in a review. RNNs can learn these longer-range relationships.

Benefits of the Hybrid Approach:

Combined Strengths: By combining CNNs and RNNs, you leverage the advantages of both:
CNNs capture sentiment indicators.
RNNs provide context and understand word order's impact.

Improved Accuracy: This combined approach can potentially lead to a more accurate sentiment analysis model. The model can capture both the presence of sentiment indicators and the overall flow of the review.

#1.- IMPORT LIBRARIES

In [1]:
# tensorflow as tf: Imports the TensorFlow library as tf for easy access.

# from tensorflow.keras.preprocessing.sequence import pad_sequences: Imports the
# pad_sequences function from the tensorflow.keras.preprocessing.sequence module.
# This function is used to pad sequences of different lengths to a uniform length
# for processing by the model.

# from tensorflow.keras.models import Model: Imports the Model class from the
# tensorflow.keras.models module. This class is used to define and build the neural
# network architecture.

# from tensorflow.keras.layers import Input, Embedding, Conv1D, MaxPooling1D, LSTM,
# Dense, Dropout: Imports various layers used for building the neural network:
# Input: Represents the input layer of the model.
# Embedding: Converts words into numerical vectors.
# Conv1D: Applies one-dimensional convolutional filters to the sequence.
# MaxPooling1D: Performs downsampling on the output of the convolutional layers.
# LSTM: Applies Long Short-Term Memory layers to capture long-range dependencies
# in the sequence.
# Dense: Represents fully-connected layers for classification.
# Dropout: Introduces dropout regularization to prevent overfitting.

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, Conv1D, MaxPooling1D, LSTM, Dense, Dropout



#2.- LOADING DATASET AND DIVIDING TRAIN AND TEST

In [2]:
# Loading the Dataset:

# tf.keras.datasets.imdb.load_data(): This part is responsible for loading the
# IMDB dataset from TensorFlow's built-in datasets.

# Splitting into Training and Validation Sets:

# (X_train, y_train), (X_val, y_val): This unpacking separates the loaded data into
# four variables:
# X_train: This variable stores the training data. It's a NumPy array where each
# element represents a movie review. A review is essentially a sequence of integers,
# where each integer represents a word based on a vocabulary.
# y_train: This variable stores the sentiment labels for the training data. It's a
# NumPy array where each element corresponds to a review in X_train and holds the
# sentiment label (0 for negative, 1 for positive).
# X_val: This variable stores the validation data, following the same format as
# X_train. It's used to evaluate the model's performance during training.
# y_val: This variable stores the sentiment labels for the validation data,
# following the same format as y_train. It corresponds to the reviews in X_val.

# Limiting Vocabulary Size (Optional Argument):

# num_words=20000: This is an optional argument that specifies the maximum number
# of words to consider in the vocabulary. By default, it considers the 20,000 most
# frequent words in the dataset. This helps reduce the dimensionality of the data
# and improve training efficiency.



(X_train, y_train), (X_val, y_val) = tf.keras.datasets.imdb.load_data(num_words=20000)


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


#3.- PADDING DATA

In [3]:
# This code snippet deals with "padding sequences"

# 1. Setting Maximum Sequence Length:

# max_length = 500: This line defines a variable named max_length and assigns it a
# value of 500. This variable represents the maximum length (number of words) a
# sequence (review) can have in the processed data.

# 2. Padding Training Data:

# X_train = pad_sequences(X_train, maxlen=max_length): This line applies the
# pad_sequences function from TensorFlow's keras.preprocessing.sequence module to
# the training data stored in X_train.
# pad_sequences: This function takes two arguments:
# The first argument (X_train) is the actual training data, which is likely a
# NumPy array where each element represents a movie review. Each review itself is
# a sequence of integers representing individual words based on a vocabulary.
# The second argument (maxlen=max_length) specifies the maximum sequence length
#  (set to 500 in this case).
# The function's purpose is to ensure all sequences in the training data (X_train)
# have the same length (max_length).

# 3. Padding Validation Data:

# X_val = pad_sequences(X_val, maxlen=max_length): This line follows the same logic
# as the previous line, but it applies the pad_sequences function to the validation
# data stored in X_val. This ensures all sequences in the validation set also have
# the same length (max_length) for consistency with the training data.


max_length = 500
X_train = pad_sequences(X_train, maxlen=max_length)
X_val = pad_sequences(X_val, maxlen=max_length)



#4.- DATA SHAPE

In [4]:
# Data shapes

# Knowing the shapes of your data helps ensure everything is processed correctly.
# It verifies that the dimensions of your training and validation data are
# compatible for training your model.

# There are 25,000 reviews in the training data (X_train).
# There are 10,000 reviews in the validation data (X_val).
# Each review (sequence) in both training and validation data has been padded to a
# maximum length of 500 (specified earlier in the code).
# y_train and y_val contain labels for each review, and they likely have a single
# dimension representing the sentiment (0 or 1). The number of elements in these
# arrays should match the number of reviews in the corresponding training and
# validation sets (X_train and X_val).
# By printing the shapes, you can confirm that the padding process worked as
# expected and your data is ready for further processing in your sentiment analysis
# model.


print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)



(25000, 500) (25000, 500) (25000,) (25000,)


#5.- DEFINING VOCABULARY SIZE, EMBEDDING DIMENSION AND MAXIMUM LENGTH

In [5]:
# This snippet defines three important variables used in building a sentiment
# analysis model for the IMDB dataset:

# vocab_size = 20000:

# This variable defines the vocabulary size, which is the number of unique words
# considered in the model. Here, it's set to 20,000.

# embedding_dim = 100:

# This variable defines the embedding dimension. In sentiment analysis, each word
# in a review gets converted into a numerical vector representation. This embedding
# dimension specifies the size (length) of these vectors. Here, each word is
# represented by a 100-dimensional vector.

# max_length = 500:

# This variable defines the maximum length (number of words) for a review (sequence)
# in the processed data. Here, it's set to 500.

vocab_size = 20000
embedding_dim = 100
max_length = 500


#6.- DEFINING THE ARCHITECTURE OF THE MODEL

In [6]:
# This code defines the architecture of a neural network for sentiment analysis
# on the IMDB dataset, using a combination of Convolutional Neural Networks (CNNs)
# and Long Short-Term Memory (LSTM) networks. Here's a breakdown of each layer:

# Input Layer (input_layer):

# This layer defines the entry point for the network. It takes sequences of integers
# representing words in a review. The .shape argument specifies the expected input
# shape, which is a tuple (max_length,) in this case. This means the network expects
# sequences with a maximum length of max_length (defined earlier).

# Embedding Layer (embedding_layer):

# This layer transforms integer-represented words into dense vectors. It takes
# three arguments:
# vocab_size: The number of unique words considered (set to vocab_size earlier).
# embedding_dim: The dimensionality of the word vectors (set to embedding_dim earlier).
# input_length (optional): Here, it's set to max_length to ensure the embedding
# layer processes sequences of the expected length.
# This layer essentially maps each word index to a corresponding vector representation,
# capturing semantic relationships between words.

# Convolutional Layer (conv_layer):

# This layer is the first CNN layer. It applies one-dimensional convolutional
# filters to the embedded sequences. The arguments are:
# 128: The number of filters used in the convolution (extracts 128 features).
# 5: The size of the filter window (considers 5 consecutive word vectors at a time).
# activation='relu': The activation function applied to the convolution output
#  (ReLU for non-linearity).
# The goal of this layer is to capture local patterns in the sequence that might
# be indicative of sentiment (e.g., presence of sentiment-laden phrases).

# Pooling Layer (pooling_layer):

# This layer performs downsampling on the output of the convolutional layer. Here,
# MaxPooling1D takes the maximum value from a window of size 4 (keeps the most
# significant feature from every 4 consecutive outputs of the convolutional layer).
# This reduces the dimensionality of the data and helps control overfitting.

# LSTM Layer (lstm_layer):

# This layer introduces an LSTM network. LSTMs are powerful for handling sequential
# data like reviews. It takes the output of the pooling layer and processes the
# sequence to capture long-range dependencies in the word order. The argument is:
# 128: The number of units in the LSTM layer (defines the internal memory of the LSTM).
# LSTMs can learn how the sentiment of a review might be influenced by words further
# apart in the sequence (e.g., "not bad" vs. "bad not").

# Dense Layer (dense_layer):

# This layer is a fully-connected layer that transforms the LSTM output into a
# lower-dimensional space. It has:
# 64: The number of neurons in the dense layer.
# activation='relu': The activation function applied (ReLU for non-linearity).
# This layer helps extract higher-level features from the sequence data.

# Dropout Layer (dropout_layer):

# This layer introduces dropout regularization. It randomly drops a certain
# percentage of neurons (here, 50%) during training to prevent overfitting.
# The argument is:
# 0.5: The dropout rate (50% of neurons are dropped).

# Output Layer (output_layer):

# This layer is the final layer of the network. It has:
# 1: The number of neurons (as the task is binary sentiment classification: positive
# or negative).
# activation='sigmoid': The activation function applied (sigmoid for binary
# classification, outputting a value between 0 and 1 representing the probability
# of positive sentiment).
# This layer outputs the final prediction, a probability score indicating the
# sentiment of the review (closer to 1 for positive, closer to 0 for negative).




input_layer = Input(shape=(max_length,))
embedding_layer = Embedding(vocab_size, embedding_dim, input_length=max_length)(input_layer)
conv_layer = Conv1D(128, 5, activation='relu')(embedding_layer)
pooling_layer = MaxPooling1D(pool_size=4)(conv_layer)
lstm_layer = LSTM(128)(pooling_layer)
dense_layer = Dense(64, activation='relu')(lstm_layer)
dropout_layer = Dropout(0.5)(dense_layer)
output_layer = Dense(1, activation='sigmoid')(dropout_layer)

#7.- MODEL COMPILATION

In [7]:
# This snippet defines the final steps for building and compiling the sentiment
# analysis model using the architecture defined earlier:

# 1. Model Creation (model = Model(inputs=input_layer, outputs=output_layer)):

# This line creates a Model object using the tensorflow.keras.models.Model class.
# It takes two arguments:
# inputs: This specifies the input layer of the model, which is the input_layer
# defined earlier.
# outputs: This specifies the output layer of the model, which is the output_layer
# defined earlier.
# Essentially, this line constructs the overall neural network architecture based
# on the sequence of layers you defined previously.

# 2. Model Compilation (model.compile(loss='binary_crossentropy', optimizer='adam',
# metrics=['accuracy'])):

# This line compiles the model, which configures it for training. It takes three
# arguments:
# loss: This specifies the loss function used to measure the difference between
# the model's predictions and the true labels. Here, 'binary_crossentropy' is used
# because it's a binary classification task (positive or negative sentiment).
# optimizer: This specifies the optimization algorithm used to train the model.
# Here, 'adam' is a popular optimizer choice for its efficiency.
# metrics: This is a list of metrics used to evaluate the model's performance
# during training and validation. Here, 'accuracy' is used to track the percentage
# of correct predictions.

# 3. Model Summary (model.summary()):

# This line calls the summary method on the model object. This method prints a
# summary of the model's architecture, including:
# Layer names and types
# Number of parameters in each layer
# Total number of trainable parameters
# Output shape
# Overall, this code snippet completes the model definition by creating the model
# object, configuring its training process, and providing insights into its
# architecture through the summary.


model = Model(inputs=input_layer, outputs=output_layer)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 500)]             0         
                                                                 
 embedding (Embedding)       (None, 500, 100)          2000000   
                                                                 
 conv1d (Conv1D)             (None, 496, 128)          64128     
                                                                 
 max_pooling1d (MaxPooling1  (None, 124, 128)          0         
 D)                                                              
                                                                 
 lstm (LSTM)                 (None, 128)               131584    
                                                                 
 dense (Dense)               (None, 64)                8256      
                                                             

#8.-TRAINING DATA

In [8]:
# Training Data:

# The first two arguments specify the training data:
# X_train: This is the NumPy array containing the padded sequences (reviews)
# representing the training data.
# y_train: This is the NumPy array containing the sentiment labels (0 for negative,
# 1 for positive) for each review in X_train.

# Training Parameters:

# The next two arguments define training parameters:
# epochs=10: This specifies the number of times the entire training dataset will
# be passed through the network for training. Here, the model will be trained for
# 10 epochs.
# batch_size=64: This specifies the number of samples used to update the model's
# weights in one iteration. Here, the model will update its weights after processing
# batches of 64 reviews.

# Validation Data (Optional):

# The final argument, validation_data=(X_val, y_val)), is optional but highly
# recommended. It specifies the validation data:
# X_val: This is the NumPy array containing the padded sequences (reviews)
# representing the validation data.
# y_val: This is the NumPy array containing the sentiment labels (0 for negative,
# 1 for positive) for each review in X_val.

trained_model = model.fit(X_train, y_train, epochs=10, batch_size=64, validation_data=(X_val, y_val))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


#9.-MODEL EVALUATION

In [9]:
# This snippet evaluates the performance of the trained sentiment analysis model
# on the validation data and then prints the results. Here's a breakdown:

# Evaluation:

# loss, accuracy = model.evaluate(X_val, y_val): This line calls the evaluate

# method on the trained model (model). It takes two arguments:

# X_val: The NumPy array containing the padded sequences (reviews) representing
# the validation data.
# y_val: The NumPy array containing the sentiment labels (0 for negative, 1 for
# positive) for each review in X_val.

# The evaluate method performs a forward pass through the network using the
# validation data and calculates two key metrics:

# Loss: This quantifies the difference between the model's predictions and the true
# labels (usually lower is better).

# Accuracy: This represents the percentage of correct predictions made by the model
# on the validation data (usually higher is better).


loss, accuracy = model.evaluate(X_val, y_val)
print(f'Loss: {loss}')
print(f'Accuracy: {accuracy}')


Loss: 0.8321132063865662
Accuracy: 0.8678799867630005


In [None]:
# Interpretation
# Loss on the validation set (0.8321): The loss is relatively high, suggesting
# there is room for improvement in the model. A high loss on the validation set
# compared to the training set indicates that the model may be overfitting.

# Accuracy on the validation set (86.79%): An accuracy of 86.79% is quite good,
# but given that the accuracy on the training set was extremely high (99.61%),
# this difference also suggests overfitting.

# The evaluation confirms that the model performs well in terms of accuracy on
# the validation set, but the high loss compared to the accuracy on the training
# set suggests overfitting. Here are some additional suggestions to address
# overfitting:

# 1.- Regularization:
# Dropout: Increase the Dropout rate.
# L2 Regularization: Add L2 regularization in the dense layers.

# 2.- Early Stopping: Implement EarlyStopping to stop training when the validation
# loss stops improving.

# 3.- Reduce the complexity of the model


#It would take a very long time to train this model again after the changes
#made
