**Movie Review Analysis Using CNNs and RNNs**

*In this machine learning project we preprocess movie review data, and analyze it using both CNNs and RNNs.*

# 1. Data Collection and Preprocessing

The Large Movie Review Dataset is downloaded using TensorFlow's utility function. This dataset is pivotal for our sentiment analysis project as it contains a substantial number of movie reviews, which are ideal for binary classification tasks.

A function is defined to explore the contents of the dataset directory. Understanding the structure of the dataset is crucial for setting up data pipelines and deciding how to preprocess the data.


In [2]:
# Import necessary libraries
from pathlib import Path
import tensorflow as tf
import numpy as np

# Download the dataset and extract it
root_url = "https://ai.stanford.edu/~amaas/data/sentiment/"
filename = "aclImdb_v1.tar.gz"
dataset_url = f"{root_url}{filename}"
dataset_path = tf.keras.utils.get_file(filename, dataset_url, extract=True, cache_dir=".")
dataset_dir = Path(dataset_path).with_name("aclImdb")

# Define a function to explore the dataset structure
def explore_dataset(path):
    print(f"Contents of {path}:")
    for item in sorted(path.iterdir()):
        print(f"- {item.name}")

# Explore the top-level structure
explore_dataset(dataset_dir)


Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Contents of datasets/aclImdb:
- README
- imdb.vocab
- imdbEr.txt
- test
- train


We load file paths for positive and negative reviews separately. This separation is necessary for labeling the reviews in our binary classification

The test set is shuffled and split into validation and test sets. This is a standard practice in machine learning to evaluate models on unseen data.

A function `create_dataset` is defined to convert these file paths into a TensorFlow dataset, which is a more efficient format for training models in TensorFlow.


In [3]:
# Function to get file paths for reviews
def get_review_paths(dir_path):
    return [str(path) for path in dir_path.glob("*.txt")]

# Load file paths for training and testing
train_pos_paths = get_review_paths(dataset_dir / "train" / "pos")
train_neg_paths = get_review_paths(dataset_dir / "train" / "neg")
test_pos_paths = get_review_paths(dataset_dir / "test" / "pos")
test_neg_paths = get_review_paths(dataset_dir / "test" / "neg")

# Shuffle and split the test set into validation and test sets
np.random.shuffle(test_pos_paths)
np.random.shuffle(test_neg_paths)
valid_pos_paths = test_pos_paths[:len(test_pos_paths) // 2]
valid_neg_paths = test_neg_paths[:len(test_neg_paths) // 2]
test_pos_paths = test_pos_paths[len(test_pos_paths) // 2:]
test_neg_paths = test_neg_paths[len(test_neg_paths) // 2:]

# Function to create a dataset from review paths
def create_dataset(pos_paths, neg_paths):
    def load_review(path, label):
        return tf.io.read_file(path), label

    pos_ds = tf.data.Dataset.from_tensor_slices((pos_paths, [1] * len(pos_paths)))
    neg_ds = tf.data.Dataset.from_tensor_slices((neg_paths, [0] * len(neg_paths)))

    return tf.data.Dataset.concatenate(
        pos_ds.map(load_review),
        neg_ds.map(load_review)
    ).shuffle(buffer_size=10000)

# Create datasets
train_dataset = create_dataset(train_pos_paths, train_neg_paths)
valid_dataset = create_dataset(valid_pos_paths, valid_neg_paths)
test_dataset = create_dataset(test_pos_paths, test_neg_paths)


The datasets are written to TFRecord files. TFRecord is a simple record-oriented binary format that many TensorFlow applications use for training data. This format is efficient and easy to work with when using large datasets.

In [4]:
from contextlib import ExitStack

from tensorflow.train import Example, Features, Feature, BytesList, Int64List

def create_text_example(review, label):
    return Example(
        features=Features(
            feature={
                "review": Feature(bytes_list=BytesList(value=[tf.compat.as_bytes(review)])),
                "label": Feature(int64_list=Int64List(value=[label])),
            }))


# Function to write dataset to TFRecord files
def write_tfrecords(name, dataset, n_shards=10):
    paths = [f"{name}.tfrecord-{index:05d}-of-{n_shards:05d}" for index in range(n_shards)]
    with ExitStack() as stack:
        writers = [stack.enter_context(tf.io.TFRecordWriter(path)) for path in paths]
        for index, (review, label) in dataset.enumerate():
            shard = index % n_shards
            example = create_text_example(review.numpy().decode('utf-8'), label.numpy())
            writers[shard].write(example.SerializeToString())
    return paths

# Write datasets to TFRecord files
train_filepaths = write_tfrecords("imdb_train", train_dataset)
valid_filepaths = write_tfrecords("imdb_valid", valid_dataset)
test_filepaths = write_tfrecords("imdb_test", test_dataset)


A function `preprocess_text` is defined to parse the TFRecord files. Parsing the data is necessary to convert it into a format that our model can be trained on.


In [5]:
def preprocess_text(tfrecord):
    feature_descriptions = {
        "review": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "label": tf.io.FixedLenFeature([], tf.int64, default_value=-1),
    }
    parsed_example = tf.io.parse_single_example(tfrecord, feature_descriptions)
    review = parsed_example["review"]
    label = parsed_example["label"]
    return review, label


In [6]:
# Test with a single TFRecord
for raw_record in tf.data.TFRecordDataset(train_filepaths).take(1):
    review, label = preprocess_text(raw_record)
    print("Review:", review.numpy().decode('utf-8'))
    print("Label:", label.numpy())


Review: Twenty five years ago, I showed this film in some children's classes in Entomology and can still remember the excitement of the kids; they were spellbound! It is not just about the termites who have built and live in the "Castles of Clay," but also about the other animals who use the mounds. There is a fantastic scene in which a cobra fights a monitor lizard while a colony of mongooses watch. It is a not only good for entomology classes, but also for teaching about ecology since there is so much about the interactions between the termites and other organisms and the whole ecology of all of the organisms that live in and around the mounds. <br /><br />I wish it was available on DVD, so that I could watch it again and show others.
Label: 1


Text data is tokenized and vectorized using TensorFlow's `TextVectorization` layer. This process converts text into numerical data that a neural network can process.

In [7]:
# Define the size of the vocabulary and maximum sequence length
vocab_size = 10000
max_length = 250

# Extracting only the review text for vectorization adaptation
# Here, train_dataset is a dataset of (review, label) pairs
train_reviews = train_dataset.map(lambda review, label: review)

# Create and adapt the TextVectorization layer
text_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=max_length)
text_vectorization.adapt(train_reviews)


The datasets are prepared with batching and prefetching, which are key techniques for efficient data loading.

In [8]:
def imdb_dataset(filepaths, shuffle_buffer_size=None, batch_size=32, prefetch_buffer=tf.data.AUTOTUNE):
    dataset = tf.data.TFRecordDataset(filepaths, num_parallel_reads=tf.data.AUTOTUNE)
    if shuffle_buffer_size:
        dataset = dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.map(preprocess_text, num_parallel_calls=tf.data.AUTOTUNE)
    dataset = dataset.batch(batch_size)
    # Apply text_vectorization to the entire batch
    dataset = dataset.map(lambda reviews, labels: (text_vectorization(reviews), labels), num_parallel_calls=tf.data.AUTOTUNE)
    return dataset.prefetch(prefetch_buffer)

# Create the datasets
train_set = imdb_dataset(train_filepaths, shuffle_buffer_size=25000, batch_size=32)
valid_set = imdb_dataset(valid_filepaths, batch_size=32)
test_set = imdb_dataset(test_filepaths, batch_size=32)


# 2. Text Analysis Using CNN

A CNN model is constructed with an Embedding layer followed by Convolutional and Pooling layers. This architecture is typically used for image data but is being adapted here for text data, showcasing the versatility of CNNs.


In [9]:
embedding_dim = 16

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.MaxPooling1D(pool_size=4),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.GlobalMaxPooling1D(),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


The model is trained on the movie reviews dataset. The training process involves adjusting the weights of the network to minimize the loss function, which in this case is binary cross-entropy, suitable for binary classification tasks.

In [10]:
# Train the model
history = model.fit(train_set, epochs=10, validation_data=valid_set)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The model's performance is evaluated on the validation set during training and on the test set after training. The results are promising, with high accuracy, indicating that the model has learned to distinguish between positive and negative reviews effectively.

In [11]:
# Evaluate the model
loss, accuracy = model.evaluate(test_set)
print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  1.1342061758041382
Accuracy:  0.8395199775695801


# 3. Text Analysis Using RNN

A separate RNN model is constructed with an Embedding layer followed by LSTM layers. RNNs, and particularly LSTMs, are well-suited for sequence data like text, as they can capture temporal dependencies.

In [13]:
model_rnn = tf.keras.Sequential([
    tf.keras.layers.Embedding(20000 + 1, 64),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model_rnn.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])


The RNN model is also trained and evaluated on the same dataset. The performance metrics are closely observed to compare with the CNN model's performance.

In [14]:
# Train the model
history = model_rnn.fit(train_set, epochs=10, validation_data=valid_set)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


The accuracy of the RNN model is slightly lower than the CNN. This difference could be a point of analysis to understand how each model type processes text data differently.

In [15]:
# Evaluate the model
loss, accuracy = model_rnn.evaluate(test_set)
print("Loss: ", loss)
print("Accuracy: ", accuracy)

Loss:  0.5249744057655334
Accuracy:  0.782480001449585


# 4. Interpretation

The CNN model shows higher accuracy compared to the RNN model. This could lead to a discussion on why CNNs might be more effective for this particular dataset, possibly due to their ability to capture local dependencies in text.

The results also open up discussions on the trade-offs between different model architectures and how they might be more or less suitable depending on the nature of the dataset and the task at hand.

Further experiments could include hyperparameter tuning, experimenting with different architectures, or using pre-trained models for transfer learning to potentially improve the results.
