SYSEN 5888 Spring 2026

Jonathan Lloyd

Homework 2, Question 2


Goal: ConvNets while renowned for their prowess in image processing, have also demonstrated strong capabilities in handling sequential data such as text. In this problem, you will be applying these principles of CNNs to a classic problem in natural language processing - sentiment analysis.

Tools: Numpy, PyTorch, Keras (TensorFlow)

Data: IMDB movie reviews dataset provided by `tensorflow.keras.datasets.imdb`

Task: Load the IMDB dataset from Keras using a vocabulary size of 2000 for tokenization & numericalization. Each review in the dataset is already pre-processed and encoded as a sequence of word indexes. A mapping between words and their corresponding indexes is provided using the `imdb.get_word_index()` method. For consistent input to the model, your task is to pad the reviews or truncate them to a uniform length. This can be achieved using the `pad_sequences` method from Keras to convert all reviews to a length of 300 words using the `maxlen` argument in the `pad_sequences` method. The preprocessed NumPy arrays will then be fed into a PyTorch CNN for sentiment classification.

In [1]:
# Update packages in Colab server (no torchtext)
%pip install --force-reinstall "torch==2.3.0" --upgrade numpy pandas matplotlib tensorflow

In [2]:
import torch
import tensorflow as tf

print("PyTorch version:", torch.__version__)
print("TensorFlow version:", tf.__version__)

In [3]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Hyperparameters for data processing
VOCAB_SIZE = 2000  # as specified in the assignment
MAX_LEN = 300      # fixed sequence length
BATCH_SIZE = 32
SHUFFLE_SEED = 42

# Load IMDB dataset from Keras (already tokenized and indexed)
(X_train_full, y_train_full), (X_test, y_test) = imdb.load_data(num_words=VOCAB_SIZE)

# Get the word -> index mapping (for analysis / interpretability)
word_index = imdb.get_word_index()

# Pad and truncate to fixed length of 300 tokens
X_train_full = pad_sequences(X_train_full, maxlen=MAX_LEN, padding="post", truncating="post")
X_test = pad_sequences(X_test, maxlen=MAX_LEN, padding="post", truncating="post")

# Create validation split from training data
np.random.seed(SHUFFLE_SEED)
indices = np.random.permutation(len(X_train_full))
X_train_full = X_train_full[indices]
y_train_full = np.array(y_train_full)[indices]

n_val = 1000
X_val = X_train_full[:n_val]
y_val = y_train_full[:n_val]
X_train = X_train_full[n_val:]
y_train = y_train_full[n_val:]

# Summary
print(f"Training samples: {len(X_train)}, Validation samples: {len(X_val)}, Test samples: {len(X_test)}")
print(f"Sequence length: {MAX_LEN}, Batch size: {BATCH_SIZE}, Vocab size (num_words): {VOCAB_SIZE}")

Architecture:

The architecture of the convolutional neural network model for this problem is as follows:

Embedding Layer:
Input Vocabulary Size: 2000 words
Embedding Dimension: 16
Input Length: 300 words

Conv1D Layer:
Filters: 128
Kernel Size: 3
Activation: ReLU
Stride: 1
Padding: Valid

GlobalMaxPooling1D Layer

Dense Layer:
Units: 1
Activation: Sigmoid

In [None]:
# PyTorch CNN Model Definition
class TextCNN(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int = 16,
        num_filters: int = 128,
        kernel_size: int = 3,
    ) -> None:
        super().__init__()
        # Embedding layer: maps word indices to dense vectors
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)

        # 1D convolution over the sequence (time) dimension
        self.conv = nn.Conv1d(
            in_channels=embed_dim,
            out_channels=num_filters,
            kernel_size=kernel_size,
        )

        # Global max pooling over the time dimension
        self.global_max_pool = nn.AdaptiveMaxPool1d(output_size=1)

        # Final classification layer
        self.fc = nn.Linear(num_filters, 1)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass.

        x: LongTensor of shape (batch_size, seq_len)
        returns: probabilities of shape (batch_size,)
        """
        # (batch, seq_len) -> (batch, seq_len, embed_dim)
        embedded = self.embedding(x)
        # (batch, seq_len, embed_dim) -> (batch, embed_dim, seq_len)
        embedded = embedded.permute(0, 2, 1)

        # Convolution + ReLU
        conv_out = torch.relu(self.conv(embedded))  # (batch, num_filters, L')

        # Global max pooling over time -> (batch, num_filters, 1)
        pooled = self.global_max_pool(conv_out).squeeze(-1)  # (batch, num_filters)

        # Linear layer to a single logit
        logits = self.fc(pooled).squeeze(-1)  # (batch,)

        # Sigmoid for binary sentiment probability
        probs = torch.sigmoid(logits)
        return probs


# Instantiate model (will be moved to appropriate device in training cell)
EMBED_DIM = 16
NUM_FILTERS = 128
KERNEL_SIZE = 3

# VOCAB_SIZE is defined in the preprocessing cell
model = TextCNN(vocab_size=VOCAB_SIZE, embed_dim=EMBED_DIM, num_filters=NUM_FILTERS, kernel_size=KERNEL_SIZE)
print(model)


Training: The model should be compiled using the 'binary_crossentropy' as the loss function and 'adam' optimizer. Additionally, 'accuracy' should be assigned as the main metric. A subset of the training data (1000 samples) should be set aside as a validation set, while the rest should be used for training. The model should be trained for a total of 30 (or 10) epochs, with a batch size of 32. After training, the model should be evaluated on the test data to obtain the final accuracy score. This will give a measure of how well the model can generalize to unseen reviews.

In [None]:
## Helper Functions

In [None]:
## Train and Run Model

# Instantiate model 


# Run experiment 

Visualization:
Plot the accuracy and loss for both training and validation datasets across epochs to analyze the performance of the model over epochs.

In [None]:
# Plots across epochs
# Plot accuracy and loss for both training and validation - build a quad 

Deliverables: 

1. Model Accuracy and Loss Curves: A detailed report of the performance of the model, focusing on accuracy and loss curves.
2. Analysis of Model Performance: A thorough analysis should be conducted to discuss the results obtained from the model. This analysis should include 

a. Whether the model overfits or underfits the training data. 

b. Examination of the loss and accuracy curves to identify potential indicators of the model's behavior (such as plateaus or sharp changes).

3. Code and Resources: Please make sure to submit your working code files along with the final results and the plots.

4. Bonus (+1) Model Optimization: Consider experimenting with other architectures or hyperparameters to further optimize the model's performance. Discuss the outcomes of your experiments and the effect of different parameters on the accuracy and loss