## Comparative Analysis of RNN vs. BERT on Text Classification

Some educational exploration of Recurrent Neural Networks (RNNs) and Transformer-based models (specifically BERT) in application to NLP tasks.

We utilizing the IMDB movie reviews dataset to evaluate and contrast the performance of these models on a sentiment analysis task.

Key Components:
* RNN Implementation: Developing an RNN model from scratch using PyTorch, focusing on its ability to handle sequential data and its performance on the sentiment analysis task
* Transformer Implementation: Leveraging a pre-trained BERT model from the Hugging Face transformers library, fine-tuning it on the same sentiment analysis task for direct comparison with the RNN model
* Evaluation and Comparison: Assesses both models' performance based on accuracy metrics, providing a side-by-side comparison to understand how each model fares in terms of understanding and classifying textual sentiment

### RNN implementation

In [5]:
# installing older versions of torchtext so we can import the necessary modules
!pip install torchtext==0.6

Collecting torchtext==0.6
  Downloading torchtext-0.6.0-py3-none-any.whl (64 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/64.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.16.0
    Uninstalling torchtext-0.16.0:
      Successfully uninstalled torchtext-0.16.0
Successfully installed torchtext-0.6.0


In [6]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchtext.datasets import IMDB
from torchtext.data import Field, BucketIterator, LabelField

ImportError: cannot import name 'Field' from 'torchtext.data' (/usr/local/lib/python3.10/dist-packages/torchtext/data/__init__.py)

In [3]:
# Define Fields (in TorchText Fields are abstractions for defining how data should be processed)
TEXT = Field(tokenize='spacy', tokenizer_language='en_core_web_sm', include_lengths=True) # specifies how to tokenize the input text
LABEL = LabelField(dtype=torch.float) # defines how the labels should be processed, in this case, setting their data type to a floating-point tensor suitable for binary classification tasks

In [4]:
# Load IMDB dataset
train_data, test_data = IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:01<00:00, 50.1MB/s]


In [5]:
# Build Vocabulary (limiting it to the top 25,000 words) and
# Load Pre-trained Word Embeddings (to represent words as 100-dimensional vectors)
TEXT.build_vocab(train_data, max_size=25000, vectors="glove.6B.100d")
# Build a vocabulary from the labels in the training data (it's much smaller and just consists of the unique class labels)
LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [02:39, 5.41MB/s]                           
100%|█████████▉| 399999/400000 [00:19<00:00, 20463.65it/s]


In [6]:
# Create Iterators (analoguous to PyTorch's DataLoaders)
# for efficient batching, shuffling, and loading of the dataset during training and testing
train_iterator, test_iterator = BucketIterator.splits(
    (train_data, test_data), batch_size=64, sort_within_batch=True)


In [10]:
# Define RNN Model
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, text, text_lengths):
        embedded = self.embedding(text) # the input text is passed through an embedding layer, converting token indices to embeddings.
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths, enforce_sorted=False) # the embedded sequences are packed into a packed sequence, handling variable-length sequences efficiently
        packed_output, hidden = self.rnn(packed_embedded) # the packed sequence is fed into an RNN, which processes it sequentially, maintaining hidden states
        hidden = hidden.squeeze(0) # the final hidden state is squeezed to remove the first dimension, preparing it for the fully connected layer
        return self.fc(hidden) # the last hidden state is passed through a fully connected layer, producing the output logits


In [11]:
# Instantiate Model, Loss, and Optimizer
input_dim = len(TEXT.vocab)
embedding_dim = 100  # Same as GloVe vectors
hidden_dim = 256
output_dim = 1

model = RNN(input_dim, embedding_dim, hidden_dim, output_dim)
optimizer = optim.SGD(model.parameters(), lr=1e-3)
criterion = nn.BCEWithLogitsLoss()

In [12]:
# Training Loop (simplified)
for epoch in range(5):
    for batch in train_iterator:
        optimizer.zero_grad()
        text, text_lengths = batch.text
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.label)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        # Log after every 100 batches
        if (i + 1) % 100 == 0:
            print(f'Epoch: {epoch+1}, Batch: {i+1}, Average Loss: {total_loss/100:.4f}')
            total_loss = 0  # Reset total loss for the next 100 batches

# Note that RNNs, especially on large datasets, can be slower to train due to their sequential nature,
# which can be less parallelizable compared to models like Transformers or CNNs

In [14]:
# Save the trained RNN model's state dictionary and the tokenizer used

# Save in the current working directory of Colab notebook
torch.save(model.state_dict(), 'rnn_imdb_model.pth')
torch.save(TEXT, 'rnn_imdb_tokenizer.pt')
torch.save(LABEL, 'rnn_imdb_label.pt')

# Save to Google Drive
from google.colab import drive
drive.mount('/content/drive')
torch.save(model.state_dict(), '/content/drive/My Drive/Projects/ML_daily/models/rnn_imdb_model.pth')
torch.save(TEXT, '/content/drive/My Drive/Projects/ML_daily/models/rnn_imdb_tokenizer.pt')
torch.save(LABEL, '/content/drive/My Drive/Projects/ML_daily/models/rnn_imdb_label.pt')



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
# Evaluation of the trained RNN performance

model.eval()  # Ensure the RNN model is in evaluation mode
rnn_total_acc, rnn_total_count = 0, 0

with torch.no_grad():
    for batch in test_iterator:
        text, text_lengths = batch.text
        labels = batch.label.type(torch.float)
        predictions = model(text, text_lengths).squeeze(1)
        predicted_labels = torch.sigmoid(predictions) >= 0.5
        rnn_total_acc += (predicted_labels == labels).sum().item()
        rnn_total_count += labels.size(0)

rnn_accuracy = rnn_total_acc / rnn_total_count * 100
print(f'Accuracy of the RNN model on the test set: {rnn_accuracy:.2f}%')



Accuracy of the RNN model on the test set: 53.27%


### Transformer implementation



In [None]:
# Install the Transformers Library
pip install transformers


In [1]:
# Import Necessary Modules
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import AdamW
import torch
from torch.utils.data import DataLoader
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.data.functional import to_map_style_dataset
from torchtext.vocab import build_vocab_from_iterator

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


### 1. Fine-tune BERT on IMDB dataset

In [9]:
# Load Pre-trained Model and Tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)  # num_labels = 2 for binary classification


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
# Prepare the Dataset
# We use the IMDB dataset for this example as well

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer.tokenize(text)

# Load IMDB dataset
train_iter, test_iter = IMDB(split=('train', 'test')) # iterable-style datasets from the TorchText IMDB dataset
# transform iterable-style datasets into map-style ones — so they can be indexed and used with a DataLoader for batch processing, making them compatible with typical PyTorch training loops
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

# Build vocabulary (a mapping of token strings to numerical indices) and encode texts
tokenizer = get_tokenizer('basic_english') # splits text into tokens (words) based on simple rules like spaces and punctuation
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"]) # Builds a vocabulary from all the generated tokens. Special token "<unk>" for unknown words not found in the vocabulary
vocab.set_default_index(vocab["<unk>"])

# Function to process a batch of data points and prepare them for input into the model
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]  # Initialize lists for labels, tokenized texts, and text offsets
    for (_label, _text) in batch:
        label_list.append(label_to_id[_label])  # Convert text labels to numerical IDs and append to label_list
        processed_text = torch.tensor(vocab(tokenizer(_text)), dtype=torch.int64)  # Tokenize text, map to vocab indices, and convert to tensor
        text_list.append(processed_text)  # Append processed text tensor to text_list
        offsets.append(processed_text.size(0))  # Append the length of processed text to offsets list
    label_list = torch.tensor(label_list, dtype=torch.int64)  # Convert label_list to a tensor of integer labels
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)  # Convert offsets to tensor, calculate cumulative sum to get start indices of texts
    text_list = torch.cat(text_list)  # Concatenate all processed text tensors into a single tensor
    return label_list.to(device), text_list.to(device), offsets.to(device)  # Move tensors to the specified device (GPU/CPU) and return

label_to_id = {'neg': 0, 'pos': 1}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# DataLoader is a utility class provided by PyTorch that abstracts the complexity of iterating over datasets
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch)


AttributeError: 'NoneType' object has no attribute 'Lock'
This exception is thrown by __iter__ of _MemoryCellIterDataPipe(remember_elements=1000, source_datapipe=_ChildDataPipe)

In [None]:
# Training Loop
optimizer = AdamW(bert_model.parameters(), lr=5e-5)

bert_model.to(device)
bert_model.train()

for epoch in range(3):  # Loop over the dataset multiple times
    for i, (labels, text, offsets) in enumerate(train_dataloader):
        optimizer.zero_grad()
        outputs = bert_model(text, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        if i % 100 == 99:  # Print every 100 mini-batches
            print(f'Epoch: {epoch + 1}, Batch: {i + 1}, Loss: {loss.item()}')


In [None]:
# Save the trained BERT model's state dictionary and the tokenizer used

# Save in the current working directory of Colab notebook
bert_model.save_pretrained('./fine_tuned_model')
tokenizer.save_pretrained('./fine_tuned_tokenizer')

# Save to Google Drive
from google.colab import drive
drive.mount('/content/drive')
bert_model.save_pretrained('/content/drive/My Drive/Projects/ML_daily/models/bert_imdb_model')
tokenizer.save_pretrained('/content/drive/My Drive/Projects/ML_daily/models/bert_imdb_tokenizer')

### 2. Load already fine-tuned BERT model

In [18]:
pip install transformers



In [2]:
#  Load the Fine-Tuned Model and Tokenizer

from google.colab import drive
drive.mount('/content/drive')

model_path = '/content/drive/My Drive/Projects/DistilBERT_finetune_imdb/fine_tuned_model'
tokenizer_path = '/content/drive/My Drive/Projects/DistilBERT_finetune_imdb/fine_tuned_tokenizer'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# Load the Fine-Tuned Model

from transformers import DistilBertForSequenceClassification

bert_model_loaded = DistilBertForSequenceClassification.from_pretrained(model_path)
bert_model_loaded.eval() # Set the model to evaluation mode


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [4]:
# Load the Tokenizer

from transformers import DistilBertTokenizer

tokenizer_loaded = DistilBertTokenizer.from_pretrained(tokenizer_path)

In [17]:
!pip install --upgrade torchtext

Collecting torchtext
  Downloading torchtext-0.17.0-cp310-cp310-manylinux1_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting torch==2.2.0 (from torchtext)
  Downloading torch-2.2.0-cp310-cp310-manylinux1_x86_64.whl (755.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m755.5/755.5 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Collecting torchdata==0.7.1 (from torchtext)
  Downloading torchdata-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m95.4 MB/s[0m eta [36m0:00:00[0m
Collecting typing-extensions>=4.8.0 (from torch==2.2.0->torchtext)
  Downloading typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.0->torchtext)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)

In [7]:
pip install 'portalocker>=2.0.0'

Collecting portalocker>=2.0.0
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.8.2


In [13]:
from torch.utils.data.dataset import IterableDataset

class IMDBDataset(IterableDataset):
    def __init__(self, data_iter, tokenizer, label_to_id, device, max_length=512):
        super(IMDBDataset).__init__()
        self.data_iter = data_iter
        self.tokenizer = tokenizer
        self.label_to_id = label_to_id
        self.device = device
        self.max_length = max_length

    def process_example(self, example):
        label, text = example
        encoding = self.tokenizer(text, padding='max_length', truncation=True, max_length=self.max_length, return_tensors="pt")
        encoding = {key: val.squeeze().to(self.device) for key, val in encoding.items()}
        label_id = torch.tensor(self.label_to_id[label], dtype=torch.long).to(self.device)
        return encoding, label_id

    def __iter__(self):
        for example in self.data_iter:
            yield self.process_example(example)

# Load the IMDB dataset iterators
train_iter, test_iter = IMDB(split=('train', 'test'))

# Create the custom dataset instances
train_dataset = IMDBDataset(train_iter, tokenizer_loaded, label_to_id, device)
test_dataset = IMDBDataset(test_iter, tokenizer_loaded, label_to_id, device)

# Function to collate data points into batches
def collate_fn(batch):
    input_ids = torch.stack([item[0]['input_ids'] for item in batch])
    attention_mask = torch.stack([item[0]['attention_mask'] for item in batch])
    labels = torch.stack([item[1] for item in batch])
    return {'input_ids': input_ids, 'attention_mask': attention_mask, 'labels': labels}

# Create DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=8, collate_fn=collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=8, collate_fn=collate_fn)


In [8]:
# Prepare the Dataset
# We use the IMDB dataset for this example as well

from torch.utils.data import DataLoader
from torchtext.datasets import IMDB
from torchtext.data.utils import get_tokenizer
from torchtext.data.functional import to_map_style_dataset

# Function to tokenize and encode a single example
def tokenize_and_encode(example):
    # Tokenize and encode the text using the loaded tokenizer
    encoding = tokenizer_loaded(example['text'], padding='max_length', truncation=True, max_length=512, return_tensors="pt")
    return encoding

# Function to process a batch of data points and prepare them for input into the model
def collate_batch(batch):
    # Initialize lists for labels and encoded texts
    input_ids_list, attention_mask_list, label_list = [], [], []
    for (_label, _text) in batch:
        # Tokenize and encode text
        encoding = tokenizer_loaded(_text, padding='max_length', truncation=True, max_length=512, return_tensors="pt")
        # Append encoded inputs and attention masks to their respective lists
        input_ids_list.append(encoding['input_ids'])
        attention_mask_list.append(encoding['attention_mask'])
        # Convert text labels to numerical IDs and append to label_list
        label_list.append(label_to_id[_label])
    # Convert lists to tensors and stack them
    input_ids = torch.stack(input_ids_list).squeeze(1).to(device)
    attention_mask = torch.stack(attention_mask_list).squeeze(1).to(device)
    labels = torch.tensor(label_list, dtype=torch.long).to(device)
    # Return a dictionary of tensors
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

# Load IMDB dataset
train_iter, test_iter = IMDB(split=('train', 'test'))
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

label_to_id = {'neg': 0, 'pos': 1}  # Define label mapping
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Set device

# Create DataLoaders for training and testing sets
train_dataloader = DataLoader(list(map(tokenize_and_encode, train_dataset)), batch_size=8, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(list(map(tokenize_and_encode, test_dataset)), batch_size=8, shuffle=False, collate_fn=collate_batch)


AttributeError: 'NoneType' object has no attribute 'Lock'
This exception is thrown by __iter__ of _MemoryCellIterDataPipe(remember_elements=1000, source_datapipe=_ChildDataPipe)

In [None]:
# Prepare the Dataset
# We use the IMDB dataset for this example as well



# Function to process a batch of data points and prepare them for input into the model
def collate_batch(batch):
    label_list, input_ids_list, attention_mask_list = [], [], []
    for (_label, _text) in batch:
        label_list.append(label_to_id[_label]) # Convert text labels to numerical IDs and append to label_list
        encoding = tokenizer_loaded(_text, truncation=True, padding='max_length', max_length=512, return_tensors='pt')
        input_ids_list.append(encoding['input_ids'].squeeze(0))  # Remove batch dimension
        attention_mask_list.append(encoding['attention_mask'].squeeze(0))

    label_list = torch.tensor(label_list, dtype=torch.int64)
    input_ids = torch.stack(input_ids_list)
    attention_masks = torch.stack(attention_mask_list)

    return label_list.to(device), input_ids.to(device), attention_masks.to(device)

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]  # Initialize lists for labels, tokenized texts, and text offsets
    for (_label, _text) in batch:
        label_list.append(label_to_id[_label])  # Convert text labels to numerical IDs and append to label_list
        processed_text = torch.tensor(vocab(tokenizer(_text)), dtype=torch.int64)  # Tokenize text, map to vocab indices, and convert to tensor
        text_list.append(processed_text)  # Append processed text tensor to text_list
        offsets.append(processed_text.size(0))  # Append the length of processed text to offsets list
    label_list = torch.tensor(label_list, dtype=torch.int64)  # Convert label_list to a tensor of integer labels
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)  # Convert offsets to tensor, calculate cumulative sum to get start indices of texts
    text_list = torch.cat(text_list)  # Concatenate all processed text tensors into a single tensor
    return label_list.to(device), text_list.to(device), offsets.to(device)  # Move tensors to the specified device (GPU/CPU) and return

label_to_id = {'neg': 0, 'pos': 1}
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# DataLoader is a utility class provided by PyTorch that abstracts the complexity of iterating over datasets
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch)

from transformers import DistilBertTokenizer

# Assuming tokenizer_loaded is your loaded tokenizer
tokenizer_loaded = DistilBertTokenizer.from_pretrained(tokenizer_path)

# Vocabulary (a mapping of token strings to numerical indices) is already built and loaded
def collate_batch(batch):
    label_list, input_ids_list, attention_mask_list = [], [], []
    for (_label, _text) in batch:
        label_list.append(label_to_id[_label]) # Convert text labels to numerical IDs and append to label_list
        encoding = tokenizer_loaded(_text, truncation=True, padding='max_length', max_length=512, return_tensors='pt')
        input_ids_list.append(encoding['input_ids'].squeeze(0))  # Remove batch dimension
        attention_mask_list.append(encoding['attention_mask'].squeeze(0))

    label_list = torch.tensor(label_list, dtype=torch.int64)
    input_ids = torch.stack(input_ids_list)
    attention_masks = torch.stack(attention_mask_list)

    return label_list.to(device), input_ids.to(device), attention_masks.to(device)

# DataLoader instances for both the training and testing datasets
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=True, collate_fn=collate_batch)



In [14]:
# Evaluation of the fine-tuned transformer model

bert_model_loaded.eval()  # Ensure the BERT model is in evaluation mode
bert_total_acc, bert_total_count = 0, 0

with torch.no_grad():
    for i, (labels, text, offsets) in enumerate(test_dataloader):
        outputs = bert_model_loaded(text, labels=labels)
        logits = outputs.logits
        bert_total_acc += (logits.argmax(1) == labels).sum().item()
        bert_total_count += labels.size(0)

bert_accuracy = bert_total_acc / bert_total_count * 100
print(f'Accuracy of the BERT model on the test set: {bert_accuracy:.2f}%')


NameError: name 'test_dataloader' is not defined

In [16]:
bert_model_loaded.eval()  # Ensure the BERT model is in evaluation mode
bert_total_acc, bert_total_count = 0, 0

with torch.no_grad():
  for batch in test_dataloader:
    print(batch)  # Just to test if iteration works
    break  # Break after the first batch to avoid lengthy outputs

  for batch in test_dataloader:
      # Extract input_ids, attention_mask, and labels from the batch
      input_ids = batch['input_ids']
      attention_mask = batch['attention_mask']
      labels = batch['labels']

      # Forward pass, no need to specify labels here unless you're calculating loss
      outputs = bert_model_loaded(input_ids=input_ids, attention_mask=attention_mask)

      # Extract logits and compute accuracy
      logits = outputs.logits
      bert_total_acc += (logits.argmax(1) == labels).sum().item()
      bert_total_count += labels.size(0)

bert_accuracy = bert_total_acc / bert_total_count * 100
print(f'Accuracy of the BERT model on the test set: {bert_accuracy:.2f}%')


AttributeError: 'NoneType' object has no attribute 'Lock'
This exception is thrown by __iter__ of _MemoryCellIterDataPipe(remember_elements=1000, source_datapipe=_ChildDataPipe)

In [None]:
# Comparison Summary

print(f'\nComparison Summary:')
print(f'RNN Model Accuracy: {rnn_accuracy:.2f}%')
print(f'BERT Model Accuracy: {bert_accuracy:.2f}%')
