In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np

In [2]:
# Load the IMDB dataset
dataset, info = tfds.load('imdb_reviews', as_supervised=True, with_info=True)
train_dataset, test_dataset = dataset['train'], dataset['test']


Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.U7BTI9_1.0.0/imdb_reviews-train.tfrecor…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.U7BTI9_1.0.0/imdb_reviews-test.tfrecord…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/incomplete.U7BTI9_1.0.0/imdb_reviews-unsupervised.…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


In [3]:
# Initialize tokenizer and model names
model_names = ["distilbert-base-uncased", "bert-base-uncased", "roberta-base"]
results = {}

In [4]:
# Function to tokenize the text data
def tokenize_function(text, label, tokenizer):
    tokens = tokenizer(text.numpy().decode('utf-8'), truncation=True, padding='max_length', max_length=128)
    input_ids = tf.reshape(tokens['input_ids'], [128])
    attention_mask = tf.reshape(tokens['attention_mask'], [128])
    return input_ids, attention_mask, label

In [5]:
# Wrapper to prepare dataset with tokenizer
def preprocess_dataset(dataset, tokenizer):
    # Define a function to apply the tokenizer directly in `map`
    def encode(text, label):
        input_ids, attention_mask, label = tokenize_function(text, label, tokenizer)
        return input_ids, attention_mask, label

    # Use `tf.py_function` to ensure `encode` works with tf.data.Dataset
    dataset = dataset.map(lambda text, label: tf.py_function(
        func=encode, inp=[text, label],
        Tout=(tf.int32, tf.int32, tf.int64)),
        num_parallel_calls=tf.data.AUTOTUNE
    )

    # Structure the dataset to separate features and labels for model training
    dataset = dataset.map(lambda input_ids, attention_mask, label: (
        {"input_ids": input_ids, "attention_mask": attention_mask}, label
    ))
    # Set the shape of the tensors to avoid unknown shapes # This line is crucial
    dataset = dataset.map(lambda features, label: (
        {"input_ids": tf.ensure_shape(features['input_ids'], (128,)),
         "attention_mask": tf.ensure_shape(features['attention_mask'], (128,))},
         tf.ensure_shape(label, ())
    ))
    return dataset # Return the dataset

In [6]:
# Training and evaluation loop for each model
for model_name in model_names:
    print(f"\nTraining with {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = TFAutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

    # Preprocess the datasets
    train_data = preprocess_dataset(train_dataset, tokenizer)
    test_data = preprocess_dataset(test_dataset, tokenizer)

    # Batch the data for training
    train_data = train_data.batch(16)
    test_data = test_data.batch(16)

    # Compile and train the model
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  metrics=['accuracy'])

    history = model.fit(
        train_data,
        epochs=1,
        validation_data=test_data
    )

    # Evaluate the model
    predictions = model.predict(test_data).logits
    predictions = np.argmax(predictions, axis=1)
    true_labels = np.concatenate([y for x, y in test_data], axis=0)

    accuracy = accuracy_score(true_labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average='binary')

    # Store the results
    results[model_name] = {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1_score": f1
    }


Training with distilbert-base-uncased...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 


Training with bert-base-uncased...


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Training with roberta-base...


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predicti



In [7]:
# Display the results
for model_name, metrics in results.items():
    print(f"\nModel: {model_name}")
    print(f"Accuracy: {metrics['accuracy']:.4f}")
    print(f"Precision: {metrics['precision']:.4f}")
    print(f"Recall: {metrics['recall']:.4f}")
    print(f"F1 Score: {metrics['f1_score']:.4f}")

# Print the best model based on accuracy
best_model = max(results, key=lambda x: results[x]['accuracy'])
print(f"\nThe best model is {best_model} with accuracy: {results[best_model]['accuracy']:.4f}")



Model: distilbert-base-uncased
Accuracy: 0.8362
Precision: 0.7724
Recall: 0.9531
F1 Score: 0.8533

Model: bert-base-uncased
Accuracy: 0.8670
Precision: 0.8313
Recall: 0.9207
F1 Score: 0.8737

Model: roberta-base
Accuracy: 0.8484
Precision: 0.7824
Recall: 0.9653
F1 Score: 0.8643

The best model is bert-base-uncased with accuracy: 0.8670


In [None]:
The best model is BERT (bert-base-uncased), achieving the highest accuracy of 86.70% and a strong F1 score of 0.8737, which balances precision and recall effectively. Here's why it performed the best:

Reasons for BERT's Superior Performance:
Balanced Architecture:
BERT is a full-sized transformer model with a robust bidirectional attention mechanism that captures context from both directions of text. This enables it to understand nuanced patterns and contextual dependencies in the IMDB reviews better than smaller or specialized models.

Pretraining Objectives:
BERT's pretraining on masked language modeling (MLM) and next sentence prediction (NSP) helps it excel at understanding sentence-level relationships and sentiment analysis tasks.

Size and Complexity:
Unlike DistilBERT, which is a smaller, distilled version, BERT retains more parameters, allowing it to better model complex relationships in the data. Its slightly larger size compared to DistilBERT contributes to improved accuracy.
Performance Balance:

BERT shows a strong balance between precision (83.13%) and recall (92.07%), indicating that it is both effective at correctly identifying positive/negative reviews and minimizing false negatives.
While RoBERTa performed well in recall (96.53%) and F1 score (0.8643), its slightly lower precision (78.24%) suggests it may misclassify some neutral or weakly polarized reviews. In contrast, BERT's higher precision and balanced performance metrics make it the best model for this task.
