# Using an open-source model to generate embeddings

Transformer models are the gold standard for NLP these days and [Hugging Face](https://huggingface.ai/) is the most prominent site for open-source NLP models.  I asked ChatGPT to tell me about the "best" open-source model for sentiment analysis on the IMDB dataset.  Here's what it had to say:

Selecting the "best" open-source model for creating text embeddings, especially for a task like sentiment analysis on the IMDB dataset, depends on several factors including the balance between performance (accuracy) and computational efficiency (speed and memory usage). As of my last update, here are a few top contenders from the Hugging Face Transformers library, which is a popular choice for accessing pre-trained models:

### 1. BERT (Bidirectional Encoder Representations from Transformers)
- **Pros**: BERT has set a new standard in NLP for a variety of tasks, including sentiment analysis. It's capable of understanding the context of words in search queries or sentences more effectively than previous models.
- **Model to Try**: `bert-base-uncased` for a balance between size and performance, or `bert-large-uncased` for higher accuracy at the cost of computational resources.

### 2. RoBERTa (Robustly optimized BERT approach)
- **Pros**: RoBERTa builds on BERT's language masking strategy, training the model more extensively on a larger dataset. It often outperforms BERT on several benchmarks.
- **Model to Try**: `roberta-base` or `roberta-large`.

### 3. DistilBERT
- **Pros**: Offers a good trade-off between performance and resource efficiency. DistilBERT is a smaller, faster, cheaper, and lighter version of BERT.
- **Model to Try**: `distilbert-base-uncased`.

### 4. ALBERT (A Lite BERT)
- **Pros**: ALBERT modifies BERT by factorizing the embedding layer and sharing parameters across the hidden layers, significantly reducing memory consumption and increasing training speed.
- **Model to Try**: `albert-base-v2` or `albert-large-v2`.

### 5. XLNet
- **Pros**: XLNet outperforms BERT on several benchmarks by using a permutation-based training method, which allows it to capture bidirectional contexts effectively.
- **Model to Try**: `xlnet-base-cased` or `xlnet-large-cased`.

### 6. Sentence Transformers
- **Pros**: Specifically designed for generating sentence embeddings, making it potentially more directly applicable for tasks like sentiment analysis. It uses Siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine similarity.
- **Model to Try**: `all-MiniLM-L6-v2` for a balance between performance and speed, or `paraphrase-MiniLM-L6-v2` for high-quality embeddings suitable for various tasks including sentiment analysis.

### Selection Criteria:
- **Task and Data Specificity**: The nature of your task (e.g., sentiment analysis) and the characteristics of your data (e.g., length of texts, domain-specific language) may make some models more suitable than others.
- **Computational Resources**: Larger models may offer better accuracy but require more computational power and longer inference times. Smaller models or distilled versions may be more practical for deployment, especially in resource-constrained environments.

### Conclusion:
For a balance between performance and computational efficiency, models like `distilbert-base-uncased`, `all-MiniLM-L6-v2`, or `paraphrase-MiniLM-L6-v2` from the Sentence Transformers library are strong candidates. These models provide a good starting point for creating embeddings for sentiment analysis on the IMDB dataset. Experimentation with a few models will help determine the best choice for your specific requirements.

# Experiments

Below, we'll show how to use two of these models to create embeddings that can be used to train a simple classifier.  First, we'll show how use the `all-MiniLM-L6-v2` Sentence Transformer model.  Then we'll show how to use the `distilbert-base-uncased` transformer.  The first three steps are the same for either approach.

### Step 1: Install additional libraries and import packages

In [1]:
# only need to do this once (per session)
# !pip install torch torchvision torchaudio
!pip install transformers
!pip install sentence-transformers
#! pip install torchdata

Collecting transformers
  Downloading transformers-4.39.3-py3-none-any.whl.metadata (134 kB)
     ---------------------------------------- 0.0/134.8 kB ? eta -:--:--
     --- ------------------------------------ 10.2/134.8 kB ? eta -:--:--
     -------- ---------------------------- 30.7/134.8 kB 330.3 kB/s eta 0:00:01
     ---------------- -------------------- 61.4/134.8 kB 469.7 kB/s eta 0:00:01
     ------------------------------------ 134.8/134.8 kB 890.0 kB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Downloading huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
Collecting regex!=2019.12.17 (from transformers)
  Downloading regex-2023.12.25-cp311-cp311-win_amd64.whl.metadata (41 kB)
     ---------------------------------------- 0.0/42.0 kB ? eta -:--:--
     ---------------------------------------  41.0/42.0 kB ? eta -:--:--
     ---------------------------------------  41.0/42.0 kB ? eta -:--:--
     -------------------------------------- 42.0

In [2]:
!pip install pytorch_lightning
!pip install torchmetrics



In [2]:
from torch.utils.data import DataLoader, TensorDataset
from torchtext.datasets import IMDB
from sentence_transformers import SentenceTransformer
import torch
import torch.nn as nn
import torchmetrics

from utils import train_model

### Step 2: Load the IMDB Dataset
The torchdata library provides a simple API to load the IMDB dataset. We will use this dataset to train and evaluate our sentiment analysis model.

In [4]:
!pip install portalocker
!pip install torchdata
!pip install torchtext



In [3]:
train_iter, test_iter = IMDB(split=('train', 'test'))

# Convert to list for easier processing
train_data = list(train_iter)
test_data = list(test_iter)

### Step 3: Process the Labels

Read the labels and convert them into tensors of 0 and 1 floats for training.

In [4]:
def process_labels(data):
    # The IMDB uses 1 for a negative review and 2 for a positive review
    # We'll use 0 and 1, respectively
    labels = [label-1 for label, _ in data]
    return torch.tensor(labels).unsqueeze(1).float()

train_labels = process_labels(train_data)
test_labels = process_labels(test_data)

## Using the MiniLM Sentence Transformer

Step 4 for loading the embedding model and generating the embeddings is a little differnt for each of the two models.  Here we'll show the MiniLM version.

### Step 4 (MiniLM): Load the Embedding Model and Generate Text Embeddings

This step will change depending on which embedding model
We will use a Sentence Transformers model to convert the IMDB reviews into embeddings. These embeddings will serve as input features for our neural network classifier.

In [5]:
# Step 3: Generate Embeddings
# Load a pre-trained Sentence Transformer model
model_name = 'all-MiniLM-L6-v2'  # A good balance between performance and speed, 22M parameters
sentence_model = SentenceTransformer(model_name)

def generate_embeddings(data):
    texts = [text for _, text in data]
    embeddings = sentence_model.encode(texts, show_progress_bar=True)
    return torch.tensor(embeddings)

train_embeddings = generate_embeddings(train_data)
test_embeddings = generate_embeddings(test_data)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

### Step 5:  Choose the classifier

This step is the same for both embedding models, but the size of the embeddings may be different, so we'll create the class for our classifier here but instantiate it for each set of embeddings.

In [6]:
class SentimentClassifier(nn.Module):
    def __init__(self, input_dim):
        super(SentimentClassifier, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim,128),
            nn.LeakyReLU(0.1),
            nn.Linear(128,1)
        )

    def forward(self, x):
        return self.fc(x)

### Step 6 (MiniLM): Build and Train the Classifier

This step is pretty similar for both embedding models, we need to build the datasets and loaders here since the embeddings will differ.  We'll also instantiate our classifier and train it.

We'll use `train_model` from `utils` and the accuracy metric from `torchmetrics`.

In [7]:
# Create TensorDatasets
train_dataset = TensorDataset(train_embeddings, train_labels)
test_dataset = TensorDataset(test_embeddings, test_labels)

# Dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [8]:
# Initialize model, loss, and optimizer
input_dim = train_embeddings.shape[1]
classifier = SentimentClassifier(input_dim)
loss_function = nn.BCEWithLogitsLoss()
metrics = {
    'accuracy': torchmetrics.Accuracy(num_classes=2, task='binary'),
}

minilm_results_df = train_model(classifier, loss_function,
                         train_loader = train_loader,
                         val_loader = test_loader,
                         metrics=metrics,
                         epochs=10)

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 10/10, Training 100.00% complete, Validation 100.00% complete lr = 1.000e-03
 Epoch  train_accuracy  train_loss  val_accuracy  val_loss      Time    LR
     6         0.82456    0.387316       0.81504  0.401627 12.763296 0.001
     7         0.82796    0.379597       0.81528  0.400027 13.656353 0.001
     8         0.83276    0.372259       0.81432  0.400576 11.800880 0.001
     9         0.83812    0.365090       0.80776  0.414728 11.708164 0.001
    10         0.84372    0.356869       0.81340  0.402624 11.345714 0.001


## Using the DistilBERT Transformer

Here we modify Step 4 to load the embedding model and generate the embeddings

### Step 4 (DistilBERT): Load the Embedding Model and Generate Text Embeddings

This is pretty slow (it took slightly more than an hour on my Macbook CPU).  I

In [9]:
from transformers import DistilBertTokenizer, DistilBertModel

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [10]:
def detect_device(use_MPS = False):
    if torch.cuda.is_available():
        print("CUDA is available")
        return torch.device("cuda")
    elif torch.backends.mps.is_available() and use_MPS:
        print("MPS (Apple Silicon) is available")
        return torch.device("mps")
    else:
        print("Only cpu is available")
        return torch.device("cpu")

device = detect_device()

CUDA is available


In [20]:
from tqdm.auto import tqdm  # Automatically use notebook-friendly or console version
model.to(device)

def generate_embeddings(data, device, batch_size=16):
    model.eval()  # Set the model to evaluation mode
    cls_embeddings = []
    mean_embeddings = []

    # Prepare DataLoader for batch processing
    data_loader = DataLoader(data, batch_size=batch_size, shuffle=False)
    progress_bar = tqdm(data_loader, desc='Generating Embeddings')

    with torch.no_grad():  # No need to calculate gradients
        for _,texts in progress_bar:
            encoded_input = tokenizer(list(texts), padding=True, truncation=True, 
                                      return_tensors='pt', max_length=512,
                                      add_special_tokens=True
                                     ).to(device)
            output = model(**{k: v.to(device) for k, v in encoded_input.items()})
            batch_mean_embeddings = output.last_hidden_state.mean(dim=1)
            mean_embeddings.append(batch_mean_embeddings.cpu())
            batch_cls_embeddings = output.last_hidden_state[:,0,:]
            cls_embeddings.append(batch_cls_embeddings.cpu())

    # Concatenate all batch embeddings
    mean_embeddings = torch.cat(mean_embeddings, dim=0)
    cls_embeddings = torch.cat(cls_embeddings, dim=0)
    return mean_embeddings, cls_embeddings

train_mean_embeddings, train_cls_embeddings = generate_embeddings(train_data, device)
test_mean_embeddings, test_cls_embeddings = generate_embeddings(test_data, device)

Generating Embeddings:   0%|          | 0/1563 [00:00<?, ?it/s]

Generating Embeddings:   0%|          | 0/1563 [00:00<?, ?it/s]

### Step 6 (DistilBERT): Build and Train the Classifier

This step is the same as before, but we instantiate the datasets and loaders again since the embeddings have changed.  We'll do this twice to see if there is a difference in the results depending on how the final embedding is formed.

#### First we'll use mean emebeddings

In [24]:
test_cls_embeddings.shape

torch.Size([25000, 768])

In [21]:
# Create TensorDatasets
train_dataset = TensorDataset(train_mean_embeddings, train_labels)
test_dataset = TensorDataset(test_mean_embeddings, test_labels)

# Dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [25]:
# Initialize model, loss, and optimizer
input_dim = train_mean_embeddings.shape[1]
classifier = SentimentClassifier(input_dim)
loss_function = nn.BCEWithLogitsLoss()
metrics = {
    'accuracy': torchmetrics.Accuracy(num_classes=2, task='binary'),
}

distilbert_mean_results_df = train_model(classifier, loss_function,
                         train_loader = train_loader,
                         val_loader = test_loader,
                         metrics=metrics,
                         epochs=10)

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 10/10, Training 100.00% complete, Validation 100.00% complete lr = 1.000e-03
 Epoch  train_accuracy  train_loss  val_accuracy  val_loss      Time    LR
     6         0.87916    0.292371       0.87532  0.290044 14.094957 0.001
     7         0.87760    0.288965       0.87688  0.288059 13.502207 0.001
     8         0.88308    0.283877       0.87840  0.286740 13.782764 0.001
     9         0.88236    0.281740       0.87916  0.283231 13.856043 0.001
    10         0.88668    0.276065       0.87460  0.293534 14.869065 0.001


#### Second we'll try the CLS token embeddings

In [26]:
# Create TensorDatasets
train_dataset = TensorDataset(train_cls_embeddings, train_labels)
test_dataset = TensorDataset(test_cls_embeddings, test_labels)

# Dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [27]:
# Initialize model, loss, and optimizer
input_dim = train_mean_embeddings.shape[1]
classifier = SentimentClassifier(input_dim)
loss_function = nn.BCEWithLogitsLoss()
metrics = {
    'accuracy': torchmetrics.Accuracy(num_classes=2, task='binary'),
}

distilbert_mean_results_df = train_model(classifier, loss_function,
                         train_loader = train_loader,
                         val_loader = test_loader,
                         metrics=metrics,
                         epochs=10)

`Trainer.fit` stopped: `max_epochs=10` reached.


Epoch 10/10, Training 100.00% complete, Validation 100.00% complete lr = 1.000e-03
 Epoch  train_accuracy  train_loss  val_accuracy  val_loss      Time    LR
     6         0.87200    0.302358       0.87160  0.299234 14.741745 0.001
     7         0.87408    0.300601       0.87248  0.301044 13.555089 0.001
     8         0.87748    0.294722       0.86592  0.310991 12.489838 0.001
     9         0.87624    0.292098       0.87056  0.302251 11.399477 0.001
    10         0.87712    0.291831       0.87420  0.294464 13.069873 0.001
