### **Encoder and Tokenizer Model Selection**

To compare the `sentence-transformers/all-MiniLM-L12-v2` model with other existing models for encoding and tokenization, we'll look at a few key performance metrics:

1. **Embedding Quality:** How well the model captures the semantic meaning of sentences.
2. **Inference Speed:** Time taken to encode a batch of sentences.
3. **Model Size:** Size of the model in terms of parameters.
4. **Resource Utilization:** CPU/GPU usage and memory requirements.
We'll compare `sentence-transformers/all-MiniLM-L12-v2` with several other popular sentence-transformer models such as `distilbert-base-nli-stsb-mean-tokens`, `bert-base-nli-mean-tokens`, and `all-MiniLM-L6-v2`.

#### Installing necessary libraries

In [27]:
!pip install sentence-transformers transformers datasets torch

Collecting ace
  Downloading ace-0.3.3-py3-none-any.whl.metadata (7.8 kB)
Collecting tools
  Downloading tools-0.1.9.tar.gz (34 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting pytils (from tools)
  Downloading pytils-0.4.1.tar.gz (99 kB)
     ---------------------------------------- 0.0/99.1 kB ? eta -:--:--
     ---------------------------------------- 0.0/99.1 kB ? eta -:--:--
     ---------------------------------------- 0.0/99.1 kB ? eta -:--:--
     ---------------------------------------- 0.0/99.1 kB ? eta -:--:--
     ---------------------------------------- 0.0/99.1 kB ? eta -:--:--
     ---------------------------------------- 0.0/99.1 kB ? eta -:--:--
     ---------------------------------------- 0.0/99.1 kB ? eta -:--:--
     ---------------------------------------- 0.0/99.1 kB ? eta -:--:--
     ---- ----------------------------------- 10.2/99.1 kB ? eta -:--:--
     ---- ----------------------------------

#### Importing libraries

In [28]:
import time
import torch
import psutil
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
from scipy.stats import pearsonr
import pandas as pd
from IPython.display import display

#### Defining the models to test

In [3]:
models = {
    'all-MiniLM-L12-v2': 'sentence-transformers/all-MiniLM-L12-v2',
    'distilbert-base-nli-stsb-mean-tokens': 'sentence-transformers/distilbert-base-nli-stsb-mean-tokens',
    'bert-base-nli-mean-tokens': 'sentence-transformers/bert-base-nli-mean-tokens',
    'all-MiniLM-L6-v2': 'sentence-transformers/all-MiniLM-L6-v2'
}

#### Loading Semantic Textual Similarity (STS) benchmark dataset

In [4]:
dataset = load_dataset('stsb_multi_mt', name='en', split='test')

Downloading readme:   0%|          | 0.00/11.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/470k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/108k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/142k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5749 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [8]:
from datasets import load_dataset

# Load the STS benchmark dataset
dataset = load_dataset('stsb_multi_mt', name='en', split='test')

# Inspect the first few examples
for i, example in enumerate(dataset[:5]):
    print(f"Example {i+1}: {example}")


Example 1: sentence1
Example 2: sentence2
Example 3: similarity_score


In [13]:
dataset

Dataset({
    features: ['sentence1', 'sentence2', 'similarity_score'],
    num_rows: 1379
})

#### Sampling sentences from the dataset for speed test

In [15]:
sentences = [sample['sentence1'] for sample in dataset.select(range(100))]

#### Testing Embedding Quality

In [24]:
def measure_embedding_quality(model_name):
    model = SentenceTransformer(model_name)
    sts_scores = []

    for example in dataset:
        embedding1 = model.encode(example['sentence1'], convert_to_tensor=True)
        embedding2 = model.encode(example['sentence2'], convert_to_tensor=True)
        similarity = util.pytorch_cos_sim(embedding1, embedding2)
        sts_scores.append((similarity.item(), example['similarity_score'] / 5.0))

    # Calculate Pearson correlation
    similarities, scores = zip(*sts_scores)
    pearson_corr, _ = pearsonr(similarities, scores)
    return pearson_corr

#### Testing Inference Speed

In [17]:
def measure_inference_speed(model_name):
    model = SentenceTransformer(model_name)
    start_time = time.time()
    embeddings = model.encode(sentences, convert_to_tensor=True)
    end_time = time.time()
    return (end_time - start_time) / len(sentences) * 1000  # Time per sentence in ms

#### Checking Model Size

In [18]:
def get_model_size(model_name):
    model = SentenceTransformer(model_name)
    return sum(p.numel() for p in model.parameters() if p.requires_grad) * 4 / (1024 ** 2)  # Size in MB

#### Measuring Resource Utilization

In [19]:
def measure_resource_utilization(model_name):
    model = SentenceTransformer(model_name)
    process = psutil.Process()
    
    # Measure CPU and memory usage before and after encoding
    memory_before = process.memory_info().rss
    cpu_before = process.cpu_percent(interval=None)
    
    embeddings = model.encode(sentences, convert_to_tensor=True)
    
    memory_after = process.memory_info().rss
    cpu_after = process.cpu_percent(interval=None)
    
    memory_usage = (memory_after - memory_before) / (1024 ** 2)  # Memory usage in MB
    cpu_usage = cpu_after - cpu_before  # CPU usage percentage
    return memory_usage, cpu_usage

#### Accumulating Results

In [20]:
results = {}

In [25]:
for model_name, model_path in models.items():
    print(f"Evaluating model: {model_name}")
    results[model_name] = {
        "embedding_quality": measure_embedding_quality(model_path),
        "inference_speed": measure_inference_speed(model_path),
        "model_size": get_model_size(model_path),
        "resource_utilization": measure_resource_utilization(model_path)
    }

Evaluating model: all-MiniLM-L12-v2
Evaluating model: distilbert-base-nli-stsb-mean-tokens


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.05k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/505 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Evaluating model: bert-base-nli-mean-tokens


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Evaluating model: all-MiniLM-L6-v2


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

#### Findings

In [29]:
results_df = pd.DataFrame(results).T

# Display the DataFrame
display(results_df)

Unnamed: 0,embedding_quality,inference_speed,model_size,resource_utilization
all-MiniLM-L12-v2,0.837595,4.648678,127.258301,"(20.3828125, 30.8)"
distilbert-base-nli-stsb-mean-tokens,0.842519,5.466628,253.154297,"(34.75390625, 58.3)"
bert-base-nli-mean-tokens,0.741453,11.453929,417.641602,"(23.87890625, 35.8)"
all-MiniLM-L6-v2,0.827406,2.355793,86.644043,"(20.3671875, 46.7)"
