# Evaluate các model embedding cho tập dữ liệu [Semantic Textual Similarity on STS Benchmark](https://huggingface.co/datasets/anti-ai/ViSTS)

[Source Code gốc lấy từ đây](https://colab.research.google.com/drive/1JZLWKiknSUnA92UY2RIhvS65WtP6sgqW?hl=fr#scrollTo=IkTAwPqxDTOK)

In [1]:
!pip install sentence-transformers
!pip install datasets
!pip install sacremoses
!pip install pyvi
!pip install einops
!pip install 'numpy<2'
!pip install vncorenlp
!pip install openai
!pip install tiktoken

Collecting sentence-transformers
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.1.1-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.3/245.3 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.1.1
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sacremoses-0.1.1
Collecting pyvi
  Downloading pyvi-0.1.1-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting sklearn-crfsuite (from pyvi)
  Downloading sklearn_crfsuite-0.5.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting python-crfsuite>=0.9

## convert_dataset là function chuyển dataset được download thành format cần thiết để đánh giá.
Code gốc sử dụng pyvi tokenizer, nhưng tokenizer này chỉ phù hợp với 1 vài model (chủ yếu base trên PhoBert) như kết quả các bạn nhìn thấy ở cuối file nên mình sẽ đánh giá việc dùng pyvi tokenizer và không dùng cho mỗi model. 

Gán use_pyvi_tokenizer = True nếu dùng pyvi tokenizer

In [2]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.readers import InputExample
from datasets import load_dataset
from pyvi.ViTokenizer import tokenize
def convert_dataset(dataset, use_pyvi_tokenizer = True):
    dataset_samples=[]
    for df in dataset:
        score = float(df['score'])/5.0  # Normalize score to range 0 ... 1
        if use_pyvi_tokenizer:
            inp_example = InputExample(texts=[tokenize(df['sentence1']),
                                        tokenize(df['sentence2'])], label=score)
        else:
            inp_example = InputExample(texts=[df['sentence1'],
                                        df['sentence2']], label=score)
        dataset_samples.append(inp_example)
    return dataset_samples

  from tqdm.autonotebook import tqdm, trange


## List model và dataset
Các model bên dưới sẽ được dùng để đánh giá các dataset bên dưới

In [3]:
model_names =["jinaai/jina-embeddings-v3",
              'openai/text-embedding-3-small',
              'openai/text-embedding-3-large',
              'openai/text-embedding-ada-002',
              "dangvantuan/vietnamese-embedding",
              "dangvantuan/vietnamese-embedding-LongContext",
              'Alibaba-NLP/gte-multilingual-base',
              'Alibaba-NLP/gte-multilingual-mlm-base',
              'BAAI/bge-m3',
              'BAAI/bge-m3-unsupervised',
              'BAAI/bge-m3-retromae',
              'intfloat/multilingual-e5-small',
              'intfloat/multilingual-e5-base',
              'intfloat/multilingual-e5-large',
              'nampham1106/bkcare-embedding',
              'VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base',
              "VoVanPhuc/sup-SimCSE-VietNamese-phobert-base",
              "keepitreal/vietnamese-sbert",
               #"bkai-foundation-models/vietnamese-bi-encoder",
               "hiieu/halong_embedding"
             ]

datasets = ['STS-B', 'STS12', 'STS13', 'STS14', 'STS15', 'STS16', 'STS-Sickr']
# datasets = ['STS-Sickr']

## Request Openai Embedding
Hàm get_openai_embeddings(model, sentences) sẽ request openai api để lấy embedding của model cho các câu trong list sentences.

Để bảo mật api key, bạn có thể dùng tính năng secret của google colab hoặc add-on của kaggle:
- Nếu dùng colab --> dùng google.colab.userdata để load api key
- Nếu dùng kaggle --> dùng kaggle_secrets.UserSecretsClient để load api key

In [4]:
from openai import OpenAI
from kaggle_secrets import UserSecretsClient #from google.colab import userdata
import concurrent.futures
import threading
import requests
import os
from openai import OpenAI
from datetime import datetime
import time
import numpy as np
import torch

req_per_sec = 60
sema = threading.Semaphore(req_per_sec)

#client = OpenAI(api_key = userdata.get('openai-key'))
user_secrets = UserSecretsClient()
openai_api_key = user_secrets.get_secret("openai-key")
client = OpenAI(api_key = openai_api_key)

def get_openai_embedding(request_message):
    with sema:
        response = client.embeddings.create(
            input=request_message['text'],
            model=request_message['model']
        )
        return np.array(response.data[0].embedding)

def get_openai_embeddings(model, sentences):
    print(f"request openai {model} embedding with {len(sentences)} texts")
    embeddings = []
    request_messages = [{'text': text, 'model': model} for text in sentences]
    with concurrent.futures.ThreadPoolExecutor() as executor:
        results = executor.map(get_openai_embedding, request_messages)

    for result in results:
        embeddings.append(result)
    embeddings = np.stack(embeddings, 0)
    return torch.tensor(embeddings)

## Code đánh giá model
Vì hàm EmbeddingSimilarityEvaluator của sentence_transformers không support đánh giá jina v3 với task và prompt cũng như không support đánh giá openai model nên mình cần viết lại hàm này.

Hàm này copy lại hàm gốc và chỉ thêm đoạn đánh giá thêm cho jina v3 nếu có truyền tham số task và prompt. Nếu là openai model thì sẽ gọi function request embedding như trên.

Mình comment dòng self.store_metrics_in_model_card_data(model, metrics) vì nó bị lỗi với openai model (không có model card trên huggingface)


In [5]:
from __future__ import annotations

import csv
import logging
import os
from contextlib import nullcontext
from typing import TYPE_CHECKING, Literal, Union
import pickle

import numpy as np
from scipy.stats import pearsonr, spearmanr
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances, paired_manhattan_distances

from sentence_transformers.evaluation.SentenceEvaluator import SentenceEvaluator
from sentence_transformers.readers import InputExample
from sentence_transformers.similarity_functions import SimilarityFunction

if TYPE_CHECKING:
    from sentence_transformers.SentenceTransformer import SentenceTransformer

logger = logging.getLogger(__name__)


class EmbeddingSimilarityEvaluator_(SentenceEvaluator):
    """
    Evaluate a model based on the similarity of the embeddings by calculating the Spearman and Pearson rank correlation
    in comparison to the gold standard labels.
    The metrics are the cosine similarity as well as euclidean and Manhattan distance
    The returned score is the Spearman correlation with a specified metric.

    Example:
        ::

            from datasets import load_dataset
            from sentence_transformers import SentenceTransformer
            from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator, SimilarityFunction

            # Load a model
            model = SentenceTransformer('all-mpnet-base-v2')

            # Load the STSB dataset (https://huggingface.co/datasets/sentence-transformers/stsb)
            eval_dataset = load_dataset("sentence-transformers/stsb", split="validation")

            # Initialize the evaluator
            dev_evaluator = EmbeddingSimilarityEvaluator(
                sentences1=eval_dataset["sentence1"],
                sentences2=eval_dataset["sentence2"],
                scores=eval_dataset["score"],
                main_similarity=SimilarityFunction.COSINE,
                name="sts-dev",
            )
            dev_evaluator(model)
            '''
            EmbeddingSimilarityEvaluator: Evaluating the model on the sts-dev dataset:
            Cosine-Similarity :       Pearson: 0.7874 Spearman: 0.8004
            Manhattan-Distance:       Pearson: 0.7823 Spearman: 0.7827
            Euclidean-Distance:       Pearson: 0.7824 Spearman: 0.7827
            Dot-Product-Similarity:   Pearson: 0.7192 Spearman: 0.7126
            '''
            # => {'sts-dev_pearson_cosine': 0.880607226102985, 'sts-dev_spearman_cosine': 0.881019449484294, ...}
    """

    def __init__(
        self,
        sentences1: list[str],
        sentences2: list[str],
        scores: list[float],
        batch_size: int = 16,
        main_similarity: str | SimilarityFunction | None = None,
        name: str = "",
        show_progress_bar: bool = False,
        write_csv: bool = True,
        precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] | None = None,
        truncate_dim: int | None = None,
    ):
        """
        Constructs an evaluator based for the dataset.

        Args:
            sentences1 (List[str]): List with the first sentence in a pair.
            sentences2 (List[str]): List with the second sentence in a pair.
            scores (List[float]): Similarity score between sentences1[i] and sentences2[i].
            batch_size (int, optional): The batch size for processing the sentences. Defaults to 16.
            main_similarity (Optional[Union[str, SimilarityFunction]], optional): The main similarity function to use.
                Can be a string (e.g. "cosine", "dot") or a SimilarityFunction object. Defaults to None.
            name (str, optional): The name of the evaluator. Defaults to "".
            show_progress_bar (bool, optional): Whether to show a progress bar during evaluation. Defaults to False.
            write_csv (bool, optional): Whether to write the evaluation results to a CSV file. Defaults to True.
            precision (Optional[Literal["float32", "int8", "uint8", "binary", "ubinary"]], optional): The precision
                to use for the embeddings. Can be "float32", "int8", "uint8", "binary", or "ubinary". Defaults to None.
            truncate_dim (Optional[int], optional): The dimension to truncate sentence embeddings to. `None` uses the
                model's current truncation dimension. Defaults to None.
        """
        super().__init__()
        self.sentences1 = sentences1
        self.sentences2 = sentences2
        self.scores = scores
        self.write_csv = write_csv
        self.precision = precision
        self.truncate_dim = truncate_dim

        assert len(self.sentences1) == len(self.sentences2)
        assert len(self.sentences1) == len(self.scores)

        self.main_similarity = SimilarityFunction(main_similarity) if main_similarity else None
        self.name = name

        self.batch_size = batch_size
        if show_progress_bar is None:
            show_progress_bar = (
                logger.getEffectiveLevel() == logging.INFO or logger.getEffectiveLevel() == logging.DEBUG
            )
        self.show_progress_bar = show_progress_bar

        self.csv_file = (
            "similarity_evaluation"
            + ("_" + name if name else "")
            + ("_" + precision if precision else "")
            + "_results.csv"
        )
        self.csv_headers = [
            "epoch",
            "steps",
            "cosine_pearson",
            "cosine_spearman",
            "euclidean_pearson",
            "euclidean_spearman",
            "manhattan_pearson",
            "manhattan_spearman",
            "dot_pearson",
            "dot_spearman",
        ]

    @classmethod
    def from_input_examples(cls, examples: list[InputExample], **kwargs):
        sentences1 = []
        sentences2 = []
        scores = []

        for example in examples:
            sentences1.append(example.texts[0])
            sentences2.append(example.texts[1])
            scores.append(example.label)
        return cls(sentences1, sentences2, scores, **kwargs)

    def __call__(
        self, model: Union[SentenceTransformer, str], output_path: str = None, epoch: int = -1, steps: int = -1, task = None, prompt = None
    ) -> dict[str, float]:
        if epoch != -1:
            if steps == -1:
                out_txt = f" after epoch {epoch}"
            else:
                out_txt = f" in epoch {epoch} after {steps} steps"
        else:
            out_txt = ""
        if self.truncate_dim is not None:
            out_txt += f" (truncated to {self.truncate_dim})"

        logger.info(f"EmbeddingSimilarityEvaluator: Evaluating the model on the {self.name} dataset{out_txt}:")

        if isinstance(model, str):
            if "openai/" in model:
                embeddings1 = get_openai_embeddings(model.split("openai/")[-1], self.sentences1)
                embeddings2 = get_openai_embeddings(model.split("openai/")[-1], self.sentences2)
                with open(f"{output_path}_sentences1_embedding.pickle", "wb") as f:
                    pickle.dump(embeddings1, f)
                with open(f"{output_path}_sentences2_embedding.pickle", "wb") as f:
                    pickle.dump(embeddings2, f)
        else:
            with nullcontext() if self.truncate_dim is None else model.truncate_sentence_embeddings(self.truncate_dim):
                if task is None:
                    embeddings1 = model.encode(
                        self.sentences1,
                        batch_size=self.batch_size,
                        show_progress_bar=self.show_progress_bar,
                        convert_to_numpy=True,
                        precision=self.precision,
                        normalize_embeddings=bool(self.precision),
                    )
                    embeddings2 = model.encode(
                        self.sentences2,
                        batch_size=self.batch_size,
                        show_progress_bar=self.show_progress_bar,
                        convert_to_numpy=True,
                        precision=self.precision,
                        normalize_embeddings=bool(self.precision),
                    )
                elif task == "text-matching":
                    embeddings1 = model.encode(
                        self.sentences1,
                        batch_size=self.batch_size,
                        show_progress_bar=self.show_progress_bar,
                        convert_to_numpy=True,
                        precision=self.precision,
                        normalize_embeddings=bool(self.precision),
                        task = task,
                    )
                    embeddings2 = model.encode(
                        self.sentences2,
                        batch_size=self.batch_size,
                        show_progress_bar=self.show_progress_bar,
                        convert_to_numpy=True,
                        precision=self.precision,
                        normalize_embeddings=bool(self.precision),
                        task = task,
                    )
                elif task == 'retrieval':
                    embeddings1 = model.encode(
                        self.sentences1,
                        batch_size=self.batch_size,
                        show_progress_bar=self.show_progress_bar,
                        convert_to_numpy=True,
                        precision=self.precision,
                        normalize_embeddings=bool(self.precision),
                        task = 'retrieval.query',
                        prompt_name = 'retrieval.query',
                    )
                    embeddings2 = model.encode(
                        self.sentences2,
                        batch_size=self.batch_size,
                        show_progress_bar=self.show_progress_bar,
                        convert_to_numpy=True,
                        precision=self.precision,
                        normalize_embeddings=bool(self.precision),
                        task = 'retrieval.passage',
                        prompt_name = 'retrieval.passage',
                    )
        # Binary and ubinary embeddings are packed, so we need to unpack them for the distance metrics
        if self.precision == "binary":
            embeddings1 = (embeddings1 + 128).astype(np.uint8)
            embeddings2 = (embeddings2 + 128).astype(np.uint8)
        if self.precision in ("ubinary", "binary"):
            embeddings1 = np.unpackbits(embeddings1, axis=1)
            embeddings2 = np.unpackbits(embeddings2, axis=1)

        labels = self.scores

        cosine_scores = 1 - (paired_cosine_distances(embeddings1, embeddings2))
        manhattan_distances = -paired_manhattan_distances(embeddings1, embeddings2)
        euclidean_distances = -paired_euclidean_distances(embeddings1, embeddings2)
        dot_products = [np.dot(emb1, emb2) for emb1, emb2 in zip(embeddings1, embeddings2)]

        eval_pearson_cosine, _ = pearsonr(labels, cosine_scores)
        eval_spearman_cosine, _ = spearmanr(labels, cosine_scores)

        eval_pearson_manhattan, _ = pearsonr(labels, manhattan_distances)
        eval_spearman_manhattan, _ = spearmanr(labels, manhattan_distances)

        eval_pearson_euclidean, _ = pearsonr(labels, euclidean_distances)
        eval_spearman_euclidean, _ = spearmanr(labels, euclidean_distances)

        eval_pearson_dot, _ = pearsonr(labels, dot_products)
        eval_spearman_dot, _ = spearmanr(labels, dot_products)

        logger.info(f"Cosine-Similarity :\tPearson: {eval_pearson_cosine:.4f}\tSpearman: {eval_spearman_cosine:.4f}")
        logger.info(
            f"Manhattan-Distance:\tPearson: {eval_pearson_manhattan:.4f}\tSpearman: {eval_spearman_manhattan:.4f}"
        )
        logger.info(
            f"Euclidean-Distance:\tPearson: {eval_pearson_euclidean:.4f}\tSpearman: {eval_spearman_euclidean:.4f}"
        )
        logger.info(f"Dot-Product-Similarity:\tPearson: {eval_pearson_dot:.4f}\tSpearman: {eval_spearman_dot:.4f}")

        if output_path is not None and self.write_csv:
            csv_path = os.path.join(output_path, self.csv_file)
            output_file_exists = os.path.isfile(csv_path)
            with open(csv_path, newline="", mode="a" if output_file_exists else "w", encoding="utf-8") as f:
                writer = csv.writer(f)
                if not output_file_exists:
                    writer.writerow(self.csv_headers)

                writer.writerow(
                    [
                        epoch,
                        steps,
                        eval_pearson_cosine,
                        eval_spearman_cosine,
                        eval_pearson_euclidean,
                        eval_spearman_euclidean,
                        eval_pearson_manhattan,
                        eval_spearman_manhattan,
                        eval_pearson_dot,
                        eval_spearman_dot,
                    ]
                )

        self.primary_metric = {
            SimilarityFunction.COSINE: "spearman_cosine",
            SimilarityFunction.EUCLIDEAN: "spearman_euclidean",
            SimilarityFunction.MANHATTAN: "spearman_manhattan",
            SimilarityFunction.DOT_PRODUCT: "spearman_dot",
        }.get(self.main_similarity, "spearman_max")
        metrics = {
            "pearson_cosine": eval_pearson_cosine,
            "spearman_cosine": eval_spearman_cosine,
            "pearson_manhattan": eval_pearson_manhattan,
            "spearman_manhattan": eval_spearman_manhattan,
            "pearson_euclidean": eval_pearson_euclidean,
            "spearman_euclidean": eval_spearman_euclidean,
            "pearson_dot": eval_pearson_dot,
            "spearman_dot": eval_spearman_dot,
            "pearson_max": max(eval_pearson_cosine, eval_pearson_manhattan, eval_pearson_euclidean, eval_pearson_dot),
            "spearman_max": max(
                eval_spearman_cosine, eval_spearman_manhattan, eval_spearman_euclidean, eval_spearman_dot
            ),
        }
        metrics = self.prefix_name_to_metrics(metrics, self.name)
        #self.store_metrics_in_model_card_data(model, metrics)
        return metrics

    @property
    def description(self) -> str:
        return "Semantic Similarity"

## Đánh giá model

In [6]:
import os
import shutil
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

path_org = 'STS_output/'
os.environ["TRANSFORMERS_VERBOSITY"] = "error"

def recreate_directory(path):
    if os.path.exists(path):
        shutil.rmtree(path)
    os.mkdir(path)


# Example usage:
path_org = "STS_output/"
recreate_directory(path_org)

for name_data in datasets:
    path_data = path_org+name_data
    os.mkdir(path_org+ '/' +name_data)

for model_name in model_names:
    if "openai/" in model_name:
        model = model_name
    else:
        model = SentenceTransformer(model_name, trust_remote_code=True, device = 'cuda')
    test_samples = []
    for name_data in datasets:
        path_data = path_org+name_data
        #os.mkdir(path_org+ '/' +name_data)
        sts_test =  load_dataset("anti-ai/ViSTS", name_data)["test"]
        test_samples.extend(convert_dataset(sts_test))
        print(name_data, len(test_samples), len(sts_test))

        print(f"Evaluating for dataset: {name_data} using model {model_name}")

        path =path_data + '/' + model_name.split("/")[0] + "_" + model_name.split("/")[1]
        os.mkdir(path)
        test_evaluator = EmbeddingSimilarityEvaluator_.from_input_examples(
                convert_dataset(sts_test, False), batch_size=16, name="sts-test")
        test_evaluator(model, output_path=path)

        path =path_data + '/' + model_name.split("/")[0] + "_" + model_name.split("/")[1] + "_pipy_tokenizer"
        os.mkdir(path)
        test_evaluator = EmbeddingSimilarityEvaluator_.from_input_examples(
                convert_dataset(sts_test), batch_size=16, name="sts-test")
        test_evaluator(model, output_path=path)

        if model_name == "jinaai/jina-embeddings-v3":
            path =path_data + '/' + model_name.split("/")[0] + "_" + model_name.split("/")[1] + '_matching'
            os.mkdir(path)
            test_evaluator = EmbeddingSimilarityEvaluator_.from_input_examples(
                    convert_dataset(sts_test, False), batch_size=16, name="sts-test")
            test_evaluator(model, output_path=path, task = 'text-matching', prompt = 'text-matching')

            path =path_data + '/' + model_name.split("/")[0] + "_" + model_name.split("/")[1] + '_retrieval'
            os.mkdir(path)
            test_evaluator = EmbeddingSimilarityEvaluator_.from_input_examples(
                    convert_dataset(sts_test, False), batch_size=16, name="sts-test")
            test_evaluator(model, output_path=path, task = 'retrieval', prompt = 'retrieval')

            path =path_data + '/' + model_name.split("/")[0] + "_" + model_name.split("/")[1] + '_matching' + "_pyvi_tokenizer"
            os.mkdir(path)
            test_evaluator = EmbeddingSimilarityEvaluator_.from_input_examples(
                    convert_dataset(sts_test), batch_size=16, name="sts-test")
            test_evaluator(model, output_path=path, task = 'text-matching', prompt = 'text-matching')

            path =path_data + '/' + model_name.split("/")[0] + "_" + model_name.split("/")[1] + '_retrieval' + "_pyvi_tokenizer"
            os.mkdir(path)
            test_evaluator = EmbeddingSimilarityEvaluator_.from_input_examples(
                    convert_dataset(sts_test), batch_size=16, name="sts-test")
            test_evaluator(model, output_path=path, task = 'retrieval', prompt = 'retrieval')
    # path_data = path_org+'mean'
    # path =path_data + '/' + model_name.split("/")[0] + "_" + model_name.split("/")[1]
    # os.mkdir(pa th_org+ '/' +'mean')
    # os.mkdir(path)
    # test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(
    #         test_samples, batch_size=8, name="sts-test")
    # test_evaluator(model, output_path=path)

modules.json:   0%|          | 0.00/354 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/731k [00:00<?, ?B/s]

custom_st.py:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-embeddings-v3:
- custom_st.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


config.json:   0%|          | 0.00/1.82k [00:00<?, ?B/s]

configuration_xlm_roberta.py:   0%|          | 0.00/6.54k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- configuration_xlm_roberta.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_lora.py:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

rotary.py:   0%|          | 0.00/24.5k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- rotary.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_xlm_roberta.py:   0%|          | 0.00/50.0k [00:00<?, ?B/s]

mha.py:   0%|          | 0.00/34.4k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- mha.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


block.py:   0%|          | 0.00/17.8k [00:00<?, ?B/s]

mlp.py:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- mlp.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


stochastic_depth.py:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- stochastic_depth.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- block.py
- mlp.py
- stochastic_depth.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


xlm_padding.py:   0%|          | 0.00/10.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- xlm_padding.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


embedding.py:   0%|          | 0.00/3.88k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- embedding.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- modeling_xlm_roberta.py
- mha.py
- block.py
- xlm_padding.py
- embedding.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/jinaai/xlm-roberta-flash-implementation:
- modeling_lora.py
- rotary.py
- modeling_xlm_roberta.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/192 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/782 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/274k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1379 [00:00<?, ? examples/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model jinaai/jina-embeddings-v3


Downloading data:   0%|          | 0.00/686k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/3108 [00:00<?, ? examples/s]

STS12 4487 3108
Evaluating for dataset: STS12 using model jinaai/jina-embeddings-v3


Downloading data:   0%|          | 0.00/301k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

STS13 5987 1500
Evaluating for dataset: STS13 using model jinaai/jina-embeddings-v3


Downloading data:   0%|          | 0.00/736k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/3750 [00:00<?, ? examples/s]

STS14 9737 3750
Evaluating for dataset: STS14 using model jinaai/jina-embeddings-v3


Downloading data:   0%|          | 0.00/612k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/3000 [00:00<?, ? examples/s]

STS15 12737 3000
Evaluating for dataset: STS15 using model jinaai/jina-embeddings-v3


Downloading data:   0%|          | 0.00/260k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1186 [00:00<?, ? examples/s]

STS16 13923 1186
Evaluating for dataset: STS16 using model jinaai/jina-embeddings-v3


Downloading data:   0%|          | 0.00/1.78M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/9927 [00:00<?, ? examples/s]

STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model jinaai/jina-embeddings-v3
STS-B 1379 1379
Evaluating for dataset: STS-B using model openai/text-embedding-3-small
request openai text-embedding-3-small embedding with 1379 texts
request openai text-embedding-3-small embedding with 1379 texts
request openai text-embedding-3-small embedding with 1379 texts
request openai text-embedding-3-small embedding with 1379 texts
STS12 4487 3108
Evaluating for dataset: STS12 using model openai/text-embedding-3-small
request openai text-embedding-3-small embedding with 3108 texts
request openai text-embedding-3-small embedding with 3108 texts
request openai text-embedding-3-small embedding with 3108 texts
request openai text-embedding-3-small embedding with 3108 texts
STS13 5987 1500
Evaluating for dataset: STS13 using model openai/text-embedding-3-small
request openai text-embedding-3-small embedding with 1500 texts
request openai text-embedding-3-small embedding with 1500 texts
req

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.63k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/753 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/540M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/895k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/965 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model dangvantuan/vietnamese-embedding
STS12 4487 3108
Evaluating for dataset: STS12 using model dangvantuan/vietnamese-embedding
STS13 5987 1500
Evaluating for dataset: STS13 using model dangvantuan/vietnamese-embedding
STS14 9737 3750
Evaluating for dataset: STS14 using model dangvantuan/vietnamese-embedding
STS15 12737 3000
Evaluating for dataset: STS15 using model dangvantuan/vietnamese-embedding
STS16 13923 1186
Evaluating for dataset: STS16 using model dangvantuan/vietnamese-embedding
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model dangvantuan/vietnamese-embedding


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/6.09k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.68k [00:00<?, ?B/s]

configuration.py:   0%|          | 0.00/6.09k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/dangvantuan/Vietnamese_impl:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling.py:   0%|          | 0.00/53.6k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/dangvantuan/Vietnamese_impl:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model dangvantuan/vietnamese-embedding-LongContext
STS12 4487 3108
Evaluating for dataset: STS12 using model dangvantuan/vietnamese-embedding-LongContext
STS13 5987 1500
Evaluating for dataset: STS13 using model dangvantuan/vietnamese-embedding-LongContext
STS14 9737 3750
Evaluating for dataset: STS14 using model dangvantuan/vietnamese-embedding-LongContext
STS15 12737 3000
Evaluating for dataset: STS15 using model dangvantuan/vietnamese-embedding-LongContext
STS16 13923 1186
Evaluating for dataset: STS16 using model dangvantuan/vietnamese-embedding-LongContext
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model dangvantuan/vietnamese-embedding-LongContext


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/123k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/55.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

configuration.py:   0%|          | 0.00/7.13k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- configuration.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling.py:   0%|          | 0.00/59.0k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/Alibaba-NLP/new-impl:
- modeling.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/611M [00:00<?, ?B/s]

Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model Alibaba-NLP/gte-multilingual-base
STS12 4487 3108
Evaluating for dataset: STS12 using model Alibaba-NLP/gte-multilingual-base
STS13 5987 1500
Evaluating for dataset: STS13 using model Alibaba-NLP/gte-multilingual-base
STS14 9737 3750
Evaluating for dataset: STS14 using model Alibaba-NLP/gte-multilingual-base
STS15 12737 3000
Evaluating for dataset: STS15 using model Alibaba-NLP/gte-multilingual-base
STS16 13923 1186
Evaluating for dataset: STS16 using model Alibaba-NLP/gte-multilingual-base
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model Alibaba-NLP/gte-multilingual-base


config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/612M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model Alibaba-NLP/gte-multilingual-mlm-base
STS12 4487 3108
Evaluating for dataset: STS12 using model Alibaba-NLP/gte-multilingual-mlm-base
STS13 5987 1500
Evaluating for dataset: STS13 using model Alibaba-NLP/gte-multilingual-mlm-base
STS14 9737 3750
Evaluating for dataset: STS14 using model Alibaba-NLP/gte-multilingual-mlm-base
STS15 12737 3000
Evaluating for dataset: STS15 using model Alibaba-NLP/gte-multilingual-mlm-base
STS16 13923 1186
Evaluating for dataset: STS16 using model Alibaba-NLP/gte-multilingual-mlm-base
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model Alibaba-NLP/gte-multilingual-mlm-base


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/15.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model BAAI/bge-m3
STS12 4487 3108
Evaluating for dataset: STS12 using model BAAI/bge-m3
STS13 5987 1500
Evaluating for dataset: STS13 using model BAAI/bge-m3
STS14 9737 3750
Evaluating for dataset: STS14 using model BAAI/bge-m3
STS15 12737 3000
Evaluating for dataset: STS15 using model BAAI/bge-m3
STS16 13923 1186
Evaluating for dataset: STS16 using model BAAI/bge-m3
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model BAAI/bge-m3


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model BAAI/bge-m3-unsupervised
STS12 4487 3108
Evaluating for dataset: STS12 using model BAAI/bge-m3-unsupervised
STS13 5987 1500
Evaluating for dataset: STS13 using model BAAI/bge-m3-unsupervised
STS14 9737 3750
Evaluating for dataset: STS14 using model BAAI/bge-m3-unsupervised
STS15 12737 3000
Evaluating for dataset: STS15 using model BAAI/bge-m3-unsupervised
STS16 13923 1186
Evaluating for dataset: STS16 using model BAAI/bge-m3-unsupervised
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model BAAI/bge-m3-unsupervised


config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

Some weights of XLMRobertaModel were not initialized from the model checkpoint at BAAI/bge-m3-retromae and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/419 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model BAAI/bge-m3-retromae
STS12 4487 3108
Evaluating for dataset: STS12 using model BAAI/bge-m3-retromae
STS13 5987 1500
Evaluating for dataset: STS13 using model BAAI/bge-m3-retromae
STS14 9737 3750
Evaluating for dataset: STS14 using model BAAI/bge-m3-retromae
STS15 12737 3000
Evaluating for dataset: STS15 using model BAAI/bge-m3-retromae
STS16 13923 1186
Evaluating for dataset: STS16 using model BAAI/bge-m3-retromae
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model BAAI/bge-m3-retromae


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/498k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model intfloat/multilingual-e5-small
STS12 4487 3108
Evaluating for dataset: STS12 using model intfloat/multilingual-e5-small
STS13 5987 1500
Evaluating for dataset: STS13 using model intfloat/multilingual-e5-small
STS14 9737 3750
Evaluating for dataset: STS14 using model intfloat/multilingual-e5-small
STS15 12737 3000
Evaluating for dataset: STS15 using model intfloat/multilingual-e5-small
STS16 13923 1186
Evaluating for dataset: STS16 using model intfloat/multilingual-e5-small
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model intfloat/multilingual-e5-small


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/179k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model intfloat/multilingual-e5-base
STS12 4487 3108
Evaluating for dataset: STS12 using model intfloat/multilingual-e5-base
STS13 5987 1500
Evaluating for dataset: STS13 using model intfloat/multilingual-e5-base
STS14 9737 3750
Evaluating for dataset: STS14 using model intfloat/multilingual-e5-base
STS15 12737 3000
Evaluating for dataset: STS15 using model intfloat/multilingual-e5-base
STS16 13923 1186
Evaluating for dataset: STS16 using model intfloat/multilingual-e5-base
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model intfloat/multilingual-e5-base


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/160k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model intfloat/multilingual-e5-large
STS12 4487 3108
Evaluating for dataset: STS12 using model intfloat/multilingual-e5-large
STS13 5987 1500
Evaluating for dataset: STS13 using model intfloat/multilingual-e5-large
STS14 9737 3750
Evaluating for dataset: STS14 using model intfloat/multilingual-e5-large
STS15 12737 3000
Evaluating for dataset: STS15 using model intfloat/multilingual-e5-large
STS16 13923 1186
Evaluating for dataset: STS16 using model intfloat/multilingual-e5-large
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model intfloat/multilingual-e5-large


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.89k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/774 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/540M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/895k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/22.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/167 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model nampham1106/bkcare-embedding
STS12 4487 3108
Evaluating for dataset: STS12 using model nampham1106/bkcare-embedding
STS13 5987 1500
Evaluating for dataset: STS13 using model nampham1106/bkcare-embedding
STS14 9737 3750
Evaluating for dataset: STS14 using model nampham1106/bkcare-embedding
STS15 12737 3000
Evaluating for dataset: STS15 using model nampham1106/bkcare-embedding
STS16 13923 1186
Evaluating for dataset: STS16 using model nampham1106/bkcare-embedding
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model nampham1106/bkcare-embedding


config.json:   0%|          | 0.00/733 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/542M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/895k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]



STS-B 1379 1379
Evaluating for dataset: STS-B using model VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base
STS12 4487 3108
Evaluating for dataset: STS12 using model VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base
STS13 5987 1500
Evaluating for dataset: STS13 using model VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base
STS14 9737 3750
Evaluating for dataset: STS14 using model VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base
STS15 12737 3000
Evaluating for dataset: STS15 using model VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base
STS16 13923 1186
Evaluating for dataset: STS16 using model VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model VoVanPhuc/unsup-SimCSE-VietNamese-phobert-base


config.json:   0%|          | 0.00/731 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/895k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]



STS-B 1379 1379
Evaluating for dataset: STS-B using model VoVanPhuc/sup-SimCSE-VietNamese-phobert-base
STS12 4487 3108
Evaluating for dataset: STS12 using model VoVanPhuc/sup-SimCSE-VietNamese-phobert-base
STS13 5987 1500
Evaluating for dataset: STS13 using model VoVanPhuc/sup-SimCSE-VietNamese-phobert-base
STS14 9737 3750
Evaluating for dataset: STS14 using model VoVanPhuc/sup-SimCSE-VietNamese-phobert-base
STS15 12737 3000
Evaluating for dataset: STS15 using model VoVanPhuc/sup-SimCSE-VietNamese-phobert-base
STS16 13923 1186
Evaluating for dataset: STS16 using model VoVanPhuc/sup-SimCSE-VietNamese-phobert-base
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model VoVanPhuc/sup-SimCSE-VietNamese-phobert-base


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/752 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/540M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/895k [00:00<?, ?B/s]

bpe.codes:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model keepitreal/vietnamese-sbert
STS12 4487 3108
Evaluating for dataset: STS12 using model keepitreal/vietnamese-sbert
STS13 5987 1500
Evaluating for dataset: STS13 using model keepitreal/vietnamese-sbert
STS14 9737 3750
Evaluating for dataset: STS14 using model keepitreal/vietnamese-sbert
STS15 12737 3000
Evaluating for dataset: STS15 using model keepitreal/vietnamese-sbert
STS16 13923 1186
Evaluating for dataset: STS16 using model keepitreal/vietnamese-sbert
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model keepitreal/vietnamese-sbert


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/201 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/13.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/749 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

STS-B 1379 1379
Evaluating for dataset: STS-B using model hiieu/halong_embedding
STS12 4487 3108
Evaluating for dataset: STS12 using model hiieu/halong_embedding
STS13 5987 1500
Evaluating for dataset: STS13 using model hiieu/halong_embedding
STS14 9737 3750
Evaluating for dataset: STS14 using model hiieu/halong_embedding
STS15 12737 3000
Evaluating for dataset: STS15 using model hiieu/halong_embedding
STS16 13923 1186
Evaluating for dataset: STS16 using model hiieu/halong_embedding
STS-Sickr 23850 9927
Evaluating for dataset: STS-Sickr using model hiieu/halong_embedding


# Load kết quả đánh giá

In [7]:
import pandas as pd
# Adjust the approach to extract data directly from the columns of cosine_pearson and cosine_spearman
pearson_scores = {model: [] for model in model_names}
spearman_scores = {model: [] for model in model_names}
if "jinaai/jina-embeddings-v3" in model_names:
    pearson_scores['jinaai/jina-embeddings-v3_matching'] = []
    pearson_scores['jinaai/jina-embeddings-v3_retrieval'] = []
    spearman_scores['jinaai/jina-embeddings-v3_matching'] = []
    spearman_scores['jinaai/jina-embeddings-v3_retrieval'] = []
    pearson_scores['jinaai/jina-embeddings-v3_matching_pyvi_tokenizer'] = []
    pearson_scores['jinaai/jina-embeddings-v3_retrieval_pyvi_tokenizer'] = []
    spearman_scores['jinaai/jina-embeddings-v3_matching_pyvi_tokenizer'] = []
    spearman_scores['jinaai/jina-embeddings-v3_retrieval_pyvi_tokenizer'] = []
for model in model_names:
    pearson_scores[f'{model}_pyvi_tokenizer'] = []
    spearman_scores[f'{model}_pyvi_tokenizer'] = []

# Loop through each dataset and model, read the CSV and extract the needed coefficients directly
for dataset in datasets:
    dataset_path = os.path.join(path_org, dataset)
    for model in model_names:
        model_path = os.path.join(dataset_path, model.split("/")[0] + "_" + model.split("/")[1])
        csv_path = os.path.join(model_path, 'similarity_evaluation_sts-test_results.csv')
        if os.path.exists(csv_path):
            data = pd.read_csv(csv_path)
            # Extract cosine_pearson and cosine_spearman directly
            cosine_pearson = data['cosine_pearson'].values[0]
            cosine_spearman = data['cosine_spearman'].values[0]
            print(model, dataset, cosine_pearson, cosine_spearman)
        else:
            # Handle missing data case
            cosine_pearson = None
            cosine_spearman = None

        # Append the scores to the respective lists in the dictionary
        pearson_scores[model].append(round(100*cosine_pearson,2))
        spearman_scores[model].append(round(100*cosine_spearman,2))

        model_path = os.path.join(dataset_path, model.split("/")[0] + "_" + model.split("/")[1] + "_pyvi_tokenizer")
        csv_path = os.path.join(model_path, 'similarity_evaluation_sts-test_results.csv')
        if os.path.exists(csv_path):
            data = pd.read_csv(csv_path)
            # Extract cosine_pearson and cosine_spearman directly
            cosine_pearson = data['cosine_pearson'].values[0]
            cosine_spearman = data['cosine_spearman'].values[0]
            print(model, dataset, cosine_pearson, cosine_spearman)
        else:
            # Handle missing data case
            cosine_pearson = None
            cosine_spearman = None

        # Append the scores to the respective lists in the dictionary
        pearson_scores[model + "_pyvi_tokenizer"].append(round(100*cosine_pearson,2))
        spearman_scores[model + "_pyvi_tokenizer"].append(round(100*cosine_spearman,2))   
        
        if model == "jinaai/jina-embeddings-v3":
            model_path = os.path.join(dataset_path, model.split("/")[0] + "_" + model.split("/")[1] + "_matching")
            csv_path = os.path.join(model_path, 'similarity_evaluation_sts-test_results.csv')
            if os.path.exists(csv_path):
                data = pd.read_csv(csv_path)
                # Extract cosine_pearson and cosine_spearman directly
                cosine_pearson = data['cosine_pearson'].values[0]
                cosine_spearman = data['cosine_spearman'].values[0]
                print(model, dataset, cosine_pearson, cosine_spearman)
            else:
                # Handle missing data case
                cosine_pearson = None
                cosine_spearman = None

            # Append the scores to the respective lists in the dictionary
            pearson_scores[model + "_matching"].append(round(100*cosine_pearson,2))
            spearman_scores[model + "_matching"].append(round(100*cosine_spearman,2))
            
            model_path = os.path.join(dataset_path, model.split("/")[0] + "_" + model.split("/")[1] + "_retrieval")
            csv_path = os.path.join(model_path, 'similarity_evaluation_sts-test_results.csv')
            if os.path.exists(csv_path):
                data = pd.read_csv(csv_path)
                # Extract cosine_pearson and cosine_spearman directly
                cosine_pearson = data['cosine_pearson'].values[0]
                cosine_spearman = data['cosine_spearman'].values[0]
                print(model, dataset, cosine_pearson, cosine_spearman)
            else:
                # Handle missing data case
                cosine_pearson = None
                cosine_spearman = None

            # Append the scores to the respective lists in the dictionary
            pearson_scores[model + "_retrieval"].append(round(100*cosine_pearson,2))
            spearman_scores[model + "_retrieval"].append(round(100*cosine_spearman,2))
            
            model_path = os.path.join(dataset_path, model.split("/")[0] + "_" + model.split("/")[1] + "_matching" + "_pyvi_tokenizer")
            csv_path = os.path.join(model_path, 'similarity_evaluation_sts-test_results.csv')
            if os.path.exists(csv_path):
                data = pd.read_csv(csv_path)
                # Extract cosine_pearson and cosine_spearman directly
                cosine_pearson = data['cosine_pearson'].values[0]
                cosine_spearman = data['cosine_spearman'].values[0]
                print(model, dataset, cosine_pearson, cosine_spearman)
            else:
                # Handle missing data case
                cosine_pearson = None
                cosine_spearman = None

            # Append the scores to the respective lists in the dictionary
            pearson_scores[model + "_matching" + "_pyvi_tokenizer"].append(round(100*cosine_pearson,2))
            spearman_scores[model + "_matching" + "_pyvi_tokenizer"].append(round(100*cosine_spearman,2))
            
            model_path = os.path.join(dataset_path, model.split("/")[0] + "_" + model.split("/")[1] + "_retrieval" + "_pyvi_tokenizer")
            csv_path = os.path.join(model_path, 'similarity_evaluation_sts-test_results.csv')
            if os.path.exists(csv_path):
                data = pd.read_csv(csv_path)
                # Extract cosine_pearson and cosine_spearman directly
                cosine_pearson = data['cosine_pearson'].values[0]
                cosine_spearman = data['cosine_spearman'].values[0]
                print(model, dataset, cosine_pearson, cosine_spearman)
            else:
                # Handle missing data case
                cosine_pearson = None
                cosine_spearman = None

            # Append the scores to the respective lists in the dictionary
            pearson_scores[model + "_retrieval" + "_pyvi_tokenizer"].append(round(100*cosine_pearson,2))
            spearman_scores[model + "_retrieval" + "_pyvi_tokenizer"].append(round(100*cosine_spearman,2))

# Convert dictionaries to dataframes
df_pearson = pd.DataFrame(pearson_scores, index=datasets)
df_spearman = pd.DataFrame(spearman_scores, index=datasets)
df_pearson.loc['Mean'] = df_pearson.mean()
df_spearman.loc['Mean'] = df_spearman.mean()
# Transpose the dataframes to switch columns and rows
df_pearson_transposed = df_pearson.T
df_spearman_transposed = df_spearman.T

jinaai/jina-embeddings-v3 STS-B 0.8422838111984067 0.840092968207589
jinaai/jina-embeddings-v3 STS-B 0.8210837086666871 0.8191636775440555
jinaai/jina-embeddings-v3 STS-B 0.8602765449441695 0.8650742175387514
jinaai/jina-embeddings-v3 STS-B 0.7385126877555928 0.722344610544845
jinaai/jina-embeddings-v3 STS-B 0.8470009655952146 0.8506507232960372
jinaai/jina-embeddings-v3 STS-B 0.6667334944857141 0.651195370527951
openai/text-embedding-3-small STS-B 0.7218889266061563 0.7166352133728238
openai/text-embedding-3-small STS-B 0.7044158227910041 0.6984607880015773
openai/text-embedding-3-large STS-B 0.7984850455061618 0.7894068723866893
openai/text-embedding-3-large STS-B 0.7803183427410513 0.7720234778386388
openai/text-embedding-ada-002 STS-B 0.7175421400312392 0.7135601201136826
openai/text-embedding-ada-002 STS-B 0.6960331261456955 0.694985738963807
dangvantuan/vietnamese-embedding STS-B 0.7896822055483628 0.7866684902706179
dangvantuan/vietnamese-embedding STS-B 0.8486555494051962 0.848

[dangvantuan/vietnamese-embedding](https://huggingface.co/dangvantuan/vietnamese-embedding) sử dụng bộ này để train nên loại trừ khỏi benchmark này

In [8]:
#df_pearson_transposed = df_pearson_transposed[df_pearson_transposed.index != 'dangvantuan/vietnamese-embedding']

In [9]:
#df_spearman_transposed = df_spearman_transposed[df_spearman_transposed.index != 'dangvantuan/vietnamese-embedding']

In [10]:
# Sort the transposed dataframes by the 'Mean' column in descending order
df_pearson_sorted = df_pearson_transposed.sort_values(by='Mean', ascending=False)
df_spearman_sorted = df_spearman_transposed.sort_values(by='Mean', ascending=False)


In [11]:
def bold_max(s):
    is_max = s == s.max()
    return ['font-weight: bold' if v else '' for v in is_max]

# Apply the function to each row of the dataframes
df_pearson_styled = df_pearson_sorted.style.apply(bold_max, axis=0)
df_spearman_styled = df_spearman_sorted.style.apply(bold_max, axis=0)


## Show kết quả
Mình đánh giá các models này theo spearman_styled và pearson_styled giống như code gốc:
- Dòng có "_pyvi_tokenizer" là sử dụng pyvi.tokenizer
- Với jina-v3, có "_matching" là dùng tham số task = 'text-matching', "_retrieval" là dùng tham số task và prompt_name là "retrieval.query" và 'retrieval.passage'

Từ kết quả có thể thấy:
- Pyvi sẽ improve kết quả một vài model trong khi làm giảm kết quả các model khác
- Với Jina v3, việc thêm tham số task = 'text-matching' giúp cải thiện kết quả đáng kể
- Các model tốt nhất là dangvantuan/vietnamese-embedding, dangvantuan/vietnamese-embedding-LongContext, jinaai/jina-embeddings-v3, Alibaba-NLP/gte-multilingual-base và VoVanPhuc/sup-SimCSE-VietNamese-phobert-base
- Các dataset này tập chung vào bài toán text matching hay text similarity (2 câu càng giống nghĩa embedding càng giống nhau), dựa vào kết quả của jina-v3 khi dùng task = 'text-matching' và task = 'retrieval' có thể thấy bài toán này sẽ khác với 'retrieval':
    - Nên để đánh giá kết quả cho task 'retrieval' sẽ cần dùng 1 tập dataset khác gồm các câu hỏi và câu trả lời tương ứng.
    - Tùy vào bài toán mục tiêu là text matching (như dataset này) hay retrieval (dùng cho RAG hay QA model) sẽ cần prepair dataset, chọn model khác nhau thay vì dùng chung 1 model cho 2 tasks.
    - Kết quả các model opensource tốt hơn đáng kể openai embedding trên các tập dữ liệu này.

In [36]:
df_spearman_styled

Unnamed: 0,STS-B,STS12,STS13,STS14,STS15,STS16,STS-Sickr,Mean
dangvantuan/vietnamese-embedding_pyvi_tokenizer,84.84,79.04,85.3,81.38,87.06,79.95,79.58,82.45
dangvantuan/vietnamese-embedding-LongContext,85.25,75.77,83.82,81.69,88.48,81.5,78.2,82.101429
jinaai/jina-embeddings-v3_matching,86.51,77.3,83.05,77.42,84.53,82.75,82.72,82.04
jinaai/jina-embeddings-v3,84.01,74.06,82.45,75.74,84.9,81.27,77.33,79.965714
jinaai/jina-embeddings-v3_matching_pyvi_tokenizer,85.07,72.93,77.87,73.71,83.13,79.13,80.83,78.952857
Alibaba-NLP/gte-multilingual-base,82.65,73.85,77.57,74.62,85.94,79.85,77.34,78.831429
VoVanPhuc/sup-SimCSE-VietNamese-phobert-base_pyvi_tokenizer,81.43,76.51,79.19,74.91,81.72,76.57,76.45,78.111429
BAAI/bge-m3,82.01,73.47,75.17,71.93,84.54,82.26,77.31,78.098571
dangvantuan/vietnamese-embedding-LongContext_pyvi_tokenizer,81.79,72.3,77.64,75.24,85.89,76.69,76.04,77.941429
jinaai/jina-embeddings-v3_pyvi_tokenizer,81.92,70.6,79.46,72.94,83.52,78.21,75.65,77.471429


In [37]:
df_pearson_styled

Unnamed: 0,STS-B,STS12,STS13,STS14,STS15,STS16,STS-Sickr,Mean
dangvantuan/vietnamese-embedding-LongContext,85.06,86.21,84.32,83.28,88.3,80.87,82.86,84.414286
dangvantuan/vietnamese-embedding_pyvi_tokenizer,84.87,87.23,85.39,82.94,86.91,79.39,82.77,84.214286
jinaai/jina-embeddings-v3_matching,86.03,83.59,81.91,78.96,83.81,81.43,86.99,83.245714
jinaai/jina-embeddings-v3,84.23,81.97,81.22,77.3,84.27,80.4,83.2,81.798571
dangvantuan/vietnamese-embedding-LongContext_pyvi_tokenizer,81.96,82.19,78.0,77.49,85.69,76.35,80.78,80.351429
jinaai/jina-embeddings-v3_matching_pyvi_tokenizer,84.7,79.41,76.7,75.46,82.61,78.0,85.24,80.302857
Alibaba-NLP/gte-multilingual-base,82.43,81.33,77.02,75.97,85.49,79.09,80.27,80.228571
VoVanPhuc/sup-SimCSE-VietNamese-phobert-base_pyvi_tokenizer,81.52,85.02,78.22,75.94,81.53,75.39,77.75,79.338571
jinaai/jina-embeddings-v3_pyvi_tokenizer,82.11,78.25,78.41,74.64,82.93,77.44,81.24,79.288571
BAAI/bge-m3,81.56,79.38,74.52,71.6,84.13,81.37,80.25,78.972857


In [21]:
df_spearman_styled.to_excel("df_spearman_styled.xlsx")

In [22]:
df_pearson_styled.to_excel("df_pearson_styled.xlsx")