Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added assets/images/speed_vs_mteb_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 22 additions & 26 deletions results/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Results

This document contains the results of the Model2Vec project. The results are presented in the following sections:
- [MTEB Results](#mteb-results)
- [MTEB Results (English)](#mteb-results-english)
- [MMTEB Results (Multilingual)](#mmteb-results-multilingual)
- [Retrieval Results](#retrieval-results)
- [Training Results](#training-results)
- [Ablations](#ablations)

Expand All @@ -13,17 +15,12 @@ Note: The `potion` and `M2V` models are our static models.

| Model | Avg (All) | Avg (MTEB) | Class | Clust | PairClass | Rank | Ret | STS | Sum | Pearl | WordSim |
|:-----------------------|------------:|-------------:|--------:|--------:|------------:|-------:|-------:|-------:|-------:|--------:|----------:|
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 56.08 | 56.09 | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 60.83 | 49.91 |
| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | 52.46 | 51.66 | 65.97 | 35.29 | 78.17 | 50.92 | 33.52 | 74.22 | 29.78 | 55.37 | 55.15 |
| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | 50.54 | 50.03 | 64.44 | 32.93 | 76.62 | 49.73 | 31.71 | 73.24 | 29.28 | 53.54 | 50.75 |
| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | 49.73 | 49.76 | 59.56 | 30.55 | 76.38 | 50.05 | 36.35 | 73.22 | 28.85 | 49.31 | 50.02 |
| [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M) | 48.87 | 48.23 | 62.19 | 31.47 | 75.37 | 48.75 | 29.11 | 72.19 | 28.89 | 52.55 | 49.21 |
| [static-retrieval-mrl-en-v1](https://huggingface.co/minishlab/static-retrieval-mrl-en-v1) | 48.18 | 48.36 | 57.39 | 28.32 | 75.63 | 49.16 | 35.61 | 72.18 | 28.64 | 49.68 | 44.76 |
| [static-similarity-mrl-multilingual-v1](https://huggingface.co/minishlab/static-similarity-mrl-multilingual-v1) | 48.15 | 47.15 | 59.96 | 24.40 | 79.02 | 48.25 | 29.54 | 74.88 | 30.28 | 51.66 | 51.66 |
| [M2V_base_output](https://huggingface.co/minishlab/M2V_base_output) | 46.79 | 45.34 | 61.25 | 25.58 | 74.9 | 47.63 | 26.14 | 68.58 | 29.2 | 54.02 | 49.18 |
| [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M) | 45.52 | 44.77 | 58.45 | 27.5 | 73.72 | 46.82 | 24.13 | 70.14 | 31.51 | 50.82 | 44.72 |
| [GloVe_300d](https://huggingface.co/sentence-transformers/average_word_embeddings_glove.6B.300d) | 42.84 | 42.36 | 57.31 | 27.66 | 72.48 | 43.3 | 22.78 | 61.9 | 28.81 | 45.65 | 43.05 |
| [BPEmb_50k_300d](https://github.com/bheinzerling/bpemb) | 39.34 | 37.78 | 55.76 | 23.35 | 57.86 | 43.21 | 17.5 | 55.1 | 29.74 | 47.56 | 41.28 |
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 55.80 | 55.93 | 69.25 | 44.90 | 82.37 | 47.14 | 42.92 | 78.95 | 25.96 | 60.83 | 49.91 |
| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | 52.83 | 52.13 | 71.70 | 41.25 | 78.17 | 42.45 | 32.67 | 73.93 | 24.74 | 55.37 | 55.15 |
| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | 51.32 | 51.08 | 70.34 | 39.74 | 76.62 | 41.79 | 31.11 | 72.91 | 25.06 | 53.54 | 50.75 |
| [M2V_base_output](https://huggingface.co/minishlab/M2V_base_output) | 48.77 | 47.96 | 66.84 | 33.96 | 74.90 | 39.31 | 25.36 | 68.76 | 26.61 | 54.02 | 49.18 |
| [GloVe_300d](https://huggingface.co/sentence-transformers/average_word_embeddings_glove.6B.300d) | 45.49 | 45.82 | 62.73 | 37.10 | 72.48 | 38.28 | 21.80 | 61.52 | 26.81 | 45.65 | 43.05 |
| [BPEmb_50k_300d](https://github.com/bheinzerling/bpemb) | 42.33 | 41.74 | 61.72 | 35.17 | 57.86 | 37.26 | 15.36 | 55.30 | 29.49 | 47.56 | 41.28 |


<details>
Expand All @@ -39,22 +36,22 @@ For readability, the MTEB task names are abbreviated as follows:
- Sum: Summarization
</details>

The results show that [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) is the most performant static embedding model. It reaches 92.11% of the performance of [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with an average MTEB score of 51.66 while being orders of magnitude faster.

Note: the [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M), [static-retrieval-mrl-en-v1](https://huggingface.co/minishlab/static-retrieval-mrl-en-v1), and [static-similarity-mrl-multilingual-v1](https://huggingface.co/minishlab/static-similarity-mrl-multilingual-v1) models are task-specific models. We've included them for completeness, but they should not be compared directly to the other models for tasks that they are not designed for.
The results show that [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) is the most performant static embedding model. It reaches 93.21% of the performance of [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with an average MTEB score of 52.13 while being orders of magnitude faster.

The figure below shows the relationship between the number of sentences per second and the average MTEB score. The circle sizes correspond to the number of parameters in the models (larger = more parameters).
This plot shows that the potion and M2V models are much faster than the other models, while still being competitive in terms of performance with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.
NOTE: for fairness of comparison, we disabled multiprocessing for Model2Vec for this benchmark. All sentence-transformers models are run with the [sentence-transformers](https://github.com/UKPLab/sentence-transformers) library's default settings for `encode`.

| ![Description](../assets/images/speed_vs_mteb_score_v3.png) |
| ![Description](../assets/images/speed_vs_mteb_plot.png) |
|:--:|
|*Figure: The average MTEB score plotted against sentences per second. The circle size indicates model size.*|


### MMTEB Results (Multilingual)
## MMTEB Results (Multilingual)
The results for the multilingual models are shown in the table below. We compare against the [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) model, as well as other multilingual static embedding models.

Note: the MMTEB leaderboard ranks models using a [Borda count](https://en.wikipedia.org/wiki/Borda_count) over per-task ranks rather than a simple average. This rewards models that perform consistently well across all tasks, rather than those that excel on one task type while performing poorly on others.

| Model | Mean (Task) | Mean (TaskType) | BitMining | Class | Clust | InstRet | MultiClass | PairClass | Rank | Ret | STS |
| :---------------------------------------- | :---------- | :-------------- | :------------ | :------------- | :--------- | :-------------------- | :------------------------ | :------------------ | :-------- | :-------- | :-------- |
| [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | 52.07 | 45.65 | 76.35 | 54.60 | 38.08 | −3.00 | 20.12 | 75.97 | 50.20 | 33.17 | 65.35 |
Expand All @@ -74,27 +71,26 @@ For readability, the MMTEB task names are abbreviated as follows:
- Class: Classification
- Clust: Clustering
- InstRet: Instruction Retrieval
- MuliClass: Multilabel Classification
- MultiClass: Multilabel Classification
- PairClass: PairClassification
- Rank: Reranking
- Ret: Retrieval
- STS: Semantic Textual Similarity

</details>

### Retrieval Results
## Retrieval Results

A subset of models we created and compare against are specifically designed for retrieval tasks. The results are shown in the table below, including two general-purpose models for comparison and a transformer.
Some of our models are specifically designed for retrieval tasks. The results are shown in the table below, with [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) included as a transformer baseline and [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) as a general-purpose static baseline.

| Model | Retrieval Score |
|:-----------------------|------------------:|
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 41.95 |
| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | 36.35 |
| [static-retrieval-mrl-en-v1](https://huggingface.co/minishlab/static-retrieval-mrl-en-v1) | 35.61 |
| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | 33.52 |
| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | 31.71 |
| [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | 42.92 |
| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | 35.06 |
| [static-retrieval-mrl-en-v1](https://huggingface.co/minishlab/static-retrieval-mrl-en-v1) | 34.95 |
| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | 32.67 |

As can be seen, [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) model is the most performant static retrieval model, reaching 86.65%% of the performance of [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a retrieval score of 36.35.
As can be seen, [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) is the most performant static retrieval model, reaching 81.69% of the performance of [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) with a retrieval score of 35.06.

## Training Results

Expand Down
98 changes: 52 additions & 46 deletions results/make_speed_vs_mteb_plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,68 +106,69 @@ def benchmark_model(name: str, info: list[str], texts: list[str]) -> dict[str, f
return {"docs_per_second": docs_per_second, "total_time": total_time}


def main(save_path: str, n_texts: int) -> None:
def main(save_path: str, n_texts: int, force_benchmark: bool) -> None:
"""Benchmark text embedding models and generate a plot."""
# Define the models to benchmark
models: dict[str, list[str]] = {
"BPEmb-50k-300d": ["", "BPEmb"],
"all-MiniLM-L6-v2": ["sentence-transformers/all-MiniLM-L6-v2", "ST"],
"bge-base-en-v1.5": ["BAAI/bge-base-en-v1.5", "ST"],
"GloVe 6B 300d": ["sentence-transformers/average_word_embeddings_glove.6B.300d", "ST"],
"potion-base-8M": ["minishlab/potion-base-8M", "M2V"],
}

# Load the dataset
ds = load_dataset("wikimedia/wikipedia", data_files="20231101.en/train-00000-of-00041.parquet")["train"]
texts = ds["text"][:n_texts]
save_dir = Path(save_path)
save_dir.mkdir(parents=True, exist_ok=True)
json_path = save_dir / "speed_benchmark_results.json"

summarized_results = [
{"Model": "potion-base-2M", "Average score": 44.77, "Samples per second": None, "Params (Million)": 1.875},
{"Model": "GloVe 6B 300d", "Average score": 42.36, "Samples per second": None, "Params (Million)": 120.000},
{"Model": "potion-base-4M", "Average score": 48.23, "Samples per second": None, "Params (Million)": 3.750},
{"Model": "all-MiniLM-L6-v2", "Average score": 56.09, "Samples per second": None, "Params (Million)": 23.000},
{"Model": "potion-base-8M", "Average score": 50.03, "Samples per second": None, "Params (Million)": 7.500},
{"Model": "bge-base-en-v1.5", "Average score": 63.56, "Samples per second": None, "Params (Million)": 109.000},
{"Model": "M2V_base_output", "Average score": 45.34, "Samples per second": None, "Params (Million)": 7.500},
{"Model": "BPEmb-50k-300d", "Average score": 37.78, "Samples per second": None, "Params (Million)": 15.000},
{"Model": "potion-base-32M", "Average score": 51.66, "Samples per second": None, "Params (Million)": 32.300},
{"Model": "potion-base-2M", "Average score": 47.49, "Samples per second": None, "Params (Million)": 1.875},
{"Model": "GloVe 6B 300d", "Average score": 45.82, "Samples per second": None, "Params (Million)": 120.000},
{"Model": "potion-base-4M", "Average score": 49.77, "Samples per second": None, "Params (Million)": 3.750},
{"Model": "all-MiniLM-L6-v2", "Average score": 55.93, "Samples per second": None, "Params (Million)": 23.000},
{"Model": "potion-base-8M", "Average score": 51.08, "Samples per second": None, "Params (Million)": 7.500},
{"Model": "bge-base-en-v1.5", "Average score": 60.77, "Samples per second": None, "Params (Million)": 109.000},
{"Model": "BPEmb-50k-300d", "Average score": 41.74, "Samples per second": None, "Params (Million)": 15.000},
{"Model": "potion-base-32M", "Average score": 52.13, "Samples per second": None, "Params (Million)": 32.300},
]

timings = {}

for name, info in models.items():
timing = benchmark_model(name, info, texts)
timings[name] = timing
# Update summarized results
for result in summarized_results:
if result["Model"] == name:
result["Samples per second"] = timing["docs_per_second"]

# Set potion-base-8M as the reference speed for the other potion models
if not force_benchmark and json_path.exists():
logger.info(f"Loading cached timings from {json_path} (use --force-benchmark to re-run)")
with open(json_path) as file:
timings = json.load(file)
else:
# Define the models to benchmark
models: dict[str, list[str]] = {
"BPEmb-50k-300d": ["", "BPEmb"],
"all-MiniLM-L6-v2": ["sentence-transformers/all-MiniLM-L6-v2", "ST"],
"bge-base-en-v1.5": ["BAAI/bge-base-en-v1.5", "ST"],
"GloVe 6B 300d": ["sentence-transformers/average_word_embeddings_glove.6B.300d", "ST"],
"potion-base-8M": ["minishlab/potion-base-8M", "M2V"],
}

# Load the dataset
ds = load_dataset("wikimedia/wikipedia", data_files="20231101.en/train-00000-of-00041.parquet")["train"]
texts = ds["text"][:n_texts]

timings = {}
for name, info in models.items():
timing = benchmark_model(name, info, texts)
timings[name] = timing

with open(json_path, "w") as file:
json.dump(timings, file, indent=4)
logger.info(f"Timings saved to {json_path}")

for result in summarized_results:
name = str(result["Model"])
if name in timings:
result["Samples per second"] = timings[name]["docs_per_second"]

# Set potion-base-8M as the reference speed for the other M2V models
potion_base_8m_speed = next(
result["Samples per second"] for result in summarized_results if result["Model"] == "potion-base-8M"
)
for model_name in ["M2V_base_output", "potion-base-2M", "potion-base-4M", "potion-base-32M"]:
for model_name in ["potion-base-2M", "potion-base-4M", "potion-base-32M"]:
for result in summarized_results:
if result["Model"] == model_name:
result["Samples per second"] = potion_base_8m_speed

# Ensure save_path is a directory
save_dir = Path(save_path)
save_dir.mkdir(parents=True, exist_ok=True)

# Save timings to JSON
json_path = save_dir / "speed_benchmark_results.json"
with open(json_path, "w") as file:
json.dump(timings, file, indent=4)

# Create and save the plot
df = pd.DataFrame(summarized_results)
plot = make_plot(df)
plot_path = save_dir / "speed_vs_mteb_plot.png"
plot.save(plot_path, width=12, height=10)

logger.info(f"Timings saved to {json_path}")
logger.info(f"Plot saved to {plot_path}")


Expand All @@ -179,6 +180,11 @@ def main(save_path: str, n_texts: int) -> None:
parser.add_argument(
"--n-texts", type=int, default=100_000, help="Number of texts to use from the dataset for benchmarking."
)
parser.add_argument(
"--force-benchmark",
action="store_true",
help="Re-run the speed benchmark even if cached results exist.",
)
args = parser.parse_args()

main(save_path=args.save_path, n_texts=args.n_texts)
main(save_path=args.save_path, n_texts=args.n_texts, force_benchmark=args.force_benchmark)
Loading