From 07f289e6e8b810aca3992ad13a0476759cb0726b Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 14:35:11 +0100 Subject: [PATCH 01/38] Updated readme --- README.md | 41 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index baea2578..b1825496 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,10 @@ Model2Vec is a technique to turn any sentence transformer into a really small st ## Updates & Announcements +- **12/02/2024**: We released **Model2Vec training**, allowing you to fine-tune your own classification models on top of Model2Vec models. Find out more in our [documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md) and in our [blog post](LINK). + - **30/01/2024**: We released two new models: [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) and [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M). [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) is our most performant model to date, using a larger vocabulary and higher dimensions. [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) is a finetune of [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) that is optimized for retrieval tasks, and is the best performing static retrieval model currently available. + - **30/10/2024**: We released three new models: [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M), [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M), and [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M). These models are trained using [Tokenlearn](https://github.com/MinishLab/tokenlearn). Find out more in our [blog post](https://minishlab.github.io/tokenlearn_blogpost/). NOTE: for users of any of our old English M2V models, we recommend switching to these new models as they [perform better on all tasks](https://github.com/MinishLab/model2vec/tree/main/results). ## Table of Contents @@ -54,6 +57,7 @@ Model2Vec is a technique to turn any sentence transformer into a really small st - [Usage](#usage) - [Inference](#inference) - [Distillation](#distillation) + - [Training](#training) - [Evaluation](#evaluation) - [Integrations](#integrations) - [Model List](#model-list) @@ -113,7 +117,7 @@ For advanced usage, such as using Model2Vec in the [Sentence Transformers librar - **Lightning-fast Inference**: up to 500 times faster on CPU than the original model. Go green or go home. - **Fast, Dataset-free Distillation**: distill your own model in 30 seconds on a CPU, without a dataset. All you need is a model and (optionally) a custom vocabulary. - **Integrated in many popular libraries**: Model2Vec can be used directly in popular libraries such as [Sentence Transformers](https://github.com/UKPLab/sentence-transformers), [LangChain](https://github.com/langchain-ai/langchain), [txtai](https://github.com/neuml/txtai), and [Chonkie](https://github.com/bhavnicksm/chonkie). See the [Integrations](#integrations) section for more information. -- **Tightly integrated with HuggingFace hub**: easily share and load models from the HuggingFace hub, using the familiar `from_pretrained` and `push_to_hub`. Our own models can be found [here](https://huggingface.co/minishlab). Feel free to share your own. +- **Tightly integrated with HuggingFace hub**: easily share and load models from the HuggingFace hub, using the familiar `from_pretrained` and `push_to_hub`. Our own models can be found [here](https://huggingface.co/minishlab). ## What is Model2Vec? @@ -265,6 +269,41 @@ m2v_model = distill(model_name=model_name, vocabulary=vocabulary, use_subword=Fa +### Training + +
+ Training a classifier +
+ +Model2Vec can be used to train a classifier on top of a distilled model. The following code snippet shows how to train a classifier on top of a distilled model: + +```python +from model2vec.train import StaticModelForClassification + +# Load a distilled model +distilled_model = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") + +# Load a dataset +from datasets import load_dataset + +ds = load_dataset("setfit/subj") +train = ds["train"] +test = ds["test"] + +X_train, y_train = train["text"], train["label"] +X_test, y_test = test["text"], test["label"] + +# Train the classifier +classifier = StaticModelForClassification.from_static_model(distilled_model) +classifier.fit(X_train, y_train) + +# Evaluate the classifier +y_hat = classifier.predict(X_test) +accuracy = np.mean(np.array(y_hat) == np.array(y_test)) * 100 +``` + +
+ ### Evaluation From c5ce2ef564e7c19b1e195c22efdb98aae4dd0931 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 14:36:47 +0100 Subject: [PATCH 02/38] Updated readme --- README.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index b1825496..28ef2d57 100644 --- a/README.md +++ b/README.md @@ -275,17 +275,16 @@ m2v_model = distill(model_name=model_name, vocabulary=vocabulary, use_subword=Fa Training a classifier
-Model2Vec can be used to train a classifier on top of a distilled model. The following code snippet shows how to train a classifier on top of a distilled model: +Model2Vec can be used to train a classifier on top of a distilled model. The following code snippet shows how to train a classifier on top of a distilled model. For more advanced usage, as well as results, please refer to the [training documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md). ```python +from datasets import load_dataset from model2vec.train import StaticModelForClassification # Load a distilled model distilled_model = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") # Load a dataset -from datasets import load_dataset - ds = load_dataset("setfit/subj") train = ds["train"] test = ds["test"] From a0eee42a3a8a6630265e3201022ffa4d8e8c693a Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 14:40:53 +0100 Subject: [PATCH 03/38] Updated readme --- README.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 28ef2d57..9d86a0e3 100644 --- a/README.md +++ b/README.md @@ -360,16 +360,15 @@ print(make_leaderboard(task_scores))
-Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) using the `StaticEmbedding` module. +Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers): The following code snippet shows how to load a Model2Vec model into a Sentence Transformer model: ```python from sentence_transformers import SentenceTransformer -from sentence_transformers.models import StaticEmbedding -# Initialize a StaticEmbedding module -static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M") -model = SentenceTransformer(modules=[static_embedding]) +# Load a Model2Vec model from the Hub +model = SentenceTransformer("minishlab/potion-base-8M") +# Make embeddings embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) ``` From 0270b2161c7667bdfa8165ca4163d6be783949f1 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 14:46:44 +0100 Subject: [PATCH 04/38] Refactored docs --- README.md | 388 ------------------------------------------- docs/README.md | 3 + docs/integrations.md | 168 +++++++++++++++++++ docs/usage.md | 220 ++++++++++++++++++++++++ 4 files changed, 391 insertions(+), 388 deletions(-) create mode 100644 docs/README.md create mode 100644 docs/integrations.md create mode 100644 docs/usage.md diff --git a/README.md b/README.md index 9d86a0e3..96397409 100644 --- a/README.md +++ b/README.md @@ -134,393 +134,6 @@ Our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0ab For a much more extensive deepdive, please refer to our [Model2Vec blog post](https://huggingface.co/blog/Pringled/model2vec) and our [Tokenlearn blog post](https://minishlab.github.io/tokenlearn_blogpost/). -## Usage - -### Inference - -
- Inference with a pretrained model -
- -Inference works as follows. The example shows one of our own models, but you can also just load a local one, or another one from the hub. -```python -from model2vec import StaticModel - -# Load a model from the Hub. You can optionally pass a token when loading a private model -model = StaticModel.from_pretrained(model_name="minishlab/potion-base-8M", token=None) - -# Make embeddings -embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) - -# Make sequences of token embeddings -token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."]) -``` -
- - -
- Inference with the Sentence Transformers library -
- -The following code snippet shows how to use a Model2Vec model in the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. - -```python -from sentence_transformers import SentenceTransformer -from sentence_transformers.models import StaticEmbedding - -# Initialize a StaticEmbedding module -static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M") -model = SentenceTransformer(modules=[static_embedding]) -embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) -``` - -
- -### Distillation - -
- Distilling from a Sentence Transformer -
- -The following code can be used to distill a model from a Sentence Transformer. As mentioned above, this leads to really small model that might be less performant. -```python -from model2vec.distill import distill - -# Distill a Sentence Transformer model -m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256) - -# Save the model -m2v_model.save_pretrained("m2v_model") - -``` -
- -
- Distilling from a loaded model -
- -If you already have a model loaded, or need to load a model in some special way, we also offer an interface to distill models in memory. - -```python -from transformers import AutoModel, AutoTokenizer - -from model2vec.distill import distill_from_model - -# Assuming a loaded model and tokenizer -model_name = "baai/bge-base-en-v1.5" -model = AutoModel.from_pretrained(model_name) -tokenizer = AutoTokenizer.from_pretrained(model_name) - -m2v_model = distill_from_model(model=model, tokenizer=tokenizer, pca_dims=256) - -m2v_model.save_pretrained("m2v_model") - -``` - -
- -
- Distilling with the Sentence Transformers library -
- -The following code snippet shows how to distill a model using the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. - -```python -from sentence_transformers import SentenceTransformer -from sentence_transformers.models import StaticEmbedding - -static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256) -model = SentenceTransformer(modules=[static_embedding]) -embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) -``` - -
- - -
- Distilling with a custom vocabulary -
- -If you pass a vocabulary, you get a set of static word embeddings, together with a custom tokenizer for exactly that vocabulary. This is comparable to how you would use GLoVe or traditional word2vec, but doesn't actually require a corpus or data. -```python -from model2vec.distill import distill - -# Load a vocabulary as a list of strings -vocabulary = ["word1", "word2", "word3"] - -# Distill a Sentence Transformer model with the custom vocabulary -m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", vocabulary=vocabulary) - -# Save the model -m2v_model.save_pretrained("m2v_model") - -# Or push it to the hub -m2v_model.push_to_hub("my_organization/my_model", token="") -``` - -By default, this will distill a model with a subword tokenizer, combining the models (subword) vocab with the new vocabulary. If you want to get a word-level tokenizer instead (with only the passed vocabulary), the `use_subword` parameter can be set to `False`, e.g.: - -```python -m2v_model = distill(model_name=model_name, vocabulary=vocabulary, use_subword=False) -``` - -**Important note:** we assume the passed vocabulary is sorted in rank frequency. i.e., we don't care about the actual word frequencies, but do assume that the most frequent word is first, and the least frequent word is last. If you're not sure whether this is case, set `apply_zipf` to `False`. This disables the weighting, but will also make performance a little bit worse. - -
- - -### Training - -
- Training a classifier -
- -Model2Vec can be used to train a classifier on top of a distilled model. The following code snippet shows how to train a classifier on top of a distilled model. For more advanced usage, as well as results, please refer to the [training documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md). - -```python -from datasets import load_dataset -from model2vec.train import StaticModelForClassification - -# Load a distilled model -distilled_model = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") - -# Load a dataset -ds = load_dataset("setfit/subj") -train = ds["train"] -test = ds["test"] - -X_train, y_train = train["text"], train["label"] -X_test, y_test = test["text"], test["label"] - -# Train the classifier -classifier = StaticModelForClassification.from_static_model(distilled_model) -classifier.fit(X_train, y_train) - -# Evaluate the classifier -y_hat = classifier.predict(X_test) -accuracy = np.mean(np.array(y_hat) == np.array(y_test)) * 100 -``` - -
- -### Evaluation - - -
- Installation -
- -Our models can be evaluated using our [evaluation package](https://github.com/MinishLab/evaluation). Install the evaluation package with: - -```bash -pip install git+https://github.com/MinishLab/evaluation.git@main -``` -
- -
- Evaluation Code -
- -The following code snippet shows how to evaluate a Model2Vec model: -```python -from model2vec import StaticModel - -from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results -from mteb import ModelMeta - -# Get all available tasks -tasks = get_tasks() -# Define the CustomMTEB object with the specified tasks -evaluation = CustomMTEB(tasks=tasks) - -# Load the model -model_name = "m2v_model" -model = StaticModel.from_pretrained(model_name) - -# Optionally, add model metadata in MTEB format -model.mteb_model_meta = ModelMeta( - name=model_name, revision="no_revision_available", release_date=None, languages=None - ) - -# Run the evaluation -results = evaluation.run(model, eval_splits=["test"], output_folder=f"results") - -# Parse the results and summarize them -parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name) -task_scores = summarize_results(parsed_results) - -# Print the results in a leaderboard format -print(make_leaderboard(task_scores)) -``` -
- -## Integrations -
- Sentence Transformers - -
- -Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers): - -The following code snippet shows how to load a Model2Vec model into a Sentence Transformer model: -```python -from sentence_transformers import SentenceTransformer - -# Load a Model2Vec model from the Hub -model = SentenceTransformer("minishlab/potion-base-8M") -# Make embeddings -embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) -``` - -The following code snippet shows how to distill a model directly into a Sentence Transformer model: - -```python -from sentence_transformers import SentenceTransformer -from sentence_transformers.models import StaticEmbedding - -static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256) -model = SentenceTransformer(modules=[static_embedding]) -embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) -``` - -For more documentation, please refer to the [Sentence Transformers documentation](https://sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding). - -
- -
- LangChain -
- -Model2Vec can be used in [LangChain](https://github.com/langchain-ai/langchain) using the `langchain-community` package. For more information, see the [LangChain Model2Vec docs](https://python.langchain.com/docs/integrations/text_embedding/model2vec/). The following code snippet shows how to use Model2Vec in LangChain after installing the `langchain-community` package with `pip install langchain-community`: - -```python -from langchain_community.embeddings import Model2vecEmbeddings -from langchain_community.vectorstores import FAISS -from langchain.schema import Document - -# Initialize a Model2Vec embedder -embedder = Model2vecEmbeddings("minishlab/potion-base-8M") - -# Create some example texts -texts = [ - "Enduring Stew", - "Hearty Elixir", - "Mighty Mushroom Risotto", - "Spicy Meat Skewer", - "Fruit Salad", -] - -# Embed the texts -embeddings = embedder.embed_documents(texts) - -# Or, create a vector store and query it -documents = [Document(page_content=text) for text in texts] -vector_store = FAISS.from_documents(documents, embedder) -query = "Risotto" -query_vector = embedder.embed_query(query) -retrieved_docs = vector_store.similarity_search_by_vector(query_vector, k=1) -``` -
- -
- Txtai -
- -Model2Vec can be used in [txtai](https://github.com/neuml/txtai) for text embeddings, nearest-neighbors search, and any of the other functionalities that txtai offers. The following code snippet shows how to use Model2Vec in txtai after installing the `txtai` package (including the `vectors` dependency) with `pip install txtai[vectors]`: - -```python -from txtai import Embeddings - -# Load a model2vec model -embeddings = Embeddings(path="minishlab/potion-base-8M", method="model2vec", backend="numpy") - -# Create some example texts -texts = ["Enduring Stew", "Hearty Elixir", "Mighty Mushroom Risotto", "Spicy Meat Skewer", "Chilly Fruit Salad"] - -# Create embeddings for downstream tasks -vectors = embeddings.batchtransform(texts) - -# Or create a nearest-neighbors index and search it -embeddings.index(texts) -result = embeddings.search("Risotto", 1) -``` - -
- -
- Chonkie -
- -Model2Vec is the default model for semantic chunking in [Chonkie](https://github.com/bhavnicksm/chonkie). To use Model2Vec for semantic chunking in Chonkie, simply install Chonkie with `pip install chonkie[semantic]` and use one of the `potion` models in the `SemanticChunker` class. The following code snippet shows how to use Model2Vec in Chonkie: - -```python -from chonkie import SDPMChunker - -# Create some example text to chunk -text = "It's dangerous to go alone! Take this." - -# Initialize the SemanticChunker with a potion model -chunker = SDPMChunker( - embedding_model="minishlab/potion-base-8M", - similarity_threshold=0.3 -) - -# Chunk the text -chunks = chunker.chunk(text) -``` - -
- - -
- Transformers.js -
- -To use a Model2Vec model in [transformers.js](https://github.com/huggingface/transformers.js), the following code snippet can be used as a starting point: - -```javascript -import { AutoModel, AutoTokenizer, Tensor } from '@huggingface/transformers'; - -const modelName = 'minishlab/potion-base-8M'; - -const modelConfig = { - config: { model_type: 'model2vec' }, - dtype: 'fp32', - revision: 'refs/pr/1' -}; -const tokenizerConfig = { - revision: 'refs/pr/2' -}; - -const model = await AutoModel.from_pretrained(modelName, modelConfig); -const tokenizer = await AutoTokenizer.from_pretrained(modelName, tokenizerConfig); - -const texts = ['hello', 'hello world']; -const { input_ids } = await tokenizer(texts, { add_special_tokens: false, return_tensor: false }); - -const cumsum = arr => arr.reduce((acc, num, i) => [...acc, num + (acc[i - 1] || 0)], []); -const offsets = [0, ...cumsum(input_ids.slice(0, -1).map(x => x.length))]; - -const flattened_input_ids = input_ids.flat(); -const modelInputs = { - input_ids: new Tensor('int64', flattened_input_ids, [flattened_input_ids.length]), - offsets: new Tensor('int64', offsets, [offsets.length]) -}; - -const { embeddings } = await model(modelInputs); -console.log(embeddings.tolist()); // output matches python version -``` - -Note that this requires that the Model2Vec has a `model.onnx` file and several required tokenizers file. To generate these for a model that does not have them yet, the following code snippet can be used: - -```bash -python scripts/export_to_onnx.py --model_path --save_path "" -``` - - -
-
- ## Model List @@ -537,7 +150,6 @@ We provide a number of models that can be used out of the box. These models are | [M2V_multilingual_output](https://huggingface.co/minishlab/M2V_multilingual_output) | Multilingual | Output | [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | Subword | 471M |
| - ## Results We have performed extensive experiments to evaluate the performance of Model2Vec models. The results are documented in the [results](results/README.md) folder. The results are presented in the following sections: diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000..342d4ebb --- /dev/null +++ b/docs/README.md @@ -0,0 +1,3 @@ +# Documentation + +This directory contains the documentation for Model2Vec. diff --git a/docs/integrations.md b/docs/integrations.md new file mode 100644 index 00000000..5c13e402 --- /dev/null +++ b/docs/integrations.md @@ -0,0 +1,168 @@ + +# Integrations + +
+ Sentence Transformers + +
+ +Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers): + +The following code snippet shows how to load a Model2Vec model into a Sentence Transformer model: +```python +from sentence_transformers import SentenceTransformer + +# Load a Model2Vec model from the Hub +model = SentenceTransformer("minishlab/potion-base-8M") +# Make embeddings +embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) +``` + +The following code snippet shows how to distill a model directly into a Sentence Transformer model: + +```python +from sentence_transformers import SentenceTransformer +from sentence_transformers.models import StaticEmbedding + +static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256) +model = SentenceTransformer(modules=[static_embedding]) +embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) +``` + +For more documentation, please refer to the [Sentence Transformers documentation](https://sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding). + +
+ +
+ LangChain +
+ +Model2Vec can be used in [LangChain](https://github.com/langchain-ai/langchain) using the `langchain-community` package. For more information, see the [LangChain Model2Vec docs](https://python.langchain.com/docs/integrations/text_embedding/model2vec/). The following code snippet shows how to use Model2Vec in LangChain after installing the `langchain-community` package with `pip install langchain-community`: + +```python +from langchain_community.embeddings import Model2vecEmbeddings +from langchain_community.vectorstores import FAISS +from langchain.schema import Document + +# Initialize a Model2Vec embedder +embedder = Model2vecEmbeddings("minishlab/potion-base-8M") + +# Create some example texts +texts = [ + "Enduring Stew", + "Hearty Elixir", + "Mighty Mushroom Risotto", + "Spicy Meat Skewer", + "Fruit Salad", +] + +# Embed the texts +embeddings = embedder.embed_documents(texts) + +# Or, create a vector store and query it +documents = [Document(page_content=text) for text in texts] +vector_store = FAISS.from_documents(documents, embedder) +query = "Risotto" +query_vector = embedder.embed_query(query) +retrieved_docs = vector_store.similarity_search_by_vector(query_vector, k=1) +``` +
+ +
+ Txtai +
+ +Model2Vec can be used in [txtai](https://github.com/neuml/txtai) for text embeddings, nearest-neighbors search, and any of the other functionalities that txtai offers. The following code snippet shows how to use Model2Vec in txtai after installing the `txtai` package (including the `vectors` dependency) with `pip install txtai[vectors]`: + +```python +from txtai import Embeddings + +# Load a model2vec model +embeddings = Embeddings(path="minishlab/potion-base-8M", method="model2vec", backend="numpy") + +# Create some example texts +texts = ["Enduring Stew", "Hearty Elixir", "Mighty Mushroom Risotto", "Spicy Meat Skewer", "Chilly Fruit Salad"] + +# Create embeddings for downstream tasks +vectors = embeddings.batchtransform(texts) + +# Or create a nearest-neighbors index and search it +embeddings.index(texts) +result = embeddings.search("Risotto", 1) +``` + +
+ +
+ Chonkie +
+ +Model2Vec is the default model for semantic chunking in [Chonkie](https://github.com/bhavnicksm/chonkie). To use Model2Vec for semantic chunking in Chonkie, simply install Chonkie with `pip install chonkie[semantic]` and use one of the `potion` models in the `SemanticChunker` class. The following code snippet shows how to use Model2Vec in Chonkie: + +```python +from chonkie import SDPMChunker + +# Create some example text to chunk +text = "It's dangerous to go alone! Take this." + +# Initialize the SemanticChunker with a potion model +chunker = SDPMChunker( + embedding_model="minishlab/potion-base-8M", + similarity_threshold=0.3 +) + +# Chunk the text +chunks = chunker.chunk(text) +``` + +
+ + +
+ Transformers.js +
+ +To use a Model2Vec model in [transformers.js](https://github.com/huggingface/transformers.js), the following code snippet can be used as a starting point: + +```javascript +import { AutoModel, AutoTokenizer, Tensor } from '@huggingface/transformers'; + +const modelName = 'minishlab/potion-base-8M'; + +const modelConfig = { + config: { model_type: 'model2vec' }, + dtype: 'fp32', + revision: 'refs/pr/1' +}; +const tokenizerConfig = { + revision: 'refs/pr/2' +}; + +const model = await AutoModel.from_pretrained(modelName, modelConfig); +const tokenizer = await AutoTokenizer.from_pretrained(modelName, tokenizerConfig); + +const texts = ['hello', 'hello world']; +const { input_ids } = await tokenizer(texts, { add_special_tokens: false, return_tensor: false }); + +const cumsum = arr => arr.reduce((acc, num, i) => [...acc, num + (acc[i - 1] || 0)], []); +const offsets = [0, ...cumsum(input_ids.slice(0, -1).map(x => x.length))]; + +const flattened_input_ids = input_ids.flat(); +const modelInputs = { + input_ids: new Tensor('int64', flattened_input_ids, [flattened_input_ids.length]), + offsets: new Tensor('int64', offsets, [offsets.length]) +}; + +const { embeddings } = await model(modelInputs); +console.log(embeddings.tolist()); // output matches python version +``` + +Note that this requires that the Model2Vec has a `model.onnx` file and several required tokenizers file. To generate these for a model that does not have them yet, the following code snippet can be used: + +```bash +python scripts/export_to_onnx.py --model_path --save_path "" +``` + + +
+
diff --git a/docs/usage.md b/docs/usage.md new file mode 100644 index 00000000..bf1b2a14 --- /dev/null +++ b/docs/usage.md @@ -0,0 +1,220 @@ + +# Usage + +## Inference + +
+ Inference with a pretrained model +
+ +Inference works as follows. The example shows one of our own models, but you can also just load a local one, or another one from the hub. +```python +from model2vec import StaticModel + +# Load a model from the Hub. You can optionally pass a token when loading a private model +model = StaticModel.from_pretrained(model_name="minishlab/potion-base-8M", token=None) + +# Make embeddings +embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) + +# Make sequences of token embeddings +token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."]) +``` +
+ + +
+ Inference with the Sentence Transformers library +
+ +The following code snippet shows how to use a Model2Vec model in the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. + +```python +from sentence_transformers import SentenceTransformer +from sentence_transformers.models import StaticEmbedding + +# Initialize a StaticEmbedding module +static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M") +model = SentenceTransformer(modules=[static_embedding]) +embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) +``` + +
+ +## Distillation + +
+ Distilling from a Sentence Transformer +
+ +The following code can be used to distill a model from a Sentence Transformer. As mentioned above, this leads to really small model that might be less performant. +```python +from model2vec.distill import distill + +# Distill a Sentence Transformer model +m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256) + +# Save the model +m2v_model.save_pretrained("m2v_model") + +``` +
+ +
+ Distilling from a loaded model +
+ +If you already have a model loaded, or need to load a model in some special way, we also offer an interface to distill models in memory. + +```python +from transformers import AutoModel, AutoTokenizer + +from model2vec.distill import distill_from_model + +# Assuming a loaded model and tokenizer +model_name = "baai/bge-base-en-v1.5" +model = AutoModel.from_pretrained(model_name) +tokenizer = AutoTokenizer.from_pretrained(model_name) + +m2v_model = distill_from_model(model=model, tokenizer=tokenizer, pca_dims=256) + +m2v_model.save_pretrained("m2v_model") + +``` + +
+ +
+ Distilling with the Sentence Transformers library +
+ +The following code snippet shows how to distill a model using the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. + +```python +from sentence_transformers import SentenceTransformer +from sentence_transformers.models import StaticEmbedding + +static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256) +model = SentenceTransformer(modules=[static_embedding]) +embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) +``` + +
+ + +
+ Distilling with a custom vocabulary +
+ +If you pass a vocabulary, you get a set of static word embeddings, together with a custom tokenizer for exactly that vocabulary. This is comparable to how you would use GLoVe or traditional word2vec, but doesn't actually require a corpus or data. +```python +from model2vec.distill import distill + +# Load a vocabulary as a list of strings +vocabulary = ["word1", "word2", "word3"] + +# Distill a Sentence Transformer model with the custom vocabulary +m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", vocabulary=vocabulary) + +# Save the model +m2v_model.save_pretrained("m2v_model") + +# Or push it to the hub +m2v_model.push_to_hub("my_organization/my_model", token="") +``` + +By default, this will distill a model with a subword tokenizer, combining the models (subword) vocab with the new vocabulary. If you want to get a word-level tokenizer instead (with only the passed vocabulary), the `use_subword` parameter can be set to `False`, e.g.: + +```python +m2v_model = distill(model_name=model_name, vocabulary=vocabulary, use_subword=False) +``` + +**Important note:** we assume the passed vocabulary is sorted in rank frequency. i.e., we don't care about the actual word frequencies, but do assume that the most frequent word is first, and the least frequent word is last. If you're not sure whether this is case, set `apply_zipf` to `False`. This disables the weighting, but will also make performance a little bit worse. + +
+ + +## Training + +
+ Training a classifier +
+ +Model2Vec can be used to train a classifier on top of a distilled model. The following code snippet shows how to train a classifier on top of a distilled model. For more advanced usage, as well as results, please refer to the [training documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md). + +```python +from datasets import load_dataset +from model2vec.train import StaticModelForClassification + +# Load a distilled model +distilled_model = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") + +# Load a dataset +ds = load_dataset("setfit/subj") +train = ds["train"] +test = ds["test"] + +X_train, y_train = train["text"], train["label"] +X_test, y_test = test["text"], test["label"] + +# Train the classifier +classifier = StaticModelForClassification.from_static_model(distilled_model) +classifier.fit(X_train, y_train) + +# Evaluate the classifier +y_hat = classifier.predict(X_test) +accuracy = np.mean(np.array(y_hat) == np.array(y_test)) * 100 +``` + +
+ +## Evaluation + + +
+ Installation +
+ +Our models can be evaluated using our [evaluation package](https://github.com/MinishLab/evaluation). Install the evaluation package with: + +```bash +pip install git+https://github.com/MinishLab/evaluation.git@main +``` +
+ +
+ Evaluation Code +
+ +The following code snippet shows how to evaluate a Model2Vec model: +```python +from model2vec import StaticModel + +from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results +from mteb import ModelMeta + +# Get all available tasks +tasks = get_tasks() +# Define the CustomMTEB object with the specified tasks +evaluation = CustomMTEB(tasks=tasks) + +# Load the model +model_name = "m2v_model" +model = StaticModel.from_pretrained(model_name) + +# Optionally, add model metadata in MTEB format +model.mteb_model_meta = ModelMeta( + name=model_name, revision="no_revision_available", release_date=None, languages=None + ) + +# Run the evaluation +results = evaluation.run(model, eval_splits=["test"], output_folder=f"results") + +# Parse the results and summarize them +parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name) +task_scores = summarize_results(parsed_results) + +# Print the results in a leaderboard format +print(make_leaderboard(task_scores)) +``` +
From 99b75183d43453a5d24b55c3fd99ed2cb0d1d49e Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 14:55:23 +0100 Subject: [PATCH 05/38] Updated docs --- docs/usage.md | 51 +++++++++------------------------------------------ 1 file changed, 9 insertions(+), 42 deletions(-) diff --git a/docs/usage.md b/docs/usage.md index bf1b2a14..9e697c13 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -3,9 +3,7 @@ ## Inference -
- Inference with a pretrained model -
+### Inference with a pretrained model Inference works as follows. The example shows one of our own models, but you can also just load a local one, or another one from the hub. ```python @@ -20,12 +18,9 @@ embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to ever # Make sequences of token embeddings token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."]) ``` -
+### Inference with the Sentence Transformers library -
- Inference with the Sentence Transformers library -
The following code snippet shows how to use a Model2Vec model in the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. @@ -39,13 +34,9 @@ model = SentenceTransformer(modules=[static_embedding]) embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) ``` -
- ## Distillation -
- Distilling from a Sentence Transformer -
+### Distilling from a Sentence Transformer The following code can be used to distill a model from a Sentence Transformer. As mentioned above, this leads to really small model that might be less performant. ```python @@ -58,11 +49,8 @@ m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256) m2v_model.save_pretrained("m2v_model") ``` -
-
- Distilling from a loaded model -
+### Distilling from a loaded model If you already have a model loaded, or need to load a model in some special way, we also offer an interface to distill models in memory. @@ -82,11 +70,7 @@ m2v_model.save_pretrained("m2v_model") ``` -
- -
- Distilling with the Sentence Transformers library -
+### Distilling with the Sentence Transformers library The following code snippet shows how to distill a model using the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. @@ -99,12 +83,7 @@ model = SentenceTransformer(modules=[static_embedding]) embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) ``` -
- - -
- Distilling with a custom vocabulary -
+### Distilling with a custom vocabulary If you pass a vocabulary, you get a set of static word embeddings, together with a custom tokenizer for exactly that vocabulary. This is comparable to how you would use GLoVe or traditional word2vec, but doesn't actually require a corpus or data. ```python @@ -131,14 +110,10 @@ m2v_model = distill(model_name=model_name, vocabulary=vocabulary, use_subword=Fa **Important note:** we assume the passed vocabulary is sorted in rank frequency. i.e., we don't care about the actual word frequencies, but do assume that the most frequent word is first, and the least frequent word is last. If you're not sure whether this is case, set `apply_zipf` to `False`. This disables the weighting, but will also make performance a little bit worse. -
- ## Training -
- Training a classifier -
+### Training a classifier Model2Vec can be used to train a classifier on top of a distilled model. The following code snippet shows how to train a classifier on top of a distilled model. For more advanced usage, as well as results, please refer to the [training documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md). @@ -166,25 +141,18 @@ y_hat = classifier.predict(X_test) accuracy = np.mean(np.array(y_hat) == np.array(y_test)) * 100 ``` -
- ## Evaluation -
- Installation -
+### Installation Our models can be evaluated using our [evaluation package](https://github.com/MinishLab/evaluation). Install the evaluation package with: ```bash pip install git+https://github.com/MinishLab/evaluation.git@main ``` -
-
- Evaluation Code -
+### Evaluation Code The following code snippet shows how to evaluate a Model2Vec model: ```python @@ -217,4 +185,3 @@ task_scores = summarize_results(parsed_results) # Print the results in a leaderboard format print(make_leaderboard(task_scores)) ``` -
From 6c2612eeb245eac41d295b0f858f2e090b2c0ed5 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 14:57:23 +0100 Subject: [PATCH 06/38] Updated docs --- docs/usage.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/docs/usage.md b/docs/usage.md index 9e697c13..6fbc2761 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -1,6 +1,25 @@ # Usage +This document provides an overview of how to use Model2Vec for inference, distillation, training, and evaluation. + +## Table of Contents +- [Inference](#inference) + - [Inference with a pretrained model](#inference-with-a-pretrained-model) + - [Inference with the Sentence Transformers library](#inference-with-the-sentence-transformers-library) +- [Distillation](#distillation) + - [Distilling from a Sentence Transformer](#distilling-from-a-sentence-transformer) + - [Distilling from a loaded model](#distilling-from-a-loaded-model) + - [Distilling with the Sentence Transformers library](#distilling-with-the-sentence-transformers-library) + - [Distilling with a custom vocabulary](#distilling-with-a-custom-vocabulary) +- [Training](#training) + - [Training a classifier](#training-a-classifier) +- [Evaluation](#evaluation) + - [Installation](#installation) + - [Evaluation Code](#evaluation-code) + + + ## Inference ### Inference with a pretrained model From b1fdf9c6a35c703f3d941e061931c99291c12c24 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:01:27 +0100 Subject: [PATCH 07/38] Updated docs --- docs/integrations.md | 41 ++++++++++++++--------------------------- docs/usage.md | 12 ++++-------- 2 files changed, 18 insertions(+), 35 deletions(-) diff --git a/docs/integrations.md b/docs/integrations.md index 5c13e402..efaec875 100644 --- a/docs/integrations.md +++ b/docs/integrations.md @@ -1,10 +1,16 @@ # Integrations -
- Sentence Transformers - -
+Model2Vec can be used in a variety of downstream libraries. This document provides examples of how to use Model2Vec in some of these libraries. + +## Table of Contents +- [Sentence Transformers](#sentence-transformers) +- [LangChain](#langchain) +- [Txtai](#txtai) +- [Chonkie](#chonkie) +- [Transformers.js](#transformersjs) + +## Sentence Transformers Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers): @@ -31,11 +37,8 @@ embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to ever For more documentation, please refer to the [Sentence Transformers documentation](https://sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding). -
-
- LangChain -
+## LangChain Model2Vec can be used in [LangChain](https://github.com/langchain-ai/langchain) using the `langchain-community` package. For more information, see the [LangChain Model2Vec docs](https://python.langchain.com/docs/integrations/text_embedding/model2vec/). The following code snippet shows how to use Model2Vec in LangChain after installing the `langchain-community` package with `pip install langchain-community`: @@ -66,11 +69,8 @@ query = "Risotto" query_vector = embedder.embed_query(query) retrieved_docs = vector_store.similarity_search_by_vector(query_vector, k=1) ``` -
-
- Txtai -
+## Txtai Model2Vec can be used in [txtai](https://github.com/neuml/txtai) for text embeddings, nearest-neighbors search, and any of the other functionalities that txtai offers. The following code snippet shows how to use Model2Vec in txtai after installing the `txtai` package (including the `vectors` dependency) with `pip install txtai[vectors]`: @@ -91,11 +91,7 @@ embeddings.index(texts) result = embeddings.search("Risotto", 1) ``` -
- -
- Chonkie -
+## Chonkie Model2Vec is the default model for semantic chunking in [Chonkie](https://github.com/bhavnicksm/chonkie). To use Model2Vec for semantic chunking in Chonkie, simply install Chonkie with `pip install chonkie[semantic]` and use one of the `potion` models in the `SemanticChunker` class. The following code snippet shows how to use Model2Vec in Chonkie: @@ -115,12 +111,7 @@ chunker = SDPMChunker( chunks = chunker.chunk(text) ``` -
- - -
- Transformers.js -
+## Transformers.js To use a Model2Vec model in [transformers.js](https://github.com/huggingface/transformers.js), the following code snippet can be used as a starting point: @@ -162,7 +153,3 @@ Note that this requires that the Model2Vec has a `model.onnx` file and several r ```bash python scripts/export_to_onnx.py --model_path --save_path "" ``` - - -
-
diff --git a/docs/usage.md b/docs/usage.md index 6fbc2761..cbeba2d3 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -18,8 +18,6 @@ This document provides an overview of how to use Model2Vec for inference, distil - [Installation](#installation) - [Evaluation Code](#evaluation-code) - - ## Inference ### Inference with a pretrained model @@ -40,16 +38,15 @@ token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It' ### Inference with the Sentence Transformers library - The following code snippet shows how to use a Model2Vec model in the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. ```python from sentence_transformers import SentenceTransformer -from sentence_transformers.models import StaticEmbedding -# Initialize a StaticEmbedding module -static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M") -model = SentenceTransformer(modules=[static_embedding]) +# Load a Model2Vec model from the Hub +model = SentenceTransformer("minishlab/potion-base-8M") + +# Make embeddings embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) ``` @@ -162,7 +159,6 @@ accuracy = np.mean(np.array(y_hat) == np.array(y_test)) * 100 ## Evaluation - ### Installation Our models can be evaluated using our [evaluation package](https://github.com/MinishLab/evaluation). Install the evaluation package with: From 2e386805f4953f501be93846783fb44f4ade5af0 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:03:56 +0100 Subject: [PATCH 08/38] Updated docs --- docs/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/README.md b/docs/README.md index 342d4ebb..6cc8de78 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,3 +1,5 @@ # Documentation -This directory contains the documentation for Model2Vec. +This directory contains the documentation for Model2Vec. The documentation is formatted in Markdown. The documentation is organized as follows: +- [usage.md](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): This document provides a technical overview of how to use Model2Vec. +- [integrations.md]((https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md):): This document provides examples of how to use Model2Vec in various downstream libraries. From eaf840cfb6783ab4a2fa21b8aa88cf29c5a78133 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:04:14 +0100 Subject: [PATCH 09/38] Updated docs --- docs/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/README.md b/docs/README.md index 6cc8de78..f9fa9119 100644 --- a/docs/README.md +++ b/docs/README.md @@ -2,4 +2,4 @@ This directory contains the documentation for Model2Vec. The documentation is formatted in Markdown. The documentation is organized as follows: - [usage.md](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): This document provides a technical overview of how to use Model2Vec. -- [integrations.md]((https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md):): This document provides examples of how to use Model2Vec in various downstream libraries. +- [integrations.md](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md): This document provides examples of how to use Model2Vec in various downstream libraries. From 515bf8a39c56c027281efb0939357d92ed2d0707 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:09:45 +0100 Subject: [PATCH 10/38] Updated docs --- README.md | 23 +++++------------------ docs/README.md | 1 + docs/what_is_model2vec.md | 11 +++++++++++ 3 files changed, 17 insertions(+), 18 deletions(-) create mode 100644 docs/what_is_model2vec.md diff --git a/README.md b/README.md index 96397409..305688e5 100644 --- a/README.md +++ b/README.md @@ -54,12 +54,7 @@ Model2Vec is a technique to turn any sentence transformer into a really small st - [Quickstart](#quickstart) - [Main Features](#main-features) - [What is Model2Vec?](#what-is-model2vec) -- [Usage](#usage) - - [Inference](#inference) - - [Distillation](#distillation) - - [Training](#training) - - [Evaluation](#evaluation) -- [Integrations](#integrations) +- [Documentation](#documentation) - [Model List](#model-list) - [Results](#results) @@ -121,18 +116,10 @@ For advanced usage, such as using Model2Vec in the [Sentence Transformers librar ## What is Model2Vec? -Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Distillation doesn't need _any_ data, just a vocabulary and a model. - -The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence. - -Our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) are pre-trained using [tokenlearn](https://github.com/MinishLab/tokenlearn), a technique to pre-train model2vec distillation models. These models are created with the following steps: -- **Distillation**: We distill a Model2Vec model from a Sentence Transformer model, using the method described above. -- **Sentence Transformer inference**: We use the Sentence Transformer model to create mean embeddings for a large number of texts from a corpus. -- **Training**: We train a model to minimize the cosine distance between the mean embeddings generated by the Sentence Transformer model and the mean embeddings generated by the Model2Vec model. -- **Post-training re-regularization**: We re-regularize the trained embeddings by first performing PCA, and then weighting the embeddings using `smooth inverse frequency (SIF)` weighting using the following formula: `w = 1e-3 / (1e-3 + proba)`. Here, `proba` is the probability of the token in the corpus we used for training. - - -For a much more extensive deepdive, please refer to our [Model2Vec blog post](https://huggingface.co/blog/Pringled/model2vec) and our [Tokenlearn blog post](https://minishlab.github.io/tokenlearn_blogpost/). +Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Distillation doesn't need _any_ data, just a vocabulary and a model. The core idea is to forward pass a vocabulary through a sentence transformer model, creating static embeddings for the indiviudal tokens. After this, there are a number of post-processing steps we do that results in our best models. For a more extensive deepdive, please refer to the following resources: +- Our initial [Model2Vec blog post](https://huggingface.co/blog/Pringled/model2vec) +- Our [Tokenlearn blog post](https://minishlab.github.io/tokenlearn_blogpost/). +- Our official [documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md) ## Model List diff --git a/docs/README.md b/docs/README.md index f9fa9119..7392176c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -3,3 +3,4 @@ This directory contains the documentation for Model2Vec. The documentation is formatted in Markdown. The documentation is organized as follows: - [usage.md](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): This document provides a technical overview of how to use Model2Vec. - [integrations.md](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md): This document provides examples of how to use Model2Vec in various downstream libraries. +- [what_is_model2vec.md](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md): This document provides a high-level overview of how Model2Vec works. diff --git a/docs/what_is_model2vec.md b/docs/what_is_model2vec.md new file mode 100644 index 00000000..813ca4bb --- /dev/null +++ b/docs/what_is_model2vec.md @@ -0,0 +1,11 @@ +# What is Model2Vec? + +This document provides a high-level overview of how Model2Vec works. + +The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence. + +Our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) are pre-trained using [tokenlearn](https://github.com/MinishLab/tokenlearn), a technique to pre-train model2vec distillation models. These models are created with the following steps: +- **Distillation**: We distill a Model2Vec model from a Sentence Transformer model, using the method described above. +- **Sentence Transformer inference**: We use the Sentence Transformer model to create mean embeddings for a large number of texts from a corpus. +- **Training**: We train a model to minimize the cosine distance between the mean embeddings generated by the Sentence Transformer model and the mean embeddings generated by the Model2Vec model. +- **Post-training re-regularization**: We re-regularize the trained embeddings by first performing PCA, and then weighting the embeddings using `smooth inverse frequency (SIF)` weighting using the following formula: `w = 1e-3 / (1e-3 + proba)`. Here, `proba` is the probability of the token in the corpus we used for training. From 1b9aaa976effcfdc6889caff8e96887a79ed0c19 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:11:22 +0100 Subject: [PATCH 11/38] Updated docs --- README.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 305688e5..b97107b8 100644 --- a/README.md +++ b/README.md @@ -116,10 +116,12 @@ For advanced usage, such as using Model2Vec in the [Sentence Transformers librar ## What is Model2Vec? -Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Distillation doesn't need _any_ data, just a vocabulary and a model. The core idea is to forward pass a vocabulary through a sentence transformer model, creating static embeddings for the indiviudal tokens. After this, there are a number of post-processing steps we do that results in our best models. For a more extensive deepdive, please refer to the following resources: -- Our initial [Model2Vec blog post](https://huggingface.co/blog/Pringled/model2vec) -- Our [Tokenlearn blog post](https://minishlab.github.io/tokenlearn_blogpost/). -- Our official [documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md) +Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Distillation doesn't need _any_ data, just a vocabulary and a model. + +The core idea is to forward pass a vocabulary through a sentence transformer model, creating static embeddings for the indiviudal tokens. After this, there are a number of post-processing steps we do that results in our best models. For a more extensive deepdive, please refer to the following resources: +- Our initial [Model2Vec blog post](https://huggingface.co/blog/Pringled/model2vec). Note that, while this post gives a good overview of the core idea, we've made a number of substantial improvements since then. +- Our [Tokenlearn blog post](https://minishlab.github.io/tokenlearn_blogpost/). This post describes the Tokenlearn method we used to train our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). +- Our official [documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md). This document provides a high-level overview of how Model2Vec works. ## Model List From 495ae7045fe13c6768d91bf68c5882b5ae8fbc3a Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:14:09 +0100 Subject: [PATCH 12/38] Updated docs --- README.md | 7 +++++++ docs/what_is_model2vec.md | 2 +- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index b97107b8..87d161fb 100644 --- a/README.md +++ b/README.md @@ -123,6 +123,13 @@ The core idea is to forward pass a vocabulary through a sentence transformer mod - Our [Tokenlearn blog post](https://minishlab.github.io/tokenlearn_blogpost/). This post describes the Tokenlearn method we used to train our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). - Our official [documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md). This document provides a high-level overview of how Model2Vec works. +## Documentation + +Our official documentation can be found [here](https://github.com/MinishLab/model2vec/blob/main/docs/README.md). This includes: +- [Usage documentation](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): Provides a technical overview of how to use Model2Vec. +- [Integrations documentation](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md):Provides examples of how to use Model2Vec in various downstream libraries. +- [Model2Vec technical documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md): Provides a high-level overview of how Model2Vec works. + ## Model List diff --git a/docs/what_is_model2vec.md b/docs/what_is_model2vec.md index 813ca4bb..3413fd32 100644 --- a/docs/what_is_model2vec.md +++ b/docs/what_is_model2vec.md @@ -2,7 +2,7 @@ This document provides a high-level overview of how Model2Vec works. -The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence. +The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using SIF weighting (previously zipf weighting). During inference, we simply take the mean of all token embeddings occurring in a sentence. Our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) are pre-trained using [tokenlearn](https://github.com/MinishLab/tokenlearn), a technique to pre-train model2vec distillation models. These models are created with the following steps: - **Distillation**: We distill a Model2Vec model from a Sentence Transformer model, using the method described above. From 2adbb7fd2ecff467d61b150c06c16ef34a78a14e Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:16:18 +0100 Subject: [PATCH 13/38] Updated docs --- README.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 87d161fb..f5f17b6b 100644 --- a/README.md +++ b/README.md @@ -127,7 +127,7 @@ The core idea is to forward pass a vocabulary through a sentence transformer mod Our official documentation can be found [here](https://github.com/MinishLab/model2vec/blob/main/docs/README.md). This includes: - [Usage documentation](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): Provides a technical overview of how to use Model2Vec. -- [Integrations documentation](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md):Provides examples of how to use Model2Vec in various downstream libraries. +- [Integrations documentation](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md): Provides examples of how to use Model2Vec in various downstream libraries. - [Model2Vec technical documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md): Provides a high-level overview of how Model2Vec works. @@ -136,14 +136,14 @@ Our official documentation can be found [here](https://github.com/MinishLab/mode We provide a number of models that can be used out of the box. These models are available on the [HuggingFace hub](https://huggingface.co/collections/minishlab/model2vec-base-models-66fd9dd9b7c3b3c0f25ca90e) and can be loaded using the `from_pretrained` method. The models are listed below. -| Model | Language | Vocab | Sentence Transformer | Tokenizer Type | Params | Tokenlearn | -|-----------------------------------------------------------------------|-------------|------------------|-----------------------------------------------------------------|----------------|---------|-------------------| -| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | English | Output + Frequent C4 tokens | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | Subword | 32.3M |
| -| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | English | Output | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | Subword | 7.5M |
| -| [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M) | English | Output | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | Subword | 3.7M |
| -| [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M) | English | Output | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | Subword | 1.8M |
| -| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | English | Output + Frequent C4 tokens | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | Subword | 32.3M |
| -| [M2V_multilingual_output](https://huggingface.co/minishlab/M2V_multilingual_output) | Multilingual | Output | [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | Subword | 471M |
| +| Model | Language | Sentence Transformer | Params | +|-----------------------------------------------------------------------|------------|-----------------------------------------------------------------|---------| +| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 32.3M | +| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 7.5M | +| [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 3.7M | +| [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 1.8M | +| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 32.3M | +| [M2V_multilingual_output](https://huggingface.co/minishlab/M2V_multilingual_output) | Multilingual | [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | 471M | ## Results @@ -161,9 +161,9 @@ MIT If you use Model2Vec in your research, please cite the following: ```bibtex @software{minishlab2024model2vec, - authors = {Stephan Tulkens, Thomas van Dongen}, + authors = {Stephan Tulkens and Thomas van Dongen}, title = {Model2Vec: The Fastest State-of-the-Art Static Embeddings in the World}, year = {2024}, - url = {https://github.com/MinishLab/model2vec}, + url = {https://github.com/MinishLab/model2vec} } ``` From 658029f567039323a6c927fc65bab7847d747c6f Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:17:34 +0100 Subject: [PATCH 14/38] Updated docs --- README.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index f5f17b6b..5a448422 100644 --- a/README.md +++ b/README.md @@ -136,14 +136,15 @@ Our official documentation can be found [here](https://github.com/MinishLab/mode We provide a number of models that can be used out of the box. These models are available on the [HuggingFace hub](https://huggingface.co/collections/minishlab/model2vec-base-models-66fd9dd9b7c3b3c0f25ca90e) and can be loaded using the `from_pretrained` method. The models are listed below. -| Model | Language | Sentence Transformer | Params | -|-----------------------------------------------------------------------|------------|-----------------------------------------------------------------|---------| -| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 32.3M | -| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 7.5M | -| [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 3.7M | -| [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 1.8M | -| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 32.3M | -| [M2V_multilingual_output](https://huggingface.co/minishlab/M2V_multilingual_output) | Multilingual | [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | 471M | + +| Model | Language | Sentence Transformer | Params | Task | +|-----------------------------------------------------------------------|------------|-----------------------------------------------------------------|---------|-----------| +| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 32.3M | General | +| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 7.5M | General | +| [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 3.7M | General | +| [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 1.8M | General | +| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 32.3M | Retrieval | +| [M2V_multilingual_output](https://huggingface.co/minishlab/M2V_multilingual_output) | Multilingual | [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | 471M | General | ## Results From c5be3336a0c2bfb1165e3c57b32768955a38909e Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:21:44 +0100 Subject: [PATCH 15/38] Updated docs --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 5a448422..01a6983f 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ -Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by 15x and making the models up to 500x faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-32M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. +Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-32M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. ## Updates & Announcements From 6b7a427b34a4babe4f3f4b1261f8bc73e247f5f3 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:26:00 +0100 Subject: [PATCH 16/38] Updated docs --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 01a6983f..1bba24f6 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ -Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-32M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. +Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. ## Updates & Announcements @@ -72,7 +72,7 @@ This will install the base inference package, which only depends on `numpy` and pip install model2vec[distill] ``` -The easiest way to get started with Model2Vec is to load one of our [flagship models from the HuggingFace hub](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings: +You can start using Model2Vec immediately by loading one of our [flagship models from the HuggingFace hub](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings: ```python from model2vec import StaticModel @@ -107,7 +107,7 @@ For advanced usage, such as using Model2Vec in the [Sentence Transformers librar ## Main Features - **State-of-the-Art Performance**: Model2Vec models outperform any other static embeddings (such as GLoVe and BPEmb) by a large margin, as can be seen in our [results](results/README.md). -- **Small**: Model2Vec reduces the size of a Sentence Transformer model by a factor of 15, from 120M params, down to 7.5M (30 MB on disk, making it the smallest model on [MTEB](https://huggingface.co/spaces/mteb/leaderboard)!). +- **Small**: Model2Vec reduces the size of a Sentence Transformer model by a factor of up to 50. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is just ~30 MB on disk, and our smallest model ~8 MB (making it the smallest model on [MTEB](https://huggingface.co/spaces/mteb/leaderboard)!). - **Lightweight Dependencies**: the base package's only major dependency is `numpy`. - **Lightning-fast Inference**: up to 500 times faster on CPU than the original model. Go green or go home. - **Fast, Dataset-free Distillation**: distill your own model in 30 seconds on a CPU, without a dataset. All you need is a model and (optionally) a custom vocabulary. From b7bf6a86a76798788444fb6d57789454b6f8ee16 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:28:07 +0100 Subject: [PATCH 17/38] Updated docs --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 1bba24f6..93fcd8fc 100644 --- a/README.md +++ b/README.md @@ -107,11 +107,11 @@ For advanced usage, such as using Model2Vec in the [Sentence Transformers librar ## Main Features - **State-of-the-Art Performance**: Model2Vec models outperform any other static embeddings (such as GLoVe and BPEmb) by a large margin, as can be seen in our [results](results/README.md). -- **Small**: Model2Vec reduces the size of a Sentence Transformer model by a factor of up to 50. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is just ~30 MB on disk, and our smallest model ~8 MB (making it the smallest model on [MTEB](https://huggingface.co/spaces/mteb/leaderboard)!). +- **Small**: Model2Vec reduces the size of a Sentence Transformer model by a factor of up to 50. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is just ~30 MB on disk, and our smallest model just ~8 MB (making it the smallest model on [MTEB](https://huggingface.co/spaces/mteb/leaderboard)!). - **Lightweight Dependencies**: the base package's only major dependency is `numpy`. -- **Lightning-fast Inference**: up to 500 times faster on CPU than the original model. Go green or go home. -- **Fast, Dataset-free Distillation**: distill your own model in 30 seconds on a CPU, without a dataset. All you need is a model and (optionally) a custom vocabulary. -- **Integrated in many popular libraries**: Model2Vec can be used directly in popular libraries such as [Sentence Transformers](https://github.com/UKPLab/sentence-transformers), [LangChain](https://github.com/langchain-ai/langchain), [txtai](https://github.com/neuml/txtai), and [Chonkie](https://github.com/bhavnicksm/chonkie). See the [Integrations](#integrations) section for more information. +- **Lightning-fast Inference**: up to 500 times faster on CPU than the original model. +- **Fast, Dataset-free Distillation**: distill your own model in 30 seconds on a CPU, without a dataset. +- **Integrated in many popular libraries**: Model2Vec is integrated direclty into popular libraries such as [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) and [LangChain](https://github.com/langchain-ai/langchain). For more information, see our [integrations documentation](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md). - **Tightly integrated with HuggingFace hub**: easily share and load models from the HuggingFace hub, using the familiar `from_pretrained` and `push_to_hub`. Our own models can be found [here](https://huggingface.co/minishlab). ## What is Model2Vec? From 281b4a10a1d1b7f9bfde097e981714192fbe62e1 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:33:38 +0100 Subject: [PATCH 18/38] Updated docs --- README.md | 46 ++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 38 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 93fcd8fc..cd11abc3 100644 --- a/README.md +++ b/README.md @@ -72,7 +72,7 @@ This will install the base inference package, which only depends on `numpy` and pip install model2vec[distill] ``` -You can start using Model2Vec immediately by loading one of our [flagship models from the HuggingFace hub](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings: +You can start using Model2Vec immediately by loading one of our [flagship models from the HuggingFace hub](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings, which you can use to classify texts, cluster, or build a RAG system: ```python from model2vec import StaticModel @@ -86,9 +86,8 @@ embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to ever token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."]) ``` -And that's it. You can use the model to classify texts, to cluster, or to build a RAG system. +Instead of using one of our models, you can also distill your own Model2Vec model from a Sentence Transformer model. The following code snippet shows how to distill a model in ~30 seconds on a CPU: -Instead of using one of our models, you can also distill your own Model2Vec model from a Sentence Transformer model. The following code snippet shows how to distill a model: ```python from model2vec.distill import distill @@ -99,9 +98,40 @@ m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256) m2v_model.save_pretrained("m2v_model") ``` -Distillation is really fast and only takes 30 seconds on CPU. Best of all, distillation requires no training data. +After distillation, you can also fine-tune your own classification models on top of the distilled model. First, make sure you install the `training` extras with: -For advanced usage, such as using Model2Vec in the [Sentence Transformers library](https://github.com/UKPLab/sentence-transformers), please refer to the [Usage](#usage) sections. +```bash +pip install model2vec[training] +``` + +Then, you can fine-tune a model as follows: + +```python +from datasets import load_dataset +from model2vec.train import StaticModelForClassification + +# Load a distilled model +distilled_model = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") + +# Load a dataset +ds = load_dataset("setfit/subj") +train = ds["train"] +test = ds["test"] + +X_train, y_train = train["text"], train["label"] +X_test, y_test = test["text"], test["label"] + +# Train the classifier +classifier = StaticModelForClassification.from_static_model(distilled_model) +classifier.fit(X_train, y_train) + +# Evaluate the classifier +y_hat = classifier.predict(X_test) +accuracy = np.mean(np.array(y_hat) == np.array(y_test)) * 100 +``` + + +For advanced usage, please refer to our [usage documentation](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md). ## Main Features @@ -126,9 +156,9 @@ The core idea is to forward pass a vocabulary through a sentence transformer mod ## Documentation Our official documentation can be found [here](https://github.com/MinishLab/model2vec/blob/main/docs/README.md). This includes: -- [Usage documentation](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): Provides a technical overview of how to use Model2Vec. -- [Integrations documentation](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md): Provides examples of how to use Model2Vec in various downstream libraries. -- [Model2Vec technical documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md): Provides a high-level overview of how Model2Vec works. +- [Usage documentation](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): provides a technical overview of how to use Model2Vec. +- [Integrations documentation](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md): provides examples of how to use Model2Vec in various downstream libraries. +- [Model2Vec technical documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md): provides a high-level overview of how Model2Vec works. ## Model List From 0d6e5d713f661b6908bc791a26a73372cb348151 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:39:16 +0100 Subject: [PATCH 19/38] Updated docs --- README.md | 18 ++++++------------ docs/usage.md | 4 ++-- 2 files changed, 8 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index cd11abc3..c8e616fd 100644 --- a/README.md +++ b/README.md @@ -98,7 +98,7 @@ m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256) m2v_model.save_pretrained("m2v_model") ``` -After distillation, you can also fine-tune your own classification models on top of the distilled model. First, make sure you install the `training` extras with: +After distillation, you can also fine-tune your own classification models on top of the distilled model, or on a pre-trained model. First, make sure you install the `training` extras with: ```bash pip install model2vec[training] @@ -107,27 +107,21 @@ pip install model2vec[training] Then, you can fine-tune a model as follows: ```python +import numpy as np from datasets import load_dataset from model2vec.train import StaticModelForClassification -# Load a distilled model -distilled_model = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") +# Initialize a classifier from a pre-trained model +classifer = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") # Load a dataset ds = load_dataset("setfit/subj") -train = ds["train"] -test = ds["test"] - -X_train, y_train = train["text"], train["label"] -X_test, y_test = test["text"], test["label"] # Train the classifier -classifier = StaticModelForClassification.from_static_model(distilled_model) -classifier.fit(X_train, y_train) +classifier.fit(ds["train"]["text"], ds["train"]["label"]) # Evaluate the classifier -y_hat = classifier.predict(X_test) -accuracy = np.mean(np.array(y_hat) == np.array(y_test)) * 100 +accuracy = np.mean(classifier.predict(ds["test"]["text"]) == ds["test"]["label"]) * 100 ``` diff --git a/docs/usage.md b/docs/usage.md index cbeba2d3..5e925e75 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -134,11 +134,12 @@ m2v_model = distill(model_name=model_name, vocabulary=vocabulary, use_subword=Fa Model2Vec can be used to train a classifier on top of a distilled model. The following code snippet shows how to train a classifier on top of a distilled model. For more advanced usage, as well as results, please refer to the [training documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md). ```python +import numpy as np from datasets import load_dataset from model2vec.train import StaticModelForClassification # Load a distilled model -distilled_model = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") +classifer = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") # Load a dataset ds = load_dataset("setfit/subj") @@ -149,7 +150,6 @@ X_train, y_train = train["text"], train["label"] X_test, y_test = test["text"], test["label"] # Train the classifier -classifier = StaticModelForClassification.from_static_model(distilled_model) classifier.fit(X_train, y_train) # Evaluate the classifier From 6fdb9cc63038874322b71372c5acb4e8105f359a Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:40:32 +0100 Subject: [PATCH 20/38] Updated docs --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index c8e616fd..cf268449 100644 --- a/README.md +++ b/README.md @@ -117,11 +117,12 @@ classifer = StaticModelForClassification.from_pretrained("minishlab/potion-base- # Load a dataset ds = load_dataset("setfit/subj") -# Train the classifier +# Train the classifier on text (X) and labels (y) classifier.fit(ds["train"]["text"], ds["train"]["label"]) # Evaluate the classifier -accuracy = np.mean(classifier.predict(ds["test"]["text"]) == ds["test"]["label"]) * 100 +predictions = classifier.predict(ds["test"]["text"]) +accuracy = np.mean(np.array(predictions) == np.array(ds["test"]["label"])) * 100 ``` From e2e9ee23ca8caa926a021634506e18df0301d64c Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:41:56 +0100 Subject: [PATCH 21/38] Updated docs --- docs/usage.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/usage.md b/docs/usage.md index 5e925e75..b2b9b214 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -138,7 +138,7 @@ import numpy as np from datasets import load_dataset from model2vec.train import StaticModelForClassification -# Load a distilled model +# Initialize a classifier from a pre-trained model classifer = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") # Load a dataset From 22b6352d9975e26b88be53136d6aff9753c136e3 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:48:24 +0100 Subject: [PATCH 22/38] Updated docs --- README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/README.md b/README.md index cf268449..482eeaca 100644 --- a/README.md +++ b/README.md @@ -125,10 +125,8 @@ predictions = classifier.predict(ds["test"]["text"]) accuracy = np.mean(np.array(predictions) == np.array(ds["test"]["label"])) * 100 ``` - For advanced usage, please refer to our [usage documentation](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md). - ## Main Features - **State-of-the-Art Performance**: Model2Vec models outperform any other static embeddings (such as GLoVe and BPEmb) by a large margin, as can be seen in our [results](results/README.md). From f9f1f0fc803c053cc6bfdcdf186f733687b6d9a9 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:49:41 +0100 Subject: [PATCH 23/38] Updated docs --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 482eeaca..57ca6347 100644 --- a/README.md +++ b/README.md @@ -34,10 +34,10 @@ -
+ Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. From 7b18b664c4ba2142907d4ff1a405d7a05b052fab Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:52:21 +0100 Subject: [PATCH 24/38] Updated docs --- README.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index 57ca6347..c9cfb4a2 100644 --- a/README.md +++ b/README.md @@ -34,10 +34,13 @@
- +[Quickstart](#quickstart) • +[Main Features](#main-features) • +[What is Model2Vec?](#what-is-model2vec) • +[Documentation](#documentation) • +[Model List](#model-list) • +[Results](#results) + Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. @@ -50,13 +53,13 @@ Model2Vec is a technique to turn any sentence transformer into a really small st - **30/10/2024**: We released three new models: [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M), [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M), and [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M). These models are trained using [Tokenlearn](https://github.com/MinishLab/tokenlearn). Find out more in our [blog post](https://minishlab.github.io/tokenlearn_blogpost/). NOTE: for users of any of our old English M2V models, we recommend switching to these new models as they [perform better on all tasks](https://github.com/MinishLab/model2vec/tree/main/results). -## Table of Contents + ## Quickstart From 0d941fe578785ba55cf6b40f85169d941e951616 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:52:51 +0100 Subject: [PATCH 25/38] Updated docs --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index c9cfb4a2..e9fe3dd6 100644 --- a/README.md +++ b/README.md @@ -32,7 +32,6 @@ License - MIT - [Quickstart](#quickstart) • [Main Features](#main-features) • @@ -41,6 +40,10 @@ [Model List](#model-list) • [Results](#results) + + + + Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. From 5e33cdfbfad93e36d88af07a437f90fd20684f5f Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:55:52 +0100 Subject: [PATCH 26/38] Updated docs --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index e9fe3dd6..ea9f23fa 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@
-

+

Package version Supported Python versions @@ -31,7 +31,7 @@ License - MIT -

+ [Quickstart](#quickstart) • [Main Features](#main-features) • @@ -39,7 +39,7 @@ [Documentation](#documentation) • [Model List](#model-list) • [Results](#results) - +
From 0f7287fea15a97e5d28eac4b3463d08c4bd9430a Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:58:38 +0100 Subject: [PATCH 27/38] Updated docs --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index ea9f23fa..7d3942df 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,8 @@ 🤗 Models | 📚 Tutorials | 🌐 Website | - 🏆 Results + 🏆 Results | + 📖 Docs @@ -32,6 +33,7 @@ License - MIT +
[Quickstart](#quickstart) • [Main Features](#main-features) • From 6c90a39238bd64d145779a3bbc66f0e27c6a895b Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 15:59:35 +0100 Subject: [PATCH 28/38] Updated docs --- README.md | 16 ++++------------ 1 file changed, 4 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 7d3942df..33343e31 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@
-

+

Package version Supported Python versions @@ -33,15 +33,7 @@ License - MIT -
- -[Quickstart](#quickstart) • -[Main Features](#main-features) • -[What is Model2Vec?](#what-is-model2vec) • -[Documentation](#documentation) • -[Model List](#model-list) • -[Results](#results) -

+
@@ -58,13 +50,13 @@ Model2Vec is a technique to turn any sentence transformer into a really small st - **30/10/2024**: We released three new models: [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M), [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M), and [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M). These models are trained using [Tokenlearn](https://github.com/MinishLab/tokenlearn). Find out more in our [blog post](https://minishlab.github.io/tokenlearn_blogpost/). NOTE: for users of any of our old English M2V models, we recommend switching to these new models as they [perform better on all tasks](https://github.com/MinishLab/model2vec/tree/main/results). - +- [Results](#results) ## Quickstart From 47f9a319787d9fee0e2928eb616af73f2ed20016 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 16:01:02 +0100 Subject: [PATCH 29/38] Updated docs --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 33343e31..79662c6c 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@

🤗 Models | 📚 Tutorials | - 🌐 Website | + 🌐 Blog | 🏆 Results | 📖 Docs

@@ -30,7 +30,7 @@ Codecov - + License - MIT From d6556648a769c740108aba46e23546a24083840a Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 16:07:51 +0100 Subject: [PATCH 30/38] Updated docs --- README.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/README.md b/README.md index 79662c6c..5a4cdb34 100644 --- a/README.md +++ b/README.md @@ -41,17 +41,9 @@ Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. - -## Updates & Announcements - -- **12/02/2024**: We released **Model2Vec training**, allowing you to fine-tune your own classification models on top of Model2Vec models. Find out more in our [documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md) and in our [blog post](LINK). - -- **30/01/2024**: We released two new models: [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) and [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M). [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) is our most performant model to date, using a larger vocabulary and higher dimensions. [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) is a finetune of [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) that is optimized for retrieval tasks, and is the best performing static retrieval model currently available. - -- **30/10/2024**: We released three new models: [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M), [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M), and [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M). These models are trained using [Tokenlearn](https://github.com/MinishLab/tokenlearn). Find out more in our [blog post](https://minishlab.github.io/tokenlearn_blogpost/). NOTE: for users of any of our old English M2V models, we recommend switching to these new models as they [perform better on all tasks](https://github.com/MinishLab/model2vec/tree/main/results). - ## Table of Contents - [Quickstart](#quickstart) +- [Updates & Announcements](#updates--announcements) - [Main Features](#main-features) - [What is Model2Vec?](#what-is-model2vec) - [Documentation](#documentation) @@ -127,6 +119,14 @@ accuracy = np.mean(np.array(predictions) == np.array(ds["test"]["label"])) * 100 For advanced usage, please refer to our [usage documentation](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md). +## Updates & Announcements + +- **12/02/2024**: We released **Model2Vec training**, allowing you to fine-tune your own classification models on top of Model2Vec models. Find out more in our [documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md) and in our [blog post](LINK). + +- **30/01/2024**: We released two new models: [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) and [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M). [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) is our most performant model to date, using a larger vocabulary and higher dimensions. [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) is a finetune of [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) that is optimized for retrieval tasks, and is the best performing static retrieval model currently available. + +- **30/10/2024**: We released three new models: [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M), [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M), and [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M). These models are trained using [Tokenlearn](https://github.com/MinishLab/tokenlearn). Find out more in our [blog post](https://minishlab.github.io/tokenlearn_blogpost/). NOTE: for users of any of our old English M2V models, we recommend switching to these new models as they [perform better on all tasks](https://github.com/MinishLab/model2vec/tree/main/results). + ## Main Features - **State-of-the-Art Performance**: Model2Vec models outperform any other static embeddings (such as GLoVe and BPEmb) by a large margin, as can be seen in our [results](results/README.md). From 42acc90c5d8198044bb30d4bd15164eebb57ef36 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 16:11:37 +0100 Subject: [PATCH 31/38] Updated docs --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 5a4cdb34..0b4146e5 100644 --- a/README.md +++ b/README.md @@ -42,13 +42,16 @@ Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. ## Table of Contents +[Quickstart](#quickstart) | [Updates & Announcements](#updates--announcements) | [Main Features](#main-features) | [What is Model2Vec?](#what-is-model2vec) | [Documentation](#documentation) | [Model List](#model-list) | [Results](#results) + + ## Quickstart From df71b71b69503bbb50b0b99c7edcbb609ab3214f Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 16:12:50 +0100 Subject: [PATCH 32/38] Updated docs --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 0b4146e5..2d7fe627 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. ## Table of Contents -[Quickstart](#quickstart) | [Updates & Announcements](#updates--announcements) | [Main Features](#main-features) | [What is Model2Vec?](#what-is-model2vec) | [Documentation](#documentation) | [Model List](#model-list) | [Results](#results) +[Quickstart](#quickstart) • [Updates & Announcements](#updates--announcements) • [Main Features](#main-features) • [Model List](#model-list) - ## Quickstart Install the package with: @@ -67,9 +58,6 @@ pip install model2vec This will install the base inference package, which only depends on `numpy` and a few other minor dependencies. If you want to distill your own models, you can install the distillation extras with: -```bash -pip install model2vec[distill] -``` You can start using Model2Vec immediately by loading one of our [flagship models from the HuggingFace hub](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings, which you can use to classify texts, cluster, or build a RAG system: ```python @@ -85,7 +73,14 @@ embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to ever token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."]) ``` -Instead of using one of our models, you can also distill your own Model2Vec model from a Sentence Transformer model. The following code snippet shows how to distill a model in ~30 seconds on a CPU: +Instead of using one of our models, you can also distill your own Model2Vec model from a Sentence Transformer model. First, install the `distillation` extras with: + +```bash +pip install model2vec[distill] +``` + + + Then, you can distill a model in ~30 seconds on a CPU with the following code snippet: ```python from model2vec.distill import distill From c86a1e60635856efd34a113cb9d5e9e08642533d Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 16:18:15 +0100 Subject: [PATCH 37/38] Updated docs --- README.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/README.md b/README.md index 9c2221d2..6bbadfc9 100644 --- a/README.md +++ b/README.md @@ -50,15 +50,12 @@ Model2Vec is a technique to turn any sentence transformer into a really small st ## Quickstart -Install the package with: +Install the lightweight base package with: ```bash pip install model2vec ``` -This will install the base inference package, which only depends on `numpy` and a few other minor dependencies. If you want to distill your own models, you can install the distillation extras with: - - You can start using Model2Vec immediately by loading one of our [flagship models from the HuggingFace hub](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings, which you can use to classify texts, cluster, or build a RAG system: ```python from model2vec import StaticModel From 35a7f73cc416a0bf138ccc78d25299a1a9855044 Mon Sep 17 00:00:00 2001 From: Pringled Date: Sun, 9 Feb 2025 16:19:27 +0100 Subject: [PATCH 38/38] Updated docs --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 6bbadfc9..19c6554f 100644 --- a/README.md +++ b/README.md @@ -56,7 +56,7 @@ Install the lightweight base package with: pip install model2vec ``` -You can start using Model2Vec immediately by loading one of our [flagship models from the HuggingFace hub](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings, which you can use to classify texts, cluster, or build a RAG system: +You can start using Model2Vec by loading one of our [flagship models from the HuggingFace hub](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings, which you can use for any task, such as text classification, retrieval, clustering, or building a RAG system: ```python from model2vec import StaticModel