diff --git a/README.md b/README.md index baea2578..19c6554f 100644 --- a/README.md +++ b/README.md @@ -14,8 +14,9 @@

🤗 Models | 📚 Tutorials | - 🌐 Website | - 🏆 Results + 🌐 Blog | + 🏆 Results | + 📖 Docs

@@ -29,51 +30,33 @@ Codecov - + License - MIT + -
- - -
-Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by 15x and making the models up to 500x faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-32M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. -## Updates & Announcements +Model2Vec is a technique to turn any sentence transformer into a really small static model, reducing model size by a factor up to 50 and making the models up to 500 times faster, with a small drop in performance. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is the most performant static embedding model in the world. See our results [here](results/README.md), or dive in to see how it works. -- **30/01/2024**: We released two new models: [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) and [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M). [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) is our most performant model to date, using a larger vocabulary and higher dimensions. [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) is a finetune of [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) that is optimized for retrieval tasks, and is the best performing static retrieval model currently available. -- **30/10/2024**: We released three new models: [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M), [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M), and [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M). These models are trained using [Tokenlearn](https://github.com/MinishLab/tokenlearn). Find out more in our [blog post](https://minishlab.github.io/tokenlearn_blogpost/). NOTE: for users of any of our old English M2V models, we recommend switching to these new models as they [perform better on all tasks](https://github.com/MinishLab/model2vec/tree/main/results). +
+

-## Table of Contents -- [Quickstart](#quickstart) -- [Main Features](#main-features) -- [What is Model2Vec?](#what-is-model2vec) -- [Usage](#usage) - - [Inference](#inference) - - [Distillation](#distillation) - - [Evaluation](#evaluation) -- [Integrations](#integrations) -- [Model List](#model-list) -- [Results](#results) +[Quickstart](#quickstart) • [Updates & Announcements](#updates--announcements) • [Main Features](#main-features) • [Model List](#model-list) +

+
## Quickstart -Install the package with: +Install the lightweight base package with: ```bash pip install model2vec ``` -This will install the base inference package, which only depends on `numpy` and a few other minor dependencies. If you want to distill your own models, you can install the distillation extras with: - -```bash -pip install model2vec[distill] -``` - -The easiest way to get started with Model2Vec is to load one of our [flagship models from the HuggingFace hub](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings: +You can start using Model2Vec by loading one of our [flagship models from the HuggingFace hub](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings, which you can use for any task, such as text classification, retrieval, clustering, or building a RAG system: ```python from model2vec import StaticModel @@ -87,402 +70,87 @@ embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to ever token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."]) ``` -And that's it. You can use the model to classify texts, to cluster, or to build a RAG system. - -Instead of using one of our models, you can also distill your own Model2Vec model from a Sentence Transformer model. The following code snippet shows how to distill a model: -```python -from model2vec.distill import distill - -# Distill a Sentence Transformer model, in this case the BAAI/bge-base-en-v1.5 model -m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256) - -# Save the model -m2v_model.save_pretrained("m2v_model") -``` - -Distillation is really fast and only takes 30 seconds on CPU. Best of all, distillation requires no training data. - -For advanced usage, such as using Model2Vec in the [Sentence Transformers library](https://github.com/UKPLab/sentence-transformers), please refer to the [Usage](#usage) sections. - - -## Main Features - -- **State-of-the-Art Performance**: Model2Vec models outperform any other static embeddings (such as GLoVe and BPEmb) by a large margin, as can be seen in our [results](results/README.md). -- **Small**: Model2Vec reduces the size of a Sentence Transformer model by a factor of 15, from 120M params, down to 7.5M (30 MB on disk, making it the smallest model on [MTEB](https://huggingface.co/spaces/mteb/leaderboard)!). -- **Lightweight Dependencies**: the base package's only major dependency is `numpy`. -- **Lightning-fast Inference**: up to 500 times faster on CPU than the original model. Go green or go home. -- **Fast, Dataset-free Distillation**: distill your own model in 30 seconds on a CPU, without a dataset. All you need is a model and (optionally) a custom vocabulary. -- **Integrated in many popular libraries**: Model2Vec can be used directly in popular libraries such as [Sentence Transformers](https://github.com/UKPLab/sentence-transformers), [LangChain](https://github.com/langchain-ai/langchain), [txtai](https://github.com/neuml/txtai), and [Chonkie](https://github.com/bhavnicksm/chonkie). See the [Integrations](#integrations) section for more information. -- **Tightly integrated with HuggingFace hub**: easily share and load models from the HuggingFace hub, using the familiar `from_pretrained` and `push_to_hub`. Our own models can be found [here](https://huggingface.co/minishlab). Feel free to share your own. - -## What is Model2Vec? - -Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Distillation doesn't need _any_ data, just a vocabulary and a model. - -The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence. - -Our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) are pre-trained using [tokenlearn](https://github.com/MinishLab/tokenlearn), a technique to pre-train model2vec distillation models. These models are created with the following steps: -- **Distillation**: We distill a Model2Vec model from a Sentence Transformer model, using the method described above. -- **Sentence Transformer inference**: We use the Sentence Transformer model to create mean embeddings for a large number of texts from a corpus. -- **Training**: We train a model to minimize the cosine distance between the mean embeddings generated by the Sentence Transformer model and the mean embeddings generated by the Model2Vec model. -- **Post-training re-regularization**: We re-regularize the trained embeddings by first performing PCA, and then weighting the embeddings using `smooth inverse frequency (SIF)` weighting using the following formula: `w = 1e-3 / (1e-3 + proba)`. Here, `proba` is the probability of the token in the corpus we used for training. - - -For a much more extensive deepdive, please refer to our [Model2Vec blog post](https://huggingface.co/blog/Pringled/model2vec) and our [Tokenlearn blog post](https://minishlab.github.io/tokenlearn_blogpost/). - -## Usage +Instead of using one of our models, you can also distill your own Model2Vec model from a Sentence Transformer model. First, install the `distillation` extras with: -### Inference - -
- Inference with a pretrained model -
- -Inference works as follows. The example shows one of our own models, but you can also just load a local one, or another one from the hub. -```python -from model2vec import StaticModel - -# Load a model from the Hub. You can optionally pass a token when loading a private model -model = StaticModel.from_pretrained(model_name="minishlab/potion-base-8M", token=None) - -# Make embeddings -embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) - -# Make sequences of token embeddings -token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."]) +```bash +pip install model2vec[distill] ``` -
- -
- Inference with the Sentence Transformers library -
-The following code snippet shows how to use a Model2Vec model in the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. + Then, you can distill a model in ~30 seconds on a CPU with the following code snippet: -```python -from sentence_transformers import SentenceTransformer -from sentence_transformers.models import StaticEmbedding - -# Initialize a StaticEmbedding module -static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M") -model = SentenceTransformer(modules=[static_embedding]) -embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) -``` - -
- -### Distillation - -
- Distilling from a Sentence Transformer -
- -The following code can be used to distill a model from a Sentence Transformer. As mentioned above, this leads to really small model that might be less performant. ```python from model2vec.distill import distill -# Distill a Sentence Transformer model +# Distill a Sentence Transformer model, in this case the BAAI/bge-base-en-v1.5 model m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256) # Save the model m2v_model.save_pretrained("m2v_model") - -``` -
- -
- Distilling from a loaded model -
- -If you already have a model loaded, or need to load a model in some special way, we also offer an interface to distill models in memory. - -```python -from transformers import AutoModel, AutoTokenizer - -from model2vec.distill import distill_from_model - -# Assuming a loaded model and tokenizer -model_name = "baai/bge-base-en-v1.5" -model = AutoModel.from_pretrained(model_name) -tokenizer = AutoTokenizer.from_pretrained(model_name) - -m2v_model = distill_from_model(model=model, tokenizer=tokenizer, pca_dims=256) - -m2v_model.save_pretrained("m2v_model") - ``` -
- -
- Distilling with the Sentence Transformers library -
- -The following code snippet shows how to distill a model using the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. - -```python -from sentence_transformers import SentenceTransformer -from sentence_transformers.models import StaticEmbedding - -static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256) -model = SentenceTransformer(modules=[static_embedding]) -embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) -``` - -
- - -
- Distilling with a custom vocabulary -
- -If you pass a vocabulary, you get a set of static word embeddings, together with a custom tokenizer for exactly that vocabulary. This is comparable to how you would use GLoVe or traditional word2vec, but doesn't actually require a corpus or data. -```python -from model2vec.distill import distill - -# Load a vocabulary as a list of strings -vocabulary = ["word1", "word2", "word3"] - -# Distill a Sentence Transformer model with the custom vocabulary -m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", vocabulary=vocabulary) - -# Save the model -m2v_model.save_pretrained("m2v_model") - -# Or push it to the hub -m2v_model.push_to_hub("my_organization/my_model", token="") -``` - -By default, this will distill a model with a subword tokenizer, combining the models (subword) vocab with the new vocabulary. If you want to get a word-level tokenizer instead (with only the passed vocabulary), the `use_subword` parameter can be set to `False`, e.g.: - -```python -m2v_model = distill(model_name=model_name, vocabulary=vocabulary, use_subword=False) -``` - -**Important note:** we assume the passed vocabulary is sorted in rank frequency. i.e., we don't care about the actual word frequencies, but do assume that the most frequent word is first, and the least frequent word is last. If you're not sure whether this is case, set `apply_zipf` to `False`. This disables the weighting, but will also make performance a little bit worse. - -
- - -### Evaluation - - -
- Installation -
- -Our models can be evaluated using our [evaluation package](https://github.com/MinishLab/evaluation). Install the evaluation package with: +After distillation, you can also fine-tune your own classification models on top of the distilled model, or on a pre-trained model. First, make sure you install the `training` extras with: ```bash -pip install git+https://github.com/MinishLab/evaluation.git@main -``` -
- -
- Evaluation Code -
- -The following code snippet shows how to evaluate a Model2Vec model: -```python -from model2vec import StaticModel - -from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results -from mteb import ModelMeta - -# Get all available tasks -tasks = get_tasks() -# Define the CustomMTEB object with the specified tasks -evaluation = CustomMTEB(tasks=tasks) - -# Load the model -model_name = "m2v_model" -model = StaticModel.from_pretrained(model_name) - -# Optionally, add model metadata in MTEB format -model.mteb_model_meta = ModelMeta( - name=model_name, revision="no_revision_available", release_date=None, languages=None - ) - -# Run the evaluation -results = evaluation.run(model, eval_splits=["test"], output_folder=f"results") - -# Parse the results and summarize them -parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name) -task_scores = summarize_results(parsed_results) - -# Print the results in a leaderboard format -print(make_leaderboard(task_scores)) -``` -
- -## Integrations -
- Sentence Transformers - -
- -Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) using the `StaticEmbedding` module. - -The following code snippet shows how to load a Model2Vec model into a Sentence Transformer model: -```python -from sentence_transformers import SentenceTransformer -from sentence_transformers.models import StaticEmbedding - -# Initialize a StaticEmbedding module -static_embedding = StaticEmbedding.from_model2vec("minishlab/potion-base-8M") -model = SentenceTransformer(modules=[static_embedding]) -embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) +pip install model2vec[training] ``` -The following code snippet shows how to distill a model directly into a Sentence Transformer model: +Then, you can fine-tune a model as follows: ```python -from sentence_transformers import SentenceTransformer -from sentence_transformers.models import StaticEmbedding - -static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256) -model = SentenceTransformer(modules=[static_embedding]) -embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) -``` +import numpy as np +from datasets import load_dataset +from model2vec.train import StaticModelForClassification -For more documentation, please refer to the [Sentence Transformers documentation](https://sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding). +# Initialize a classifier from a pre-trained model +classifer = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") -
+# Load a dataset +ds = load_dataset("setfit/subj") -
- LangChain -
+# Train the classifier on text (X) and labels (y) +classifier.fit(ds["train"]["text"], ds["train"]["label"]) -Model2Vec can be used in [LangChain](https://github.com/langchain-ai/langchain) using the `langchain-community` package. For more information, see the [LangChain Model2Vec docs](https://python.langchain.com/docs/integrations/text_embedding/model2vec/). The following code snippet shows how to use Model2Vec in LangChain after installing the `langchain-community` package with `pip install langchain-community`: - -```python -from langchain_community.embeddings import Model2vecEmbeddings -from langchain_community.vectorstores import FAISS -from langchain.schema import Document - -# Initialize a Model2Vec embedder -embedder = Model2vecEmbeddings("minishlab/potion-base-8M") - -# Create some example texts -texts = [ - "Enduring Stew", - "Hearty Elixir", - "Mighty Mushroom Risotto", - "Spicy Meat Skewer", - "Fruit Salad", -] - -# Embed the texts -embeddings = embedder.embed_documents(texts) - -# Or, create a vector store and query it -documents = [Document(page_content=text) for text in texts] -vector_store = FAISS.from_documents(documents, embedder) -query = "Risotto" -query_vector = embedder.embed_query(query) -retrieved_docs = vector_store.similarity_search_by_vector(query_vector, k=1) +# Evaluate the classifier +predictions = classifier.predict(ds["test"]["text"]) +accuracy = np.mean(np.array(predictions) == np.array(ds["test"]["label"])) * 100 ``` -
- -
- Txtai -
-Model2Vec can be used in [txtai](https://github.com/neuml/txtai) for text embeddings, nearest-neighbors search, and any of the other functionalities that txtai offers. The following code snippet shows how to use Model2Vec in txtai after installing the `txtai` package (including the `vectors` dependency) with `pip install txtai[vectors]`: - -```python -from txtai import Embeddings - -# Load a model2vec model -embeddings = Embeddings(path="minishlab/potion-base-8M", method="model2vec", backend="numpy") - -# Create some example texts -texts = ["Enduring Stew", "Hearty Elixir", "Mighty Mushroom Risotto", "Spicy Meat Skewer", "Chilly Fruit Salad"] - -# Create embeddings for downstream tasks -vectors = embeddings.batchtransform(texts) - -# Or create a nearest-neighbors index and search it -embeddings.index(texts) -result = embeddings.search("Risotto", 1) -``` - -
- -
- Chonkie -
- -Model2Vec is the default model for semantic chunking in [Chonkie](https://github.com/bhavnicksm/chonkie). To use Model2Vec for semantic chunking in Chonkie, simply install Chonkie with `pip install chonkie[semantic]` and use one of the `potion` models in the `SemanticChunker` class. The following code snippet shows how to use Model2Vec in Chonkie: - -```python -from chonkie import SDPMChunker +For advanced usage, please refer to our [usage documentation](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md). -# Create some example text to chunk -text = "It's dangerous to go alone! Take this." - -# Initialize the SemanticChunker with a potion model -chunker = SDPMChunker( - embedding_model="minishlab/potion-base-8M", - similarity_threshold=0.3 -) - -# Chunk the text -chunks = chunker.chunk(text) -``` - -
- - -
- Transformers.js -
- -To use a Model2Vec model in [transformers.js](https://github.com/huggingface/transformers.js), the following code snippet can be used as a starting point: - -```javascript -import { AutoModel, AutoTokenizer, Tensor } from '@huggingface/transformers'; - -const modelName = 'minishlab/potion-base-8M'; +## Updates & Announcements -const modelConfig = { - config: { model_type: 'model2vec' }, - dtype: 'fp32', - revision: 'refs/pr/1' -}; -const tokenizerConfig = { - revision: 'refs/pr/2' -}; +- **12/02/2024**: We released **Model2Vec training**, allowing you to fine-tune your own classification models on top of Model2Vec models. Find out more in our [documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md) and in our [blog post](LINK). -const model = await AutoModel.from_pretrained(modelName, modelConfig); -const tokenizer = await AutoTokenizer.from_pretrained(modelName, tokenizerConfig); +- **30/01/2024**: We released two new models: [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) and [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M). [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) is our most performant model to date, using a larger vocabulary and higher dimensions. [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) is a finetune of [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) that is optimized for retrieval tasks, and is the best performing static retrieval model currently available. -const texts = ['hello', 'hello world']; -const { input_ids } = await tokenizer(texts, { add_special_tokens: false, return_tensor: false }); +- **30/10/2024**: We released three new models: [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M), [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M), and [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M). These models are trained using [Tokenlearn](https://github.com/MinishLab/tokenlearn). Find out more in our [blog post](https://minishlab.github.io/tokenlearn_blogpost/). NOTE: for users of any of our old English M2V models, we recommend switching to these new models as they [perform better on all tasks](https://github.com/MinishLab/model2vec/tree/main/results). -const cumsum = arr => arr.reduce((acc, num, i) => [...acc, num + (acc[i - 1] || 0)], []); -const offsets = [0, ...cumsum(input_ids.slice(0, -1).map(x => x.length))]; +## Main Features -const flattened_input_ids = input_ids.flat(); -const modelInputs = { - input_ids: new Tensor('int64', flattened_input_ids, [flattened_input_ids.length]), - offsets: new Tensor('int64', offsets, [offsets.length]) -}; +- **State-of-the-Art Performance**: Model2Vec models outperform any other static embeddings (such as GLoVe and BPEmb) by a large margin, as can be seen in our [results](results/README.md). +- **Small**: Model2Vec reduces the size of a Sentence Transformer model by a factor of up to 50. Our [best model](https://huggingface.co/minishlab/potion-base-8M) is just ~30 MB on disk, and our smallest model just ~8 MB (making it the smallest model on [MTEB](https://huggingface.co/spaces/mteb/leaderboard)!). +- **Lightweight Dependencies**: the base package's only major dependency is `numpy`. +- **Lightning-fast Inference**: up to 500 times faster on CPU than the original model. +- **Fast, Dataset-free Distillation**: distill your own model in 30 seconds on a CPU, without a dataset. +- **Integrated in many popular libraries**: Model2Vec is integrated direclty into popular libraries such as [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) and [LangChain](https://github.com/langchain-ai/langchain). For more information, see our [integrations documentation](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md). +- **Tightly integrated with HuggingFace hub**: easily share and load models from the HuggingFace hub, using the familiar `from_pretrained` and `push_to_hub`. Our own models can be found [here](https://huggingface.co/minishlab). -const { embeddings } = await model(modelInputs); -console.log(embeddings.tolist()); // output matches python version -``` +## What is Model2Vec? -Note that this requires that the Model2Vec has a `model.onnx` file and several required tokenizers file. To generate these for a model that does not have them yet, the following code snippet can be used: +Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Like BPEmb, it can create subword embeddings, but with much better performance. Distillation doesn't need _any_ data, just a vocabulary and a model. -```bash -python scripts/export_to_onnx.py --model_path --save_path "" -``` +The core idea is to forward pass a vocabulary through a sentence transformer model, creating static embeddings for the indiviudal tokens. After this, there are a number of post-processing steps we do that results in our best models. For a more extensive deepdive, please refer to the following resources: +- Our initial [Model2Vec blog post](https://huggingface.co/blog/Pringled/model2vec). Note that, while this post gives a good overview of the core idea, we've made a number of substantial improvements since then. +- Our [Tokenlearn blog post](https://minishlab.github.io/tokenlearn_blogpost/). This post describes the Tokenlearn method we used to train our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062). +- Our official [documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md). This document provides a high-level overview of how Model2Vec works. +## Documentation -
-
+Our official documentation can be found [here](https://github.com/MinishLab/model2vec/blob/main/docs/README.md). This includes: +- [Usage documentation](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): provides a technical overview of how to use Model2Vec. +- [Integrations documentation](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md): provides examples of how to use Model2Vec in various downstream libraries. +- [Model2Vec technical documentation](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md): provides a high-level overview of how Model2Vec works. ## Model List @@ -490,15 +158,15 @@ python scripts/export_to_onnx.py --model_path --save We provide a number of models that can be used out of the box. These models are available on the [HuggingFace hub](https://huggingface.co/collections/minishlab/model2vec-base-models-66fd9dd9b7c3b3c0f25ca90e) and can be loaded using the `from_pretrained` method. The models are listed below. -| Model | Language | Vocab | Sentence Transformer | Tokenizer Type | Params | Tokenlearn | -|-----------------------------------------------------------------------|-------------|------------------|-----------------------------------------------------------------|----------------|---------|-------------------| -| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | English | Output + Frequent C4 tokens | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | Subword | 32.3M |
| -| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | English | Output | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | Subword | 7.5M |
| -| [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M) | English | Output | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | Subword | 3.7M |
| -| [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M) | English | Output | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | Subword | 1.8M |
| -| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | English | Output + Frequent C4 tokens | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | Subword | 32.3M |
| -| [M2V_multilingual_output](https://huggingface.co/minishlab/M2V_multilingual_output) | Multilingual | Output | [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | Subword | 471M |
| +| Model | Language | Sentence Transformer | Params | Task | +|-----------------------------------------------------------------------|------------|-----------------------------------------------------------------|---------|-----------| +| [potion-base-32M](https://huggingface.co/minishlab/potion-base-32M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 32.3M | General | +| [potion-base-8M](https://huggingface.co/minishlab/potion-base-8M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 7.5M | General | +| [potion-base-4M](https://huggingface.co/minishlab/potion-base-4M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 3.7M | General | +| [potion-base-2M](https://huggingface.co/minishlab/potion-base-2M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 1.8M | General | +| [potion-retrieval-32M](https://huggingface.co/minishlab/potion-retrieval-32M) | English | [bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) | 32.3M | Retrieval | +| [M2V_multilingual_output](https://huggingface.co/minishlab/M2V_multilingual_output) | Multilingual | [LaBSE](https://huggingface.co/sentence-transformers/LaBSE) | 471M | General | ## Results @@ -516,9 +184,9 @@ MIT If you use Model2Vec in your research, please cite the following: ```bibtex @software{minishlab2024model2vec, - authors = {Stephan Tulkens, Thomas van Dongen}, + authors = {Stephan Tulkens and Thomas van Dongen}, title = {Model2Vec: The Fastest State-of-the-Art Static Embeddings in the World}, year = {2024}, - url = {https://github.com/MinishLab/model2vec}, + url = {https://github.com/MinishLab/model2vec} } ``` diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000..7392176c --- /dev/null +++ b/docs/README.md @@ -0,0 +1,6 @@ +# Documentation + +This directory contains the documentation for Model2Vec. The documentation is formatted in Markdown. The documentation is organized as follows: +- [usage.md](https://github.com/MinishLab/model2vec/blob/main/docs/usage.md): This document provides a technical overview of how to use Model2Vec. +- [integrations.md](https://github.com/MinishLab/model2vec/blob/main/docs/integrations.md): This document provides examples of how to use Model2Vec in various downstream libraries. +- [what_is_model2vec.md](https://github.com/MinishLab/model2vec/blob/main/docs/what_is_model2vec.md): This document provides a high-level overview of how Model2Vec works. diff --git a/docs/integrations.md b/docs/integrations.md new file mode 100644 index 00000000..efaec875 --- /dev/null +++ b/docs/integrations.md @@ -0,0 +1,155 @@ + +# Integrations + +Model2Vec can be used in a variety of downstream libraries. This document provides examples of how to use Model2Vec in some of these libraries. + +## Table of Contents +- [Sentence Transformers](#sentence-transformers) +- [LangChain](#langchain) +- [Txtai](#txtai) +- [Chonkie](#chonkie) +- [Transformers.js](#transformersjs) + +## Sentence Transformers + +Model2Vec can be used directly in [Sentence Transformers](https://github.com/UKPLab/sentence-transformers): + +The following code snippet shows how to load a Model2Vec model into a Sentence Transformer model: +```python +from sentence_transformers import SentenceTransformer + +# Load a Model2Vec model from the Hub +model = SentenceTransformer("minishlab/potion-base-8M") +# Make embeddings +embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) +``` + +The following code snippet shows how to distill a model directly into a Sentence Transformer model: + +```python +from sentence_transformers import SentenceTransformer +from sentence_transformers.models import StaticEmbedding + +static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256) +model = SentenceTransformer(modules=[static_embedding]) +embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) +``` + +For more documentation, please refer to the [Sentence Transformers documentation](https://sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding). + + +## LangChain + +Model2Vec can be used in [LangChain](https://github.com/langchain-ai/langchain) using the `langchain-community` package. For more information, see the [LangChain Model2Vec docs](https://python.langchain.com/docs/integrations/text_embedding/model2vec/). The following code snippet shows how to use Model2Vec in LangChain after installing the `langchain-community` package with `pip install langchain-community`: + +```python +from langchain_community.embeddings import Model2vecEmbeddings +from langchain_community.vectorstores import FAISS +from langchain.schema import Document + +# Initialize a Model2Vec embedder +embedder = Model2vecEmbeddings("minishlab/potion-base-8M") + +# Create some example texts +texts = [ + "Enduring Stew", + "Hearty Elixir", + "Mighty Mushroom Risotto", + "Spicy Meat Skewer", + "Fruit Salad", +] + +# Embed the texts +embeddings = embedder.embed_documents(texts) + +# Or, create a vector store and query it +documents = [Document(page_content=text) for text in texts] +vector_store = FAISS.from_documents(documents, embedder) +query = "Risotto" +query_vector = embedder.embed_query(query) +retrieved_docs = vector_store.similarity_search_by_vector(query_vector, k=1) +``` + +## Txtai + +Model2Vec can be used in [txtai](https://github.com/neuml/txtai) for text embeddings, nearest-neighbors search, and any of the other functionalities that txtai offers. The following code snippet shows how to use Model2Vec in txtai after installing the `txtai` package (including the `vectors` dependency) with `pip install txtai[vectors]`: + +```python +from txtai import Embeddings + +# Load a model2vec model +embeddings = Embeddings(path="minishlab/potion-base-8M", method="model2vec", backend="numpy") + +# Create some example texts +texts = ["Enduring Stew", "Hearty Elixir", "Mighty Mushroom Risotto", "Spicy Meat Skewer", "Chilly Fruit Salad"] + +# Create embeddings for downstream tasks +vectors = embeddings.batchtransform(texts) + +# Or create a nearest-neighbors index and search it +embeddings.index(texts) +result = embeddings.search("Risotto", 1) +``` + +## Chonkie + +Model2Vec is the default model for semantic chunking in [Chonkie](https://github.com/bhavnicksm/chonkie). To use Model2Vec for semantic chunking in Chonkie, simply install Chonkie with `pip install chonkie[semantic]` and use one of the `potion` models in the `SemanticChunker` class. The following code snippet shows how to use Model2Vec in Chonkie: + +```python +from chonkie import SDPMChunker + +# Create some example text to chunk +text = "It's dangerous to go alone! Take this." + +# Initialize the SemanticChunker with a potion model +chunker = SDPMChunker( + embedding_model="minishlab/potion-base-8M", + similarity_threshold=0.3 +) + +# Chunk the text +chunks = chunker.chunk(text) +``` + +## Transformers.js + +To use a Model2Vec model in [transformers.js](https://github.com/huggingface/transformers.js), the following code snippet can be used as a starting point: + +```javascript +import { AutoModel, AutoTokenizer, Tensor } from '@huggingface/transformers'; + +const modelName = 'minishlab/potion-base-8M'; + +const modelConfig = { + config: { model_type: 'model2vec' }, + dtype: 'fp32', + revision: 'refs/pr/1' +}; +const tokenizerConfig = { + revision: 'refs/pr/2' +}; + +const model = await AutoModel.from_pretrained(modelName, modelConfig); +const tokenizer = await AutoTokenizer.from_pretrained(modelName, tokenizerConfig); + +const texts = ['hello', 'hello world']; +const { input_ids } = await tokenizer(texts, { add_special_tokens: false, return_tensor: false }); + +const cumsum = arr => arr.reduce((acc, num, i) => [...acc, num + (acc[i - 1] || 0)], []); +const offsets = [0, ...cumsum(input_ids.slice(0, -1).map(x => x.length))]; + +const flattened_input_ids = input_ids.flat(); +const modelInputs = { + input_ids: new Tensor('int64', flattened_input_ids, [flattened_input_ids.length]), + offsets: new Tensor('int64', offsets, [offsets.length]) +}; + +const { embeddings } = await model(modelInputs); +console.log(embeddings.tolist()); // output matches python version +``` + +Note that this requires that the Model2Vec has a `model.onnx` file and several required tokenizers file. To generate these for a model that does not have them yet, the following code snippet can be used: + +```bash +python scripts/export_to_onnx.py --model_path --save_path "" +``` diff --git a/docs/usage.md b/docs/usage.md new file mode 100644 index 00000000..b2b9b214 --- /dev/null +++ b/docs/usage.md @@ -0,0 +1,202 @@ + +# Usage + +This document provides an overview of how to use Model2Vec for inference, distillation, training, and evaluation. + +## Table of Contents +- [Inference](#inference) + - [Inference with a pretrained model](#inference-with-a-pretrained-model) + - [Inference with the Sentence Transformers library](#inference-with-the-sentence-transformers-library) +- [Distillation](#distillation) + - [Distilling from a Sentence Transformer](#distilling-from-a-sentence-transformer) + - [Distilling from a loaded model](#distilling-from-a-loaded-model) + - [Distilling with the Sentence Transformers library](#distilling-with-the-sentence-transformers-library) + - [Distilling with a custom vocabulary](#distilling-with-a-custom-vocabulary) +- [Training](#training) + - [Training a classifier](#training-a-classifier) +- [Evaluation](#evaluation) + - [Installation](#installation) + - [Evaluation Code](#evaluation-code) + +## Inference + +### Inference with a pretrained model + +Inference works as follows. The example shows one of our own models, but you can also just load a local one, or another one from the hub. +```python +from model2vec import StaticModel + +# Load a model from the Hub. You can optionally pass a token when loading a private model +model = StaticModel.from_pretrained(model_name="minishlab/potion-base-8M", token=None) + +# Make embeddings +embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) + +# Make sequences of token embeddings +token_embeddings = model.encode_as_sequence(["It's dangerous to go alone!", "It's a secret to everybody."]) +``` + +### Inference with the Sentence Transformers library + +The following code snippet shows how to use a Model2Vec model in the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. + +```python +from sentence_transformers import SentenceTransformer + +# Load a Model2Vec model from the Hub +model = SentenceTransformer("minishlab/potion-base-8M") + +# Make embeddings +embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) +``` + +## Distillation + +### Distilling from a Sentence Transformer + +The following code can be used to distill a model from a Sentence Transformer. As mentioned above, this leads to really small model that might be less performant. +```python +from model2vec.distill import distill + +# Distill a Sentence Transformer model +m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", pca_dims=256) + +# Save the model +m2v_model.save_pretrained("m2v_model") + +``` + +### Distilling from a loaded model + +If you already have a model loaded, or need to load a model in some special way, we also offer an interface to distill models in memory. + +```python +from transformers import AutoModel, AutoTokenizer + +from model2vec.distill import distill_from_model + +# Assuming a loaded model and tokenizer +model_name = "baai/bge-base-en-v1.5" +model = AutoModel.from_pretrained(model_name) +tokenizer = AutoTokenizer.from_pretrained(model_name) + +m2v_model = distill_from_model(model=model, tokenizer=tokenizer, pca_dims=256) + +m2v_model.save_pretrained("m2v_model") + +``` + +### Distilling with the Sentence Transformers library + +The following code snippet shows how to distill a model using the [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) library. This is useful if you want to use the model in a Sentence Transformers pipeline. + +```python +from sentence_transformers import SentenceTransformer +from sentence_transformers.models import StaticEmbedding + +static_embedding = StaticEmbedding.from_distillation("BAAI/bge-base-en-v1.5", device="cpu", pca_dims=256) +model = SentenceTransformer(modules=[static_embedding]) +embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."]) +``` + +### Distilling with a custom vocabulary + +If you pass a vocabulary, you get a set of static word embeddings, together with a custom tokenizer for exactly that vocabulary. This is comparable to how you would use GLoVe or traditional word2vec, but doesn't actually require a corpus or data. +```python +from model2vec.distill import distill + +# Load a vocabulary as a list of strings +vocabulary = ["word1", "word2", "word3"] + +# Distill a Sentence Transformer model with the custom vocabulary +m2v_model = distill(model_name="BAAI/bge-base-en-v1.5", vocabulary=vocabulary) + +# Save the model +m2v_model.save_pretrained("m2v_model") + +# Or push it to the hub +m2v_model.push_to_hub("my_organization/my_model", token="") +``` + +By default, this will distill a model with a subword tokenizer, combining the models (subword) vocab with the new vocabulary. If you want to get a word-level tokenizer instead (with only the passed vocabulary), the `use_subword` parameter can be set to `False`, e.g.: + +```python +m2v_model = distill(model_name=model_name, vocabulary=vocabulary, use_subword=False) +``` + +**Important note:** we assume the passed vocabulary is sorted in rank frequency. i.e., we don't care about the actual word frequencies, but do assume that the most frequent word is first, and the least frequent word is last. If you're not sure whether this is case, set `apply_zipf` to `False`. This disables the weighting, but will also make performance a little bit worse. + + +## Training + +### Training a classifier + +Model2Vec can be used to train a classifier on top of a distilled model. The following code snippet shows how to train a classifier on top of a distilled model. For more advanced usage, as well as results, please refer to the [training documentation](https://github.com/MinishLab/model2vec/blob/main/model2vec/train/README.md). + +```python +import numpy as np +from datasets import load_dataset +from model2vec.train import StaticModelForClassification + +# Initialize a classifier from a pre-trained model +classifer = StaticModelForClassification.from_pretrained("minishlab/potion-base-8M") + +# Load a dataset +ds = load_dataset("setfit/subj") +train = ds["train"] +test = ds["test"] + +X_train, y_train = train["text"], train["label"] +X_test, y_test = test["text"], test["label"] + +# Train the classifier +classifier.fit(X_train, y_train) + +# Evaluate the classifier +y_hat = classifier.predict(X_test) +accuracy = np.mean(np.array(y_hat) == np.array(y_test)) * 100 +``` + +## Evaluation + +### Installation + +Our models can be evaluated using our [evaluation package](https://github.com/MinishLab/evaluation). Install the evaluation package with: + +```bash +pip install git+https://github.com/MinishLab/evaluation.git@main +``` + +### Evaluation Code + +The following code snippet shows how to evaluate a Model2Vec model: +```python +from model2vec import StaticModel + +from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results +from mteb import ModelMeta + +# Get all available tasks +tasks = get_tasks() +# Define the CustomMTEB object with the specified tasks +evaluation = CustomMTEB(tasks=tasks) + +# Load the model +model_name = "m2v_model" +model = StaticModel.from_pretrained(model_name) + +# Optionally, add model metadata in MTEB format +model.mteb_model_meta = ModelMeta( + name=model_name, revision="no_revision_available", release_date=None, languages=None + ) + +# Run the evaluation +results = evaluation.run(model, eval_splits=["test"], output_folder=f"results") + +# Parse the results and summarize them +parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name) +task_scores = summarize_results(parsed_results) + +# Print the results in a leaderboard format +print(make_leaderboard(task_scores)) +``` diff --git a/docs/what_is_model2vec.md b/docs/what_is_model2vec.md new file mode 100644 index 00000000..3413fd32 --- /dev/null +++ b/docs/what_is_model2vec.md @@ -0,0 +1,11 @@ +# What is Model2Vec? + +This document provides a high-level overview of how Model2Vec works. + +The base model2vec technique works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using SIF weighting (previously zipf weighting). During inference, we simply take the mean of all token embeddings occurring in a sentence. + +Our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) are pre-trained using [tokenlearn](https://github.com/MinishLab/tokenlearn), a technique to pre-train model2vec distillation models. These models are created with the following steps: +- **Distillation**: We distill a Model2Vec model from a Sentence Transformer model, using the method described above. +- **Sentence Transformer inference**: We use the Sentence Transformer model to create mean embeddings for a large number of texts from a corpus. +- **Training**: We train a model to minimize the cosine distance between the mean embeddings generated by the Sentence Transformer model and the mean embeddings generated by the Model2Vec model. +- **Post-training re-regularization**: We re-regularize the trained embeddings by first performing PCA, and then weighting the embeddings using `smooth inverse frequency (SIF)` weighting using the following formula: `w = 1e-3 / (1e-3 + proba)`. Here, `proba` is the probability of the token in the corpus we used for training.