<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://vespa.ai/assets/vespa-ai-logo-heather.svg">
  <source media="(prefers-color-scheme: light)" srcset="https://vespa.ai/assets/vespa-ai-logo-rock.svg">
  <img alt="#Vespa" width="200" src="https://vespa.ai/assets/vespa-ai-logo-rock.svg" style="margin-bottom: 25px;">
</picture>

# Using Mixedbread.ai cross-encoder for reranking in Vespa.ai

First, let us recap what cross-encoders are and where they might fit in a Vespa application.

In contrast to bi-encoders, it is important to know that cross-encoders do NOT produce an embedding. Instead a cross-encoder acts on _pairs_ of input sequences and produces a single scalar score between 0 and 1 indicating the similarity between the two sentences.

> The cross-encoder model is a transformer based model with a classification head on top of the Transformer CLS token (classification token). The model has been fine-tuned using the MS Marco passage training set and is a binary classifier which classifies if a query,document pair is relevant or not.

From [this](https://blog.vespa.ai/pretrained-transformer-language-models-for-search-part-4/) blog post from 2021 that explains more in depth.

## Properties of cross-encoders and where they fit in Vespa

Cross-encoders are great at comparing a query and a document, but the time complexity increases linearly with the number of documents a query is compared to.

This is why cross-encoders are often at the top of leaderboards for ranking performance, such as MS MARCO Passage Ranking leaderboard, which does not have a strict latency requirement for returning results for a query.

This makes cross-encoders perfect for a _global-phase reranking_, introduced in [this](https://blog.vespa.ai/improving-llm-context-ranking-with-cross-encoders/) blog post.

![image](https://blog.vespa.ai/assets/2023-05-08-improving-llm-context-ranking-with-cross-encoders/image1.png)

In this notebook, we will show how to use the Mixedbread.ai cross-encoder for global-phase reranking in Vespa.

The inference can also be run on GPU in [Vespa Cloud](https://cloud.vespa.ai/), to accelerate inference even further.


## Exploring the Mixedbread.ai cross-encoder

[mixedbread.ai](https://huggingface.co/mixedbread-ai) has done an amazing job of releasing both (binary) embedding-models and rerankers on huggingface 🤗 the last weeks.

> Check out our previous notebook on using binary embeddings from mixedbread.ai in Vespa Cloud [here](https://pyvespa.readthedocs.io/en/latest/examples/mixedbread-binary-embeddings-with-sentence-transformers-cloud.html)

For this demo, we will use [mixedbread-ai/mxbai-rerank-xsmall-v1](https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1), but you can experiment with the larger models, depending on how you want to balance speed, accuracy, and cost (if you want to use GPU).

This model is really powerful despite its small size, and

Table of accuracy on a [BEIR](http://beir.ai) (11 datasets):

| Model                  | Accuracy |
| ---------------------- | -------- |
| Lexical Search         | 66.4     |
| bge-reranker-base      | 66.9     |
| bge-reranker-large     | 70.6     |
| cohere-embed-v3        | 70.9     |
| mxbai-rerank-xsmall-v1 | 70.0     |
| mxbai-rerank-base-v1   | 72.3     |
| mxbai-rerank-large-v1  | 74.9     |

Table from their introductory [blog post](https://www.mixedbread.ai/blog/mxbai-rerank-v1).

As we can see, the `mxbai-rerank-xsmall-v1` model is almost on par with much larger models, while being much faster and cheaper to run.


## Downloading the model


In [42]:
import requests

url = "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/resolve/main/onnx/model_quantized.onnx"
local_model_path = "model/model_quantized.onnx"

r = requests.get(url)
with open(local_model_path, "wb") as f:
    f.write(r.content)
    print(f"Downloaded model to {local_model_path}")

Downloaded model to model/model_quantized.onnx


## Defining our Vespa application


In [43]:
from vespa.package import (
    Component,
    Document,
    Field,
    FieldSet,
    Function,
    GlobalPhaseRanking,
    OnnxModel,
    Parameter,
    RankProfile,
    Schema,
)

schema = Schema(
    name="doc",
    mode="index",
    document=Document(
        fields=[
            Field(name="id", type="string", indexing=["summary", "attribute"]),
            Field(
                name="text",
                type="string",
                indexing=["index", "summary"],
                index="enable-bm25",
            ),
            # Let´s add a synthetic field (see https://docs.vespa.ai/en/schemas.html#field)
            # to define how the tokens are derived from the text field
            Field(
                name="body_tokens",
                type="tensor<float>(d0[512])",
                # The tokenizer will be defined in the next cell
                indexing=["input text", "embed tokenizer", "attribute", "summary"],
                is_document_field=False,  # Indicates a synthetic field
            ),
        ],
    ),
    fieldsets=[FieldSet(name="default", fields=["text"])],
    models=[
        OnnxModel(
            model_name="crossencoder",
            model_file_path=f"{local_model_path}",
            inputs={
                "input_ids": "input_ids",
                "attention_mask": "attention_mask",
            },
            outputs={"logits": "logits"},
        )
    ],
    rank_profiles=[
        RankProfile(name="bm25", first_phase="bm25(text)"),
        RankProfile(
            name="reranking",
            inherits="default",
            inputs=[("query(q)", "tensor<float>(d0[512])")],
            functions=[
                Function(
                    name="input_ids",
                    # See https://docs.vespa.ai/en/reference/rank-features.html
                    expression="tokenInputIds(512, query(q), attribute(body_tokens))",
                ),
                Function(
                    name="attention_mask",
                    expression="tokenAttentionMask(512, query(q), attribute(body_tokens))",
                ),
            ],
            first_phase="bm25(text)",
            global_phase=GlobalPhaseRanking(
                rerank_count=10,
                expression="onnx(crossencoder){d0:0,d1:0}",
            ),
        ),
    ],
)

In [44]:
from vespa.package import ApplicationPackage

app_package = ApplicationPackage(
    name="reranking",
    schema=[schema],
    components=[
        Component(
            # See https://docs.vespa.ai/en/reference/embedding-reference.html#huggingface-tokenizer-embedder
            id="tokenizer",
            type="hugging-face-tokenizer",
            parameters=[
                Parameter(
                    "model",
                    {
                        "url": "https://huggingface.co/mixedbread-ai/mxbai-rerank-xsmall-v1/raw/main/tokenizer.json"
                    },
                ),
            ],
        )
    ],
)

It is useful to inspect the schema-file (see https://docs.vespa.ai/en/reference/schema-reference.html) before deploying the application.


In [45]:
print(schema.schema_to_text)

schema doc {
    document doc {
        field id type string {
            indexing: summary | attribute
        }
        field text type string {
            indexing: index | summary
            index: enable-bm25
        }
    }
    field body_tokens type tensor<float>(d0[512]) {
        indexing: input text | embed tokenizer | attribute | summary
    }
    fieldset default {
        fields: text
    }
    onnx-model crossencoder {
        file: files/crossencoder.onnx
        input input_ids: input_ids
        input attention_mask: attention_mask
        output logits: logits
    }
    rank-profile bm25 {
        first-phase {
            expression {
                bm25(text)
            }
        }
    }
    rank-profile reranking inherits default {
        inputs {
            query(q) tensor<float>(d0[512])             
        
        }
        function input_ids() {
            expression {
                tokenInputIds(512, query(q), attribute(body_tokens))
            }


In [46]:
# Optionally, we can also write the application package to disk before deploying it.
app_package.to_files("my-app")

In [47]:
from vespa.deployment import VespaDocker

vespa_docker = VespaDocker(port=8080)

vespa_docker.deploy(application_package=app_package)

Waiting for configuration server, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 0/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 5/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Waiting for application status, 10/300 seconds...
Using plain http against endpoint http://localhost:8080/ApplicationStatus
Application is up!
Finished deployment.


Vespa(http://localhost, 8080)

In [48]:
analyze_result = vespa_docker.container.exec_run(
    "bash -c '/opt/vespa/bin/vespa-analyze-onnx-model /opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/2/files/crossencoder.onnx'"
)
print(analyze_result.output.decode("utf-8"))

unspecified option[0](optimize model), fallback: true
vm_size: 167420 kB, vm_rss: 47228 kB, malloc_peak: 0 kb, malloc_curr: 1060 (before loading model)
vm_size: 344196 kB, vm_rss: 232436 kB, malloc_peak: 0 kb, malloc_curr: 177836 (after loading model)
model meta-data:
  input[0]: 'input_ids' long[batch_size][sequence_length]
  input[1]: 'attention_mask' long[batch_size][sequence_length]
  output[0]: 'logits' float[batch_size][1]
unspecified option[1](symbolic size 'batch_size'), fallback: 1
unspecified option[2](symbolic size 'sequence_length'), fallback: 1
test setup:
  input[0]: tensor(d0[1],d1[1]) -> long[1][1]
  input[1]: tensor(d0[1],d1[1]) -> long[1][1]
  output[0]: float[1][1] -> tensor<float>(d0[1],d1[1])
unspecified option[3](max concurrent evaluations), fallback: 1
vm_size: 344196 kB, vm_rss: 232576 kB, malloc_peak: 0 kb, malloc_curr: 177836 (no evaluations yet)
vm_size: 344196 kB, vm_rss: 232712 kB, malloc_peak: 0 kb, malloc_curr: 177836 (concurrent evaluations: 1)
estimated

In [49]:
from vespa.application import Vespa

app = Vespa(url="http://localhost", port=8080)

In [50]:
# Feed a few sample documents to the application
sample_docs = [
    {"id": i, "fields": {"text": text}}
    for i, text in enumerate(
        [
            "'To Kill a Mockingbird' is a novel by Harper Lee published in 1960. It was immediately successful, winning the Pulitzer Prize, and has become a classic of modern American literature.",
            "The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil.",
            "Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961.",
            "Jane Austen was an English novelist known primarily for her six major novels, which interpret, critique and comment upon the British landed gentry at the end of the 18th century.",
            "The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era.",
            "'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan.",
        ]
    )
]

In [51]:
from vespa.io import VespaResponse


def callback(response: VespaResponse, id: str):
    if not response.is_successful():
        print(
            f"Failed to feed document {id} with status code {response.status_code}: Reason {response.get_json()}"
        )


app.feed_iterable(sample_docs, schema="doc", callback=callback)

In [52]:
!vespa visit

{"id":"id:doc:doc::2","fields":{"text":"Harper Lee, an American novelist widely known for her novel 'To Kill a Mockingbird', was born in 1926 in Monroeville, Alabama. She received the Pulitzer Prize for Fiction in 1961."}}
{"id":"id:doc:doc::4","fields":{"text":"The 'Harry Potter' series, which consists of seven fantasy novels written by British author J.K. Rowling, is among the most popular and critically acclaimed books of the modern era."}}
{"id":"id:doc:doc::1","fields":{"text":"The novel 'Moby-Dick' was written by Herman Melville and first published in 1851. It is considered a masterpiece of American literature and deals with complex themes of obsession, revenge, and the conflict between good and evil."}}
{"id":"id:doc:doc::5","fields":{"text":"'The Great Gatsby', a novel written by American author F. Scott Fitzgerald, was published in 1925. The story is set in the Jazz Age and follows the life of millionaire Jay Gatsby and his pursuit of Daisy Buchanan."}}
{"id":"id:doc:doc::3","

In [53]:
from pprint import pprint

with app.syncio(connections=1) as sync_app:
    query = sync_app.query(
        body={
            "yql": "select * from sources * where userQuery();",
            "query": "who wrote to kill a mockingbird?",
            "ranking.profile": "reranking",
            "ranking.listFeatures": "true",
            "presentation.timing": "true",
        }
    )
    for hit in query.hits:
        pprint(hit)

{'fields': {'body_tokens': {'type': 'tensor<float>(d0[512])',
                            'values': [382.0,
                                       635.0,
                                       1447.0,
                                       46926.0,
                                       280.0,
                                       261.0,
                                       266.0,
                                       2626.0,
                                       1223.0,
                                       293.0,
                                       733.0,
                                       1806.0,
                                       1107.0,
                                       260.0,
                                       3052.0,
                                       19680.0,
                                       261.0,
                                       284.0,
                                       1378.0,
                                       267.0,
       

In [32]:
query.json

{'timing': {'querytime': 0.004, 'summaryfetchtime': 0.0, 'searchtime': 0.005},
 'root': {'id': 'toplevel',
  'relevance': 1.0,
  'fields': {'totalCount': 4},
  'coverage': {'coverage': 100,
   'documents': 6,
   'full': True,
   'nodes': 1,
   'results': 1,
   'resultsFull': 1},
  'children': [{'id': 'id:doc:doc::0',
    'relevance': 3.8951633348237644,
    'source': 'reranking_content',
    'fields': {'sddocname': 'doc',
     'documentid': 'id:doc:doc::0',
     'body_tokens': {'type': 'tensor<float>(d0[512])',
      'values': [382.0,
       5175.0,
       11815.0,
       266.0,
       64338.0,
       280.0,
       269.0,
       266.0,
       2626.0,
       293.0,
       10760.0,
       2967.0,
       1378.0,
       267.0,
       5356.0,
       260.0,
       325.0,
       284.0,
       1587.0,
       1473.0,
       261.0,
       2361.0,
       262.0,
       29103.0,
       7292.0,
       261.0,
       263.0,
       303.0,
       638.0,
       266.0,
       2205.0,
       265.0,
       

It will of course be necessary to evaluate the performance of the cross-encoder in your specific use-case, but this notebook should give you a good starting point.
