# Introduction: The models have eyes: reranking goes multi-modal!

Welcome to this short introductory notebook! In this notebook, we're assuming that you are at least vaguely familiar with the [two-step pipeline concept](https://www.answer.ai/posts/2024-09-16-rerankers.html).

Traditionally, this has been applied to text, and complex document formats like images, charts and PDFs were first processed by complex data extraction pipelines, before being indexed by text-based retrieval models and re-ranked via text-only rerankers, to be passed to a text-only LLM for final processing.

However, this is no longer the case! In recent months:

- Retrieval models have become multi-modal, with the success of models like [ColPali](https://arxiv.org/abs/2407.01449) (which you can use in a couple lines of code via our sister library [byaldi](https://github.com/answerdotai/byaldi)) and [DSE](https://arxiv.org/abs/2406.11251).
- Large Language Models are becoming Vision-Language Models (VLMs): Claude Sonnet 3.5 and GPT-4o both support images as input, and Open Source models such as [Qwen2-VL](https://qwenlm.github.io/blog/qwen2-vl/) (Or [PaliGemma](https://arxiv.org/abs/2407.07726), or [Pixtral](https://mistral.ai/news/pixtral-12b/), or [Phi3.5](https://huggingface.co/microsoft/Phi-3.5-vision-instruct), or even [Llama3.2](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/)...) are not far behind.
- But re-ranking models haven't yet caught up! Can you imagine, it took *multiple* months, in this era of AI development, for this to happen.

However, the last point is finally addressed, with the recent release of [MonoQwen2-VL-v0.1](https://huggingface.co/lightonai/MonoQwen2-VL-v0.1) by LightOn. This first model, of many to come, plays it safe by adapting the [MonoT5 approach](https://aclanthology.org/2020.findings-emnlp.63/) (which we also support, with the `t5` rerankers!) to multi-modal re-ranking, with impressive results for a first attempt.

In this notebook, we'll show how easy it is to use a multi-modal reranker in `rerankers`, with both images saved as base64 (in-memory) and images saved as files on disk.

Let's go!

# Loading a multi-modal reranker

There's no extra trick to loading a multi-modal reranker. As always with `rerankers`, all you need to do is use a single-line of code:

In [1]:
from rerankers import Reranker

ranker = Reranker("monovlm", device='cuda')

Loading default monovlm model for language en
Default Model: lightonai/MonoQwen2-VL-v0.1
Loading MonoVLMRanker model lightonai/MonoQwen2-VL-v0.1 (this message can be suppressed by setting verbose=0)
bf16
Using dtype torch.bfloat16
Loading model lightonai/MonoQwen2-VL-v0.1, this might take a while...
Using device cuda.
Using dtype torch.bfloat16.


Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

VLM true token set to True
VLM false token set to False


*In the future, we hope there'll be more than one type of multimodal reranker. At the moment, only the `monovlm` type is supported, as that's the type of the MonoQwen2-VL model presented above.*


And that's it, the model is loaded! Please note, as this model is backed by an LLM, we strongly recommend running it on GPU, and there are additional dependencies, which we recommend installing via `pip install rerankers[monovlm]`. Notably, the memory consumption without flash attention 2 is much higher than we'd like for a re-ranking model, so it is enabled by default. If you really want to disable it, you can do so this way:

In [2]:
# ranker = Reranker("monovlm", device='cuda', attention_implementation="YOUR_ALTERNATE_IMPLEMENTATION")

Now, let's go for a basic query:

In [3]:
query = "Who is the director of Howl's Moving Castle?"

We are going to assume that, for some reason, you are using an LLM which needs context to answer this question (as it allows me to use an example that I like).

Let's now get two images to rerank, from the internet:

In [4]:
# This is a frame from the movie "Spirited Away". Should be tangentially related to Miyazaki in the model's weights, but there is no text whatsoever in the image.
spirited_frame_url = "https://cdn.theatlantic.com/thumbor/UoK2ddSqvAtWsk1k_W14V9NEMOM=/6x180:4798x2676/1200x625/media/img/mt/2022/09/spirited_away_anniversary_2/original.jpg"
# This is the poster for the movie "Howl's Moving Castle". It directly contains text showing the director's name.
howls_poster_url = "https://m.media-amazon.com/images/I/81M0Eewr7QL.jpg"

For this first example, we will convert the images to base64, which is the preferred way of manipulate image bytes:

In [5]:
import requests
import base64
from io import BytesIO

# Download images and convert to base64
def url_to_base64(url):
    response = requests.get(url)
    img_bytes = BytesIO(response.content)
    base64_str = base64.b64encode(img_bytes.getvalue())
    return base64_str

spirited_base64 = url_to_base64(spirited_frame_url)
howls_base64 = url_to_base64(howls_poster_url)

That's pretty much it, we are ready to rerank! Let's see if the model is able to realise that the Spirited Away frame is useless for this query, but that the Howl's Moving Castle poster is essentially the most relevant image there can be:

In [6]:
results = ranker.rank(query, [spirited_base64, howls_base64], doc_ids=["spirited", "howls"])

for i, doc in enumerate(results.top_k(2)):
    print(f"Rank {i}:")
    print("Document ID:", doc.doc_id)
    print("Document Score:", doc.score)
    print("Document Base64:", doc.base64[:30] + '...')
    print("Document Path:", doc.image_path)

Rank 0:
Document ID: howls
Document Score: 1.0
Document Base64: /9j/4AAQSkZJRgABAQAAAQABAAD/4Q...
Document Path: None
Rank 1:
Document ID: spirited
Document Score: 0.0693359375
Document Base64: /9j/4AAQSkZJRgABAQAAAQABAAD/4g...
Document Path: None


Pretty good! It's normal for scores to look *pretty extreme* with `Mono-` type rerankers, so the score of 1.0 for Howl's is perfectly normal.

But what if we want to use images that are stored locally, and not in-memory as base64? No problem! To show this, we'll first (temporarily) save the two images we downloaded:

In [7]:
# Save base64 images to files
def save_base64_to_file(base64_str, filename):
    with open(filename, 'wb') as f:
        f.write(base64.b64decode(base64_str))

save_base64_to_file(spirited_base64, "spirited.jpg")
save_base64_to_file(howls_base64, "howls.jpg")

We can now pretend that these files were always present on-disk. To rerank them in terms of relevance to our query, all we need to do is to pass the file paths to the `rank` call:

In [8]:
# Rerank using file paths
results = ranker.rank(query, ["spirited.jpg", "howls.jpg"], doc_ids=["spirited", "howls"])

for i, doc in enumerate(results.top_k(2)):
    print(f"Rank {i}:")
    print("Document ID:", doc.doc_id)
    print("Document Score:", doc.score)
    print("Document Base64:", doc.base64[:30] + '...')
    print("Document Path:", doc.image_path)

Rank 0:
Document ID: howls
Document Score: 1.0
Document Base64: /9j/4AAQSkZJRgABAQAAAQABAAD/4Q...
Document Path: howls.jpg
Rank 1:
Document ID: spirited
Document Score: 0.0693359375
Document Base64: /9j/4AAQSkZJRgABAQAAAQABAAD/4g...
Document Path: spirited.jpg


That's it! There's no hidden trick, and you'll see that the results are identical to the ones we obtained above, indicating that nothing went wrong. As usual with `rerankers`, less-is-more: you can now add MonoQwen2-VL to your pipeline in a couple lines of code!

Let's now quickly clean up after ourselves:

In [9]:
import os

try:
    os.remove("spirited.jpg")
    os.remove("howls.jpg")
except FileNotFoundError:
    # Do nothing if you were very dilligent and deleted them already
    pass

In [10]:
results.top_k(1)[0]

Result(document=Document(document_type='image', text=None, base64='/9j/4AAQSkZJRgABAQAAAQABAAD/4QBKRXhpZgAASUkqAAgAAAABAGmHBAABAAAAGgAAAAAAAAABAIaSBwAVAAAALAAAAAAAAAAAAAAAAAAAAFZlcnNpb24gMS4wLjAA/9sAhAAJCQoICggLCwkLCgsLCw4QDAoLDRMXFRAUDxYSEg4WEg8UDw8UEhQYExYUGSAaHhkYKyEcJBMcHTIiMyo3JSIwAQYLCgsNDgsMDA4ODA0QDh0UDQwiFBUXDh4IFwwQFhARFwsQExQLERkRHgkZDAgiGB0UDx0QDQwPFhALFBUjFhj/wgARCAhtBfgDASIAAhEBAxEB/8QANgABAAEFAQEAAAAAAAAAAAAAAAECAwQFBgcIAQEAAwEBAQEBAAAAAAAAAAAAAQIDBAUGBwj/2gAMAwEAAhADEAAAAPDQBKUTICEoQCSJTBCYlZAqJgAAAAlCUokhIhIEEokRMAkgQTCUoE5GPCdnrCtklqQkQIEpAETIAAhCUSQlDMivF6cLu857P4vQ11FTo5kXotW3m3Lu2E4WXqiMqzsqXlOJrhZsHL1hEgCIlITEIBUAJsCQWhMCYlZBMIFZJiYRKkhYEwBMEggAJlAghNJiZWgLoFASAAAAAAAAmCUwQgZSAAAAAlOsBcBSOeQkmJAkiQAiRCYEwgTBMAAAAASIkkiQbvSxpSGYEiSJiCSwBEgATKBUFoIUmQBIIBYEoSoCwDPwMjG0zqiFb159vK6OfDy6lsxjTFrGV4deXksXblqwEc/UFbATEKykJpmAKABMgvASAAkXM/AzstsjUbzR0sGmCYm0RIImAlKEoRITCLphGUzCakmsQIAABAAJBAAABMTKExAABEomBmAAAVROsBoCASpesznTyV63Mx5I9bmXkb12TyF69MvIHsCHj72EePPYh469jHjj2OTxt