# Multimodal RAG using Document Retrieval (ColPali) and Vision Language Models (VLMs)

_Authored by: [Sergio Paniego](https://github.com/sergiopaniego)_




**🚨WARNING**: This notebooks uses a lot of computational resources. When running in Colab, it will use a A100 GPU.

This notebook demonstrates how you can build a multimodal Retrieval Agumented Generation (RAG) system using a multimodal retriever model (ColPali) and how you can connect it to a Vision Language Model (Qwen2-VL) to improve your RAG system.



## 1. Install dependencies

Let’s start by installing the necessary libraries for our project!

We will install transformers from source since the VLM model (Qwen2-VL) used is not yet part of the packaged version. This could be modified once the package is released.


In [None]:
!pip install -U -q byaldi pdf2image git+https://github.com/huggingface/transformers.git qwen-vl-utils flash-attn
# Tested with byaldi==....

We will also intall poppler-utils to manipulate the PDFs

In [None]:
!sudo apt-get install -y poppler-utils

In [3]:
# Login needed because ColPali uses: https://huggingface.co/google/paligemma-3b-mix-448 // https://github.com/AnswerDotAI/byaldi?tab=readme-ov-file#colpali-access

#from huggingface_hub import notebook_login

#notebook_login()

## 2. Load Dataset 📁

For this recipe, we will use IKEA assembly instructions.

To download the assembly instructions, you can follow [these steps](https://www.ikea.com/us/en/customer-service/assembly-instructions-puba2cdc880).

In [4]:
import requests
import os

pdfs = {
    "MALM": "https://www.ikea.com/us/en/assembly_instructions/malm-4-drawer-chest-white__AA-2398381-2-100.pdf",
    "BILLY": "https://www.ikea.com/us/en/assembly_instructions/billy-bookcase-white__AA-1844854-6-2.pdf",
    "BOAXEL": "https://www.ikea.com/us/en/assembly_instructions/boaxel-wall-upright-white__AA-2341341-2-100.pdf",
    "ADILS": "https://www.ikea.com/us/en/assembly_instructions/adils-leg-white__AA-844478-6-2.pdf",
    "MICKE": "https://www.ikea.com/us/en/assembly_instructions/micke-desk-white__AA-476626-10-100.pdf"
}

output_dir = "data"
os.makedirs(output_dir, exist_ok=True)

for name, url in pdfs.items():
    response = requests.get(url)
    pdf_path = os.path.join(output_dir, f"{name}.pdf")

    with open(pdf_path, "wb") as f:
        f.write(response.content)

    print(f"Downloaded {name} to {pdf_path}")

print("Downloaded files:", os.listdir(output_dir))

Downloaded MALM to data/MALM.pdf
Downloaded BILLY to data/BILLY.pdf
Downloaded BOAXEL to data/BOAXEL.pdf
Downloaded ADILS to data/ADILS.pdf
Downloaded MICKE to data/MICKE.pdf
Downloaded files: ['ADILS.pdf', 'MALM.pdf', 'BILLY.pdf', 'MICKE.pdf', 'BOAXEL.pdf']


After downloading the instructions, we will convert the PDFs to images so the document retrieval model (ColPali) can manipulate them.

In [None]:
import os
from pdf2image import convert_from_path


def convert_pdfs_to_images(pdf_folder):
    pdf_files = [f for f in os.listdir(pdf_folder) if f.endswith('.pdf')]
    all_images = {}

    for doc_id, pdf_file in enumerate(pdf_files):
        pdf_path = os.path.join(pdf_folder, pdf_file)
        images = convert_from_path(pdf_path)
        all_images[doc_id] = images

    return all_images

all_images = convert_pdfs_to_images("/content/data/")
all_images[0][2]

## 3. Init ColPali Multimodal Document Retrieval model 🤖

[Byaldi](https://github.com/AnswerDotAI/byaldi)

In [None]:
from byaldi import RAGMultiModalModel

docs_retrieval_model = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2")

We can directly index our documents using the documents retrieval model passing the folder where the pdfs are stored

In [None]:
docs_retrieval_model.index(
    input_path="data/",
    index_name="image_index",
    store_collection_with_index=False,
    overwrite=True
)

## 4. Let's retrieve using the Documents Retrieval Model model 🤔

In [None]:
text_query = "How do I assemble the Micke desk?"

results = docs_retrieval_model.search(text_query, k=3)
results

In [None]:
all_images[4][0] # page_num are 1-indexed, while doc_ids are 0-indexed. Source https://github.com/AnswerDotAI/byaldi?tab=readme-ov-file#searching

In [None]:
def get_grouped_images(results, all_images):
    grouped_images = []

    for result in results:
        doc_id = result['doc_id']
        page_num = result['page_num']
        grouped_images.append(all_images[doc_id][page_num - 1])

    return grouped_images

grouped_images = get_grouped_images(results, all_images)

In [None]:
grouped_images

## 5. Init Visual Language Model for Question Answering 🙋

In [None]:
from transformers import Qwen2VLForConditionalGeneration, Qwen2VLProcessor
from qwen_vl_utils import process_vision_info
import torch

vl_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2", # https://huggingface.co/docs/transformers/main/en/model_doc/qwen2_vl#flash-attention-2-to-speed-up-generation
)
vl_model.cuda().eval()

In [33]:
min_pixels = 224*224
max_pixels = 1024*1024 # https://huggingface.co/docs/transformers/main/en/model_doc/qwen2_vl#image-resolution-for-performance-boost
vl_model_processor = Qwen2VLProcessor.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    min_pixels=min_pixels,
    max_pixels=max_pixels
)

## 6. Let's assemble the vlm model and test the system 🔧

In [34]:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": grouped_images[0],
            },
            {
                "type": "image",
                "image": grouped_images[1],
            },
            {
                "type": "image",
                "image": grouped_images[2],
            },
            {
                "type": "text",
                "text": text_query
            },
        ],
    }
]

In [35]:
text = vl_model_processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

In [44]:
image_inputs, _ = process_vision_info(messages)
inputs = vl_model_processor(
    text=[text],
    images=image_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

In [45]:
generated_ids = vl_model.generate(**inputs, max_new_tokens=500)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = vl_model_processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

In [46]:
print(output_text[0])

To assemble the Micke desk, follow these steps:

1. **Prepare the Components**: Ensure all parts are clean and free of debris. Use a screwdriver to remove the protective film from the screws.

2. **Position the Legs**: Place the legs of the desk on the floor, ensuring they are level and stable.

3. **Install the Legs**: Insert the legs into the holes provided on the desk. Use the screws to secure the legs in place.

4. **Attach the Drawers**: Place the drawers on the desk and secure them with the screws provided. Make sure the drawers are level and stable.

5. **Complete Assembly**: Once all components are in place, tighten the screws to secure the desk.

6. **Check for Proper Fit**: Make sure the desk is level and stable. Adjust the legs and drawers as needed.

7. **Final Touches**: Clean the desk with a soft cloth and apply a protective finish if desired.

By following these steps, you should be able to assemble the Micke desk successfully.


## 7. Assembling it all! 🧑‍🏭️

In [51]:
def answer_with_multimodal_rag(vl_model, docs_retrieval_model, vl_model_processor, messages, grouped_images, text_query, top_k, max_new_tokens):
    results = docs_retrieval_model.search(text_query, k=top_k)
    grouped_images = get_grouped_images(results, all_images)

    messages = [
    {
      "role": "user",
      "content": [
          {"type": "image", "image": image} for image in grouped_images
            ] + [
          {"type": "text", "text": text_query}
        ],
      }
    ]

    # Prepare the inputs
    text = vl_model_processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = vl_model_processor(
        text=[text],
        images=image_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to("cuda")

    # Generate text from the vl_model
    generated_ids = vl_model.generate(**inputs, max_new_tokens=max_new_tokens)
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

    # Decode the generated text
    output_text = vl_model_processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text

output_text = answer_with_multimodal_rag(
    vl_model=vl_model,
    docs_retrieval_model=docs_retrieval_model,
    vl_model_processor=vl_model_processor,
    messages=messages,
    grouped_images=grouped_images,
    text_query="What is shown in these images?",
    top_k=3,
    max_new_tokens=100
)
print(output_text[0])

The images in the document are part of a furniture assembly manual. Here is a detailed description of each image:

1. **Image 12**: This image shows a step where the top of a cabinet is being assembled. The assembly involves inserting a screw into a hole, which is labeled with the number 8x and the part number 131372.

2. **Image 13**: This image shows the bottom part of the cabinet being assembled. The assembly involves inserting a screw into a hole, which is labeled with the number 2x and the part number 109341.

3. **Image 39**: This image shows a step where the top of a drawer is being assembled. The assembly involves inserting a screw into a hole, which is labeled with the number 1x and the part number 131372.

4. **Image 40**: This image shows the bottom part of the drawer being assembled. The assembly involves inserting a screw into a hole, which is labeled with the number 2x and the part number 109341.

5. **Image 10**: This image shows a step where the top of a cabinet is bein

## 8. Going further 🧑‍🎓️

...