# RAT-Inspired ColQwen

This notebook implements a **RAT-inspired** methodology coupled with **ColQwen**, a VLM-based model for information retrieval. This pipeline draws from the principles of **Retrieval-Augmented Thoughts (RAT)** and it adapts these ideas to fit a more focused document analysis workflow, balancing iterative reasoning and multimodal document processing.

## Key Components:
1. **RAT-Inspired pipeline**:
    - **Iterative Chain-of-Thought (CoT) Refinement**:
      - Starts with a zero-shot CoT prompt to draft the initial reasoning process.
      - Each step is progressively refined using retrieved contexts, focusing on accuracy, robustness and logical consistency.
    - **Adapted Retrieval**:
      - Retrieval is tailored to specific PDF sections and multimodal content rather than broader external corpora.
    - **Batch Multimodal Processing**:
      - Optimized for handling extracted document images and text in manageable batches.

2. **ColQwen Integration**:
    - A **Retrieval-Augmented Generation (RAG)** model based on **Qwen2-VL-7B-Instruct**.
    - Uses **ColBERT-inspired multi-vector representation** for efficient search and retrieval.
    - Capable of processing both text and visual inputs, ensuring thorough analysis of multimodal documents.

## Pipeline Description:
1. **Document Indexing**:
    - The input PDF is indexed using the ColQwen framework for efficient query-based retrieval.
2. **Initial Reasoning**:
    - A zero-shot CoT reasoning step drafts an initial thought process for answering the query.
3. **Stepwise Refinement**:
    - Each reasoning step is revised using retrieved context, incorporating relevant data from the indexed document.
    - The process emphasizes precision, especially for numerical data, by validating numbers and their contextual alignment.
4. **Multimodal Analysis**:
    - Extracted relevant pages from the PDF are converted to images for multimodal input processing.
    - Text and images are analyzed jointly using Qwen2VL's capabilities.
5. **Final Answer Generation**:
    - Consolidates all refined reasoning steps into a coherent, contextually accurate response.
6. **Answer Validation**:
    - Verifies the output to ensure alignment with the query, focusing on accuracy and completeness.


## Adaptation Highlights:
- **Iterative, Task-Specific Retrieval**:
    - Unlike RAT's broad external knowledge retrieval, this implementation focuses on refining reasoning within the scope of a single document.
- **Multimodal Focus**:
    - Specifically tailored to PDFs with both text and visual data, leveraging Vision-Language models for deeper analysis.
- **Simplified Scope**:
    - This adaptation prioritizes document-specific insights, making it suitable for tasks like financial report analysis or technical documentation reviews.

By taking inspiration from RAT while adapting its principles for document-specific multimodal reasoning, this implementation strikes a balance between theoretical rigor and practical application.





## Our Pipeline

![Our Pipeline](./pipeline_rat_colqwen.png)

In [None]:
!sudo apt-get update
!apt-get install poppler-utils

from byaldi import RAGMultiModalModel
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
from pdf2image import convert_from_path
from groq import Groq
import os
import PyPDF2
from typing import List, Dict

# init groq api
os.environ["GROQ_API_KEY"] = "yourGroqAPIToken"

In [None]:
# init models
RAG = RAGMultiModalModel.from_pretrained("vidore/colqwen2-v1.0")
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="cuda"
)

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

In [None]:
def extract_multiple_pages(pdf_path: str, results: List[Dict], k: int = 4) -> tuple:
    """ Extracts k best pages into a new PDF """

    pages_to_extract = [result["page_num"] - 1 for result in results[:k]]

    writer = PyPDF2.PdfWriter()
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page_num in pages_to_extract:
            writer.add_page(reader.pages[page_num])

    output_pdf_path = "/content/extracted_pages.pdf"

    with open(output_pdf_path, "wb") as output_file:
        writer.write(output_file)

    return output_pdf_path, pages_to_extract

In [None]:
def create_rat_prompt(query: str, step: int, thoughts: List[str], image_index: int) -> str:
    """ Creates a prompt for RAT reasoning specialized in information retrieval"""

    return f"""You are an expert at finding and verifying specific information in documents.

    Question: {query}

    Previous findings:
    {' '.join(thoughts[:step])}

    Current step {step + 1}, examining image {image_index + 1}:
    Let's carefully analyze this page:
    1. What specific numbers/data are we looking for?
    2. Are we looking for a percentage (growth, change, margin, ...) ? an amount ? a date ?
    3. What exact numbers do you see in this image?
    4. What is the precise context of each number found? For EACH number found, give the context surrounding EACH of them
    5. How does this information connect to our query?

    ### Important Guidelines:
    - Focus on context and the explicit connection between numbers and the query.
    - The frequency of a number being found is NOT a reliable criterion for correctness.
    - Verify if the numbers are associated with clear labels (e.g., "change", "margin", "growth").
    - For changes (e.g., "from X to Y"), ensure both values (start and end) are included in the response.
    - Provide a single verified answer, ensuring all alternative numbers are justified or discarded explicitly based on context.

    Provide a clear finding that:
    - States the exact number found with its full context
    - Explains why this number is chosen as the most precise
    - Includes additional context (e.g., source section, exact table row or phrase) to justify the choice

    """

In [None]:
def process_thought(model, processor, thought: str, step_prompt: str, image_batch) -> List[str]:
    """ Processes a batch of thoughts with the model """

    messages = []
    for image in image_batch:
        messages.append({
            "role": "system",
            "content": step_prompt
        })
        messages.append({
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": thought}
            ]
        })

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    image_inputs, video_inputs = process_vision_info(messages)
    inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")

    generate_ids = model.generate(
        **inputs,
        max_new_tokens=640,
        temperature=0.7,
        top_p=0.9,
    )
    return processor.batch_decode(
        [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)],
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )

def generate_initial_thoughts(text_query: str, model, processor) -> List[str]:
    """ Generates the initial strategy for information retrieval """

    prompt = f"""Let's plan our search for this information step by step:

    Question: {text_query}

    Step 1: First, we need to identify the exact information we're looking for.
    Step 2: Then, based on the Question, identify the type of information we want to find : percentage (growth, change, margin, ...), amount, date, etc.
    Step 3: Then, locate where this type of information is typically presented.
    Step 4: Finally, find any additional context that could help verify our findings. For changes (e.g., "from X to Y"), ensure both the starting and ending values are captured along with the calculated change. Avoid assuming correctness based solely on repetition; focus on contextual accuracy.

    """

    messages = [{"role": "user",
                 "content": [
                     {"type": "text",
                      "text": prompt}
                     ]
                 }
               ]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], return_tensors="pt").to("cuda")

    generate_ids = model.generate(
        **inputs,
        max_new_tokens=640,
        temperature=0.7,
        top_p=0.9,
    )

    initial_thoughts = processor.batch_decode(
        [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)],
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]

    initial_thoughts_list = [t.strip() for t in initial_thoughts.split("\n") if t.strip() and "Step" in t]

    # print initial sequence of thoughts
    print("\nGenerated Initial Thoughts:")
    for thought in initial_thoughts_list:
        print(thought)

    return initial_thoughts_list

In [None]:
def rat_process_query(text_query: str, RAG, model, processor, pdf_path: str, k: int = 4) -> str:
    """ RAT-inspired process and final prompt"""

    # init sequence of thoughts
    initial_thoughts = generate_initial_thoughts(text_query, model, processor)
    revised_thoughts = [initial_thoughts[0]]

    # context : extract pages
    results = RAG.search(text_query, k=k)
    output_pdf_path, page_numbers = extract_multiple_pages(pdf_path, results, k)
    images = convert_from_path(output_pdf_path)

    # optimize inference, batch images
    batch_size = 2
    for i in range(1, len(initial_thoughts)):
        current_context = " ".join(revised_thoughts)
        current_thought = initial_thoughts[i]

        for batch_start in range(0, len(images), batch_size):
            image_batch = images[batch_start:batch_start + batch_size]
            step_prompt = create_rat_prompt(text_query, i, revised_thoughts, batch_start)
            revised_thought = process_thought(model, processor, current_thought, step_prompt, image_batch)
            revised_thoughts.extend(revised_thought)

    final_prompt = f"""Based on our systematic information search:

    Found information:
    {' '.join(revised_thoughts)}

    Original question: {text_query}

    Please provide:

    - NUMBERS FOUND:
    [List all relevant numbers with their exact context and source]

    - VERIFICATION:
    [Explain which number is the most precise and correct, why others are not, and include starting/ending values for changes if relevant. Avoid assuming correctness based solely on repetition. Focus instead on the detailed context and alignment with the query.]

    - FINAL ANSWER:
    [Provide only the single, verified number that answers the query with its full context. Include changes as "from X to Y" if applicable.]

    """

    messages = [
        {"role": "system",
         "content": final_prompt}
    ]

    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], return_tensors="pt").to("cuda")

    generate_ids = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.7,
        top_p=0.9,
    )

    final_response = processor.batch_decode(
        [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generate_ids)],
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]

    print("\nFinal Response:")
    print(final_response)

    return final_response

In [None]:
if __name__ == "__main__":

    pdf_path = "/your/pdf.pdf"

    # doc indexing
    print("Indexing PDF...")

    RAG.index(
        input_path=pdf_path,
        index_name="multimodal_rag",
        store_collection_with_index=False,
        overwrite=True
    )

    # querying
    query = "your query"
    response = rat_process_query(query, RAG, model, processor, pdf_path, k=4)
    print("\nQuery:", query)
    print("\nResponse:", response)
