### **Ridefinizione del Progetto**

Il progetto si evolve in una pipeline duale che combina due approcci distinti per l'elaborazione di documenti scientifici su ArXiv, uno focalizzato sull'information retrieval multimodale per il QA e l'altro orientato alla creazione e utilizzo di un grafo per il reasoning avanzato. Ecco una sintesi della nuova struttura:

---

### **1. Obiettivo Generale**
Creare un Google Colab notebook che esplora due metodi distinti:
1. **Approccio Multimodale per il QA**: Retrieval di contenuti (testo e visivi) da documenti ArXiv con valutazione quantitativa delle prestazioni su un task di multiple-choice QA.
2. **GraphRAG per Reasoning Avanzato**: Creazione e utilizzo di un grafo di conoscenza a partire da documenti ArXiv per rispondere a domande complesse (multi-hop reasoning e aggregazioni) attraverso tecnologie avanzate.

---


### **2. Descrizione dei Due Moduli**

#### **Modulo 1: Approccio Multimodale per QA**
- **Focus**: Estrarre chunk di testo e figure dai PDF, utilizzarli in pipeline di retrieval (multimodale e text-only), e valutarli quantitativamente in un task di QA multiple-choice.
- **Tecnologie**:
  - Modelli multimodali (es. ColQwen2).
  - Late interaction per text-only retrieval.
  - Metriche di valutazione come Precision@k, MRR e accuratezza downstream su QA.
- **Output**:
  - Risultati numerici sulle prestazioni dei vari approcci.
  - Tabelle e grafici comparativi.

---

#### **Modulo 2: GraphRAG per Reasoning Avanzato**
- **Focus**: Costruire un grafo di conoscenza a partire dai documenti ArXiv (estrazione di entità e relazioni) per rispondere a domande complesse.
- **Pipeline**:
  1. **Estrazione di Entità e Relazioni**:
     - Uso di LLM (es. GPT-4o-mini) per identificare nodi e archi, con varianti per aggiungere forza relazionale o descrizioni.
  2. **Popolazione del Grafo**:
     - Archiviazione in un database Neo4j per interrogazioni successive.
  3. **Interrogazione del Grafo**:
     - Approcci:
       - Vicini diretti di entità.
       - Community detection e clustering.
       - Generazione di query Cypher con un LLM.
  4. **Reasoning e Risposta**:
     - Generazione di risposte condizionate da cluster di entità.
     - Combina risposte locali e globali con pesi stimati.
- **Tecnologie**:
  - **Neo4j** per graphDB.
  - **Milvus** per indicizzazione come VectorDB.
  - **LangGraph** per pipeline automatizzata.
- **Output**:
  - Demo interattiva con domande esempio.
  - Visualizzazioni di query sul grafo e risposte.

---

## Setup

In [None]:
!pip install datasets
!pip install pillow
!pip install matplotlib
!pip install transformers torch datasets byaldi
!pip install -q pdf2image 
!pip install git+https://github.com/huggingface/transformers.git 
!pip install qwen-vl-utils 
!pip install flash-attn
!pip install byaldi

In [None]:
!pip install torch torchvision

In [None]:
!sudo apt-get install -y poppler-utils

[sudo] password for rgiordano: 


In [12]:
# Import required libraries
import os

from datasets import load_dataset

import torch
import transformers
from byaldi import RAGMultiModalModel
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

import tqdm

from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd

In [6]:
# Set device
device = "mps" if torch.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"
    
print(f"Using device: {device}")

Using device: mps


## Load Dataset

In [7]:
"""
@misc{li2024multimodal,
            title={Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models}, 
            author={Lei Li and Yuqi Wang and Runxin Xu and Peiyi Wang and Xiachong Feng and Lingpeng Kong and Qi Liu},
            year={2024},
            eprint={2403.00231},
            archivePrefix={arXiv},
            primaryClass={cs.CV}
        }
        """

# Load parquet files inside folder data
ds = load_dataset('parquet', data_files='data/*.parquet')

In [64]:
import requests

def download_arxiv_pdf(arxiv_id, save_path):
    url = f"https://arxiv.org/pdf/{arxiv_id}"
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_path, 'wb') as file:
            file.write(response.content)
        print(f"PDF downloaded successfully: {save_path}")
    else:
        print(f"Failed to download PDF. Status code: {response.status_code}")

# Example usage
arxiv_id = "1810.10511"
save_path = f"{arxiv_id}.pdf"
download_arxiv_pdf(arxiv_id, save_path)


PDF downloaded successfully: 1810.10511.pdf


In [4]:
ds['train'][0]

{'query': 'Based on the graph, what is the impact of correcting for fspec not equal to 1 on the surface density trend?',
 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1621x1191>,
 'image_filename': 'images/1810.10511_2.jpg',
 'options': "['A. Correction causes a significant increase in surface density across all radii.', 'B. Correction results in a decrease in surface density for larger radii.', 'C. Correction causes the surface density to converge with the fspec = 1 case at larger radii.', 'D. Correction does not affect the surface density trend at all.', '-']",
 'answer': 'C',
 'page': '',
 'model': 'gpt4V',
 'prompt': '',
 'source': 'arxiv_qa'}

In [25]:
# Save all images
PATH='data/images'
os.makedirs(PATH, exist_ok=True)

for i, item in enumerate(ds['train']):
    item['image'].save(f'data/{item["image_filename"]}')

#### Generation of image's description

In [8]:
multimodal_model_name = "Qwen/Qwen2-VL-2B-Instruct"
multimodal_model = Qwen2VLForConditionalGeneration.from_pretrained(
                                                        multimodal_model_name,
                                                        trust_remote_code=True, 
                                                        torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained(multimodal_model_name, trust_remote_code=True)

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  6.32it/s]


In [9]:
def create_messages(img):
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": img,
                    "max_pixels": 720**2,
                },
                {
                    "type": "text", 
                    "text": 
                        "Based on the image, provide a detailed scientific description of the graph."
                },
            ],
        }
    ]

    return messages


In [16]:
def invoke_generation(messages):

    print('Applying vision template...')
    # Apply a chat template to the messages without tokenizing and add a generation prompt
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    print('Tokenizing text...')
    # Process vision information from the messages to get image and video inputs
    image_inputs, video_inputs = process_vision_info(messages)

    print('Preparing inputs...')
    # Prepare the inputs for the model by combining text, images, and videos
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",  # Return the inputs as PyTorch tensors
    )

    print('Moving inputs to device...')
    # Move the inputs to the specified device (e.g., GPU)
    inputs = inputs.to(device)

    print('Generating output...')
    # Generate output IDs from the model with a maximum of 500 new tokens
    generated_ids = multimodal_model.generate(**inputs, max_new_tokens=500)

    print('Decoding output...')
    # Trim the generated IDs to remove the input IDs from the beginning
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

    print('Decoding text...')
    # Decode the trimmed generated IDs to get the output text
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    print('Output:', output_text)
    print('\n\n ---------------------------------------- \n\n')
    return output_text

In [19]:
# Function to process a single example
def process_item(example):
    try:
        img = example['image']  # Extract the image
        messages = create_messages(img)  # Create messages
        output_text = invoke_generation(messages)  # Generate output text
        return {"image_filename": example["image_filename"], "output_text": output_text}
    except Exception as e:
        print(f"Error processing item: {e}")

# Function to save processed data as a checkpoint
def save_checkpoint(processed_data, checkpoint_file):
    df = pd.DataFrame(processed_data)
    df.to_parquet(checkpoint_file, index=False)
    print(f"Checkpoint saved: {len(processed_data)} items to {checkpoint_file}")

# Process dataset in batches with checkpoints
def process_dataset_with_checkpoints(ds, batch_size=10, output_file="processed_images.parquet"):
    train_data = ds['train']  # Access training data
    processed_items = []  # To store processed items
    checkpoint_file = output_file

    for idx, example in enumerate(tqdm.tqdm(train_data)):
        processed_item = process_item(example)
        processed_items.append(processed_item)

        print(processed_items)

        # Save every `batch_size` items or at the end
        if (idx + 1) % batch_size == 0 or (idx + 1) == len(train_data):
            save_checkpoint(processed_items, checkpoint_file)

    return processed_items

# Run the processing
processed_dataset = process_dataset_with_checkpoints(
    ds, batch_size=10, output_file="processed_images.parquet"
)


  0%|          | 0/500 [00:00<?, ?it/s]

Applying vision template...
Tokenizing text...
Preparing inputs...
Moving inputs to device...
Generating output...


  0%|          | 1/500 [00:27<3:52:47, 27.99s/it]

Decoding output...
Decoding text...
Output: ['The graph in the image is a plot of the surface density of a hypothetical object as a function of its radius. The x-axis represents the radius of the object in units of h^{-1} Mpc, while the y-axis represents the surface density in units of h^2 Mpc^{-2}. The surface density is calculated using the formula:\n\n\\[ \\Sigma = \\frac{1}{4\\pi R^2} \\]\n\nwhere \\( R \\) is the radius of the object.\n\nThere are three lines on the graph:\n\n1. **Black line (f_spec = 1)**: This line represents the surface density when the factor \\( f_{\\text{spec}} \\) is equal to 1. This line is a straight line with a slope of -1, indicating that the surface density decreases as the radius increases.\n\n2. **Blue dashed line (f_spec ≠ 1, w/o corr.)**: This line represents the surface density when the factor \\( f_{\\text{spec}} \\) is not equal to 1, but without any correction. This line is a curved line with a slope that is not -1, indicating that the surface 

  0%|          | 2/500 [00:44<2:58:15, 21.48s/it]

Decoding output...
Decoding text...
Output: ['The image is a bar chart that compares the percentage of users in different categories across three different time periods: JUL10, FEB11, and FEB11Q. The x-axis represents the thread ID, ranging from 0 to 50. The y-axis represents the percentage of users in each category, with values ranging from 0 to 1. The categories are labeled as follows:\n\n- PO: Percentage of users in the PO category\n- PiS: Percentage of users in the PiS category\n- UNK: Percentage of users in the UNK category\n\nThe chart shows the distribution of users across these categories for each time period. For example, on JUL10, the majority of users are in the PO category, with a smaller percentage in the PiS and UNK categories. On FEB11, the distribution is similar, but the percentage of users in the PiS category is slightly higher than on JUL10. On FEB11Q, the distribution is even more skewed towards the PO category, with a much higher percentage of users in this categor

  1%|          | 3/500 [01:13<3:23:29, 24.57s/it]

Decoding output...
Decoding text...
Output: ['The image depicts two different scenarios related to quantum mechanics, specifically the behavior of a wave function and the propagation of light.\n\n### Left Scenario: Wave Function\n1. **Wave Function Representation**: The wave function is represented by a series of concentric circles. Each circle represents a different energy level of the quantum system. The circles are connected by lines, indicating the transitions between these energy levels.\n2. **Energy Levels**: The wave function is shown to be a superposition of these energy levels, meaning it is a combination of all possible states at different energies.\n3. **Transition Probabilities**: The wave function is labeled with a probability amplitude, \\( | \\psi_g, J_{cl} > \\), which represents the probability of finding the system in a particular state \\( \\psi_g \\) at a given energy level \\( J_{cl} \\).\n4. **Transition Amplitudes**: The transition amplitudes are represented by t

  1%|          | 4/500 [01:42<3:38:46, 26.46s/it]

Decoding output...
Decoding text...
Output: ['This image is a scientific graph that compares the real component and residual of a star, labeled as "SVS13B q = 1.0," using three different models: a Gaussian, a point source, and a model with a power-law radial profile, denoted as \\( R^{-1.5} \\) and \\( R^{-2.0} \\). The graph is plotted against the angular scale (arcsec) on the x-axis and the UV-distance (kλ) on the y-axis.\n\n### Top Graph (Real Component)\n- **X-Axis (Angular Scale (arcsec))**: The x-axis represents the angular scale in arcseconds.\n- **Y-Axis (Real Component (mJy))**: The y-axis represents the real component in milli-Jansky (mJy).\n- **Data Points (Black Circles)**: These represent the measured real component values.\n- **Curves (Green, Purple, and Red)**: These curves represent different models:\n  - **Green Curve**: \\( R^{-1.5} \\)\n  - **Purple Curve**: \\( R^{-2.0} \\)\n  - **Red Curve**: Gaussian\n- **Error Bars**: The error bars on the data points indicate th

  1%|          | 5/500 [02:11<3:46:12, 27.42s/it]

Decoding output...
Decoding text...
Output: ['The image presents a comparative analysis of two different types of graphs, each representing a different type of network. The graphs are labeled as (a) and (b), and they are plotted on a log-log scale, which allows for a more detailed examination of the data.\n\n### Graph (a)\n- **Title**: Average arrival flux vs. average queue length\n- **Axes**:\n  - **X-axis**: Betweenness\n  - **Y-axis**: Average arrival flux\n- **Data Points**: The graph shows a scatter plot with data points representing various values of betweenness and average arrival flux. The data points are colored differently, with red points representing a higher average arrival flux and black points representing a lower average arrival flux.\n\n### Graph (b)\n- **Title**: Betweenness vs. average queue length\n- **Axes**:\n  - **X-axis**: Betweenness\n  - **Y-axis**: Average queue length\n- **Data Points**: Similar to Graph (a), this graph also shows a scatter plot with data po

  1%|          | 6/500 [02:32<3:26:29, 25.08s/it]

Decoding output...
Decoding text...
Output: ['The image is a scientific graph that compares the efficiency of a system over time for different parameter changes. The x-axis represents the simulation time, ranging from 0 to 1000 seconds. The y-axis represents the efficiency, E, which is the average of 100 runs.\n\nThere are four different lines in the graph, each representing a different parameter change:\n1. Reference: This line represents the efficiency of the system when all parameters are set to their default values.\n2. Single parameter changes: This line represents the efficiency of the system when a single parameter is changed.\n3. N=800: This line represents the efficiency of the system when the number of particles (N) is increased to 800.\n4. Δt=2×10^-3: This line represents the efficiency of the system when the time step (Δt) is decreased to 2×10^-3.\n\nThe graph shows that the efficiency decreases as the time step (Δt) is decreased and the number of particles (N) is increased

  1%|▏         | 7/500 [02:55<3:21:55, 24.58s/it]

Decoding output...
Decoding text...
Output: ['The image is a scientific graph that appears to be related to the study of cosmic microwave background (CMB) anisotropies. The graph is a 2D plot with two axes: the x-axis is labeled "log10(arclength / λ)" and the y-axis is labeled "log10(log10(Δ(−1)))". The x-axis represents the length scale of the CMB anisotropies, while the y-axis represents the logarithm of the logarithm of the fractional difference between the observed and predicted CMB power spectrum.\n\nThere are three different plots labeled a, b, and c, each with a different color scheme. The color scheme for each plot is as follows:\n- Plot a: Light blue\n- Plot b: Dark blue\n- Plot c: Dark blue\n\nEach plot contains a set of data points, represented by dots, and a fitted line. The fitted lines are labeled with the equation "log10(log10(Δ(−1))) = -0.45 + 0.0001 * arclength / λ", where "Δ(−1)" is the fractional difference between the observed and predicted CMB power spectrum.\n\nTh

  2%|▏         | 8/500 [03:25<3:34:05, 26.11s/it]

Decoding output...
Decoding text...
Output: ['The graph in the image represents a comparison between two different functions, \\( S(\\Omega_0) \\), where \\( \\Omega_0 \\) is a variable representing a parameter or condition. The graph has two distinct lines, each labeled with a different function name:\n\n1. The blue line is labeled \\( S_{cl}^{\\text{BH}}(\\Omega_0) \\).\n2. The green line is labeled \\( S_{cl}^{\\text{HR}}(\\Omega_0) \\).\n\nThe x-axis of the graph is labeled \\( \\Omega_0 \\), which represents the parameter \\( \\Omega_0 \\). The y-axis is labeled \\( S \\), which represents the function values \\( S(\\Omega_0) \\).\n\nThe graph shows the following characteristics:\n\n- The blue line \\( S_{cl}^{\\text{BH}}(\\Omega_0) \\) starts at a higher value than the green line \\( S_{cl}^{\\text{HR}}(\\Omega_0) \\) at \\( \\Omega_0 = 0 \\).\n- As \\( \\Omega_0 \\) increases, the blue line \\( S_{cl}^{\\text{BH}}(\\Omega_0) \\) decreases more steeply than the green line \\( S_{

  2%|▏         | 9/500 [03:55<3:43:32, 27.32s/it]

Decoding output...
Decoding text...
Output: ['The graph in the image is a time series plot, which shows the values of a variable \\( a_k(t) \\) over time \\( t \\). The x-axis represents time \\( t \\) in units of \\( 10^{6} \\) (which is 1 million), and the y-axis represents the value of \\( a_k(t) \\) in the range of 0 to 0.7. \n\nThe graph consists of two main lines:\n1. The upper line (black) shows the time series of \\( a_k(t) \\).\n2. The lower line (gray) shows the standard deviation of \\( a_k(t) \\).\n\n### Analysis:\n\n#### Upper Line (Black):\n- The black line shows the time series of \\( a_k(t) \\). It starts at approximately 0.3 and increases over time, reaching a peak around \\( t \\approx 3 \\times 10^6 \\). After this peak, the value decreases and oscillates around a lower value, with some fluctuations. The oscillations are periodic and have a period of approximately 1 million units of time.\n\n#### Lower Line (Gray):\n- The gray line represents the standard deviation o

  2%|▏         | 10/500 [04:24<3:48:16, 27.95s/it]

Decoding output...
Decoding text...
Output: ['### Image Description\n\n#### A: Population Density vs. SUC2 Frequency\n\n- **Axes**: \n  - **X-axis**: SUC2 Frequency (x)\n  - **Y-axis**: Population Density (Cells/μL)\n- **Data Points**: \n  - The data points represent different population densities at various SUC2 frequencies.\n  - The solid line represents the population density at a specific SUC2 frequency.\n  - The dashed line represents the population density at another SUC2 frequency.\n- **Arrows**: \n  - Arrows indicate the direction of change in population density as SUC2 frequency changes.\n  - The arrows are drawn from the solid line to the dashed line, showing how the population density changes as SUC2 frequency increases.\n\n#### B: Survival vs. SUC2 Frequency\n\n- **Axes**: \n  - **X-axis**: SUC2 Frequency (x)\n  - **Y-axis**: Survival Rate (Cells/μL)\n- **Data Points**: \n  - The data points represent different survival rates at various SUC2 frequencies.\n  - The solid line

  2%|▏         | 11/500 [04:53<3:50:40, 28.30s/it]

Decoding output...
Decoding text...
Output: ["The image depicts a deep learning model for graph neural networks (GNNs), specifically a graph convolutional neural network (GCN). The model is designed to process and learn from graph-structured data, such as social networks, biological networks, or other types of networks.\n\n### Left Panel:\n- **A**: This is a matrix representation of the input graph. The matrix is divided into blocks, each representing a node in the graph. The blocks are stacked vertically, and the rows represent the nodes, and the columns represent the edges between nodes.\n- **V**: This is the output of the GCN. It is a matrix that captures the relationships between nodes in the graph. The output is a weighted sum of the input nodes, where the weights are learned by the GCN.\n\n### Middle Panel:\n- **E**: This is the edge matrix of the graph. It is a matrix that represents the connections between nodes in the graph. The edges are represented by the edges in the matrix

  2%|▏         | 12/500 [05:22<3:52:30, 28.59s/it]

Decoding output...
Decoding text...
Output: ['The graph in the image is a plot of the normalized density of states (δ(T)) as a function of temperature (T) for different temperatures (T/T2). The temperature T/T2 is a parameter that varies along the x-axis, while the temperature T is the independent variable on the y-axis. The graph shows the normalized density of states δ(T) as a function of T/T2 for different temperatures.\n\nThe inset in the graph shows the normalized density of states (δ(T)) as a function of temperature (T) for a specific temperature (T/T2 = 0.1). The inset also shows the normalized density of states (δ(T)) as a function of temperature (T) for a different temperature (T/T2 = 0.3). The inset also shows the normalized density of states (δ(T)) as a function of temperature (T) for a further different temperature (T/T2 = 0.5).\n\nThe graph shows that the normalized density of states δ(T) increases with increasing temperature T/T2. This is because the temperature T/T2 is a

  3%|▎         | 13/500 [05:45<3:37:09, 26.75s/it]

Decoding output...
Decoding text...
Output: ['The graph presented in the image is a comparative analysis of the probability density functions (PDFs) of two different datasets, represented by different colors: blue for dataset 1 and orange for dataset 2. The x-axis of the graph is labeled as "Im[τ_W]/τ_H," which represents the imaginary part of the time scale τ_W relative to the characteristic time τ_H. The y-axis is labeled as "PDF," which stands for probability density function.\n\nThe graph includes three different curves, each corresponding to a different value of the parameter η, which is denoted as 4.10, 5.86, and 10.08. These values are likely used to represent different datasets or conditions under which the PDFs were measured.\n\nIn the inset of the graph, there is a plot of the PDFs for dataset 1, which is represented by the blue curve. The inset shows a histogram with a peak at the origin, indicating that the PDF is concentrated around zero. The inset also includes a referenc

  3%|▎         | 14/500 [06:13<3:39:18, 27.08s/it]

Decoding output...
Decoding text...
Output: ["The image depicts a flowchart illustrating the process of generating a video from an audio signal. The flowchart is divided into several sections, each representing a different component of the process. Here is a detailed description of each section:\n\n1. **Monocular Reconstruction (Section 3.1)**:\n   - This section involves reconstructing a monocular image from a single image. The input to this section is a single image, and the output is a reconstructed image that captures the shape, pose, expression, and texture of the person in the image.\n\n2. **Audio Conditioned Neural Renderer (Section 3.3)**:\n   - This section uses the reconstructed image from the monocular reconstruction to condition a neural renderer. The input to this section is the reconstructed image, and the output is a rendered neural texture that captures the visual appearance of the person in the video.\n\n3. **Audio-to-Expression Generation (Section 3.2)**:\n   - This s

  3%|▎         | 15/500 [06:28<3:11:22, 23.68s/it]

Decoding output...
Decoding text...
Output: ['The graph represents the energy levels of a hydrogen atom as a function of the principal quantum number n. The energy levels are shown as a function of the principal quantum number n, ranging from n = 1 to n = 12. The energy levels are labeled on the y-axis, with the energy levels increasing from left to right. The x-axis represents the principal quantum number n, with values ranging from n = 1 to n = 12. The energy levels are represented by different colors and lines, with the lowest energy level (n = 1) represented by a solid blue line, the next lowest energy level (n = 2) represented by a dashed blue line, the next lowest energy level (n = 3) represented by a dotted blue line, and the highest energy level (n = 12) represented by a dotted red line. The energy levels are shown to be decreasing as the principal quantum number n increases.']


 ---------------------------------------- 


[{'image_filename': 'images/1810.10511_2.jpg', 'output

  3%|▎         | 16/500 [06:54<3:15:11, 24.20s/it]

Decoding output...
Decoding text...
Output: ['The figure depicts the convergence of a numerical method for solving a specific problem, which appears to be related to fluid dynamics or a similar field. The x-axis represents the iteration number \\( i \\), while the y-axis represents the quality factors \\( Q_e/Q_{lb}^{\\mathrm{TM}} \\) and \\( Q_m/Q_{lb}^{\\mathrm{TM}} \\). The quality factors are normalized to the lower bound of the quality factor, \\( Q_{lb}^{\\mathrm{TM}} \\).\n\nThe top panel shows the convergence of the quality factors over iterations. The red line represents \\( Q_e/Q_{lb}^{\\mathrm{TM}} \\), and the blue line represents \\( Q_m/Q_{lb}^{\\mathrm{TM}} \\). Both quality factors initially decrease rapidly, indicating that the method is converging to a solution. However, after a certain point, the quality factors stabilize, suggesting that the method has reached a stable solution.\n\nThe bottom panel shows the evolution of the physical variable \\( \\tilde{\\rho}_e \\

  3%|▎         | 17/500 [07:05<2:43:38, 20.33s/it]

Decoding output...
Decoding text...
Output: ['The graph is a scatter plot that displays the proportion of a variable, represented by the y-axis, as a function of another variable, represented by the x-axis. The x-axis is labeled "s," and the y-axis is labeled "Proportion." The data points are represented by blue circles, and the error bars are shown for each data point, indicating the variability or uncertainty in the measurement. The error bars are shown as horizontal lines extending from the data points, with the length of the error bars representing the standard deviation or confidence interval around the measured value. The graph appears to show a positive correlation between the two variables, with the proportion increasing as the value of "s" increases.']


 ---------------------------------------- 


[{'image_filename': 'images/1810.10511_2.jpg', 'output_text': ['The graph in the image is a plot of the surface density of a hypothetical object as a function of its radius. The x-a

  4%|▎         | 18/500 [07:26<2:45:12, 20.57s/it]

Decoding output...
Decoding text...
Output: ['The graph in the image is a probability density function (PDF) plot, which is a type of statistical graph used to represent the probability distribution of a random variable. In this particular graph, the x-axis represents the value of the random variable, while the y-axis represents the probability density at that value.\n\nThe graph shows three different curves, each representing a different value of the random variable. The curves are labeled with the values 0.02, 0.04, and 0.08 on the y-axis. The x-axis ranges from 0 to 80, which likely represents the range of possible values for the random variable.\n\nThe curves are all increasing and show a sharp increase as the value of the random variable increases. This indicates that the probability density function is skewed to the right, meaning that the probability of the random variable taking on a higher value is greater than the probability of it taking on a lower value.\n\nThe curves also 

  4%|▍         | 19/500 [07:49<2:49:18, 21.12s/it]

Decoding output...
Decoding text...
Output: ['The image depicts a network structure with two distinct layers, labeled "Network Up" and "Network Down." The network is composed of a series of interconnected nodes, represented by black dots, connected by lines. The lines are arranged in a grid-like pattern, forming a network structure.\n\nThe "Network Up" layer consists of a series of horizontal lines, while the "Network Down" layer consists of a series of vertical lines. The nodes in the "Network Up" layer are connected to the nodes in the "Network Down" layer through the lines. The lines in the "Network Up" layer are connected to the nodes in the "Network Down" layer, forming a network that connects the two layers.\n\nThe lines in the "Network Up" layer are connected to the nodes in the "Network Down" layer in a specific pattern. The lines in the "Network Up" layer are connected to the nodes in the "Network Down" layer in a way that creates a network that is connected from the top to th

  4%|▍         | 20/500 [08:19<3:10:19, 23.79s/it]

Decoding output...
Decoding text...
Output: ['The image presents two graphs, labeled as "a)" and "b)", which are related to the magnetic properties of CoO/Pt multilayers. The graphs are plotted on a logarithmic scale, with the x-axis representing the magnetic field strength (μ0H) and the y-axis representing the normalized magnetization (M/Ms), where Ms is the saturation magnetization.\n\n### Graph "a)"\n- **Title:** H || in film plane\n- **Legend:** Red line: CoO/Pt multilayers, Black line: CoO single layer\n- **Data Points:** The graph shows the normalized magnetization (M/Ms) as a function of magnetic field strength (μ0H) for both CoO/Pt multilayers and CoO single layer. The multilayer data points are represented by the red line, while the single-layer data points are represented by the black line.\n- **Observation:** The multilayer data points (red line) show a higher saturation magnetization compared to the single-layer data points (black line). This indicates that the multilayer s

  4%|▍         | 21/500 [08:49<3:26:38, 25.88s/it]

Decoding output...
Decoding text...
Output: ['The image depicts a comparative analysis of two methods for generating force fields for molecular dynamics simulations. The methods are labeled as "force matching" and "relative entropy method," and they are compared against a third method labeled "flow-CG potential."\n\n### (a) Force Matching\n- **Description:** This method involves simulating a molecular dynamics (MD) trajectory and then matching the resulting force field to a reference force field.\n- **Steps:**\n  1. **Simulate:** MD trajectory is simulated.\n  2. **Force Matching:** The force field is matched to a reference force field.\n  3. **Density Estimation:** The density of the system is estimated from the force field.\n\n### (b) Relative Entropy Method\n- **Description:** This method involves estimating the relative entropy between the force field and a reference force field.\n- **Steps:**\n  1. **Density Estimation:** The density of the system is estimated from the force field

  4%|▍         | 22/500 [09:21<3:39:49, 27.59s/it]

Decoding output...
Decoding text...
Output: ['The image is a scientific diagram illustrating the effect of a classical magnetic field on the orientation of magnetic moments in a material. The diagram is divided into two sections labeled (a) and (b).\n\n### Section (a):\n- **Objects**: The diagram shows several green spheres with red arrows pointing upwards, representing magnetic moments.\n- **Classical Magnetic Field**: A purple arrow pointing upwards is labeled "Classical magnetic field," indicating that the magnetic field is applied in a classical manner, without quantum effects.\n\n### Section (b):\n- **Objects**: The diagram shows the same green spheres with red arrows pointing upwards, representing magnetic moments.\n- **Classical Magnetic Field**: The purple arrow is still pointing upwards, but it is labeled "Quantum magnetic field," indicating that the magnetic field is now quantum-mechanically influenced.\n\n### Analysis:\n1. **Classical Magnetic Field**: In the classical magne

  5%|▍         | 23/500 [09:42<3:23:12, 25.56s/it]

Decoding output...
Decoding text...
Output: ['The image depicts a comparison between two different representations of a parking lot scene. The top image is a depth map, which is a grayscale image where the intensity of each pixel represents the distance of the point it represents from the camera. The bottom image is a point cloud, which is a set of 3D points that represent the positions of all the points in the scene.\n\nThe depth map shows the relative distances of each point in the scene from the camera. The points are represented by green lines, which are perpendicular to the ground plane. The length of each line represents the distance of the point from the camera. The points are clustered together in the center of the image, indicating that they are closer to the camera.\n\nThe point cloud, on the other hand, shows the actual positions of all the points in the scene. The points are represented by green dots, which are scattered throughout the image. The points are connected by lin

  5%|▍         | 24/500 [10:21<3:55:38, 29.70s/it]

Decoding output...
Decoding text...
Output: ['This image is a scientific figure that compares the enrichment of motifs in two types of composite networks: locally modular and non-modular. The figure is divided into three main sections: A, B, and C.\n\n### Section A: Locally Modular Composite Network Motifs\n- **Motif (1)**: This motif is represented by a triangle with three nodes. The nodes are labeled with the following information:\n  - **Z_in**: 19.48\n  - **Z_out**: 11.76\n  - **N**: 1656\n  - **Motif Class**: TR feedforward loop: Various functions\n  - **Functional Theme**: TR feedforward loop: Various functions\n\n- **Motif (2)**: This motif is represented by a triangle with four nodes. The nodes are labeled with the following information:\n  - **Z_in**: 42.76\n  - **Z_out**: 22.39\n  - **N**: 1879\n  - **Motif Class**: Coregulated interacting proteins: Coregulated protein networks, various functions\n  - **Functional Theme**: Coregulated interacting proteins: Coregulated protein

  5%|▌         | 25/500 [10:54<4:01:46, 30.54s/it]

Decoding output...
Decoding text...
Output: ['The graph in the image is a plot of the secrecy capacity (Nats/sec/Hz) against the signal-to-noise ratio (SNR) for various values of the number of antennas (K) and the number of users (N). The x-axis represents the SNR in dB, ranging from -10 to 30 dB, while the y-axis represents the secrecy capacity in Nats/sec/Hz.\n\nThere are five different curves represented in the graph, each corresponding to a different combination of K and N:\n\n1. BF K=1: This curve represents the secrecy capacity for a single user with K=1 and N=1.\n2. BF K=10: This curve represents the secrecy capacity for a single user with K=10 and N=1.\n3. BF K=20: This curve represents the secrecy capacity for a single user with K=20 and N=1.\n4. ZF K=1: This curve represents the secrecy capacity for a single user with K=1 and N=1.\n5. ZF K=10: This curve represents the secrecy capacity for a single user with K=10 and N=1.\n6. ZF K=20: This curve represents the secrecy capacit

  5%|▌         | 26/500 [11:24<4:01:16, 30.54s/it]

Decoding output...
Decoding text...
Output: ["The figure depicts a cross-lingual auto-encoding and unsupervised cross-modal feature mapping framework for generating captions from sentences in different languages. The framework consists of two main components: (a) Cross-lingual auto-encoding and (b) Unsupervised cross-modal feature mapping.\n\n### (a) Cross-lingual Auto-encoding\n1. **Input**: A sentence in English (Sx) and a corresponding image (Ix).\n2. **Processing**: The sentence is first passed through a sentence parser to extract the relevant information.\n3. **Encoding**: The encoded sentence is then fed into a graph encoder (Gx) to generate a graph representation of the sentence.\n4. **Decoding**: The graph representation is then decoded back into a sentence (Sy) using a decoder (Gy).\n5. **Training Data**: The paired data (Sx, Sy) is used to train the graph encoder and decoder.\n\n### (b) Unsupervised Cross-modal Feature Mapping\n1. **Input**: A sentence in English (Sx) and a c

  5%|▌         | 27/500 [11:53<3:55:20, 29.85s/it]

Decoding output...
Decoding text...
Output: ['The graph in the image is a scatter plot that represents the ratio of the energy difference between two states, denoted as ΔE/E_R, as a function of the angle φ (in radians). The x-axis is labeled "φ (rad)" and ranges from 1 to 6 radians. The y-axis is labeled "ΔE/E_R" and ranges from 0 to 20. The data points are represented by black diamonds, and they are scattered across the graph.\n\nThe graph shows a general trend where the ratio ΔE/E_R increases as the angle φ increases. However, there are several notable features:\n\n1. **Initial Increase**: The ratio ΔE/E_R starts to increase rapidly as the angle φ increases from 1 to approximately 3 radians. This suggests that the energy difference between the two states is increasing with the angle.\n\n2. **Plateau**: There is a plateau region where the ratio ΔE/E_R remains relatively constant as the angle φ increases further. This plateau region is characterized by a relatively flat curve, indicati

  6%|▌         | 28/500 [12:27<4:05:45, 31.24s/it]

Decoding output...
Decoding text...
Output: ['The graph in the image is a plot of \\( r_h / r_0 \\) against \\( \\tau / r_0 \\) for a Schwarzschild-AdS system. The x-axis represents \\( \\tau / r_0 \\), which is a dimensionless time variable, and the y-axis represents \\( r_h / r_0 \\), which is a dimensionless radius variable.\n\nThe graph shows a curve that starts at a point near the origin (0,0) and then decreases as \\( \\tau / r_0 \\) increases. The curve approaches a horizontal asymptote at \\( r_h / r_0 = 0 \\) as \\( \\tau / r_0 \\) approaches infinity. This indicates that as time \\( \\tau \\) becomes very large, the radius \\( r_h \\) approaches zero.\n\nThere are two vertical dashed lines on the graph:\n1. The first vertical dashed line is at \\( \\tau_1 / r_0 \\), which is a critical point on the curve. This point is where the curve starts to deviate from the horizontal asymptote.\n2. The second vertical dashed line is at \\( \\tau_c / r_0 \\), which is another critical poi

  6%|▌         | 29/500 [12:53<3:53:09, 29.70s/it]

Decoding output...
Decoding text...
Output: ["The image represents a mind map that illustrates the logical reasoning process involved in a medical scenario. The mind map is divided into four main sections: MedNLI, RadQA, CLIP, and LLM.\n\n1. **MedNLI (Medical NLI)**:\n   - **Premise**: The patient emerged with Apgar scores of 7 and 8.\n   - **Hypothesis**: The patient had low Apgar scores.\n   - **Contradiction**: The patient had high Apgar scores.\n\n2. **RadQA (Radiology QA)**:\n   - **Context**: The emergency room clinicians requested a second read on a C-spine CT.\n   - **Finding**: There is no evidence of evidence of fracture or subluxation.\n   - **Question**: Are there any abnormalities in the C-spine?\n\n3. **CLIP (Clinical Imaging Pathology)**:\n   - **Patient**: The patient has a follow-up neck CTA and appointment with a surgery on 1978-10-18.\n   - **Appointment-related, Imaging-related, Procedure-related follow-ups**.\n\nThe mind map shows that the patient's high Apgar scor

  6%|▌         | 30/500 [13:20<3:45:17, 28.76s/it]

Decoding output...
Decoding text...
Output: ['The image is a series of six panels, each depicting a different setup of a magnetic field configuration over time. The panels are labeled with the following:\n\n1. **Initial state (time = 0 h)**: This panel shows the initial state of the magnetic field configuration at the start of the simulation.\n2. **Setup (A) (time = 10 h)**: This panel shows the magnetic field configuration after 10 hours of simulation time.\n3. **Setup (B) (time = 10 h)**: This panel shows the magnetic field configuration after 10 hours of simulation time.\n4. **Setup (C) (time = 10 h)**: This panel shows the magnetic field configuration after 10 hours of simulation time.\n5. **Setup (D) (time = 10 h)**: This panel shows the magnetic field configuration after 10 hours of simulation time.\n6. **Setup (E) (time = 20 h)**: This panel shows the magnetic field configuration after 20 hours of simulation time.\n\nEach panel includes a color bar on the left side, indicating t

## 1. Multimodal Representational Models with Late Interaction

#### Load model colqwen2-v0.1

In [6]:
RAG = RAGMultiModalModel.from_pretrained("./models/colqwen2-v0.1-merged", device=device)

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


In [11]:
try:
    RAG.index(
    input_path="data/images/", # The path to your documents
    index_name='colqwen2-v0.1-merged-arxiv_qa', # The name you want to give to your index. It'll be saved at `index_root/index_name/`.
    store_collection_with_index=False, # Whether the index should store the base64 encoded documents.
    overwrite=False # Whether to overwrite an index if it already exists. If False, it'll return None and do nothing if `index_root/index_name` exists.
)
except ValueError:
    RAG=RAG.from_index('colqwen2-v0.1-merged-arxiv_qa', device='cpu')

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


  self.indexed_embeddings.extend(torch.load(file))


In [14]:
text_query = 'Based on the graph, what is the impact of correcting for fspec not equal to 1 on the surface density trend?'
results = RAG.search(text_query, k=1)
results

[{'doc_id': 245, 'page_num': 1, 'score': 24.875, 'metadata': {}, 'base64': None}]

In [29]:
import gzip
import json

# Open the .json.gz file
with gzip.open('.byaldi/colqwen2-v0.1-merged-arxiv_qa/index_config.json.gz', 'rt', encoding='utf-8') as f:
    data = json.load(f)

# Print or process the JSON data
print(data)


{'model_name': './models/colqwen2-v0.1-merged', 'full_document_collection': False, 'highest_doc_id': 499, 'resize_stored_images': False, 'max_image_width': None, 'max_image_height': None, 'library_version': '0.0.7'}


In [27]:
image_name='/'.join(RAG.get_doc_ids_to_file_names()[results[0]['doc_id']].split('/')[-2:])

In [None]:
del RAG

NameError: name 'RAG' is not defined

In [53]:
torch.mps.empty_cache()

In [42]:
# Find image_name in ds
for i, item in enumerate(ds['train']):
    if item['image_filename']==image_name:
        break

#### RAG Pipeline Qwen

In [18]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_name = "Qwen/Qwen2-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(model_name,
                                                        trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  4.77it/s]


In [7]:
item=ds['train'][0]

In [27]:
item

{'query': 'Based on the graph, what is the impact of correcting for fspec not equal to 1 on the surface density trend?',
 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1621x1191>,
 'image_filename': 'images/1810.10511_2.jpg',
 'options': "['A. Correction causes a significant increase in surface density across all radii.', 'B. Correction results in a decrease in surface density for larger radii.', 'C. Correction causes the surface density to converge with the fspec = 1 case at larger radii.', 'D. Correction does not affect the surface density trend at all.', '-']",
 'answer': 'C',
 'page': '',
 'model': 'gpt4V',
 'prompt': '',
 'source': 'arxiv_qa'}

In [65]:
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", trust_remote_code=True)

# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": item['image'],
#                 "max_pixels": 720**2,
#             },
#             {
#                 "type": "text", 
#                 "text": 
#                     item['query'] + 
#                     "\n  Choose the correct answer from the options below: \n" +
#                     item['options'] +
#                     "Answer with the letter of the correct option."

#             },
#         ],
#     }
# ]

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": item['image'],
                "max_pixels": 720**2,
            },
            {
                "type": "text", 
                "text": 
                    "Based on the image, provide a detailed scientific description of the graph."

            },
        ],
    }
]

In [66]:
print(messages)

[{'role': 'user', 'content': [{'type': 'image', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1621x1191 at 0x3B61D7C90>, 'max_pixels': 518400}, {'type': 'text', 'text': 'Based on the image, provide a detailed scientific description of the graph.'}]}]


In [67]:
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

In [68]:
print(text)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Based on the image, provide a detailed scientific description of the graph.<|im_end|>
<|im_start|>assistant



In [69]:
image_inputs, video_inputs = process_vision_info(messages)

In [70]:
image_inputs

[<PIL.Image.Image image mode=RGB size=812x616>]

In [71]:
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(device)

In [75]:
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)


In [76]:
print(output_text[0])

The graph in the image is a plot of the surface density of a hypothetical object as a function of its radius. The x-axis represents the radius of the object in units of h^{-1} Mpc, while the y-axis represents the surface density in units of h^2 Mpc^{-2}. The surface density is calculated using the formula:

\[ \Sigma = \frac{1}{4\pi R^2} \]

where \( R \) is the radius of the object.

There are three lines on the graph:

1. **Black line (f_spec = 1)**: This line represents the surface density when the factor \( f_{\text{spec}} \) is equal to 1. This line is a straight line with a slope of -1, indicating that the surface density decreases as the radius increases.

2. **Blue dashed line (f_spec ≠ 1, w/o corr.)**: This line represents the surface density when the factor \( f_{\text{spec}} \) is not equal to 1, but without any correction. This line is a curved line with a slope that is not -1, indicating that the surface density decreases more steeply as the radius increases.

3. **Blue so

#### RAG pipeline

In [5]:
# model_id = "google/gemma-2-2b-it"
model_id = "models/gemma-2-2b-it"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device=device
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  4.64it/s]


In [6]:
messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]

outputs = pipeline(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)
# Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas. I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world.  So, what be yer pleasure, eh? 🦜

Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas.  I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world.  So, what be yer pleasure, eh? 🦜


#### **Parte 1: Multimodal Retrieval**


3. **Implementazione Approcci**:
   - Multimodale con modelli avanzati.
   - Text-only con late interaction e late chunking.

4. **Pipeline di QA**: Risoluzione multiple-choice con un modello generativo.

5. **Valutazione**:
   - Metriche di retrieval.
   - Accuratezza su QA.

6. **Visualizzazione**: Risultati e grafici comparativi.

#### **Parte 2: GraphRAG**
1. **Setup**: Installazione librerie (Neo4j, Milvus, LangGraph).
2. **Estrazione Triple**:
   - LLM per nodi e archi.
   - Scelta tra aggiunta di forza relazionale o descrizioni.
3. **Costruzione del Grafo**:
   - Inserimento in Neo4j.
4. **Interrogazione e Reasoning**:
   - Community detection, vicini diretti, query Cypher.
   - Generazione risposte locali e globali.
5. **Demo Interattiva**:
   - Domande multi-hop con risposte strutturate.
6. **Visualizzazione**:
   - Visualizzazione del grafo e delle risposte.

---


### **4. Output Finale**
- **Modulo 1**: Report quantitativo sulle prestazioni dei modelli di retrieval nel task multimodale.
- **Modulo 2**: Grafo di conoscenza interattivo con risposte a domande complessive e reasoning complesso.